At present, the big data platform is often used to run some batch tasks, which of course cannot be separated from scheduled tasks. For example, periodically extract service database data, periodically run Hive/Spark tasks, and periodically push daily and monthly indicator data. Task scheduling system has become an indispensable part of big data processing platform.
I. Original task scheduling
I remember the first time I participated in the establishment of the big data platform from scratch. At the beginning, Crontab was used for task scheduling, which divided the day, month and week. Various task scripts were configured on one host. Crontab is very easy to use and easy to configure. At the beginning of the task is very few, can also use, get up every day check the log. With the increasing number of tasks, some tasks cannot be completed at the original planned time. Some tasks depend on others before the superior task is finished. At this time, there is no data, and the task will report an error, or the two tasks run in parallel, resulting in wrong results. The more trouble it is to check the causes of task errors, and the dependence of various tasks is more and more responsible. Finally, it is necessary to check the problems of tasks from a tangle, one by one, to comb out the hemp rope every day. Although crontab is simple and stable, with the increase of tasks and the increasing complexity of dependency, it cannot meet our needs at all. At this time, we need to build our own scheduling system.
Second, scheduling system
There is a strong dependency between multiple task units. The downstream task can be executed only when the upstream task is successfully executed. For example, after the upstream task 1 finishes and gets the result, the downstream task 2 and 3 must be combined with the result of task 1. Therefore, the downstream task must start after the upstream task successfully runs and gets the result. In order to ensure the accuracy of data processing results, it is necessary to require orderly and efficient execution of these tasks according to upstream and downstream dependencies, and finally ensure that business indicators can be generated on time.
Airflow
Apache Airflow is a powerful tool used as a workflow tool for directed Acyclic graph (DAG) orchestration, task scheduling, and task monitoring of a task. Airflow manages execution dependencies between jobs in the DAG and handles job failures, retries, and alerts. Developers can write Python code to translate data into operations in a workflow.
It mainly consists of the following components:
-
Web Server: mainly includes workflow configuration, monitoring, management and other operations
-
Scheduler: A workflow scheduling process that triggers workflow execution and status updates
-
Message queue: stores task execution commands and task execution status reports
-
Worker: Performs tasks and reports status
-
Mysql: Stores workflow and task metadata information
Specific execution process:
-
Scheduler scans daG files to the database to determine whether execution is triggered
-
Dag_run is generated, and task_instance is stored in the database
-
Sends the execution task command to the message queue
-
The worker retrieves the task from the queue and executes the command to execute the task
-
The worker reports task execution status to the message queue
-
Schduler obtains the task execution status and goes to the next step
-
Schduler updates the database based on status
Kettle
Drag and drop each task operation component into the workspace. The Kettle supports various common data conversions. In addition, users can drag and drop custom scripts from Python, Java, JavaScript, and SQL onto the canvas. Kettle can accept many file types as input, and can connect to more than 40 databases via JDBC, ODBC, as sources or targets. The community version is free but offers fewer features than the paid version.
XXL-JOB
Xxl-job is a distributed task scheduling platform, whose core design goal is rapid development, simple learning, lightweight and easy to expand. The scheduling behavior is abstracted into a common platform called “scheduling center”, which does not undertake the business logic itself, and the “scheduling center” is responsible for initiating scheduling requests. The tasks are abstracted into scattered JobHandler, which is managed by the “executor”. The “executor” is responsible for receiving scheduling requests and executing the corresponding business logic in JobHandler. Therefore, “scheduling” and “task” can be decoupled from each other to improve overall system stability and scalability. (IT turned out that XXL was the initials of the author’s name in pinyin)
There are many open source tools for scheduling system, which can be improved according to the familiarity and needs of employees in your company.
Three, how to design the dispatching system
Scheduling platform actually needs to solve three problems: task scheduling, task execution and task monitoring.
-
** Task orchestration, ** adopts the method of calling external orchestration services. The main consideration is that orchestration needs to be implemented according to some attributes of the business, so the volatile business part is separated from the job scheduling platform. If the scheduling logic is adjusted or modified later, no operation is required on the service job scheduling platform.
-
** Task queuing, ** support multi-queue queuing configuration, later according to different types of developers can configure different queues and resources, for example, for different developers need to have different service queues, for different tasks also need to have different queue priority support. Scheduling isolation through queues can better meet the needs of users with different requirements. Different queues have different resources. Make proper use of resources to maximize service value.
-
** Task scheduling, ** is to schedule the task and a group of sub-tasks belonging to the task. For the sake of simplicity and control, each task will get a group of orderly task list after scheduling, and then schedule each task. Here, a little more complicated is that there are subtasks in the task, subtasks are some processing components, such as field conversion, data extraction, subtasks need to be referenced in the upper task to achieve scheduling. Task is the basic unit of scheduling operation. The scheduled task will be sent to the message queue, and then wait for the task coordination computing platform to consume and run the task. At this time, the scheduling platform only needs to wait for the result message of the task completion, and then update the status of the job and task, and determine the next scheduled task according to the actual status.
The following items should also be paid attention to in scheduling platform design:
-
The scheduled tasks need to be timed out. For example, if the execution time of a task is too long due to improper design by the developer, you can set the maximum execution time for the task. If the execution time exceeds the maximum, you need to kill the task in time to prevent the task from occupying too many resources and affecting normal task operation.
-
Control the number of jobs that can be scheduled at the same time. Cluster resources are limited, so we need to control the concurrency of tasks. After thousands of tasks in the later stage, we need to timely adjust the start time of tasks to avoid starting a large number of tasks at the same time and reduce the pressure of scheduling and computing resources.
-
Job priority control, each business has a certain level of importance, we should ensure that the most important business execution priority, priority allocation of scheduling resources. When a task is overbacklog, perform tasks with a higher priority to minimize service impact.
Four,
ETL development is one of the necessary skills of data engineers and plays an important role in data warehouse, BI and other scenarios. However, many practitioners do not even know what ETL corresponds to in English, let alone in-depth analysis of ETL, which is undoubtedly very incompetent. You can develop ETL in any programming language, be it shell, Python, Java, or even stored procedures in a database, as long as it is the result of extracting (E), converting (T), and loading (L) data. Because ETL is an extremely complex process and handwritten programs are difficult to manage, more and more visual scheduling tools are available.
It doesn’t matter whether a cat is black or white as long as it catches mice. No matter what kind of tool it is, as long as it is efficient and easy to maintain, it is a good tool.
Good historical articles recommended
-
How to build a big data platform from zero to one
-
Log collection components: Flume, Logstash, and Filebeat comparison
-
Are you an analyst or a number-raiser?
-
Talk about data quality in ETL