Recently, we met a lot of partners who are studying ETL and its tools and complained to us: They are using the same Kettle, and the starting point is obviously the same, but why do others do ETL so fast and well, while they are constantly failing?

In fact, open-source tools such as Kettle cover most of the functions required by daily work. You can deploy a set of tools to solve basic enterprise requirements.

Today, we will make a simple comparison and evaluation of one of the more popular “apps” — scheduling tool, to help you quickly unlock the new posture of using open source tools to do ETL.



Why do you need a scheduling system?

Let’s start with literacy.

We all know that the calculation, analysis and processing of big data generally consists of multiple task units (Hive, Sparksql, Spark, Shell, etc.), and each task unit completes specific data processing logic.

There is a strong dependency between multiple task units. The downstream task can be executed only when the upstream task is successfully executed. For example, after the upstream task gets A result, the downstream task needs to combine A result to produce B result. Therefore, the downstream task must start after the upstream task successfully runs and gets the result.

In order to ensure the accuracy of data processing results, it is necessary to require orderly and efficient execution of these tasks according to upstream and downstream dependencies. A relatively basic processing method is to estimate the processing time of each task, calculate the start and end time of each task according to the sequence, and keep the whole system running stably by running tasks regularly.

A complete data analysis task should be executed at least once. In the process of low-frequency data processing with less data and simpler dependence, this scheduling method can fully meet the requirements. In enterprise scenario, however, is more of a need to perform every day, if a task number is more, on the task start time calculation will spend a lot of time, if there are other upstream the task execution time beyond the scheduled time is expected or abnormal operation problem, the above approach would be completely unable to cope with, also can cause repeated loss of manpower, therefore, For enterprise data development process, a complete and efficient workflow scheduling system will play a vital role.

Scheduling Tool Comparison

Oozie

Oozie: Elephant trainer (scheduling MapReduce). Oozie is an open source framework based on a workflow engine. Oozie is deployed in Java servlets and is used for scheduling tasks in a logical order.

It has the following features:

  • Unified scheduling common HADOOP system Mr Task startup, HDFS operation, shell scheduling, Hive operation, etc.
  • Let complex dependencies, time triggers, and event triggers be expressed in XML language to make development more efficient (this may not be, personally hate XML, I feel not efficient…) ;
  • A group of tasks using a DAG, graphical expression, clear process;
  • Support a variety of task scheduling, can complete most hadoop tasks;
  • Program definition support EL constant and function, rich expression;
  • Oozie provides email notification upon completion of work;
  • Azkaban operates using the Web. Oozie supports Web, RestApi, and Java API operations.

Azkaban

Azkaban is a batch workflow task scheduler open-source by Linkedin. Used to run a set of jobs and processes in a particular order within a workflow. Azkaban defines a KV file format to establish dependencies between tasks and provides an easy-to-use Web user interface to maintain and track your workflow.

It has the following features:

  • Web User Interface
  • Easy upload workflow
  • You can easily set relationships between tasks
  • Scheduling workflow
  • Authentication/Authorization (permission work)
  • Ability to kill and restart workflows
  • Modular and pluggable plug-in mechanism
  • Project workspace
  • Logging and auditing of workflow and tasks

taskctl

Is a comprehensive job automation scheduling technology management tool. With TASKCTL, you can quickly organize these jobs, manage them effectively, and control their performance parameterized. In the industry, this technology is commonly called job scheduling, and its technical essence is the automatic control of job operation management.

Taskctl is a one-stop big data tool platform and community for individuals, business owners and independent data application developers. The basic package is free forever! With TaskCTL, individuals and enterprises can integrate and develop their own multi-source business system data, form data assets, and enable their own scenarios to easily build their own data platforms in the cloud without paying too much attention to the complex installation, configuration, and daily operation and maintenance of the underlying storage of big data and computing engines.

The taskctl scheduling function is as follows:

  • Complete adaptive scheduling of more than 20 data sources: Mysql, Oracle, Hive, HBase, Redis, MongoDB, ODPS, Postgresql, ElasticSearch, WebService, GBase, etc.
  • Modular and pluggable plug-in mechanism: It shields technical differences of application platforms and ADAPTS unified execution, stop, and status log query and access interfaces
  • Support visual workflow configuration: support graph drag and drop, automatic minimum cross typesetting, clearly show the series and parallel relationship between job nodes; Job ICONS of different types can be customized, and the executing job node can be located quickly.
  • Support task alarm: email, SMS, wechat, Dingding and other multi-channel subscription, platform message, process message, operation message multi-level push.
  • Manual intervention diversification: normal scheduling, free scheduling, virtual scheduling. Force break, force pass, disable pass, preset breakpoint, ignore conditions, etc.
  • Support job priority configuration: platform level, process level and job level parallel control, resource weight setting. Dynamically set the top priority of a job.
  • Support workflow assembly: support scheduling meta-information architecture organization at various levels, such as: Project A workflow (nested) A module (nested) A job
  • Support workflow test run: support the whole process development system, such as coding a compiling A debugging a release a running a complete set of life cycle management.
  • Fast locating faulty tasks: Provides the automatic locating function for executing or abnormal job nodes.

conclusion

Apache Oozie is a heavyweight task scheduling system with comprehensive functions. However, it is difficult to deploy and configure Apache Oozie. It is difficult to use Crontab and Oozie. Azkaban is a tool between Oozie and Crontab, but not as secure as Oozie. In case of failure, Azkaban will lose all workflows and Oozie can continue to run. Compared with the above two tools, TaskCTL solves complex configuration and deployment problems, is easy to expand, and has more convenient development, operation and maintenance functions in the workflow.

Taskctl is more than just a full-featured workflow scheduling tool. It’s a one-stop big data platform for everything from simple ETL tasks to complex data central-platform tasks. The basic version is free forever! No matter what problem you encounter, you can find customer service to solve it. The experience is 100 times better than open source products. Sure not to try it?