• 原文 address: What we learned migrating off Cron to Airflow
  • Katie Macias
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: cf020031308
  • Proofreader: Yqian1991

Last fall, VideoAmp’s data engineering department was making major changes. The team consisted of three data engineers and a systems engineer who worked closely with us. We collectively identified how to optimize the company’s technical debt.

At the time, the data team was the sole owner of all batch jobs, which took feedback from our real-time bidding warehouse and fed it to the Postgres database, which fed it to the UI. If these jobs fail, the UI will become obsolete, and our internal traders and external customers will have nothing but stale data to rely on. Therefore, service level agreements that satisfy these jobs are critical to the success of the platform. Most of these jobs are built in Scala and use Spark. These batch jobs are carefully choreographed through Cron, a scheduling tool built into Linux.

The advantages of Cron

We found that some of the major pain points caused by Cron even outweigh the benefits. Cron is built into Linux and does not require installation. In addition, Cron is quite reliable, which makes it an attractive option. Cron is therefore an excellent choice for proof-of-concept projects. But it doesn’t work very well on a large scale.

Crontab files: How were applications arranged previously

The disadvantage of Cron

The first problem with Cron is that changes to crontab files are not easy to track. The crontab file contains scheduling plans for jobs running on this machine that are cross-project, but are not tracked in source control or integrated into the deployment process of a single project. Instead, engineers edit as needed, but do not record time or project dependencies.

The second problem is opacity. Cron logs are output to the server where the job is running — not centralized somewhere. How does a developer know if a job succeeds or fails? Parsing the logs for these details is expensive, whether developers are browsing them themselves or need to expose them to downstream engineering teams.

Finally, it is expedient and difficult to rerun failed jobs. By default, Cron sets only a few environment variables. Novice developers are often surprised to find that the bash commands stored in the crontab file do not produce the same output as their terminal because the Settings in their bash configuration file do not exist in the Cron environment. This requires the developer to build all the dependencies of the user environment that executes the command.

It is clear that many tools need to be built on top of Cron to be successful. While we have solved this slightly, we know that there are many more powerful open source options available. The orchestration tools we used throughout the team ranged from Luigi to Oozie to other custom solutions, but ultimately those experiences left us feeling unsatisfied. AirBnB’s inclusion in the Apache incubator means it’s expected.

Setup and Migration

In typical hacking fashion, we secretly pulled resources from the existing technology stack and set up the Airflow Metadb and host to prove our idea.

This Metadb contains important information such as one-way acyclic graphs (DAGs) and task lists, the effects and results of jobs, and variables used to signal other tasks. Airflow Metadb can be built on top of relational databases such as PostgreSQL or SQLite. This dependency allows us to scale out of a single instance. We appropriated a PostgreSQL VIRTUAL machine that was being developed by another team on The Amazon RDS platform (their development sprint was over and they no longer used the instance).

The Airflow host is installed on our Spark development VIRTUAL machine and is part of our Spark cluster. This host is configured with LocalExecutor and running the scheduler and UI for Airflow. After being installed on an instance in our Spark cluster, our job has the permissions to execute the Spark dependencies it needs. This is the key to a successful migration and the reason previous attempts failed.

Moving from Cron to Airflow presents special challenges. As Airflow has its own task schedule, we had to modify our application to get input in the new format. Fortunately, the Airflow variable provides the necessary meta information for the scheduling script. We also removed the application of most of the tools we built in there, such as push alerts and signals. Finally, we ended up breaking many of our applications into smaller tasks to follow the DAG paradigm.

Authorui: Developers can clearly identify from the UI which Dags are stable and which are not robust and need to be enhanced.

Lessons learned

  1. Application bloat When Cron is used, the scheduling logic must be tightly coupled to the application. Each application must do the entire work of the DAG. This extra logic obscures the unit of work that is at the heart of the application’s purpose. This makes it difficult to debug and develop in parallel. Since moving to Airflow, the idea of putting tasks in DAG has enabled the team to develop powerful scripts that focus on that unit of work while minimizing our workload.
  2. Optimise rendering of batch jobs By using the Airflow UI, the data Team’s batch operations are transparent to the Team and other engineering departments that rely on our data.
  3. Data engineers with different skill sets can build a pipeline together. Our data engineering team uses Scala and Python to execute Spark jobs. Airflow provides our team with a familiar middle layer to establish protocols between Python and Scala applications — allowing us to support engineers with both skills.
  4. The extended path to our batch job scheduling was obvious when using Cron, our scheduling was limited to one machine. Extending the application requires us to build a coordination layer. Airflow is providing us with out-of-the-box extensions. Using Airflow, the path is clear from LocalExecutor to CeleryExecutor and from CeleryExecutor on a single machine to multiple Airflow workers.
  5. Rerunking the job makes it easy and easy to use Cron, we need to get the Bash command executed and hope that our user environment is similar enough to the Cron environment to reproduce the problem for debugging. Now, via Airflow, any data engineer can directly view the log, learn about the error and re-run the failed task.
  6. Appropriate alert level all alarms for a batch job will be sent to the alarm mailbox of our streaming application prior to Airflow. Via Airflow, the team has built a Slack operator that can be called uniformly by all DaGs to push notifications. This enables us to separate the urgent failure notifications from our real-time bid stack from the important but non-urgent notifications from our batch jobs.
  7. A poorly performing DAG can cause an entire machine to crash set up specifications for your data team to monitor the external dependencies of their jobs so as not to affect the service level agreements of other jobs.
  8. Scrolling over your Airflow log this should be self-evident, but Airflow stores logs for all applications it calls. Be sure to roll over the logs properly to prevent the entire machine from shutting down due to insufficient disk space.

Five months later, VideoAmp’s data engineering team had almost tripled in size. We manage 36 DaGs and counting! Airflow is extended to enable all of our engineers to contribute to and support our batch process. The simplicity of the tool makes getting started relatively painless for new engineers. The team is rapidly developing improvements such as a unified push alert for our Slack channel, upgrading to Python3 Airflow, moving to CeleryExecutor, and taking advantage of the power provided by Airflow.

Any questions or comments can be asked directly here, or share your experience below.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.