It has been more than half a year since I wrote a small composition last time. Some things have happened in the past few months. The biggest one is that I changed my job from Palo Alto to Mountain View at the beginning of this month. I’ve been on the job for about three weeks now and I’m very pleased to have embarked on a new project, the core of which is going to be an Directed Acyclic Graph — DAG scheduling tool. I’ve read a lot of blogs, I’ve logged big and small holes, Having successfully run it on a server, the next step is to build something on top of it. Now that I’m done, I should write an Airflow example describing briefly how to run a program as a service on a Linux machine.

The installation

I have created user Airflow to run this, that is, to install and run this and to modify the configuration of Airflow, and I have given this user the green light to grant root access in order to proceed smoothly. Installing Airflow is the easiest step and is described in detail on the official website. My environment is Python 3.6.6, the Airflow version is 1.10.0, to avoid conflicts with the existing package I have installed it in a Virtualenv, run the following command under /home/airflow

  • virtualenv venv -p `which python3`
  • source venv/bin/activate
  • PIP install apache – airflow [postgres, crypto, gcp_api] = = 1.10.0

The square brackets are the optional dependencies, here I’m using PostgreSQL as the Airflow metadata database (default SQLite) and I want to encrypt my various link parameters like passwords and interact with Google Cloud services, so install these three. Users can select different dependencies based on their actual situation. For details, please refer to the official documents.

As an aside, if you want to add optional dependencies (square brackets) to Python packages you develop, you can do so by defining the extra_require of setup.py, as described here.

The configuration file

Since we wanted to run Airflow as a service for further development and maintenance, rather than running it once and then not running it, I used Systemd to manage the Airflow process.

There is a brief introduction to the Airflow documentation for the systemd configuration. Specifically, in my configuration I have set the environment variable AIRFLOW_HOME to /etc/airflow, I set AIRFLOW_CONFIG to /etc/authorris/authorris.cfg so that I have only these two environment variables in my /etc/sysconfig/authorris.cfg file

  • AIRFLOW_CONFIG=/etc/airflow/airflow.cfg
  • AIRFLOW_HOME=/etc/airflow

For this to work, there are two required services, a WebServer to display the Web UI and a Scheduler to perform tasks in the DAG. Fortunately, Airflow has provided us with sample files for both services: The only line that needs to be modified is ExecStart, since we are going to run airflow in a virtual environment, and these two files are shown as follows

  • airflow-webserver.service
  • #
  • # Licensed to the Apache Software Foundation (ASF) under one
  • # or more contributor license agreements. See the NOTICE file
  • # distributed with this work for additional information
  • # regarding copyright ownership. The ASF licenses this file
  • # to you under the Apache License, Version 2.0 (the
  • # “License”); you may not use this file except in compliance
  • # with the License. You may obtain a copy of the License at
  • #
  • # http://www.apache.org/licenses/LICENSE-2.0
  • #
  • # Unless required by applicable law or agreed to in writing,
  • # software distributed under the License is distributed on an
  • # “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  • # KIND, either express or implied. See the License for the
  • # specific language governing permissions and limitations
  • # under the License.
  • [Unit]
  • Description=Airflow webserver daemon
  • After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
  • Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
  • [Service]
  • EnvironmentFile=/etc/sysconfig/airflow
  • User=airflow
  • Group=airflow
  • Type=simple
  • ExecStart=/bin/bash -c ‘source /home/airflow/venv/bin/activate ; airflow webserver –pid /run/airflow/webserver.pid’
  • Restart=on-failure
  • RestartSec=5s
  • PrivateTmp=true
  • [Install]
  • WantedBy=multi-user.target
  • airflow-scheduler.service
  • #
  • # Licensed to the Apache Software Foundation (ASF) under one
  • # or more contributor license agreements. See the NOTICE file
  • # distributed with this work for additional information
  • # regarding copyright ownership. The ASF licenses this file
  • # to you under the Apache License, Version 2.0 (the
  • # “License”); you may not use this file except in compliance
  • # with the License. You may obtain a copy of the License at
  • #
  • # http://www.apache.org/licenses/LICENSE-2.0
  • #
  • # Unless required by applicable law or agreed to in writing,
  • # software distributed under the License is distributed on an
  • # “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  • # KIND, either express or implied. See the License for the
  • # specific language governing permissions and limitations
  • # under the License.
  • [Unit]
  • Description=Airflow scheduler daemon
  • After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
  • Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
  • [Service]
  • EnvironmentFile=/etc/sysconfig/airflow
  • User=airflow
  • Group=airflow
  • Type=simple
  • ExecStart=/bin/bash -c ‘source /home/airflow/venv/bin/activate ; airflow scheculer’
  • Restart=always
  • RestartSec=5s
  • [Install]
  • WantedBy=multi-user.target

The exact location of bash varies from system to system, but one thing to note is that it must be executed using an absolute path. Then place both under /etc/systemd/system. There is also an essential authority.conf file, which will be copied directly to /etc/systemd.

This completes the Systemd section, but not yet, we need an airflow. CFG to tell airflow how to configure it. Each user’s specific situation is not the same, I will not repeat, here only to mention a few more important.

  • sql_alchemy_conn = postgresql+psycopg2://<user>:<password>@<host>:<port>We use PostgreSQL as the metadata database in our production environment
  • load_examples = FalseThe sample DAGs themselves are a good reference for development tests, but they are obviously not used in production, so turn them off
  • fernet_key = <some base64 string>This is absolutely necessary, otherwise the Airflow will store the sensitive parameters of various links in clear text for reference in the generation methodhere
  • executor = LocalExecutorLocalExecutor maximizes the parallelization capability of a single machine i.e. running multiple processes to perform different tasks at the same time is sufficient for current requirements and laterally scale-out using Redis + celery mode could also be considered

run

The first step is to initialize the database, which I’m going to do manually, again as airflow

  • source ~/venv/bin/activate
  • export AIRFLOW_HOME=/etc/airflow
  • airflow initdb

I’ll then use Systemd to control the Airflow startup and shutdown

  • sudo systemctl [start|stop|restart|status] airflow-webserver
  • sudo systemctl [start|stop|restart|status] airflow-scheduler

Each time you want to add a new DAG, simply place the Python file in /etc/latt/dags.

reference

  • airflow.readthedocs.io/en/latest/
  • Github.com/apache/incu…
  • Wecode.wepay.com/posts/airfl…
  • medium.com/@vando/airf…
  • Robinhood. Engineering/according to – robinho…

Share to Sina WeiboSina Weibo
Share to FacebookFacebook
Share to TwitterTwitter
Share to EvernoteEvernote
Share to PocketPocket
Share to more… More…