Task scheduling practice based on Azkaban

This article is published by the Cloud + community

Author: maxluo

Introduction to Azkaban

Azkaban is an open source task scheduling framework for LinkedIn, similar to the JBPM and Activiti workflow frameworks in JavaEE.

Azkaban features:

1. Task dependency processing.

2. Task monitoring, failure alarm.

3. Visualization of task flow.

4. Task permission management.

Common task scheduling frameworks include Apache Oozie, LinkedIn Azkaban, Apache Airflow, and Alibaba Zeus. Azkaban has the advantages of light volume pluggable, user-friendly WebUI, SLA alarm, complete access control, and easy secondary development. It has also been widely used. The following figure shows the Azkaban architecture diagram, which consists of three parts: Azkaban Webserver, Azkaban Executor, and DB.

Webserver is mainly responsible for authority verification, project management, job flow and other work;

Executor is responsible for executing job flows and collecting execution logs.

MySQL is used to store information about the execution status of jobs/job streams. The figure shows the single-executor scenario, but in practice most projects use the multi-executor scenario.

1.1 Job flow execution process

Azkaban WebServer selects an appropriate task running node based on the collected Executor status, pushes the task to that node, and manages and runs all jobs for the workflow.

1.2 Deployment Mode

Azkaban supports three deployment modes, one for learning and one for testing, and one for high availability.

Solo – server mode

The DB uses an embedded H2, and the Web Server and Executor Server run in the same process. This mode contains all the features of Azkaban, but is generally used for learning and testing.

Two – server mode

The DB uses MySQL, which supports the master-slave architecture, and the Web Server and Executor Server run in different processes.

Distributed multiple-executor pattern

The DB uses MySQL, which supports the master-slave architecture. The Web Server and Executor Server run on different machines, and there are multiple Executor Servers.

1.3 Compilation and Deployment

Compile environment

Yum install git yum install GCC -c++ yum install java-1.8.0-openjdk-develCopy the code

Download source & decompress

The mkdir -p/data/azkaban/installcdMv/data/azkaban wget https://github.com/azkaban/azkaban/archive/3.42.0.tar.gz 3.42.0. Tar. Gz azkaban - 3.42.0. Tar. Gz tar - ZXVF azkaban - 3.42.0. Tar. GzCopy the code

compile

cdAzkaban-3.42.0./gradlew build installDist -xtest
Copy the code

Solo-server mode is deployed

In order to simplify the deployment test, the solo-server mode is adopted for deployment.

cd/data/azkaban/install tar -zxvf .. / azkaban - 3.42.0 / azkaban - solo - server/build/distributions azkaban - solo - server - 0.1.0 from - the SNAPSHOT. Tar. Gz - C.Copy the code

Modify the time zone

cd/ data/azkaban/install/azkaban - solo - server - 0.1.0 from - the SNAPSHOT tzselect# choose Asia/Shanghai

vim ./conf/azkaban.properties

default.timezone.id=Asia/Shanghai # change time zone
Copy the code

Start the

./bin/azkaban-solo-start.sh
Copy the code

Note: the startup/shutdown must be into the/data/azkaban/install/azkaban – solo – server – 0.1.0 from – the SNAPSHOT/directory.

The login

http://ip:port/

Specific see listener port configuration. / conf/azkaban. Properties: jetty. Port = 8081

IP indicates the server address.

/conf/azkaban-users. XML. The user name and password of the admin role are azkaban and azkaban respectively.

Detailed configuration method content see: azkaban. Making. IO/azkaban/doc…

2. Network connectivity between Azkaban and warehouse cluster

The current communication between Azkaban and Snova is based on two facts: 1. Azkaban Executor servers can access the Internet or Snova server IP. 2. Snova provides IP access from the external network. The following figure is a schematic diagram of network connectivity:

When Azkaban Executor executes a job, its scripts or commands access Snova through public IP addresses.

The following steps explain how to use azkaban-based workflows.

Three, preparatory work

3.1 Creating an External IP address for the Snova Cluster

On the basic configuration page of the Snova cluster console, click Apply for an External IP address. After the system runs successfully, the external IP address for accessing the cluster is displayed.

3.2 Adding the Snova Access Address whitelist

On the Snova console, create a whitelist on the cluster Details page and configuration page, as shown in the following figure.

Why build this access whitelist?

For system security, Snova by default denies database access to addresses or users that are not on the whitelist.

That is, set the IP whitelist CIDR address to XX.xx.xx. xx/xx, including all IP addresses or network segments of all Azkaban executors.

3.3 User Authorization

In Section 3.2, it is recommended to create a separate user for SCF task scheduling and computation. Therefore, you need to grant the user the permission to access the database and table.

Create a user

CREATE USER scf_visit WITH LOGIN PASSWORD 'scf_passwd';
Copy the code

And set the user access password.

Database table Authorization

GRANT ALL on t1 to scf_visit;
Copy the code

Four, scheduled task

http://node1:8081/index

Log in to Azkaban and Create Project=>Upload Zip package generated in the previous step => Execute Flow Perform step by step operations.

Specific steps can be found in the reference documentation:

www.cnblogs.com/qingyunzong…

4.1 Creating a Project

4.2 create a job

job1

The file name is job.job and must end with. Job. As follows:

type=command

command=echo "job1"

retries=5
Copy the code

Note: type type and use see azkaban) lot. IO/azkaban/doc…

job2

type=command

dependencies=job1

retries=5

command=echo "job2 xx"The command. 1 = ls - alCopy the code

Note: Dependencies are the file name of the task that the job depends on (excluding the. Job suffix). If multiple dependencies are required, separate them with commas (,), for example, joB2 and job5.

job3

type=command

dependencies=job2,job5

command=sleep 60
Copy the code

job5

type=command

dependencies=job1

command=pwd
Copy the code

job6

type=command

dependencies=job3

command=sh /data/shell/admin.sh psqlx
Copy the code

/data/shell/admin.sh, note that the function can encapsulate the user function code, the script content is as follows, to read the data in the table, and print:

function psqlx() {  result=PGPASSWORD=scf_passwd  psql -h xx.xx.xx.xx  -p xx -U scf_visit   -d postgres <<EOF select * from t1; EOF  echo $result }
Copy the code

4.3 Uploading the Job Package

Compress all job files into a zip package. Note: All files must be in the root directory of the compressed package, no subdirectories, as follows:

4.3 run

Query the execution process and results.

4.4 Configuring periodic Scheduling

After debugging, you can set a periodic scheduling plan. For example, you can schedule workflows periodically every day to complete the running plan.

5. Summary of practice

Here is a detailed comparison of the two most popular schedulers on the market. Apache Oozie is well known.

5.1 contrast

Compare from function

Both can schedule Linux commands, MapReduce, Spark, Pig, Java, Hive, Java programs, and script workflow tasks

Both can perform workflow tasks on a regular basis

Compare from workflow definition

Azkaban uses the Properties file to define the workflow

Oozie uses XML files to define workflow

From the job spread reference to contrast

${input} ${input}

Oozie supports parameters and EL expressions, such as ${fs:dirSize(myInputDir)}.

In terms of timed execution

Azkaban’s scheduled execution tasks are based on time

2. Oozie executes scheduled tasks based on time and input data

Compare from resource management

1. Azkaban has strict permission control, such as users’ read/write/execute operations on workflow

2. Oozie does not have strict permission control

5.2 Application Scenarios

Data analysis can be summarized into three steps: 1. Data import. 2. Data calculation. 3. Data export.

Three types of tasks may run concurrently and be task dependent. Therefore, Azkaban can basically meet the requirements of the above task scheduling management and operation scenarios.

First create a job1 for user data import, such as from COS, and execute the following SQL command for task content.

insert into gp_table select * from cos_table;

Data can also be imported into Snova data warehouse periodically through other import tools, such as DataX. So you simply deploy DataX to the corresponding directory of your Azkaban Executor machine and call it

Next, create JoB2, user data calculation and analysis. This step can be the result of multiple jobs running or concurrent running.

Finally, the results can be exported to the application database.

insert into cos_table select * from gp_table;

Less than 5.2

1, Azkaban, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL The new instance ID starts from the failed job. The instance ID that successfully runs is skipped.

2. The job starts a complex program using a shell command. If the shell returns success, it does not mean that the program runs successfully.

3. The fault tolerance of job running management is insufficient. After a job is submitted to a running task, the status of the task fails after the job restarts or the Executor process is suspended.

This article has been published by Tencent Cloud + community in various channels

For more fresh technology dry goods, you can follow usTencent Cloud technology community – Cloud Plus community official number and Zhihu organization number