This article is published by the Cloud + community
Author: maxluo
Introduction to Azkaban
Azkaban is an open source task scheduling framework for LinkedIn, similar to the JBPM and Activiti workflow frameworks in JavaEE.
Azkaban features:
1. Task dependency processing.
2. Task monitoring, failure alarm.
3. Visualization of task flow.
4. Task permission management.
Common task scheduling frameworks include Apache Oozie, LinkedIn Azkaban, Apache Airflow, and Alibaba Zeus. Azkaban has the advantages of light volume pluggable, user-friendly WebUI, SLA alarm, complete access control, and easy secondary development. It has also been widely used. The following figure shows the Azkaban architecture diagram, which consists of three parts: Azkaban Webserver, Azkaban Executor, and DB.
Webserver is mainly responsible for authority verification, project management, job flow and other work;
Executor is responsible for executing job flows and collecting execution logs.
MySQL is used to store information about the execution status of jobs/job streams. The figure shows the single-executor scenario, but in practice most projects use the multi-executor scenario.
1.1 Job flow execution process
Azkaban WebServer selects an appropriate task running node based on the collected Executor status, pushes the task to that node, and manages and runs all jobs for the workflow.
1.2 Deployment Mode
Azkaban supports three deployment modes, one for learning and one for testing, and one for high availability.
Solo – server mode
The DB uses an embedded H2, and the Web Server and Executor Server run in the same process. This mode contains all the features of Azkaban, but is generally used for learning and testing.
Two – server mode
The DB uses MySQL, which supports the master-slave architecture, and the Web Server and Executor Server run in different processes.
Distributed multiple-executor pattern
The DB uses MySQL, which supports the master-slave architecture. The Web Server and Executor Server run on different machines, and there are multiple Executor Servers.
1.3 Compilation and Deployment
Compile environment
Yum install git yum install GCC -c++ yum install java-1.8.0-openjdk-develCopy the code
Download source & decompress
The mkdir -p/data/azkaban/installcdMv/data/azkaban wget https://github.com/azkaban/azkaban/archive/3.42.0.tar.gz 3.42.0. Tar. Gz azkaban - 3.42.0. Tar. Gz tar - ZXVF azkaban - 3.42.0. Tar. GzCopy the code
compile
cdAzkaban-3.42.0./gradlew build installDist -xtest
Copy the code
Solo-server mode is deployed
In order to simplify the deployment test, the solo-server mode is adopted for deployment.
cd/data/azkaban/install tar -zxvf .. / azkaban - 3.42.0 / azkaban - solo - server/build/distributions azkaban - solo - server - 0.1.0 from - the SNAPSHOT. Tar. Gz - C.Copy the code
Modify the time zone
cd/ data/azkaban/install/azkaban - solo - server - 0.1.0 from - the SNAPSHOT tzselect# choose Asia/Shanghai
vim ./conf/azkaban.properties
default.timezone.id=Asia/Shanghai # change time zone
Copy the code
Start the
./bin/azkaban-solo-start.sh
Copy the code
Note: the startup/shutdown must be into the/data/azkaban/install/azkaban – solo – server – 0.1.0 from – the SNAPSHOT/directory.
The login
http://ip:port/
Specific see listener port configuration. / conf/azkaban. Properties: jetty. Port = 8081
IP indicates the server address.
/conf/azkaban-users. XML. The user name and password of the admin role are azkaban and azkaban respectively.
Detailed configuration method content see: azkaban. Making. IO/azkaban/doc…
2. Network connectivity between Azkaban and warehouse cluster
The current communication between Azkaban and Snova is based on two facts: 1. Azkaban Executor servers can access the Internet or Snova server IP. 2. Snova provides IP access from the external network. The following figure is a schematic diagram of network connectivity:
When Azkaban Executor executes a job, its scripts or commands access Snova through public IP addresses.
The following steps explain how to use azkaban-based workflows.
Three, preparatory work
3.1 Creating an External IP address for the Snova Cluster
On the basic configuration page of the Snova cluster console, click Apply for an External IP address. After the system runs successfully, the external IP address for accessing the cluster is displayed.
3.2 Adding the Snova Access Address whitelist
On the Snova console, create a whitelist on the cluster Details page and configuration page, as shown in the following figure.
Why build this access whitelist?
For system security, Snova by default denies database access to addresses or users that are not on the whitelist.
That is, set the IP whitelist CIDR address to XX.xx.xx. xx/xx, including all IP addresses or network segments of all Azkaban executors.
3.3 User Authorization
In Section 3.2, it is recommended to create a separate user for SCF task scheduling and computation. Therefore, you need to grant the user the permission to access the database and table.
Create a user
CREATE USER scf_visit WITH LOGIN PASSWORD 'scf_passwd';
Copy the code
And set the user access password.
Database table Authorization
GRANT ALL on t1 to scf_visit;
Copy the code
Four, scheduled task
http://node1:8081/index
Log in to Azkaban and Create Project=>Upload Zip package generated in the previous step => Execute Flow Perform step by step operations.
Specific steps can be found in the reference documentation:
www.cnblogs.com/qingyunzong…
4.1 Creating a Project
4.2 create a job
job1
The file name is job.job and must end with. Job. As follows:
type=command
command=echo "job1"
retries=5
Copy the code
Note: type type and use see azkaban) lot. IO/azkaban/doc…
job2
type=command
dependencies=job1
retries=5
command=echo "job2 xx"The command. 1 = ls - alCopy the code
Note: Dependencies are the file name of the task that the job depends on (excluding the. Job suffix). If multiple dependencies are required, separate them with commas (,), for example, joB2 and job5.
job3
type=command
dependencies=job2,job5
command=sleep 60
Copy the code
job5
type=command
dependencies=job1
command=pwd
Copy the code
job6
type=command
dependencies=job3
command=sh /data/shell/admin.sh psqlx
Copy the code
/data/shell/admin.sh, note that the function can encapsulate the user function code, the script content is as follows, to read the data in the table, and print:
function psqlx() { result=PGPASSWORD=scf_passwd psql -h xx.xx.xx.xx -p xx -U scf_visit -d postgres <<EOF select * from t1; EOF echo $result }
Copy the code
4.3 Uploading the Job Package
Compress all job files into a zip package. Note: All files must be in the root directory of the compressed package, no subdirectories, as follows:
4.3 run
Query the execution process and results.
4.4 Configuring periodic Scheduling
After debugging, you can set a periodic scheduling plan. For example, you can schedule workflows periodically every day to complete the running plan.
5. Summary of practice
Here is a detailed comparison of the two most popular schedulers on the market. Apache Oozie is well known.
5.1 contrast
Compare from function
Both can schedule Linux commands, MapReduce, Spark, Pig, Java, Hive, Java programs, and script workflow tasks
Both can perform workflow tasks on a regular basis
Compare from workflow definition
Azkaban uses the Properties file to define the workflow
Oozie uses XML files to define workflow
From the job spread reference to contrast
${input} ${input}
Oozie supports parameters and EL expressions, such as ${fs:dirSize(myInputDir)}.
In terms of timed execution
Azkaban’s scheduled execution tasks are based on time
2. Oozie executes scheduled tasks based on time and input data
Compare from resource management
1. Azkaban has strict permission control, such as users’ read/write/execute operations on workflow
2. Oozie does not have strict permission control
5.2 Application Scenarios
Data analysis can be summarized into three steps: 1. Data import. 2. Data calculation. 3. Data export.
Three types of tasks may run concurrently and be task dependent. Therefore, Azkaban can basically meet the requirements of the above task scheduling management and operation scenarios.
First create a job1 for user data import, such as from COS, and execute the following SQL command for task content.
insert into gp_table select * from cos_table;
Data can also be imported into Snova data warehouse periodically through other import tools, such as DataX. So you simply deploy DataX to the corresponding directory of your Azkaban Executor machine and call it
Next, create JoB2, user data calculation and analysis. This step can be the result of multiple jobs running or concurrent running.
Finally, the results can be exported to the application database.
insert into cos_table select * from gp_table;
Less than 5.2
1, Azkaban, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL, SQL The new instance ID starts from the failed job. The instance ID that successfully runs is skipped.
2. The job starts a complex program using a shell command. If the shell returns success, it does not mean that the program runs successfully.
3. The fault tolerance of job running management is insufficient. After a job is submitted to a running task, the status of the task fails after the job restarts or the Executor process is suspended.
This article has been published by Tencent Cloud + community in various channels
For more fresh technology dry goods, you can follow usTencent Cloud technology community – Cloud Plus community official number and Zhihu organization number