A list,
Azkaban schedules tasks by uploading configuration files on the UI. It has two important concepts:
- Job: The scheduling task you need to perform.
- Flow: A diagram that captures jobs and their dependencies is called Flow.
Azkaban 3.x currently supports both Flow 1.0 and Flow 2.0. This article will focus on Flow 1.0 and the next article will cover Flow 2.0.
Second, basic task scheduling
2.1 Creating a Project
You can create the corresponding project on the Azkaban main screen:
2.2 Task Configuration
Create a task configuration file hello-azkaban. job with the following content. The task here is simply to print ‘Hello Azkaban! ‘:
#command.jobtype=command command=echo 'Hello Azkaban! 'Copy the code
2.3 Uploading a Package
Package hello-azkaban. job as a zip file:
Upload through the Web UI:
Flows can be seen when the Flows are uploaded successfully:
2.4 Executing tasks
Click Execute Flow on the page to Execute the task:
2.5 Execution Result
Click detail to view the task execution log:
Three, multi-task scheduling
3.1 Dependency Configuration
Here we assume that we have five tasks (TaskA — TaskE). Task D needs to be executed after TaskA, B, and C is completed, and TaskE needs to be executed after task D is completed. In this case, we need to define the dependencies property. The configuration of each task is as follows:
Task-A.job :
type=command
command=echo 'Task A'
Copy the code
Task-B.job :
type=command
command=echo 'Task B'
Copy the code
Task-C.job :
type=command
command=echo 'Task C'
Copy the code
Task-D.job :
type=command
command=echo 'Task D'
dependencies=Task-A,Task-B,Task-C
Copy the code
Task-E.job :
type=command
command=echo 'Task E'
dependencies=Task-D
Copy the code
3.2 Compressed Upload
Here, it should be noted that a Project can only receive one compressed package, I still use the above Project, by default, the later compressed package will overwrite the previous compressed package:
3.3 Dependencies
When multiple tasks have dependencies, the file name of the last task is used as the name of Flow by default, and the dependency relationship is shown as follows:
3.4 Execution Result
As can be seen from this case, Flow1.0 cannot configure multiple tasks with a single job file, but Flow 2.0 solves this problem perfectly.
4. Schedule HDFS jobs
The procedure is the same as the previous procedure. This section uses the file list on the HDFS as an example. You are advised to use the full path. The configuration file is as follows:
Type =command command=/usr/app/hadoop-2.6.0-cdh5.15.2/bin/hadoop fs-ls /Copy the code
Execution Result:
5. Schedule MR jobs
MR Job configuration:
Type = command command = / usr/app/hadoop - server - cdh5.15.2 / bin/hadoop jar The/usr/app/hadoop - server - cdh5.15.2 / share/hadoop/graphs/hadoop - graphs - examples - server - cdh5.15.2. Jar PI 3 3Copy the code
Execution Result:
6. Schedule Hive jobs
Job configuration:
Type = command command = / usr/app/hive - 1.1.0 - cdh5.15.2 / bin/hive - f 'SQL' test.Copy the code
Create a table of employees and check its structure:
CREATE DATABASE IF NOT EXISTS hive;
use hive;
drop table if exists emp;
CREATE TABLE emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
-- View emP table information
desc emp;
Copy the code
Package job file and SQL file together:
The result is as follows:
7. Modify job configuration online
During testing, we might need to change the configuration frequently, which would be cumbersome if we had to repackage and upload each change. Therefore, Azkaban supports online configuration modification. Click the Flow that needs to be modified to enter the details page:
Click the Eidt button on the details page to enter the edit page:
On the edit page, you can add or modify configurations:
Ps: Possible problems
If any of the following exceptions occur, most likely due to insufficient memory on the execution host, Azkaban requires that the available memory of the execution host be greater than 3 GB before performing tasks:
Cannot request memory (Xms 0 kb, Xmx 0 kb) from system for job
Copy the code
If your execution host cannot increase memory, you can disable memory checking by modifying the commonprivate.properties file in the plugins/jobtypes/ directory as follows:
memCheck.enabled=false
Copy the code
See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series