A list,

Azkaban schedules tasks by uploading configuration files on the UI. It has two important concepts:

  • Job: The scheduling task you need to perform.
  • Flow: A diagram that captures jobs and their dependencies is called Flow.

Azkaban 3.x currently supports both Flow 1.0 and Flow 2.0. This article will focus on Flow 1.0 and the next article will cover Flow 2.0.

Second, basic task scheduling

2.1 Creating a Project

You can create the corresponding project on the Azkaban main screen:

2.2 Task Configuration

Create a task configuration file hello-azkaban. job with the following content. The task here is simply to print ‘Hello Azkaban! ‘:

#command.jobtype=command command=echo 'Hello Azkaban! 'Copy the code

2.3 Uploading a Package

Package hello-azkaban. job as a zip file:

Upload through the Web UI:

Flows can be seen when the Flows are uploaded successfully:

2.4 Executing tasks

Click Execute Flow on the page to Execute the task:

2.5 Execution Result

Click detail to view the task execution log:

Three, multi-task scheduling

3.1 Dependency Configuration

Here we assume that we have five tasks (TaskA — TaskE). Task D needs to be executed after TaskA, B, and C is completed, and TaskE needs to be executed after task D is completed. In this case, we need to define the dependencies property. The configuration of each task is as follows:

Task-A.job :

type=command
command=echo 'Task A'
Copy the code

Task-B.job :

type=command
command=echo 'Task B'
Copy the code

Task-C.job :

type=command
command=echo 'Task C'
Copy the code

Task-D.job :

type=command
command=echo 'Task D'
dependencies=Task-A,Task-B,Task-C
Copy the code

Task-E.job :

type=command
command=echo 'Task E'
dependencies=Task-D
Copy the code

3.2 Compressed Upload

Here, it should be noted that a Project can only receive one compressed package, I still use the above Project, by default, the later compressed package will overwrite the previous compressed package:

3.3 Dependencies

When multiple tasks have dependencies, the file name of the last task is used as the name of Flow by default, and the dependency relationship is shown as follows:

3.4 Execution Result

As can be seen from this case, Flow1.0 cannot configure multiple tasks with a single job file, but Flow 2.0 solves this problem perfectly.

4. Schedule HDFS jobs

The procedure is the same as the previous procedure. This section uses the file list on the HDFS as an example. You are advised to use the full path. The configuration file is as follows:

Type =command command=/usr/app/hadoop-2.6.0-cdh5.15.2/bin/hadoop fs-ls /Copy the code

Execution Result:

5. Schedule MR jobs

MR Job configuration:

Type = command command = / usr/app/hadoop - server - cdh5.15.2 / bin/hadoop jar The/usr/app/hadoop - server - cdh5.15.2 / share/hadoop/graphs/hadoop - graphs - examples - server - cdh5.15.2. Jar PI 3 3Copy the code

Execution Result:

6. Schedule Hive jobs

Job configuration:

Type = command command = / usr/app/hive - 1.1.0 - cdh5.15.2 / bin/hive - f 'SQL' test.Copy the code

Create a table of employees and check its structure:

CREATE DATABASE IF NOT EXISTS hive;
use hive;
drop table if exists emp;
CREATE TABLE emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
-- View emP table information
desc emp;
Copy the code

Package job file and SQL file together:

The result is as follows:

7. Modify job configuration online

During testing, we might need to change the configuration frequently, which would be cumbersome if we had to repackage and upload each change. Therefore, Azkaban supports online configuration modification. Click the Flow that needs to be modified to enter the details page:

Click the Eidt button on the details page to enter the edit page:

On the edit page, you can add or modify configurations:

Ps: Possible problems

If any of the following exceptions occur, most likely due to insufficient memory on the execution host, Azkaban requires that the available memory of the execution host be greater than 3 GB before performing tasks:

Cannot request memory (Xms 0 kb, Xmx 0 kb) from system for job
Copy the code

If your execution host cannot increase memory, you can disable memory checking by modifying the commonprivate.properties file in the plugins/jobtypes/ directory as follows:

memCheck.enabled=false
Copy the code

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series