Hadoop YARN Introduction

Apache YARN (Yet Another Resource Negotiator) is the cluster Resource management system introduced in Hadoop 2.0. Users can deploy various service frameworks on YARN for unified management and resource allocation.

2. YARN architecture

1. ResourceManager

ResourceManager usually runs as a backend process on an independent machine. ResourceManager coordinates and manages cluster resources. ResourceManager allocates resources to all applications submitted by users. The ResourceManager makes decisions based on application priorities, queue capacity, ACLs, and data locations, and then schedules cluster resources in a shared, secure, and multi-tenant manner.

2. NodeManager

NodeManager is the manager of each specific node in the YARN cluster. Responsible for managing the life cycle of all containers within the node, monitoring resources and tracking node health. Details are as follows:

  • Starts toResourceManagerRegister and send heartbeat messages periodically, and waitResourceManagerThe instruction;
  • maintenanceContainerLife cycle monitoringContainerResource utilization;
  • Manage the dependencies associated with task runtime, according toApplicationMasterNeed in startupContainerCopy the required programs and their dependencies locally.

3. ApplicationMaster

When a user submits an application, YARN starts a lightweight process, ApplicationMaster. ApplicationMaster coordinates resources from ResourceManager, monitors resource usage in containers through NodeManager, and monitors tasks and fault tolerance. Details are as follows:

  • Dynamic computing resource requirements are determined according to the running status of applications.
  • toResourceManagerApply for resources and monitor the usage of applied resources.
  • Track task status and progress, and report resource usage and application progress.
  • Be responsible for fault tolerance of tasks.

4. Contain

Container is a resource abstraction in YARN. It encapsulates multi-dimensional resources on a node, such as memory, CPUS, disks, and networks. When an AM applies for resources from RM, resources returned by RM are represented by Containers. YARN assigns a Container to each task. The task can use only the resources described in the Container. ApplicationMaster can run any type of task inside a Container. For example, the MapReduce ApplicationMaster requests a container to start a Map or Reduce task, and the Giraph ApplicationMaster requests a container to run the Giraph task.

Description of YARN working principles

  1. The Client submits the job to YARN.

  2. Resource Manager Select a Node Manager, start a Container, and run the Application Master instance.

  3. The Application Master requests more Container resources from the Resource Manager as needed (if the job is small, the Application Manager will choose to run the task in its own JVM);

  4. The Application Master uses the obtained Container resources to perform distributed computing.

Detailed working principles of YARN

1. Homework submission

The client invokes the job.waitForCompletion method to submit a MapReduce job to the entire cluster (Step 1). The new job ID(application ID) is assigned by the resource manager (step 2). The client of the job verifies the output of the job, calculates the split of the input, and copies the job’s resources (including Jar packages, configuration files, and split information) to HDFS(step 3). Finally, submit the job (step 4) by calling the Resource manager’s submitApplication().

2. Initialize the job

When the resource manager receives a request for submitApplciation(), it sends the request to the scheduler, which allocates the Container, and the resource manager starts the application manager process inside the Container. Monitored by the node manager (step 5).

The application manager for a MapReduce job is a Java application with a main class of MRAppMaster that monitors the progress of the job by creating bookkeeping objects to get progress and completion reports on the job (Step 6). Then, the input split calculated by the client is obtained through the distributed file system (step 7). Then, a map task is created for each input split, and a Reduce task object is created according to mapReduce.job. reduces.

3. Assign tasks

If the job is small, the application manager chooses to run the task in its own JVM.

If it is not a small job, then the application manager requests containers from the resource manager to run all map and Reduce jobs (step 8). These requests are transmitted via heartbeat and include the location of data for each map task, such as the hostname and rack where the split input is stored. The scheduler uses this information to schedule tasks, as far as possible, to nodes where the data is stored, or to nodes on the same rack as the split input.

4. Task running

After a task is assigned to a Container by the resource manager’s scheduler, the application manager starts the Container by contacting the node manager (Step 9). The task is performed by a Java application with a YarnChild primary class, and the resources required for the task, such as the job configuration, JAR files, and all files in the distributed cache, are localized before running the task (step 10). Finally, run the Map or Reduce task (Step 11).

YarnChild runs in a dedicated JVM, but YARN does not support JVM reuse.

5. Progress and status updates

YARN in the progress and status (including counter) is returned to the application manager, the client per second (general graphs. Client. Progressmonitor. Pollinterval) to the application manager request progress update, display to the user.

6. Homework completed

In addition to the application manager request work progress, the client every five minutes by calling waitForCompletion () to check whether the job is completed, time interval by mapreduce.client.com pletion. Pollinterval to set. After the job is complete, the application manager and Container clean up the work state, and the Job cleanup method of OutputCommiter is called. Job information is stored by the job history server for later verification by users.

5. Submit the job to YARN

In this example, the MApReduce program that calculates Pi in Hadoop Examples is submitted. The Jar package is in the share/ Hadoop/MApReduce directory of the Hadoop installation directory:

#Submission format: Hadoop JAR Jar package path Primary class Name Primary class parameter
#Hadoop jar hadoop-mapreduce-examples-2.6.0- cDH5.15.2. jar PI 3 3
Copy the code

The resources

  1. Have a basic knowledge of Yarn architecture and principles

  2. Apache Hadoop 2.9.2 > Apache Hadoop YARN