In Hadoop1, the MapReduce framework is responsible for scheduling cluster resources and running the MapReduce program. Due to the high coupling between resource scheduling and computing in this architecture, only MapReduce computing tasks can be run in a Hadoop cluster, and other computing tasks cannot be run, resulting in high maintenance costs.
In Hadoop2, the Yarn + MapReduce architecture is changed to allocate resources to Yarn, and MapReduce is only responsible for computing. In this way, the Hadoop cluster can run MapReduce computing tasks and any computing tasks that support Yarn resource scheduling, such as Spark and Storm.
The Yarn architecture is as follows:
1. Yarn composition
The figure shows that Yarn consists of two parts:
1. Resource Manager. Responsible for resource management and allocation of the entire cluster;
2. NodeManager. Basically, it appears at the same time with the DataNode process of HDFS and is responsible for resource and task management of specific servers.
ResourceManager has two important components:
1. Scheduler. In fact, it is a resource scheduling algorithm that allocates resources based on the resource requests submitted by applications and the current cluster resources. Yarn allocates resources in a Container. Each Container contains a certain amount of computing resources, such as memory and CPU. Containers are allocated by the scheduler, started and managed by the NodeManager. The NodeManager monitors the running status of containers on each node and reports to ResourceManager.
2. Application manager. Mainly responsible for application submission, monitoring application running status, etc. When the application is started, an ApplicationMaster is launched in the cluster, and the ApplicationMaster also runs in the container, ApplicationMaster then applies for resources from ResourceManager based on application requirements. After applying for resources, ApplicationMaster distributes application code to containers on each node for distributed computing.
Second, Yarn workflow
The following uses a MapReduce program as an example to analyze the working process of Yarn:
1. We submit our application, including MapReduce ApplicationMaster, MapReduce program, and MapReduce program start command, to the Yarn cluster.
2. ResourceManager communicates with NodeManager and allocates the first Container to a NodeManager in the cluster based on cluster resources. NodeManager starts the Container.
3. ResourceManager distributes MapReduce ApplicationMaster to the Container and starts the MapReduce ApplicationMaster.
4. Immediately after MapReduce ApplicaitonMaster starts, register with ResourceManager and apply for resources for MapReduce.
5. The MapReduce ApplicationMaster communicates with the NodeManager immediately after applying for the Container and distributes the MapReduce program to the Container. The Map or Reduce process is running here;
6. Map or Reduce jobs communicate with MapReduce ApplicationMaster to report their running status. If the running is complete, The MapReduce Application deregisters ResourceManager and releases all container resources.
As shown in this figure, Yarn does not have any coupling with MapReduce. Yarn only communicates with MapReduce ApplicationMaster. MapReduce ApplicationMaster acts as a bridge between MapReduce and Yarn. MapReduce ApplicationMaster is implemented by MapReduce based on the Yarn interface specification.