This is the 17th day of my participation in Gwen Challenge
Yarn
Introduction of Yarn
Yarn is a resource management system for a Hadoop cluster. Hadoop2.0 completely redesigns and reconstructs the MapReduce framework. We call MapReduce in Hadoop2.0 MRv2 or Yarn
Another goal of Yarn is to extend Hadoop so that it can support not only MapReduce computing, but also easily manage applications such as Hive, Hbase, Pig, Spark/Shark. The new architecture enables various types of applications to run on Hadoop and to be centrally managed at the system level through Yarn. That is, with Yarn, various applications can run in the same Hadoop system without interfering with each other and share cluster resources
YARN Generation background
Yarn is evolved from MRv1 and overcomes some limitations of MRv1. Let’s start with some of MRv1’s limitations:
- Poor scalability: In MRv1, JobTracker has two functions of resource management and job control at the same time, which is called the biggest bottleneck of the system and seriously restricts the scalability of Hadoop cluster
- Poor reliability: MRv1 uses the Master/Slave structure. The Master has a single point of failure. Once a failure occurs, the whole cluster becomes unavailable
- Low resource utilization: MRv1 is a coarse-grained resource allocation unit based on the task slot allocation model. Typically, one task does not run out of resources corresponding to a slot, and other tasks cannot use idle resources. In addition, slot is classified into Map slot and Task slot, which cannot be shared with each other. As a result, one resource is used up and the other is idle.
- Multiple computing frameworks cannot be supported: MapReduce, a disk-based offline computing framework, cannot meet application requirements due to the increasing computing load and computing speed. Several new memory-based, streaming computing frameworks have emerged, and MRv1 does not support multiple computing frameworks
To solve the preceding disadvantages, MRv2 abstracts the resource management function into an independent universal system YARN. In the MapReduce software stack, the resource management system YARN is pluggable and replaceable (for example, Mesos is used to replace YARN). Once the MapReduce interface changes, the implementation of all resource management systems must be changed. The yarn-based software stack is different. All frameworks need to implement external interfaces defined by YARN to run on YARN, creating an ecosystem based on YARN.
MapReduce processing flow and problems in earlier versions of Hadoop
- The job Tracker is the heart of the Map-Reduce framework. The job Tracker needs to communicate regularly with the machines in the cluster. The heartbeat needs to manage which programs should run on which machines. You need to manage all job failures and job restarts.
- A TaskTracker is a part of a Map-Reduce cluster that every machine has. It monitors the resources of its own machine
- TaskTracker also monitors the Task health of the current machine. TaskTracker needs to send this information via heartbeat to JobTracker, which collects this information to determine which machines the newly submitted job assignment is running on
As the workload of the cluster size grows, problems with the original framework become apparent:
- Is the Map – reduce the JobTracker
Centralized processing point
There,A single point of failure
- The JobTracker completed
Too many tasks
When there are too many Map-Reduce jobs, it will cause a large memory overhead and potentially increase the risk of JobTracker fail. This is also the general conclusion in the industry that map-Reduce of old Hadoop can only support 4000 hosts. - On TaskTracker, it is too simple to use the number of Map/Reduce tasks as the representation of resources, without considering the CPU/memory usage. If two tasks with large memory consumption are scheduled together, OOM is easy to appear
- On TaskTracker, resources are forcibly divided into Map task slot and Reduce task slot. If only Map tasks or Reduce tasks are available in the system, resources are wasted, which is the problem of cluster resource utilization mentioned above.
Principle and operation mechanism of the new Hadoop YARN framework
The basic idea is to separate JobTracker’s two main functions, resource management and task scheduling/monitoring, into separate components. The new resource manager manages the allocation of computing resources for all applications globally. The ApplicationMaster for each application is responsible for scheduling and coordinating accordingly.
An application is nothing more than a single traditional MapReduce job or a DAG(directed acyclic graph) job. The ResourceManager and phase management server of each machine Can manage user processes on which machine and organize computing
The ApplicationMaster of each application is responsible for: requesting the appropriate resource container from the scheduler, running tasks, tracking the state of applications and monitoring their progress, and handling the causes of task failures.
Comparison of old and new Hadoop MapReduce frameworks
- The client remains the same, and most of its API and interface remain compatible. This is also to develop user transparency. There is no need to make major changes to the source code, but the JobTracker and TaskTracker of the original framework are missing. Instead, ResourceManager AppliactionMaster NodeManager consists of three parts.
- ResourceManager is a central service. It schedules and starts the ApplicationMaster to which each Job belongs, and monitors the ApplicationMaster. Job task monitoring, restart, and so on are missing, which is why ApplicationMaster exists. ResourceManager schedules jobs and resources, receives jobs submitted by JobSubmitter, and starts the scheduling process based on the job context information and status information collected from NodeManager. Assign a Container as the Application Master
- NodeManager maintains the Container status and sends heartbeat messages to RM.
- ApplicationMaster Is responsible for all jobs in the life cycle of a Job. It is similar to JobTracker in the old framework, but each Job(not every Job) has an ApplicationMaster, which can run on a machine other than ResourceManager
Hadoop yarn advantage
- Greatly reduces the resource consumption of JobTracker (now known as ResourceManager) and distributes programs that detect the status of each Job subtask
- Resources are represented in memory, rather than in the number of slots left
- In the old framework, one of the big burdens of JobTracker was to monitor the health of tasks under the KOB. Now, this part is left to ApplicationMaster. The ResourceManager has a module named ApplicationsMaster. It checks the ApplicationMaster running status and restarts it on another machine if anything goes wrong