< 1 > Hadoop 1.0
The first generation of Hadoop consists of HDFS, a distributed file storage system, and MapReduce, a distributed computing framework. HDFS consists of a NameNode and multiple Datanodes, and MapReduce consists of a JobTracker and multiple Tasktrackers.Copy the code
< 2 > Hadoop 2.0
The second generation of Hadoop solves various problems existing in HDFS and MapReduce in Hadoop 1.0. The problems of HDFS are as follows: single NameNode restricts the scalability of HDFS. HDFS Federation is proposed, which enables multiple Namenodes to manage different directories to achieve access isolation and horizontal expansion. MapReduce has the following problems: MapReduce is not scalable and supports multiple frameworks. A new resource management framework YARN is proposed. YARN Separates the resource management and job control functions of jobTracker from those of ResourceManager and ApplicationMaster. ResourceManager allocates resources for all applications, while ApplicationMaster manages only one application. Note: YARN is a subproject of Hadoop, along with MapReduce.Copy the code
< 3 > 1.0 graphs
The first-generation MapReduce distributed computing framework consists of two parts: a programming model and a runtime environment. Its basic programming model abstracts problems into Map and Reduce phases. Map phase parses input data into key/value, while Reduce phase processes values with the same key in protocol and writes the final result to HDFS. Its runtime environment consists of two types of services: JobTracker, where JobTracker is responsible for resource management and control of all jobs, and TaskTracker, which receives and executes assignments from JobTracker.Copy the code
< 4 > graphs of 2.0
MapReduce 2.0 and MapReduce 1.0 share the same programming model, the only difference being the runtime environment. MapReduce 2.0 is the MRv1 that runs on the resource management framework YARN after processing on the basis of MapReduce 1.0. It no longer consists of JobTracker and TaskTracker, but becomes a job control process ApplicationMaster. ApplicationMaster manages only one job. YARN manages resources. Mapreduce is responsible for computing, and YARN is responsible for resource control [email protected] In short, MRv1 is an independent offline computing framework, and MRv2 is MRv1 running on YARN.Copy the code
<5>Hadoop-MapReduce
The offline computing framework is suitable for batch processing and does not have low latency. Hadoop-mapreduce is an open source implementation of Google distributed computing framework MapReduce and distributed storage system GFS. It consists of the Distributed computing framework MapReduce and the Hadoop Distributed File System (HDFS). It has high fault tolerance, high scalability, and simple programming interfaces, and has been adopted by most Internet companies.Copy the code
<6>Hadoop-YAR
An offshoot of Hadoop 2.0 is actually a resource management system. YARN is a subproject of Hadoop, along with MapReduce. It is actually a unified resource management system on which you can run various computing frameworks, including MapReduce, Spark, Storm, Flink, etc.Copy the code
conclusion
Hadoop 1.0: Consists of HDFS, a distributed file system, and MapReduce, an offline computing framework. Hadoop 2.0: includes HDFS that supports NameNode scale-out, YARN, and MapReduce, an offline computing framework running on YARN. Hadoop 2.0 is more powerful than Hadoop 1.0, with better scalability, performance, and support for multiple computing frameworks.Copy the code
From the perspective of open source, the proposal of YARN weakens the debate between multiple computing frameworks to some extent. YARN is an evolution of Hadoop MapReduce.
In the MapReduce era, MapReduce was criticized for not being suitable for iterative and streaming computing. As a result, distributed computing frameworks like Spark(for iterative computing) and Storm(for streaming computing) emerged, and developers of these systems compared MapReduce on their websites or in papers. And when YARN came along, it became clear: MapReduce is just a class of application abstraction running on YARN. Spark and Storm are both developed for different types of applications. There is no distinction between them. It can also be deployed on the YARN resource management system. In this way, an ecosystem with YARN as the underlying resource management platform and multiple distributed computing frameworks running on it is born. Therefore graphs, spark, storm, flink is an application running on the yarn.Copy the code
Spark is a popular in-memory distributed computing framework, which is a revelation in a world where MapReduce has been criticized for its inefficiency.
From the perspective of architecture and application, Spark is a development library that only contains computing logic and does not contain any implementation of resource management and scheduling. This enables Spark to run flexibly on mainstream resource management systems, such as MESOS and YARN. We call them Spark on mesos and Spark on Yarn. Running Spark on a resource management system brings many benefits, including: sharing cluster resources with other computing frameworks; Resources are allocated on demand to improve cluster resource utilization.Copy the code
What are the frameworks that run on the YARN resource management system
Frameworks running On YARN include Mapreduce-on-yarn, Spark-on-yarn, storm-on-yarn, and Tez-on-yarn. (1) mapreduce-on-yarn: indicates offline calculation On YARN. (2) spark-on-yarn: specifies the memory calculation On YARN. (3) Storm-on-YARN: real-time/streaming calculation On YARN; (4) TEz-on-YARN: DAG calculation On YARNCopy the code