Hi, I’m Alex the dreamer. I’ve actually written quite a bit about big data technology components before, such as:
In front of the high-energy | HDFS architecture, you understand?
Have you got all the key points of MapReduce?
Learn Presto from 0 to 1. This one is enough
.
However, the feeling is basically describing some theoretical content, lacking some ideological essence of the structure. Moreover, big data technology is actually an innovative application of distributed technology in the field of data processing. Its essence is to use more computers to form a cluster and provide more computing resources, so as to meet the requirements of greater computing pressure. The birth of big data technology is to solve the problem of data storage and calculation. Just recently I have been reading li Zhihui’s book and geek column, hoping to export some dry goods. In this article, I would like to introduce the architecture analysis and sorting of the three components of Hadoop, such as HDFS, MapReduce and Yarn, to help you learn and grow.
primers
Big data is to collect all kinds of data together for calculation and explore their value. These data include database data, log data, and specially collected user behavior data; It includes the data generated by the enterprise itself, also includes the data purchased from the third party, and also includes all kinds of Internet public data obtained by using the web crawler…
In the face of such a huge amount of data, how to store and how to effectively use large-scale server cluster processing computing is the core of big data technology.
HDFS Distributed file storage architecture
As we know, the first driver of Google’s big data “troika” is GFS (Google File System), and Hadoop’s first product is HDFS. It can be said that distributed file storage is the basis of distributed computing, and the importance of distributed file storage can also be seen. If we compare big data computing to cooking, then the data is the ingredients, and the Hadoop distributed file system HDFS is the cooking pot.
Chefs come and go, ingredients come and go, dishes come and go, but one constant is the cauldron. The same is true of big data. In recent years, various computing frameworks, algorithms and application scenarios have been constantly updated and dazzling, but HDFS is still the king of big data storage.
Why is HDFS so entrenched? In the whole big data system, the most valuable and irreplaceable asset is data, and all the big data should be carried out around data. As the earliest big data storage system, HDFS stores valuable data assets. In order to be widely used by people, various new algorithms and frameworks must support HDFS to obtain the data stored in it. So the more big data technology develops, the more new technologies, the more HDFS is supported, the more we need HDFS. HDFS may not be the best big data storage technology, but it is still the most important.
Before theIn front of the high-energy | HDFS architecture, you understand?In this article, we have talked about the HDFS architecture, as shown below:The HDFS integrates thousands of servers into a unified file storage system. The NameNode server acts as a file control block to manage file metadata, such as file names, access permissions, and data storage addresses. Actual file data is stored on DataNode servers.
DataNode stores data in blocks. All block information, such as the block ID and IP address of the server where the block resides, is recorded on the NameNode server, and the specific block data is stored on the DataNode server. In theory, NameNode can allocate all data blocks on all DataNode servers to one file, that is, one file can use the storage space on all servers’ hard disks.
In addition, the HDFS copies data blocks to prevent file damage caused by disk or server damage. Each data block is stored on multiple servers or even multiple racks.aboutHow does HDFS do file management and fault toleranceTake a look at this article:Dry goods | file management and fault tolerance of HDFS is how to do?
MapReduce Big data computing architecture
The core idea of big data computing is that mobile computing is more cost-effective than mobile data. Since the computing method is different from the traditional computing method, mobile computing is not moving data, then the traditional programming model for big data computing will encounter many difficulties, so Hadoop big data computing uses a programming model called MapReduce.
The MapReduce programming model wasn’t originally created by Hadoop, or even Google, but Google and Hadoop innovatively applied the MapReduce programming model to big data computing, and it immediately worked wonders. Seemingly complex big data computing such as machine learning, data mining and SQL processing becomes simpler and clearer.
For example, the ultimate purpose of data storage in HDFS is to calculate, and to obtain beneficial results through data analysis or machine learning. However, if traditional applications treat HDFS as a common file and perform calculations after reading data from the file, it is hard to know when to calculate large data computing scenarios that require hundreds of TERabytes of data at a time.
The classic computing framework for big data processing is MapReduce. The core idea of MapReduce is to compute data in fragments. Since data is distributed in blocks across a cluster of many servers, is it possible to do distributed computing on each block on those servers?
In fact, MapReduce can run the same computing program on multiple servers in a distributed cluster, and each process on each server can read the block of data to be processed on the same server. Therefore, large amounts of data can be computed simultaneously. But in this way, the data of each data block is independent. What if these data blocks need to be associated with computation?MapReduce divides the calculation process into two parts. One is a Map process. Multiple Map processes are started on each server. The other part is the Reduce part. MapReduce starts multiple Reduce processes on each server and shuffles the <key, value> set of all map outputs. Shuffle refers to sending the same key to the same Reduce process to perform data association calculation.
To illustrate this process more directly, the following uses the classic WordCount, that is, the word frequency data of the same word in all data, as an example to understand the process of Map and Reduce.
Assume that the original data has two data blocks, and the MapReduce framework starts two Map processes to process the data. The map function performs word segmentation on the input data, and then outputs < key, value > for each word like < word, 1 >. Then the MapReduce framework performs shuffle operations, and the same key is sent to the same Reduce process. The reduce input is the < List of keys, Values > structure. That is, the values of the same key are merged into a list.
In this example, the value list is a list of many ones. Reduce summates these ones and gets the word frequency result for each word.
Example code is as follows:
The above source code describes how map and Reduce processes work together to process data. How do these processes start on a distributed cluster of servers? How does the data flow and ultimately complete the calculation?
Let’s take a look at the process using Hadoop 1 as an example.
MapReduce1 mainly includes JobTracker and TaskTracker. JobTracker is deployed in only one MapReduce cluster, while TaskTracker is deployed on all servers in the cluster with DataNode.
After the MapReduce application JobClient starts, it submits a job to JobTracker. JobTracker analyzes the servers on which the Map process needs to be started based on the file path entered in the job. Task commands are then sent to TaskTracker on these servers.When TaskTracker receives the task, it starts a TaskRunner process that downloads the program for the task, reflects the map function in the loader, reads the assigned chunk of data in the task, and performs the map calculation. After the map calculation is complete, the TaskTracker shuffles the map output, and the TaskRunner loads the Reduce function to perform subsequent calculations.
Yarn resource scheduling framework
During the startup process of MapReduce applications, the most important thing is to distribute MapReduce programs to servers in the big data cluster. In Hadoop 1, this process is mainly accomplished through the communication between TaskTracker and JobTracker.
But what are the downsides to this architectural approach?
Server cluster resource scheduling management is coupled with MapReduce execution. If you want to run other computing tasks, such as Spark or Storm, in the current cluster, the resources in the cluster cannot be used uniformly.
In the early days of Hadoop, Hadoop was the only big data technology, and this disadvantage was not obvious. However, with the development of big data technology and the emergence of various new computing frameworks, it is impossible to deploy a server cluster for each one, and even if we can deploy a new cluster, the data is still on the HDFS of the original cluster. Therefore, MapReduce resource management and computing framework need to be separated. This is the main change in Hadoop 2. Yarn is separated from MapReduce and becomes an independent resource scheduling framework.
Yarn is also an interesting design idea
First, to avoid high coupling of functionality, you need to split the functionality of the original JobTracker
Second, multiple frameworks can be deployed in one cluster. That is, one unified resource scheduling framework YARN can be deployed in one cluster. Various computing frameworks can be deployed on YARN.
The overall architecture of Yarn is as follows:
According to the figure, Yarn consists of two parts: a Resource Manager and a Node Manager. The two main Yarn processes are as follows: ResourceManager schedules and manages resources in a cluster and is deployed on an independent server. The NodeManager process is responsible for resource and task management on a specific server. It starts on every computing server in the cluster and appears together with the DataNode process of the HDFS.
Specifically, the resource manager includes two main components: the scheduler and the application manager. The scheduler is actually a resource allocation algorithm that allocates resources according to the resource request submitted by the application program (Client) and the resource status of the current server cluster.
Yarn has several built-in resource scheduling algorithms, such as Fair Scheduler and Capacity Scheduler. You can also develop your own resource scheduling algorithm for Yarn to invoke. Yarn allocates resources in a Container. Each Container contains a certain amount of computing resources, such as memory and CPU. By default, each Container contains one CPU core. Containers are started and managed by the NodeManager process. The NodeManger process monitors the running status of containers on this node and reports to the ResourceManger process.
The application manager is responsible for submitting applications, monitoring application health, and so on. Once the application is started, it needs to run an ApplicationMaster in the cluster, and the ApplicationMaster also needs to run in the container. Each application starts its own ApplicationMaster. ApplicationMaster further applies for container resources from ResourceManager based on application resource requirements. Once you have the container, you distribute your application code to the container to start and then start distributed computing.
Shoulders of giants
1. Learning Big Data from Scratch
2. Self-cultivation of Architects
3, www.cnblogs.com/yszd/p/1088…
4、 hadoop.apache.org/
summary
This issue simply introduces the architecture ideas and principles of the three major components of Hadoop. For some non-key content, it is not introduced in detail. You can learn by yourself or add my WX: ZWJ_bigDataer to find me to exchange and learn. If you found this episode helpful, don’t forget to send your support
👇 “Big Data Visionary” 🔥 :
A big data enthusiast who likes reading, output and replicating. Besides sharing basic principles of big data, technical practice, architecture design and prototype realization, I also like to output some interesting and practical programming content and reading experience……