Before I wrote a (2)Hadoop MapReduce principle analysis, later seemed to read several times, even I feel too boring, even the author himself also read quite laborious shows that the writing is not good =. =, so I decided to reintroduce the working principle of MapReduce on the basis of the previous chapter. Please take that!

MapReduce

Graphs were:

MapReduce is a computing model that breaks down large data-processing tasks into individual tasks that can be executed in parallel in a cluster of servers, and the results of these tasks can be combined to compute the final result. In short, Hadoop Mapreduce is a software framework that is easy to program and can rapidly process large amounts of data in parallel across large clusters (thousands of nodes), deployed on business machines in a reliable and fault-tolerant manner. The term MapReduce comes from two basic data transformation operations: the Map process and the Reduce process.

The map:

The MAP operation converts the elements in the collection from one form to another, in which case the input key-value pairs are converted to zero to multiple key-value pairs output.

Reduce:

All key-value pairs of a key will be distributed to the same Reduce operation. Specifically, this key and all values corresponding to this key will be passed to the same Reducer. The purpose of the reduce process is to convert a collection of values to one value (such as summing or averaging) or to another set. This Reducer ultimately produces a key-value pair.

Task:

Hadoop provides a basic design to handle most of the hard work to ensure that the task can be successfully executed. For example, Hadoop decides that if a submitted job is broken up into multiple independent Map and Reduce tasks, it will schedule those tasks and allocate appropriate resources to them. For example, deciding where to assign a task in a cluster (and, if possible, usually where the data the task will process is located to minimize network overhead). Hadoop monitors each task to make sure it completes successfully and restarts some tasks that fail.

Graphs architecture

In MapReduce, two machine roles are used to perform MapReduce jobs: JobTracker and TaskTracker. JobTracker is used to schedule work, and TaskTracker is used to perform work. There is only one JobTracker in a Hadoop cluster.

The client submits a job to JobTracker, which JobTracker breaks into chunks and assigns to TaskTracker to execute. TaskTracker sends JobTracker Heartbeat information at regular intervals. If JobTracker does not receive TaskTracker’s heartbeat for a period of time, JobTracker will assume that TaskTracker has died and will assign TaskTracker tasks to other TaskTracker tasks.

MapReduce execution process

1. Start a job on the client

2. Request a JobID from JobTracker

3. Copy resource files required for running jobs to the HDFS, including JAR files packaged by MapReduce, configuration files, and input partition information calculated by the client. These files are stored in a folder JobTracker created specifically for the job, called JobID for the job. The JAR file will have 10 copies by default, and input partitioning information to tell JobTracker how many map tasks should be started for the job.

4. After JobTracker receives the job, it puts it in the job queue and waits for JobTracker to schedule it. When JobTracker schedules the job according to its own scheduling algorithm, it creates a map task for each partition based on the input partition information and assigns the map task to TaskTracker for execution. Data-local assigns the map task to the TaskTracker of the database containing the map processing, and copies the JAR package to the TaskTracker to run. However, data localization is not considered when reducer tasks are allocated.

5. TaskTracker sends JobTracker a Heartbeat at regular intervals to tell JobTracker that it is still running, along with a lot of information such as the progress of the Map task. When JobTracker receives the last task completion message for the job, it sets the job to “success” and JobClient sends the message to the user.

Map and Reduce in MapReduce

The MapReduce framework operates only on key-value pairs. MapReduce treats different types of job inputs as key-value pairs and generates a set of key-value pairs as output.

For example, we want to count all the books in the library. You count shelf one, I’ll count shelf two. This is Map. The more of us, the faster we count. Now let’s get together and add up everyone’s statistics. This is Reduce. In simple terms, Map means “divide” and Reduce means “combine”.

(input) ->map-> ->combine-> ->reduce-> (output)

map

Each map task in MapReduce can be subdivided into four phases: Record Reader, Mapper, Combiner, and Partitioner. The output of the map task, called intermediate keys and intermediate values, is sent to the reducer for subsequent processing.

(1) Read files in the HDFS. Each line resolves to a <k,v>. The map function is called once for each key-value pair. <0,helloyou> <10,hello me>

(2) Overwrites map(), receives <k,v> generated in (1), processes it, and converts it into a new <k,v> output. <hello,1><you,1><hello,1><me,1>

(3) Partition the

output from (2), which is divided into one region by default.
,v>

(4) Sort and group the data in different partitions according to Key. Grouping means that the values of the same key are grouped into a collection. Sorted: < 1 > hello, < 1 > hello, < me, 1 > < 1 > you,, after the group: “hello, {1, 1} > < me, {1} > < you, {1} >

(5) Merge and reduce the grouped data.

reduce

Reduce tasks can be divided into four stages: Shuffle, sort, Reducer, and Output Format.

(1) The output of multiple Map tasks is copied to different Reduce nodes based on different partitions through the network. (shuffle)

(2) Merge and sort the outputs of multiple maps. Overwrite reduce function, receive grouped data, realize its own business logic,



after processing, generate new < K, V > output.
,1>
,1>
,2>

(3) Write the < K, V > output of Reduce to HDFS.