In front of the written a \(two \)Hadoop MapReduce principle analysis, and then as if to read several times, even I feel too boring, even the author himself also see quite labors on the written is not good =. =, so I decided to re-explain how MapReduce works on the basis of simplifying the previous article. Please take that!

MapReduce

Graphs were:

MapReduce is a computing model that breaks down large data processing tasks into many individual tasks that can be executed in parallel in a cluster of servers, and the results of these tasks can be combined to calculate the final result. In short, Hadoop MapReduce is a software framework that is easy to program and can quickly process large amounts of data in parallel on large clusters (thousands of nodes), deployed on commercial machines in a reliable, fault-tolerant manner. The term MapReduce comes from two basic data transformation operations: the Map process and the Reduce process.

The map:

The map operation converts the elements in the collection from one form to another, in which case the input key-value pairs are converted to zero to more than one key-value pair output.

Reduce:

All key-value pairs for a particular key are distributed to the same reduce operation; specifically, all values for this key and the corresponding key are passed to the same Reducer. The purpose of the reduce process is to convert a set of values to one value (such as summing or averaging) or to another set. This Reducer eventually produces a key-value pair.

Task:

Hadoop provides a basic design to handle most of the hard work to ensure that the tasks can be executed successfully. For example, Hadoop decides that if a submitted job is broken down into separate Map and Reduce tasks to be executed, it will schedule and allocate appropriate resources to those tasks. For example, deciding where to assign a task in the cluster (usually where the data is processed by the task, if possible, to minimize network overhead). Hadoop monitors each task to ensure its successful completion and restarts some failed tasks.

Graphs architecture

In MapReduce, there are two machine roles that are used to perform MapReduce tasks: JobTracker and TaskTracker. Where JobTracker is used to schedule work and TaskTracker is used to execute work. There is only one JobTracker in a Hadoop cluster.

The client submits a job to the JobTracker, and the JobTracker divides the job into pieces and assigns them to the TaskTracker to execute. The TaskTracker sends Heartbeat information to the JobTracker at regular intervals. If the JobTracker does not receive a heartbeat message from the TaskTracker in a certain period of time, the JobTracker assumes that the TaskTracker is dead and assigns TaskTracker job tasks to other TaskTrackers.

The MapReduce execution process



1. The client starts a Job

2. Ask JobTracker for a JobID

3. Copy the resource files needed to run the job to HDFS, including the JAR files packaged by MapReduce program, configuration files and the input partitioning information calculated by the client. These files are stored in a folder created by JobTracker specifically for the job called the job jobID. By default, the JAR file will have 10 replicas, and enter partitioning information that tells JobTracker how many Map tasks should be started for the job, etc.

4. After JobTracker receives the job, it puts it in the job queue and waits for JobTracker to schedule it. When JobTracker schedules the job according to its scheduling algorithm, a Map task is created for each partition based on the input partition information and the Map task is assigned to TaskTracker to execute. It is important to note that the map task is not assigned to a TaskTracker. Data-local assigns the map task to the TaskTracker database containing the map processing, and copies the application JARs to the TaskTracker to run. However, data localization is not considered when assigning reducer tasks.

5. The TaskTracker sends a Heartbeat to the JobTracker every once in a while to tell the JobTracker that it is still running and that the Heartbeat carries a lot of information, such as the progress of the map task completion. When the JobTracker receives the last completion message of the job, it sets the job to “successful” and the JobClient relays the message to the user.

Map and Reduce in MapReduce

The MapReduce framework operates only on key-value pairs. MapReduce treats the different types of input to a Job as key-value pairs and produces a set of key-value pairs as output.

For example, let’s count all the books in the library. You count shelf one, I count shelf two. That’s the Map. The more people we are, the faster we count books. Now let’s get together and add up the numbers of all the people. This is “Reduce.” Simply put, Map is “divide” and Reduce is “combine”.

(input) ->map-> ->combine-> ->reduce-> (output)

map

Each Map task in MapReduce can be divided into four stages: Record Reader, Mapper, Combiner, and Partitioner. The output of the map task, called intermediate keys and intermediate values, is sent to the reducer for subsequent processing.

(1) Read the files in HDFS. Each row parses to a

. The map function is called once for each key-value pair. <0,helloyou> <10,hello me> (2) Overwrite map(), receive (1) generated in

, process, and convert to the new

output.

(3)

(4) Sort and group the data in different partitions according to Key. Grouping means that the values of the same key are put into a set. After sorting: < 1 > hello, < 1 > hello, < me, 1 > < 1 > you,, after the group: “hello, {1, 1} > < me, {1} > < you, {1} > (5) after the grouping of data reduction to merge.
,v>
,1>
,1>
,1>
,1>
,v>
,v>
,v>

reduce

Reduce task can be divided into four stages: shuffle, sort, reducer and output format.



(1) The output of multiple map tasks is copied to different reduce nodes through the network according to different partitions. (shuffle)

(2) for the output of multiple mapsMerge and sort. Overwrite reduce function, receive the data after grouping, implement their own business logic, <hello,2><me,1><you,1> after processing, generate new <k,v> output.

(3) For <k,v> of reduce output, write it to HDFS.