MapReduce is Hadoop’s solution to large-scale distributed computing. MapReduce is both a programming model and a computing framework. That is, developers must develop programs based on the MapReduce programming model and then distribute programs to run in Hadoop clusters using the MapReduce computing framework. Let’s start with MapReduce as a programming model.


MapReduce programming model

MapReduce is a very simple and very powerful programming model.

The simplicity lies in the fact that its programming model contains only map and Reduce processes. The main input of MAP is a pair of <key, value> values, and a pair of <key, value> values are output after map calculation. Then merge the same keys to form the <key, value set >; Then input the <key, value set > into Reduce and output zero or more <key, value> pairs after calculation.

However, MapReduce is also very powerful. From relational algebra (SQL computation) to matrix computation (graph computation), almost all computing requirements in the field of big data can be realized by MapReduce programming.

Let’s take the WordCount program for example. WordCount is the number of occurrences of each word in the text. If you only count the word frequency of an article, tens of K to several meters of data, then write a program, read the data into memory, build a Hash table to record the frequency of each word, as shown in the figure below.


Word frequency statistics for small data


But if you want to count the word frequency of all the trillions of web pages on the world’s Internet (which is typical for a search engine like Google), you can’t write a program to read all the world’s web pages into memory, so you need MapReduce.

WordCount’s MapReduce program is as follows.

public class WordCount {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for(IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }}}Copy the code

Its core is a map function and a Reduce function.

The input to the map function is mainly a <key, value> pair. In this case, value is a row of data in all the text to be counted. The key is not important here and we ignore it.

public void map(Object key, Text value, Context context)Copy the code

The map function extracts the words in the line and prints a <key, value> pair for each word.

The MapReduce computing framework collects these <word,1 > and puts the same words together to form <word, <1,1,1,1,1….. >> such <key, value set > data, which is then entered into the Reduce function.

public void reduce(Text key, Iterable<IntWritable> values,Context context)Copy the code

Here, the reduce input parameter values is a set of many ones, and key is the specific word word.

The calculation process of the reduce function is to sum the 1 in the set and output the word (word) and the sum (sum) as <key, value>(<word, sum). Each output is the sum of a word and its word frequency statistics.

Assume that word frequency statistics are required for two blocks of text data. The MapReduce calculation process is as follows.


MapReduce calculation process


A map function can perform operations on a part of the data, so that a large data can be divided into many blocks (which is what HDFS does). The MapReduce computing framework assigns a map function to each block for calculation, thus realizing distributed computing of big data.

As mentioned above, the MapReduce programming model divides the big data calculation process into map and Reduce phases. In the Map phase, a Map calculation task is assigned to each data block, and all the keys output by map are combined. The same key and its corresponding value are sent to the same Reduce task for processing.

There are two key issues to address in this process

  • How to assign a map calculation task to each block of data, how to send the code to the server where the block is located, how to start sending, and how to know where the data is in the file (what is the block ID) after it is started.


  • How to aggregate the <key, value> outputs of maps on different servers and send the same key to the Reduce job

These two key problems correspond to the two “MapReduce framework processing” in the figure “MapReduce Computing Process” in this article.

MapReduce computing involves two processes of the MapReduce framework



Let’s take a look at how MapReduce starts processing a big data computing application.

Starting and running mechanism of MapReduce jobs

Taking Hadoop1 as an example, MapReduce involves the following key processes:

  • Big data application process: Starts the main entry of the MapReduce program, specifies Map and Reduce classes, input and output file paths, and submits jobs to the Hadoop cluster.

  • JobTracker: Starts map and Reduce tasks based on the amount of input data to be processed, schedules and monitors tasks throughout the job life cycle. The JobTracker process is globally unique in the Hadoop cluster.

  • TaskTracker: Starts and manages the Map and Reduce processes. Because each data block needs to have a map function, TaskTracker is usually started on the same server as the HDFS DataNode, which means that most servers in the Hadoop cluster run both DataNode and TaskTacker.

As shown in the figure below.

Starting and running mechanism of MapReduce jobs



The specific operation startup and calculation process are as follows:

  • Application processes store user job JAR packages in HDFS, and these JAR packages are distributed to servers in the Hadoop cluster to perform MapReduce calculations.

  • The application submits the job to JobTracker.

  • JobTacker creates a JobInProcess tree based on job scheduling policies. Each job has its own JobInProcess tree.

  • JobInProcess creates a corresponding number of Taskinprocesses based on the number of input data fragments (usually the number of data blocks) and the number of reduce sets.

  • The TaskTracker process and the JobTracker process communicate periodically.

  • If TaskTracker has free computing resources (free CPU cores), JobTracker will assign it tasks. Tasks are assigned to TaskTracker based on the name of the server that matches a block of computing tasks on the same machine, so that the starting task processes the data on that machine.

  • After receiving the task, the TaskRunner starts the Map or Reduce process based on the task type (Map or Reduce) and task parameters (such as the jar package path of the job, the path of the input data file, the start and offset of the data to be processed in the file, and the DataNode host name of multiple backup data blocks).

  • After the Map or Reduce program is started, check whether the JAR file to be executed exists on the local PC. If not, download the JAR file from the HDFS and load the Map or Reduce code to execute the task.

  • If it is a Map process, it reads data from HDFS (usually the data blocks to read happen to be stored locally). If it is a Reduce process, write the result data to the HDFS.

Through the preceding process, MapReduce can distribute big data computing tasks in the entire Hadoop cluster, and the data to be processed by each Map computing task can be read from the local disk. All the user has to do is write a map function and a Reduce function, not caring about how these two functions are distributed to the cluster and how the data blocks are allocated to computing tasks. All of this is done by the MapReduce computing framework.

MapReduce data combination and connection mechanism

In the WordCount example, to count the number of occurrences of the same word in all the input data, a map can only process part of the data, and a popular word will appear in almost all the maps, and the words must be combined to get the correct result.

In fact, almost all big data computing scenarios need to deal with the problem of data association, as simple as WordCount just merge key, or as complex as database join operation, which needs to join two (or more) types of data according to key.

The MapReduce computing framework performs data combination and connection operations between Map output and Reduce input. This process is described by a special word, shuffle.

Graphs shuffle process

The calculation results of each Map task are written to the local file system. When the Map task is almost complete, the MapReduce computing framework starts the Shuffle process and invoks a Partitioner interface on the Map side. The reduce partition is selected for each <key, value> generated by map and sent to the corresponding Reduce process through HTTP communication. In this way, no matter which server node the map is located on, the same key must be sent to the same Reduce process. The Reduce end sorts and merges the received <key and value>, and puts the same keys together to form a <key and value set > and sends it to Reduce.

The default Partitioner of the MapReduce framework modulates the number of Reduce jobs using the hash value of the key. The same key must fall on the same Reduce job ID and implementation. Such Partitioner code requires only one line, as shown below.

/** Use {@link Object#hashCode()} to partition. */ 
public int getPartition(K2 key, V2 value, int numReduceTasks) { 
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; 
}Copy the code

Shuffle is a miracle in the process of big data computing. No matter MapReduce or Spark, any batch computing of big data must have a shuffle process to associate data and reveal the internal relationship and value of data. If you do not understand shuffle, you will be confused in Map and Reduce programming and not know how to correctly design map output and Reduce input. Shuffle is also the most difficult part in the whole MapReduce process. Half of the early MapReduce code is about shuffle processing.