This is the seventh day of my participation in the August More text Challenge. For details, see:August is more challenging

The body of the

MapReduce programming model

The MapReduce programming model is simple and powerful to develop, and is specifically designed to process large amounts of data in parallel. Next, the MapReduce process is depicted in a diagram, as shown in the figure below.

For more details on the MapReduce programming model, see my blog post — What is the MapReduce Programming Model?

The whole process

In the figure above, the MapReduce workflow can be divided into five steps:

Fragment and format data sources

Input the data source for the Map phase, which must be sharded and formatted.

  • Sharding operation: This means dividing the source file into equally sized chunks (128MB by default in Hadoop 2.x), or splits. Hadoop builds a Map task for each of the slices and runs a custom Map () function from that task. To process each record in the shard;
  • Format: Format the split into

    pairs, where key represents the offset and value represents the content of each row.
    ,value>

Perform MapTask

Each Map task has a memory buffer (buffer size 100MB) into which the intermediate results of the input split data processed by the Map task are written. If the writer’s data reaches the memory buffer threshold (80MB), a thread is started to write the overflow data to disk without affecting the Map intermediate results to continue writing to the buffer. During the overwrite process, the MapReduce framework sorts keys. If the intermediate result is large, multiple overwrite files are formed, and all buffer data is overwritten to the disk to form one overwrite file. If multiple overwrite files are formed, all overwrite files are merged into one file.

Performing the Shuffle process

During MapReduce work, how to transfer data processed in the Map phase to the Reduce phase is a key process in the MapReduce framework. This process is called Shuffle. Shuffle distributes the processing result data output by MapTask to ReduceTask. During the distribution, the data is partitioned and sorted by key.

Perform ReduceTask

The data flow that you input into the ReduceTask is in the form of <key, {value list}>. You can customize the Reduce () method for logical processing and output the data in the form of <key, value>.

Written to the file

The MapReduce framework automatically passes the <key, value> generated by the ReduceTask into the write method of the OutputFormat to implement file writing.

MapTask

  1. Read phase: MapTask parsers key/value from input InputSplit through user-written RecordReader.
  2. Map phase: the parsed key/value is handed over to the Map () function written by the user for processing, and a series of new keys/values are generated.
  3. Collect phase: In a user-written map() function, when the data is processed, it usually calls outputCollector.collect() to output the results. Inside this function, it slices the generated key/value (by calling the partitioner), And is written to a ring memory buffer (which has a default size of 100MB).
  4. Spill phase: When the buffer is about to overflow (80% of the buffer size by default), an overflow file is created on the local file system and data in the buffer is written to the file.

Before writing data to a local disk, sort the data locally and merge or compress the data if necessary. Before writing data to disks, threads partition data according to the number of reducetasks. A Reduce job corresponds to data in a partition. The goal is to avoid an awkward situation in which some Reduce tasks are assigned a large amount of data, while others are assigned little or no data. If the Combiner is set at this time, Combine the sorted results. The purpose of this is to write as little data to disk as possible.

  1. Combine phase: When all data processing is complete, MapTask merges all temporary files once to ensure that only one data file is eventually generated

Sorting and Combine operations are carried out continuously during the process of merging. The purpose is to minimize the amount of data written to the human disk each time. Second, minimize the amount of data transmitted over the network in the next replication phase. Finally merged into a partitioned and sorted file.

ReduceTask

  1. Copy phase: Reduce remotely replicates a piece of data from each MapTask (the data sent by each MapTask is in order) and writes a piece of data to disk if its size exceeds a certain value. Otherwise, Reduce directly stores the data in the memory
  2. The Merge phase: When data is replicated remotely, the ReduceTask starts two background threads to Merge files on the memory and disk respectively to prevent excessive memory usage or disk files.
  3. Sort phase: The user writes the Reduce () method. The input data is a set of data aggregated by key.

To bring together data with the same key, Hadoop uses a sortion-based strategy. Since each MapTask has implemented local sorting of its own processing results, ReduceTask only needs to perform a merge sort for all data.

  1. Reduce phase: The Reduce () method is invoked for the sorted key pairs. The Reduce () method is invoked for the key pairs with equal keys. Zero or more key pairs are generated each time, and the output key pairs are finally written to the HDFS
  2. Write phase: The reduce() function writes the calculated results to the HDFS.

Many intermediate files (written to disk) are generated during the merge, but MapReduce keeps as little data written to disk as possible, and the result of the last merge is not written to disk, but directly entered into the Reduce function.