Above all: This article introduces the idea of MapReduce in a simple way and how it is designed to solve IO problems. It also briefly introduces how MapReduce can move computation to data.

Why is it called MapReduce

What does Map do?

Suppose we now have five numbers:

Name Major Gender Address
Zhang SAN Java 1 Beijing, Shanghai
Li si Java 0 Beijing
Zhang SAN Scala 1 Beijing, Shanghai
Cathy Java 1 Shanghai
Li si Python 0 Beijing

If you want to filter the data whose gender is 0, you can read the first data, cut the data into the gender column, and determine whether to filter the data based on the value. The second data is then read, and so on. Finally, here’s the data:

Name Major Gender Address
Zhang SAN Java 1 Beijing, Shanghai
Zhang SAN Scala 1 Beijing, Shanghai
Cathy Java 1 Shanghai

If you need to convert code values to dictionary values, you end up with the following data:

Name Major Gender Address
Zhang SAN Java Man Beijing, Shanghai
Li si Java Woman Beijing
Zhang SAN Scala Man Beijing, Shanghai
Cathy Java Man Shanghai
Li si Python Woman Beijing

If you need to expand the field compound value, you end up with the following love data:

Name Major Gender Address
Zhang SAN Java 1 Beijing
Zhang SAN Java 1 Shanghai
Li si Java 0 Beijing
Zhang SAN Scala 1 Beijing
Zhang SAN Scala 1 Shanghai
Cathy Java 1 Shanghai
Li si Python 0 Beijing

As you can see, each of the above operations reads data one by one, and does not care about the other data while processing one piece of data. Either the data is retained or the data is transformed. Semantically speaking, all of the above operations are a transformation mapping process.

To summarize, a Map is a mapping of one record

What does Reduce do?

Again, here are the numbers:

Name Major Gender Address
Zhang SAN Java 1 Beijing, Shanghai
Li si Java 0 Beijing
Zhang SAN Scala 1 Beijing, Shanghai
Cathy Java 1 Shanghai
Li si Python 0 Beijing

If you want to count how many students study in each major, you can read the data successively. If you find that the major is Java, then the data set in the middle of the component is Java 1. After grouping all the data, you can get the following data set:

Key Value
Java 1
Java 1
Scala 1
Java 1
Python 1

Java, Python, and Scala are then grouped together to perform statistical calculations in parallel.

To summarize: Reduce is computing in groups. According to Map, data are first grouped according to the same features to form Key and Value data grids, and then parallel computing is carried out to realize Reduce process.

conclusion

The MapReduce process is as follows:

Map

  • Achieve mapping, transformation, filtering functions
  • One data map to multiple data

Reduce

  • Achieve decomposition, narrowing, induction functions
  • A set of data outputs multiple results

Map and Reduce are connected based on Key Value pairs. The establishment of key-value pairs is an important basis for dividing data groups. Reduce calculation is based on the output of Map calculation.

Distributed computing for MapReduce

Let’s start with a few nouns:

MapTask: Each dotted box on the left is a MapTask, including split slices, map methods, group sorting, etc.

ReduceTask: In the figure above, each virtual box on the right is a ReduceTask, including the combination of grouped data, reduce method and output of specific data;

Split slice: The file layer in HDFS splits data into blocks. Split is equal to a block by default (the relationship between split and block can be 1:1, 1:N, and N:1), but split is logical and exists for decoupling.

Parallelism of MapTask

We know that HDFS blocks can be customized. If the size is small, it is suitable for IO – intensive computing. The reverse is true for CPU-intensive computing. Different project teams have different requirements for this block, so it is impossible to define the exact size of the block. Split slices, on the other hand, can set the relationship between blocks so that parallelism can be controlled depending on the IO – or CPU-intensive requirements of different projects. That is, split controls the parallelism of a MapTask.

Split indicates the location of the block and the offset range, which enables the calculation to move to data

ReduceTask (Partition) parallelism

In the figure above, data is mapped to Key values in the unit of one record through map method. The same Key (also called group) is a group, and the reduce method is called to calculate this group of data iteratively within the method.

So the parallelism of the ReduceTask is determined by the developer

Relationships between multiple roles

Block and the split

  • 1:1
  • N:1
  • 1:N

The split with the map

  • 1:1

The map and reduce

  • N:1
  • N:N
  • 1:1
  • 1:N (Note that groups split by map cannot be split into multiple partitions)

Group and the partition

  • N:1
  • N:N
  • 1:1
  • 1:N (also note that groups split by map cannot be split into multiple partitions)

Detailed MapReduce processing process

Step 1: IN MapTask, a split corresponds to a map method. Split formats records and calls the map method on a record basis

Step 2: The output of map is mapped to the Key Value, which will participate in partition calculation. With the Key, calculate partition number (partition number), forming K, V and P

The output of MapTask is a file stored on the local file system. If the generated K V P is directly written to the file, frequent IO will trigger the call to the kernel, which is a process of switching from user mode to kernel mode, which will cost a lot. Therefore, you need to use buffer in memory, which by default is 100M.

If the buffer is full, make another system IO call and write to the file once. When the data is processed, the map does not output, and many small files need to be merged into a single file. But the partitions of these files are out of order. If Reduce uses these files directly, if the first Reduce method needs to pull partition 0, the files are out of order and the files need to be traversed. So the complexity is order n. Therefore, when K, V, and P are in the memory, the partition is sorted first, and then merged and written to the disk by merging sort. In this case, the complexity of reduce reading a partition is O (1).

But there’s another problem. If Reduce pulls data from all maps that belong to the partition, keys in the same partition are still out of order after a Reduce pulls files from multiple maps that belong to a partition. In this case, the complexity of reading Key values in a Reduce is O (n). So the buffer also needs to be sorted by key.

That is, the third step is introduced:

Step 3: When the memory buffer overwrites to the disk, perform a quadratic sort to ensure that the partition is in order and the keys in the partition are in order. In this way, when Reduce reads partition data output from multiple MapTask files, the complexity is o(1)

Step 4: The same ReduceTask may need to receive multiple MapTask output files, but since these MapTask output files have been sorted according to Key, ReduceTask can use merge sort to merge data. In addition, with the support of the iterator pattern, this process can occur simultaneously with the computation of the Reduce method, reducing IO.

Solve problems with MapReduce

Example 1: The need to find duplicate rows in Hadoop enlightenment

Example 2: WordCount

Example 3: Count the number of words with the same frequency