MR shuffle

Brush up on MR Shuffle:

MapTask

The Map output is not simply written to disk, but to the buffer first, and the contents of the buffer are sorted when the buffer is spilled.

Each MapTask (calculates a split) has a ring buffer (100MB by default, which is a tuning advantage, but should never write MR again), and when the buffer reaches a threshold (80%, which is also tunable), a background thread takes care of spilling the buffer contents onto spill disk.

Notice that this is parallel execution, where one thread is writing to the buffer all the time, while another thread is writing to disk. When the buffer is filled before the overflow writer thread has time to write, the writing buffer thread is blocked.

Before writing data from the buffer to the disk, data will be partitioned according to the key and sorted according to the key. At this time, the partition will be called, which can be implemented by itself. After sorting, if combiner is available, the data in the same group will be aggregated to adjust advantages. Reduced disk and network IO. Each partition will write one file. For example, if there are four partitions, four files will be written to disk.

Finally, these small files will be merged into a file, which is characterized by orderly partition and orderly key in the partition.

To summarize, MapTask outputs that each buffer overwrite produces a merged file on disk. If there are at least 3 overwritten files, combiner is performed again to compress the data, according to the authoritative guide. . This is also a parameter tuning mapreduce.map.com bine minspills, when less than 3, indicates that the map output is not much, also not worth calling a combiner. The Map output is not compressed by default; you can turn the compression on and specify the appropriate compression format.

ReduceTask

Reducer Removes map output from the network. Graphs. Shuffle. Max. Threads decided to open many threads to pull. The default value is 0, representing twice the number of CPU cores used. These parameters are no longer mentioned, really up in production, absolutely engraved on my heart.

How does Reduce know where the file output from Map is? After the map task is completed, Reduce communicates with AM, the task scheduler, and then asks AM when it wants to pull the file, so that Reduce knows where to get the data.

Reduce will merge the partitioned data to be computed. Since the map end has done sorting (quick sorting), at this time, merge and sort them to obtain a set of ordered keys, and send each group of data to the reducer reduce method we wrote ourselves for corresponding logical calculation.

If you have enough memory, reduce the number of disk overwrites.

Spark

Recall again that MR. In Reduce we can take a set of data and operate on it. This is because the entire shuffle of MR is done by sorting keys. I was taught to sort upstream to save downstream efficiency. In fact, if you think about it carefully, reduce is a protocol operation on the same key, that is, the same key must meet.

Spark shuffle also sorts by key (SortShuffleWrite), but provides a ByPass mode that does not require sorting. Think about it here

ShuffleWrite

SortShuffleWrite

ExternalSorter

Let’s combine that with updateFunc here

groupByKey

GroupByKey calls the combineByKey operator at the bottom. This operator is very useful. Sometimes we can use this operator to achieve some special logic. Further down is ShuffledRDD

mergeCombiner
groupByKey

Groupbykey is not aggregated on the map side. If you think about it, if it does add overhead, this is a bit of spark optimization.

reduceByKey

ReduceByKey, the underlying is also combineByKey

// todo … Shuffle Write doesn’t finish the shuffle write.

ShuffleReader

Take a look at how the Reduce end receives data and what operations it does. Let’s look at it from here

The third function.

If map side aggregation is not enabled, groupByKey is typical. All aggregation logic occurs on the Reduce side, and the Reduce side uses those three functions.

conclusion

In terms of protocol operations, MR has the overhead of sorting, whereas Spark uses HashTable to let identical keys meet.