MapReduce Paper published in 2003 MapReduce: Simplified Data Process. Google already has tons of data to process every day. The birth of MapReduce allows programmers to focus on implementing business logic in the context of large data volumes and to follow the rules of the MapReduce architecture to a certain extent. At the time, MapReduce was already being used to calculate URL access frequency, distributed Grep, inverted indexes, and distributed sorting.
The core idea of MapReduce is that it divides a large amount of data to be processed, processes the fragmented data, and outputs intermediate data. This process is called Map. The intermediate content is then combined to produce the final result, a process called Reduce.
This process can be implemented in many ways, taking into account actual business scenarios, hardware resources, etc. At Google, it’s usually some huge cluster of hundreds or thousands of machines. These machines are doing Map and Reduce work simultaneously and constantly.
-
Sharding: A large amount of data is first divided into small pieces, usually 16 to 64 megabytes in size. This number refers to the data localization strategy mentioned below, which is intended to save bandwidth. (Bandwidth was much worse in 2003, after all)
-
MapReduce task allocation: a master program generates copies of multiple worker programs to run in the cluster. These worker programs usually allocate M Map tasks and R Reduce tasks.
-
Map stage: Map workers will read the data after fragmentation, process it according to user-defined functions, and output the result with key-value structure to intermediate files.
-
Reduce phase 1: The Master obtains the index of the intermediate file in phase 3 and sends it to the Reduce worker.
-
Reduce phase 2: The Reduce worker finds the corresponding intermediate files and sorts them by key. The sorting purpose is to arrange the files with the same key together. This process I understand as integration. At the same time, sorting a large number of files in a single worker is very time and space consuming, so it should be sorted before this, and the integration is just a merge stage in merge-sort. For example,
# enter
[{k3, v1}, {k1, v2}, {k2 v1}, {k1, v3}, {k2, v2}, {k1: v1}]
# sort
[{k1: v1}, {k1, v2}, {k1, v3}, {k2 v1}, {k2, v2}, {k3, v1}]
# Same key integration
[{k1: [v1, v2, v3]}, {k2, [v1, v2]}, {k3, [v1]}]
Copy the code
-
Reduce stage 3: After stage 5, the content after integration is transmitted to user-defined functions in key-value units. For example, sum [v1, v2, v3] of 5.
-
End work: Usually when all Map workers and Reduce workers have finished, the master wakes up the user program (usually an initial calling function). At this point the call will receive a return.
The master is unique in the entire system, which means that if the master dies, the cluster is considered unavailable by the outside world. The number of workers is large, so it is acceptable that a certain number of workers are unavailable. Usually the master will ping these workers repeatedly while saving their state. When a worker dies, the master considers that his work has not been done and transfers it to other workers. The advantage of this is that each worker’s work is atomic. When MapReduce is executed later, there is usually a lot of Reduce work to be done. In this case, the master can convert the previous Map worker into a Reduce worker.