Master-slave institutions
- Main: the jobtracker
- From: tasktracker
MapReduce stages:
1, the Split
- Data is uploaded to the HDFS in the form of blocks, which are used as Split data. For example, wordCount splits data according to rows, and assigns each row of data to Map for processing.
2, Mapper: key-value (object)
- Split Sends the Split data to map in the form of key-value pairs, such as wordCount. The sequence number of the first letter of each row of data is used as the key, and each row of data is used as the value. The output is also in key-value pair form.
3、 Shuffle
- A) Partition (HashPartition: the partition can be customized according to the hashcode value of key and the quantity modulus of Reduce), and the operation speed is fast. Data skew and Reduce load balancing must be addressed.
- B) Sort: sort by dictionary by default. WriterCompartor (compare)
- C) Merge: reduce the current mapper output data and merge values according to the same key (comparison).
- D) Grouping (keys are the same (compare), values form a set) (merge)
4、 Reduce
- A) Input data: key + iterator
1. The following figure shows the computing architecture of MapReduce
The map output data is shuffled to form a series of values with the same keys as the Reduce input
2. Calculation process of Wordcount
3, Mapper
The idea of Map-Reduce is to divide and learn.
- The Mapper is responsible for “partitioning,” breaking complex tasks into “simple tasks.
“Simple task” has several meanings:
- The size of the data or calculation is much smaller than the original task
- Nearby computing, that is, computing is allocated to the node where the desired data is stored, moving computing rather than moving data
- There is no relation between these small tasks and they can be computed in parallel
3, Shuffler
- A step between mapper and Reducer, consisting of four processes: partitioning, sorting, merging, grouping,
- You can see that the output of mapper is segmented and combined according to certain key values; Output key values in a certain range to a specific Reduce for processing.
- This simplifies the Reduce process
Shuffler’s detailed flow chart is shown below
- Shuffler provides map output to Reduce input
- Shuffler in map: First of all, the map output data is stored in the memory first. As the memory is limited, the data will overflow when it reaches a certain threshold, and the data will be written to the disk. However, before this, the data must be partitioned (make each map output key-value pair have a partition number, The map task is processed by the corresponding Reduce task), sort (by default, keys are lexicographically sorted), and then spill to disk, write to disk, and form temporary files. When the Map task is finished, Shuffler merges all temporary files.
- Shuffler for reduce: The data obtained from each map through shuffle (the merged files in the disk) is fetched from the partition corresponding to the Reduce. The data is stored in the memory first, but there may be overwrite, and it needs to be sorted again. The shuffle process is performed twice, and the sorting is performed after font> sorting. Each group is processed by a Reduce. That’s the end of Shuffler.
4. Split size of MapReduce
-
max.split(100M)
-
min.split(10M)
-
Block (64M in hadoop1.x, 128M in 2.0)
-
The three determine the size of Split: Max (min.split, min(block, max.split)) According to the formula above, either a Split is a block, or a block is Split into multiple fragments. It is not possible for a fragment to consist of multiple blocks:
- If the two blocks are not on the same machine, the data needs to be copied. This goes against the philosophy of MapReduce: move computing without moving data.
Hadoop1.x differs from MapReduce in Hadoop2x:
Hadoop1. x is the master/slave architecture of JobTracker and TaskTracker. Hadoop2. x runs in yarn. It is the primary/secondary architecture of resourceManager and NodeManager.