Graphs process

  • The Map operation
  1. Map When the MapTask is executed, the input data is from the HDFS Block. For example, if there are three files in a directory with the size of 5M, 10M, 150M, four Mapper files will be generated to process 5M, 10M, 128M, and 22M data respectively.
  2. After Partition is run by Mapper, MapReduce provides Partitioner interface, whose function is to determine which ReduceTask the current pair of output data should be processed according to the Key or Value and the number of Reduce. By default, the number of reduce tasks is modeled after the key is hashed. The default mode is used only to avoid data skew. Next, you need to write the data to a memory buffer. The buffer is used to collect Map results in batches to reduce disk I/O impact.
  3. Splil: Spill this important process is performed by the Spill thread. The Spill thread starts to work as soon as it receives the “command” from the Map task, which is called SortAndSpill. The Map output is processed by the Collector, and each Map task continuously outputs key-value pairs to a circular data structure, Kvbuffer, constructed in memory, with a default size of 100M. The size of the memory buffer is limited and the default is 100MB. Through graphs. Task. IO. Sort. MB (default: 100) parameters to adjust. You can adjust it according to the hardware, especially the memory size. If you increase the memory size, the number of disk spill times will be reduced. If the memory size is sufficient, the performance will be significantly improved. Spill usually starts to spill at 80% of the Buffer size (because other threads may be writing data to the Buffer while it is spill). By graphs. The map. Sort. Spill. Percent (default: 0.80).
  4. Combine(optional): When Combiner exists, map results are combined according to the functions defined by Combiner. When does Combiner operate? And Map in a JVM, is determined by the parameters of the min.num.spill.for.com bine, the default is 3, that is, the number of files by default spill to be performed when there are three combine operation, reduce disk data finally
  5. Merge: A Map Task generates many spill files during calculation. These spill files are merged before the Map Task ends. This process is called Merge. Graphs. Task. IO. Sort. Factor (default: 10), representatives of the merge of most at the same time, the merge of spill, if there are 100 spill a file, this is not a complete the merge process, this time need large graphs. Task. IO. Sort. Factor (default: 10) to reduce the number of merge operations, thereby reducing disk operations.
  6. Compress(Optional): Reduces disk I/O and network I/O. You can also perform the following operations: Compress. You can Compress spill and merge files. Intermediate result is very big, IO bottleneck when compression is very useful, can press through mapreduce.map.output.com (default: If the value is set to true, data will be compressed and written to disks. Data read is compressed and needs to be decompressed. In actual experience, the bottleneck of Hive running in Hadoop is USUALLY I/O, not CPU. Lzo, BZip2 Lzma etc., including Lzo is a more balanced choice, the mapreduce.map.output.com codec (default: Org.apache.hadoop.io.com press. DefaultCodec) parameter Settings. However, this process consumes CPU and is suitable for large I/O bottlenecks.
  • Reduce the operating
  1. The Copy process. Simply pull the data. The Reduce process starts some data copy threads (Fetcher) and requests the NodeManager where MapTask resides to obtain the output file of MapTask through HTTP. Since MapTask has long since ended, these files are managed by NodeManager.
  2. The Merge phase. The Merge action on the Map end stores data copied from different Map ends in the array. The copied data is stored in the memory buffer first. When the amount of data in the memory reaches a certain threshold, the Merge function is enabled. Similar to the Map side, this is also the write overflow process. A large number of write overflow files are generated on the disk, and these files are merged.
  3. Output file from the Reducer. After constantly merging, a “final file” is generated. This file can be stored on disk or in memory, which is the default. The Shuffle process ends only when the input files have been scheduled.