Pengtuo. Tech/Big Data R&D /2018/…

This article will introduce MapReduce, Hadoop’s important computing framework.

The complete MapReduce framework consists of two parts:

  1. Algorithm logic level, i.emap,shuffleAs well asreduceThree important algorithmic components, which will be introduced in this article;
  2. At the actual operation level, that is, in what form and what process do algorithmic logic jobs run in distributed hostMapReduce version2From now on, assignments are submitted toYARNSo this section will not be covered in this article.

Other articles in the series are:

  • Hadoop Learning Series (1) Hadoop pseudo-distributed Environment setup
  • Hadoop learning series (2) HDFS detailed analysis
  • YARN details in Hadoop Learning Series 3

What is MapReduce?

MapReduce is a Parallel distributed computing framework based on Java. Data processing applications written by MapReduce can run on large commercial hardware clusters to deal with the parallelization of large data sets. Data processing can occur on data stored in file systems (unstructured) or databases (structured). MapReduce takes advantage of the location of data and processes it near where it is stored to minimize communication overhead.

The MapReduce framework organizes distributed servers, runs various tasks in parallel, and manages all communication and data transmission between different parts of the system. It can automatically calculate task parallelization processing, automatic classification data and computing tasks, in the cluster nodes automatically assigned and perform the task and collect the calculation results, the data distributed storage, data communication and fault tolerance, parallel computing involves many of the system at the bottom of the complicated details to the system is responsible for processing, reducing the burden of developers.

MapReduce is also a parallel Programming Model & Methodology. Based on the design idea of functional programming language Lisp, it provides a simple parallel programming method, which abstracts the complex parallel computing process running on a large scale cluster into two functions: Map and Reduce. Map and Reduce functions are used to implement basic parallel computing tasks. Abstract operations and parallel programming interfaces are provided to facilitate large-scale data programming and computing.

Second, The Algorithm

The MapReduce framework typically consists of three operations (or steps) :

  1. Map: Each working node willmapFunction applies to local data and writes the output to temporary storage. The master node ensures that only one copy of redundant input data is processed.
  2. Shuffle: Work node according to the output key (bymapFunction generation) redistributes data, sorts, groups, and copies data mapped so that all data belonging to a key resides on the same working node.
  3. Reduce: The worker node now processes each set of output data for each key in parallel.

MapReduce flowchart:

MapReduce allows Map operations to be run in a distributed manner. Each Map operation can be executed in parallel as long as it is independent of other Map operations.

Another, more detailed, interpretation of MapReduce as a five-step process is:

  1. Prepare the Map() input:MapReduceFrame first specifiedMapThe processor then assigns it the input data to process — key-value pairsK1And provide the processor with all input data related to the key value;
  2. Run the user-provided Map() code:Map()K1Run once on the key-value pair to generate a value fromK2Output of the specified key-value pair;
  3. Shuffle the Map output to the Reduce processors: will be previously generatedK2Key-value pair, move to the same working node according to whether “key” is the same;
  4. Run the user-provided Reduce() code: For each working nodeK2Key value pairsReduce()Operation;
  5. Produce the final output:MapReduceFramework collection allReduceOutput, and pressK2Sort it to produce the final result for output.

In the actual production environment, data may be scattered on various servers. In the original big data processing method, data is sent to the place where the code is located for processing, which is inefficient and consumes a lot of bandwidth. To cope with this situation, the MapReduce framework uses the following processing method: Map() operations or Reduce() operations are sent to the server where the data resides and “mobile computing replaces mobile data” is used to speed up the entire framework. Most of the computation takes place on nodes with data on local disks, thus reducing network traffic.

Mapper

A Map function performs specified operations on each element of a conceptual list of independent elements, so each element is operated on independently, and the original list is not changed because a new list is created to hold the new answers. That is, Map operations can be highly parallel

The Map and Reduce functions of the MapReduce framework are defined based on data structures in the form of (key, value). Map retrieves a key-value pair in a Data Domain and returns a list of key-value pairs:

The Map (k1, v1) to a list (k2, v2)Copy the code

The Map function is called in parallel and applied to each key-value pair (keyed by K1) in the input dataset. Each call then returns a list of keyed by K2 pairs. The MapReduce framework then collects all key-value pairs with the same key (in this case k2) from all the lists and combines them to create a group for each key.

Reducer

Reduce is a proper merge of the elements of a list. Although not as parallel as the Map function, the simplification function is also useful in highly parallel environments because there is always a simple answer to the simplification and because large-scale operations are relatively independent. The Reduce function is applied in parallel to each group to generate a set of values in the same data domain:

Reduce(k2, list(v2)) → list(v3)Copy the code

The Reduce end receives ordered data groups from different tasks. In this case, Reduce() performs corresponding Reduce operations according to the code logic written by the programmer, such as counting and adding the same key pair. If the amount of data received by the Reduce end is small, the data is directly stored in the memory. If the amount of data exceeds a certain proportion of the buffer size, the data is merged and overwritten to disks.

Partitioner

As mentioned earlier, the Map phase has an operation that splits the data into groups. This process is called Partition, and the Java class that divides the data is called the Partitioner.

The Partitioner component allows Map to partition keys, so that keys of different partitions are assigned to different Reduce processes. Therefore, the number of Partitioners is equal to the number of Reducer. A Partitioner corresponds to a Reduce job, which can be considered as the input fragment of Reduce. It can be controlled programmatically based on the actual business situation to improve Reduce efficiency or perform load balancing. The built-in partition of MapReduce is HashPartition.

It’s always good to have multiple splits because the time taken to process the splits is short compared to the time taken to process the entire input. When the split is small, load balancing can be handled better, but the split should not be too small, because if it is too small, management split and task load time will account for too much of the total elapsed time.

The following is a schematic diagram of map and Reduce tasks:

Third, WordCount Example

Here is the Java code for a statistical word frequency case:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  // Inherit Mapper class to implement its own map function
  public static class TokenizerMapper extends Mapper<Object.Text.Text.IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    // The map function must be implemented
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while(itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }}}// Inherit the Reducer class to implement the reduce function
  public static class IntSumReducer extends Reducer<Text.IntWritable.Text.IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
      int sum = 0;
      for(IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }}public static void main(String[] args) throws Exception {
    // Initialize the Configuration and read the Configuration information of the MapReduce system
    Configuration conf = new Configuration();

    // Build the Job and load the calculator wordcount.class
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);

    Mapper, Combiner, Reducer, that is, we inherit implementation classes
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);

    // Set the input and output data
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true)?0 : 1); }}Copy the code

Combiner class is also specified when Mapper and Reducer are specified. Combiner is a localized reduce operation (so we can see that the WordCount class is loaded with Reduce). It is a follow-up operation of MAP operation and is performed on the same host as the Map operation. It is used to merge repeated keys before the Map calculates intermediate files to reduce the size of intermediate files. In this way, the network transmission cost is reduced and the network transmission efficiency is improved during the Shuffle operation.

The command to submit the MR job:

Hadoop jar {program jar package} {task name} {data input path} {data output path}Copy the code

Such as:

hadoop jar hadoop-mapreduce-wordcount.jar WordCount /sample/input /sample/output
Copy the code

Schematic diagram of the above code:

Map -> Shuffle -> Reduce intermediate results, including the last output, are stored on the local disk.

4. Advantage & MapReduce

MapReduce has two major advantages:

1) Parallel processing:

In MapReduce, we divide the job into multiple nodes, each of which processes a portion of the job simultaneously. Thus, MapReduce is based on the Divide and Conquer paradigm, which helps us process data using different machines. Because the data is processed in parallel by multiple machines instead of a single machine, the time required to process the data is greatly reduced.

2) Data location:

Instead of moving the data to the computing part, we move the computation to the data in the MapReduce framework. The data is distributed among multiple nodes, each of which processes the portion of the data that resides on it.

This gives the following advantages:

  • Moving the processing unit to the location of the data can reduce the network cost;
  • As all nodes process part of their data in parallel, the processing time is shortened.
  • Each node takes a portion of the data to process, so there is no potential for overloading nodes.

However, MapReduce has its limitations:

  1. It can only calculate off-line data instead of streaming computing and real-time computing.
  2. Intermediate results are stored on disks, which increases the DISK I/O load and is slow to read.
  3. Development hassles, for examplewordcountFeatures require a lot of setup and code, andSparkIt’s going to be very simple.