The article directories

  • 1. Summary of graphs
    • 1.1 graphs is defined
    • 1.2 Advantages and disadvantages of MapReduce
      • 1.2.1 advantages
      • 1.2.2 shortcomings
    • 1.3 Core Ideas of MapReduce
    • 1.4 process of graphs
    • 1.5 official WordCount source code
    • 1.6 Common Data serialization types
    • 1.7 MapReduce programming specifications
      • 1. The Mapper stage
      • 2. The phase of Reducer
      • 3. The Driver stage

1. Summary of graphs

1.1 graphs is defined

  • MapReduce is a programming framework for distributed computing programs and a core framework for users to develop Hadoop-based data analysis applications.
  • The core function of MapReduce is to integrate user-written service logic codes and default components into a complete distributed computing program, which concurrently runs on a Hadoop cluster.

1.2 Advantages and disadvantages of MapReduce

1.2.1 advantages

1) MapReduce is easy to program

  • It simply implements a few interfaces, can complete a distributed program, this distributed program can be distributed to a large number of cheap PC machines run. So if you write a distributed program, it’s just like writing a simple serial program. This feature has made MapReduce programming very popular.

2) Good scalability

  • When your computing resources can’t be met, you can simply add more machines to expand their computing power.

3) High fault tolerance

  • MapReduce is designed to be deployed on inexpensive PCS, which requires high fault tolerance. For example, if one of the machines is down, it can transfer its computing tasks to another node so that the task does not fail, and this process does not require human participation, but is completely done internally by Hadoop.

4) Suitable for offline processing of mass data above PB level

  • It can realize the concurrent work of thousands of server clusters and provide data processing capability.

1.2.2 shortcomings

1) Not good at real-time computing

  • MapReduce cannot return results in milliseconds or seconds like MySQL does.

2) Not good at streaming calculation

  • Whereas the input data for streaming computing is dynamic, the input data set for MapReduce is static and cannot change dynamically. This is because MapReduce itself is designed to be static.

3) Not good at DAG (directed acyclic graph) calculation

  • Multiple applications have dependencies, and the input of one application is the output of the other. In this case, it is not that MapReduce cannot be run, but that the output of each MapReduce job is written to disks, causing a large amount of DISK I/OS and low performance.

1.3 Core Ideas of MapReduce





(1) Distributed computing programs often need to be divided into at least two phases. (2) Concurrent instances of MapTask in the first stage, which run completely in parallel and are unrelated to each other. (3) The ReduceTask concurrent instances in the second stage are irrelevant, but their data depend on the output of all MapTask concurrent instances in the previous stage. (4) The MapReduce programming model can only contain one Map phase and one Reduce phase. If the user’s business logic is very complex, only multiple MapReduce programs can be run in serial. Summary: Analyze the data flow trend of WordCount to deeply understand the core idea of MapReduce.

1.4 process of graphs

  • A complete MapReduce program runs in distributed mode with three types of instance processes:

    (1)MrAppMaster: Responsible for process scheduling and state coordination of the whole program.

    (2)MapTask: Responsible for the entire data processing process in the Map phase.

    (3)ReduceTask: Is responsible for the entire data processing process in the Reduce phase.

1.5 official WordCount source code

  • Decompile the source code using decompile tools, found that WordCount cases have Map class, Reduce class and driver class. And the data type is a serialized type encapsulated by Hadoop itself.

1.6 Common Data serialization types

  • In addition toTextThe rest of the type is appended to the Java typeWritable

1.7 MapReduce programming specifications

  • The program compiled by the user is divided into three parts: Mapper, Reducer and Driver.

1. The Mapper stage

(1) User-defined Mapper inherits its parent class (2) Input data of Mapper is KV pair (K: offset, V: (3) The business logic in Mapper is written in map() method (4) The output data of Mapper is in the form of KV pair (the type of KV can be customized) (5) map() Method (MapTask process) is called once for each
,v>

2. The phase of Reducer

(1) User-defined Reducer inherits its parent class (2) Input data type of Reducer corresponds to output data type of Mapper, It is also the business logic of KV (3) Reducer written in reduce() method (4) ReduceTask process calls reduce() method once for each group

with the same K
,>

3. The Driver stage

  • Equivalent to YARN cluster client, used to submit our entire program toYARNCluster, which encapsulates the operation parameters of the MapReduce programjobobject