About the author: Fan Donglai, co-founder of Panshan Technology. This article is excerpted from “Spark-on-Learning-as-you-Go Talk 44”
Hello, I’m Fan Donglai. Today we are going to talk about a relatively basic and important content called MapReduce. The reason why MapReduce is fundamental is that it was born too long ago, and it is not new.
Google’s troika
USNew divides computer science into four areas: artificial intelligence, programming languages, systems, and theory. There are two top conferences in the system field, one is ODSI (USENIX Conference on Operating Systems Design and Implementation), The other is SOSP (ACM Symposium on Operating Systems Principles). These two conferences are very important in the industry. If we could collect the important papers on these two conferences in recent decades in one book, Can be seen as a textbook on operating systems and distributed systems.
From 2003 to 2006, Google published three papers in ODSI and SOSP respectively, which aroused a wide discussion on distributed system in the industry. These three papers are as follows:
- SOSP2003: The Google File System;
- ODSI2004: MapReduce: Simplifed Data Processing on Large Clusters;
- ODSI2006: Bigtable: A Distributed Storage System for Structured Data.
In 2006, Schmidt, CEO of Google, put forward the term cloud computing. These three papers of Google are also known as the troika of Google, representing the cornerstone of Google’s big data processing and the foundation of cloud computing. However, it is worth noting that although Google, as a leader in the industry, often open source its own technology, objectively speaking, Google’s open source technology is not the latest technology for internal use, and there may even be generation differences, which also reflects Google’s technical strength.
The first paper focuses on distributed file systems, the second on distributed computing frameworks, and the third on distributed data storage. These three papers have unveiled the mystery of distributed system and made important contributions to big data processing technology. With the theoretical basis of these three papers and a series of subsequent papers, as well as the strong practical ability of the open source community, Hadoop, HBase, Spark and others soon came to the stage, and the big data technology began to show a state of flowering.
Click here to check out The Education column “Learning and Using Spark in Action 44”
02 MapReduce Programming model and MapReduce computing framework
In the second article, Google explicitly states that MapReduce is a distributed computing framework that it implements, with a programming model called MapReduce. Based on this paper, the open source community has implemented a distributed computing framework, also known as MapReduce. But some books and online sources do not mention MapReduce, leading to confusion.
There are many other examples of Google taking the name of the programming model directly as the name of the computing framework, such as Google Dataflow. MapReduce has two meanings. Generally speaking, when talking about computing framework, we refer to MapReduce computing framework of open source community. However, with the rise of new generation computing framework such as Spark and Flink, MapReduce computing framework of open source community is less and less used in production environment. Fade away from the stage.
The second implication of MapReduce is that it is a programming model derived from the old idea of functional programming, implemented in older languages such as Lisp, and given new life in distributed computing with the rapid increase in CPU single-core performance and the number of cores.
MapReduce model abstracts data processing methods into Map and Reduce. Map is also called mapping. As its name implies, it represents one-to-one mapping of data and usually completes data conversion, as shown in the following figure:
Reduce, known as reduction, represents another type of mapping that typically does the work of aggregation, as shown below:
The rounded box can be regarded as a set, the box inside can be regarded as a certain data to be processed, and the arrow represents the way of mapping and the custom function to be executed. Using MapReduce programming thought, we can achieve the following contents:
- Abstract data sets (input data) into collections;
- The data processing process is represented by Map and Reduce.
- Implement your own logic in custom functions.
This enables the flow of processing (mapping) from the input data to the resulting data.
Click here to check out The Education column “Learning and Using Spark in Action 44”
03 Concurrency and parallelism
Generally speaking, the simpler the things at the bottom are, the more complex the changes at the top are. For the MapReduce programming model, the combination of Map and Reduce plus user-defined functions is very expressive for the business. Here is an example of grouping aggregation, as shown below:
The user-defined function and map operator on the Map end convert the original data names to generate group labels: gender. The user-defined function and reduce operator on the Reduce end aggregate data based on labels.
MapReduce believes that no complex data processing process is more than a combination of the two mapping methods, such as Map + Map + Reduce, or reduce followed by Map, etc. You can see a relatively complex combination form in the picture I showed:
Many languages that support functional programming provide Map and Reduce operators for their own set data structures. Now, we can easily think will be the first round box as a dozens of data collection, it is a collection of the variables in the memory, so in order to achieve above the transformation, for the computer, the difficulty is not large, even in larger amount of data, we can also consider different boxes and calculation process to the same computer CPU core of different calculation, This is what we call parallelism and concurrency.
conclusion
The main purpose of this class is to give a brief introduction to the technologies, paradigms, and abstractions before the in-depth introduction of Spark, laying a foundation for future study. That’s all for today’s lesson, but stay tuned for more details on Spark
Copyright notice: The copyright of this article belongs to Pull hook education and the columnist. Any media, website or individual shall not be reproduced, linked, reposted or otherwise copied and published/published without the authorization of this agreement, the offender shall be corrected.
Click here to check out The Education column “Learning and Using Spark in Action 44”
Copyright notice: The copyright of this article belongs to Pull hook education and the columnist. Any media, website or individual shall not be reproduced, linked, reposted or otherwise copied and published/published without the authorization of this agreement, the offender shall be corrected.