MapReduce: Simplifying data processing on large clusters (1)

Special instructions

This is a series organized by simviso team based on MIT distributed system course translation, led by Zhiqiu and other members of the translation of courses and papers involved in the course.

Participants in this article

Know the autumn	review
A virtual less	translation

The profile

MapReduce is a programming model that is an implementation for processing and generating large data sets. The user generates a set of intermediate key-value pairs by specifying a map function that handles Key/Value pairs. Then, specify a reduce function that merges all intermediate values with the same intermediate key. Many tasks in real life can be expressed through this model, and specific cases will be presented in the paper.

Programs written in this functional style can be automatically executed in parallel on a large cluster of business machines. The system is only concerned with how to split the input data, scheduling problems in a large cluster of computers, troubleshooting of the computers in the cluster, and managing the necessary communication between the computers in the cluster. Using the MapReduce programming model allows programmers without experience in parallel computing and distributed system development to use distributed system resources efficiently.

Our Implementation of MapReduce can run on a large cluster of commercial computers and is highly scalable: a standard MapReduce computation can process terabytes of data across thousands of machines. Programmers will find the system easy to use. Hundreds of MapReduce applications have been implemented at Google, and more than 1,000 MapReduce jobs are executed daily on Google clusters.

1 introduction

Over the past five years, the authors and many others who work at Google have implemented hundreds of special-purpose calculations. They can be used to process large amounts of raw data, such as crawled documents, web page request logs, and so on. This is used to calculate various derived data, such as inverted indexes, various graphical representations of Web documents, summaries of the number of pages fetched per host, and the most frequent query sets for a particular day. Most of these calculations are conceptually simple. However, the amount of input data is often very large, and computations have to be divided among hundreds or thousands of machines in order to be completed in a reasonable amount of time. How to parallelize calculations, how to allocate data, how to deal with failures, all intertwine and require a lot of code to deal with them. As a result, this makes simple calculations extremely complex and unwieldy.

To cope with this complexity, we designed a new abstraction that allowed us to express the simple calculations we were trying to perform, but hid in the library the messy details of parallelization, fault tolerance, data distribution, and load balancing. Our abstract design is inspired by the Map and Reduce primitives that exist in Lisp and many other functional languages. We realize that most calculations involve a map operation on each logical record in the input to compute a set of intermediate key-value pairs. Then, to properly consolidate the derived data, we perform a reduce operation on all values that share the same key. By using functional models with user-specified Map and Reduce operations, this allows us to easily parallelize large computations and use the results of re-execution as the primary mechanism for fault tolerance.

The main contribution of this work is to provide a simple and powerful interface that enables automatic parallelization and distributed execution of large-scale computations. By using the implementation of this interface, high performance can be achieved on large commercial computer clusters.

Chapter 2 of this article describes the basic programming model and presents some examples. Chapter 3 is about the implementation of our MapReduce interface tailored to our clustered computing environment. Chapter 4 describes some of the useful improvements we found to this programming model. Chapter 5 is about performance testing our Implementation of MapReduce through a series of tasks. Chapter 6 explores MapReduce’s use at Google, including some of our experiences using it to rewrite our indexing system. Chapter 7 discusses some of the related and future work.

2 Programming Model

The calculation task takes a set of key-value pairs as input and generates a set of key-value pairs as output. Users of the MapReduce library express this computation in two functions, Map and Reduce.

The Map function written by the user takes the input and generates a collection of intermediate key-value pairs. The MapReduce library combines all values that share a single key and passes them to the Reduce function.

The Reduce function is also user-written. It accepts as input an intermediate key and a collection of values for that key. It merges these values together to produce a smaller set of values. The result of each call to Reduce is usually zero or one value. The intermediate value is passed through an iterator to the user-written Reduce function. This allows us to process lists of stored values that are too large to be stored in memory.

2.1 case

Consider a scenario where we have to count the number of occurrences of each word from a large number of documents. The user will write code similar to the pseudocode below:

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
    EmitIntermediate(w,"1");
    reduce(String key, Iterator values):
    // key: a word
    // values: a list of counts
int result = 0;
for each v in values:
    result += ParseInt(v);
    Emit(AsString(result));Copy the code

The map function returns a key-value pair of a word plus the number of occurrences (in this case, the number of occurrences is 1). The reduce function counts occurrences of the word together.

In addition, the user creates an object that conforms to the Specification of the MapReduce model by writing code, passing in the names of input and output files, and optional tuning parameters. Next, the user calls

MapReduce

Function and pass the object to the function. The user’s code is linked to the MapReduce library, which is implemented by C++, and the complete code for the case is provided in appendix A.

2.2 type

Although the input and output types in the previous pseudocode are strings, conceptually, the map and Reduce functions provided by users have associated types.

map(k1,v1)--> list(k2,v2)
reduce(k2,list(v2))--> list (v2)Copy the code

For example, the input keys and values come from different places than the output keys and values. In addition, the middle keys and values are of the same type as the output keys and values.

In our C++ implementation, we use the String type as the input and output types for user-defined functions, and the user performs the appropriate type conversion on the String in his own code.

2.3 More Cases

Here are some simple examples that can be easily represented using the MapReduce model:

Distributed filters: The Map function emits a line that matches a rule. The Reduce function is an identity function that copies intermediate data to the output. F (x)=x; f(x)=x

Calculate URL access frequency: The map function is used to process a log of web page requests and output (URL,1). The reduce function is used to add up all the values of the same URL and output the result (URL, total number of visits) as a key-value pair.

Invert the network link graph: the map function finds all the target urls in the source page and prints key-value pairs like

. The reduce function combines all links for a given target URL into a list, printing key-value pairs such as

.
,>
,>

Term (here refers to an item in the search system, here refers to the search term) vector (here refers to the index group) Summarizes the most important words in a document or group of documents as

< Word, frequency >

For each input document, the map function outputs a

pair (where hostname is extracted from the URL in the document). The Reduce function receives a term vector of every document for a given host. It adds these term vectors together, removes the less frequent ones, and prints a final key-value pair

.
,>
,>

Inverted indexes: The map function parses each document and outputs a sequence of key-value pairs like

. The reduce function takes input of all key-value pairs for a given word, sorts all document ids, and prints

. The set of all output key-value pairs can form a simple inverted index. We can simply calculate the position of each word in the document.
,>
,>

Distributed sorting: The map function extracts a key from each record and prints key-value pairs such as

. The reduce function does nothing to these key-value pairs and prints them directly. This calculation relies on partitioning (see Section 4.1) as well as sorting properties (see Section 4.2).
,>

MapReduce: Simplifying data processing on large clusters (1)

Special instructions

Participants in this article

The profile

1 introduction

2 Programming Model

2.1 case

2.2 type

2.3 More Cases

Related Posts

How do I set up my Ubuntu Core application

Spring source code -bean creation of the core steps, three levels of caching to solve the principle of cyclic dependency analysis

A complete tutorial on Python Logging library