Let’s start with M’s interview question

The problem

You have a community to manage, and now you need to develop the ability to like and display the number of likes. The community is used by a large number of people (more than 100W daily), and it has a large number of likes (100 likes per person per day).

Given that liked messages are stored in message queues, how do you design distributed statistical likes?

Let’s analyze the problem

  1. With more than 100 million likes per day, distributed design is a must, and monocytes have exploded.
  2. The storage layer definitely needs to support persistent storage, you can use mysql+ cache layer
  3. Fault-tolerant and idempotent problems need to be considered
  4. Message consumption should be done in batches; single message processing is too expensive.

The answers given by the students in the interview

  1. Batch consumption data, after the completion of consumption through ack confirmation
  2. The program is atomic and rolls back operations that have had an impact when a computing exception occurs
  3. The program is idempotent, and the calculation result is not affected when the network problem occurs.

Interview students to answer the existing questions

  1. Ack acknowledgments may cause recalculation not to start when messages are not normally consumed
  2. Atomic programs are expensive and require exception handling for rollback operations, making it difficult to guarantee actual transactions

Possible solutions

  1. A master node is required to schedule workers by recording consumption
  2. Consider whether the task needs to be retransmitted by judging the worker’s heartbeat
  3. Write to back-end cache /DB when a single worker program completely handles the required work, and temporary data is stored in memory or local storage. Notify the master that the data has been processed.

So what’s the core? When the acceptable amount of data is 100 times or more than that of a single machine, how to solve the problem of data processing with sharding + scheduling

Get to the point

Understanding the above questions will help you understand MapReduce, and today we will focus on the distributed processing idea of MapReduce. Only theoretical issues are discussed, not framework and usage.

The paper addresses

The English papers address: static.googleusercontent.com/media/resea…

Translation: the paper developer.aliyun.com/article/318…

Core problem: how to process data when there is a large amount of data

The core concept

  1. Data Indicates the source data to be calculated. The data type is

    .
    ,>
  2. The master program is responsible for the assignment of tasks, of course, to ensure that the assignment is reasonable, the function is more complex.
  3. Worker is the program that performs tasks and is responsible for executing map or Reduce methods.

4.1 Map Method Map (K1, V1) ->list(k2,v2) 4.2 Combine (List (K2,v2)) -> (k2,list(v2)) 5. Reduce method processes the map result reduce(k2,list(v2)) ->list(v2)

process

  1. The data to be processed is stored on disks
  2. The master assigns the worker of the map task, and each map task retrieves data from the specified data location
  3. The map results are stored in memory or on a local disk
  4. The results of the map are combined through the Combine process, and the same key is aggregated
  5. Allocate the Reduce program to process the reduce function of the aggregated mapRes
  6. The result is written to the local disk
  7. complete

For example,

We refer to the example in the paper: counting the number of words in the text as an example to see what the next complete process looks like.

The subtotal

We will not delve into the implementation of MapReduce for the moment, but focus on the idea of using sharding logic to process large amounts of data.