Let’s start with M’s interview question

The problem

You have a community to manage, and now you need to develop the ability to like and display the number of likes. The community is used by a large number of people (more than 100W daily), and it has a large number of likes (100 likes per person per day).

Given that liked messages are stored in message queues, how do you design distributed statistical likes?

Let’s analyze the problem

With more than 100 million likes per day, distributed design is a must, and monocytes have exploded.
The storage layer definitely needs to support persistent storage, you can use mysql+ cache layer
Fault-tolerant and idempotent problems need to be considered
Message consumption should be done in batches; single message processing is too expensive.

The answers given by the students in the interview

Batch consumption data, after the completion of consumption through ack confirmation
The program is atomic and rolls back operations that have had an impact when a computing exception occurs
The program is idempotent, and the calculation result is not affected when the network problem occurs.

Interview students to answer the existing questions

Ack acknowledgments may cause recalculation not to start when messages are not normally consumed
Atomic programs are expensive and require exception handling for rollback operations, making it difficult to guarantee actual transactions

Possible solutions

A master node is required to schedule workers by recording consumption
Consider whether the task needs to be retransmitted by judging the worker’s heartbeat
Write to back-end cache /DB when a single worker program completely handles the required work, and temporary data is stored in memory or local storage. Notify the master that the data has been processed.

So what’s the core? When the acceptable amount of data is 100 times or more than that of a single machine, how to solve the problem of data processing with sharding + scheduling

Get to the point

Understanding the above questions will help you understand MapReduce, and today we will focus on the distributed processing idea of MapReduce. Only theoretical issues are discussed, not framework and usage.

The paper addresses

The English papers address: static.googleusercontent.com/media/resea…

Translation: the paper developer.aliyun.com/article/318…

Core problem: how to process data when there is a large amount of data

The core concept

Data Indicates the source data to be calculated. The data type is

.
,>
The master program is responsible for the assignment of tasks, of course, to ensure that the assignment is reasonable, the function is more complex.
Worker is the program that performs tasks and is responsible for executing map or Reduce methods.

4.1 Map Method Map (K1, V1) ->list(k2,v2) 4.2 Combine (List (K2,v2)) -> (k2,list(v2)) 5. Reduce method processes the map result reduce(k2,list(v2)) ->list(v2)

process

The data to be processed is stored on disks
The master assigns the worker of the map task, and each map task retrieves data from the specified data location
The map results are stored in memory or on a local disk
The results of the map are combined through the Combine process, and the same key is aggregated
Allocate the Reduce program to process the reduce function of the aggregated mapRes
The result is written to the local disk
complete

For example,

We refer to the example in the paper: counting the number of words in the text as an example to see what the next complete process looks like.

The subtotal

We will not delve into the implementation of MapReduce for the moment, but focus on the idea of using sharding logic to process large amounts of data.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

MIT 6.824 Getting Started 1– MapReduce and sharding

Let’s start with M’s interview question

The problem

Let’s analyze the problem

The answers given by the students in the interview

Interview students to answer the existing questions

Possible solutions

Get to the point

The paper addresses

Core problem: how to process data when there is a large amount of data

The core concept

process

For example,

The subtotal

MIT 6.824 Getting Started 1– MapReduce and sharding

Let’s start with M’s interview question

The problem

Let’s analyze the problem

The answers given by the students in the interview

Interview students to answer the existing questions

Possible solutions

Get to the point

The paper addresses

Core problem: how to process data when there is a large amount of data

The core concept

process

For example,

The subtotal

Related Posts

3.27 Go Module Introduction | Go Theme month

Learn about blue-green deployment and Canary release

SpringBoot unit tests are used with Mockito