Programming changes the world

Data fragmentation

➤ Let’s look at an example

We often use Redis as a cache to put some data on it to reduce the pressure of the data.

When the amount of data is small and the access pressure is not large, usually a Redis can be done, in order to high availability, get a master slave is enough;

When the amount of data increases and the amount of concurrency increases, it is difficult to put all the cached data on one machine. After all, the resources of a machine are limited. Usually, we will build a cluster environment, so that the data can be evenly placed in each Redis, for example, there are four Redis in our cluster.

So how to put the data into the four Redis as evenly as possible? The simplest is the modulo algorithm:

Hash (key) % N, where N is the number of Redis, where N = 4;

Data fragmentation

This looks pretty good, because by doing this, we can store the data on an average of four Redis, and when a new request comes in, we can locate which Redis the data will be on, so that we can query the cached data precisely.

Problems with data sharding

But 4 Redis is not enough, need to add 4 more Redis;

Then the algorithm becomes hash(key) % 8;

Shard number change

As you can imagine, most of the current cache would be in the wrong location, causing a cache avalanche in extreme cases.

Consistent Hash algorithm

A consistent Hash algorithm can solve this problem well, and its general process looks like this:

Take 0 as the starting point, 2^32-1 as the ending point, draw a line, overlap the starting point and the ending point, the line becomes a circle, clockwise from small to large.

The first point to the right of 0 is 1, then 2, and so on.

Hash the IP addresses or other keywords of the three servers and modulo 2^32, so that it must fall somewhere on the circle, denoted as Node1, Node2, Node3.

Let me draw a circle

Then do the same for the data key, which will inevitably fall somewhere on the circle; Then, walking clockwise, you can find a Node, which is the server where the key is stored.

Data finds the corresponding Redis node clockwise

If a server is added or deleted, only part of the data is affected.

The number of nodes changes, affecting some data

However, if there are too few nodes or the nodes are not evenly distributed, data skew is easily caused. That is, most data is concentrated on a certain server.

Data skew

To solve the data skew problem, the consistent Hash algorithm proposes virtual nodes, which compute multiple hashes for each service node and place them at different locations on the circle.

Virtual node

Of course, we can also find that the consistent Hash algorithm only solves most of the data problems.

Uncle will point code | article “original”


Please pay attention to the uncle who can code