Data fragmentation
➤ Let’s look at an example
We often use Redis as a cache to put some data on it to reduce the pressure of the data.
When the amount of data is small and the access pressure is not large, usually a Redis can be done, in order to high availability, get a master slave is enough;
When the amount of data increases and the amount of concurrency increases, it is difficult to put all the cached data on one machine. After all, the resources of a machine are limited. Usually, we will build a cluster environment, so that the data can be evenly placed in each Redis, for example, there are four Redis in our cluster.
So how to put the data into the four Redis as evenly as possible? The simplest is the modulo algorithm:
Hash (key) % N, where N is the number of Redis, where N = 4;
This looks pretty good, because by doing this, we can store the data on an average of four Redis, and when a new request comes in, we can locate which Redis the data will be on, so that we can query the cached data precisely.
Problems with data sharding
But 4 Redis is not enough, need to add 4 more Redis;
Then the algorithm becomes hash(key) % 8;
As you can imagine, most of the current cache would be in the wrong location, causing a cache avalanche in extreme cases.
Consistent Hash algorithm
A consistent Hash algorithm can solve this problem well, and its general process looks like this:
Take 0 as the starting point, 2^32-1 as the ending point, draw a line, overlap the starting point and the ending point, the line becomes a circle, clockwise from small to large.
The first point to the right of 0 is 1, then 2, and so on.
Hash the IP addresses or other keywords of the three servers and modulo 2^32, so that it must fall somewhere on the circle, denoted as Node1, Node2, Node3.
Then do the same for the data key, which will inevitably fall somewhere on the circle; Then, walking clockwise, you can find a Node, which is the server where the key is stored.
If a server is added or deleted, only part of the data is affected.
However, if there are too few nodes or the nodes are not evenly distributed, data skew is easily caused. That is, most data is concentrated on a certain server.
To solve the data skew problem, the consistent Hash algorithm proposes virtual nodes, which compute multiple hashes for each service node and place them at different locations on the circle.
Of course, we can also find that the consistent Hash algorithm only solves most of the data problems.
Uncle will point code | article “original”