Hello, I’m Rich

Personal public number: programmers point things, welcome to learn and exchange

In the technical group, some people are discussing the problem of consistent hash algorithm, and there is no more topic to write. Let’s briefly introduce its principle. Let’s take a look at what a consistent hash algorithm is and what makes it great, using a classic scenario in distributed caching and some topics that come up in interviews.

Building scenarios

If we have three cache servers numbered node0, node1, and node2, we now have 30 million keys, and we want to evenly cache these keys on the three machines, what solution will you come up with?

So the first thing that we might think of is the algorithm hash (key) % N, which is the number of machines that we hash the key. The result of key hashing must be 0, 1, or 2, which corresponds to node0, node1, and node2. Access data directly to the corresponding server, simple and crude, can completely solve the above problem.

The problem of the hash

Although the algorithm is simple to use, it has some limitations in the expansion and contraction of the cluster, because it is common to adjust the number of servers according to the volume of business in the production environment. If the number of servers N changes, the hash (key) % N will also change.

For example, if a server node is down and the calculation formula changes from Hash (key) % 3 to hash (key) % 2, the result will change. If you want to access a key, the cache location of the key will probably change, and the cached key data will lose its effect and significance.

A large number of caches fail at the same time, resulting in an avalanche of caches, which leads to the unavailability of the entire cache system, which is basically unacceptable. In order to solve the above situation, the consistent hash algorithm was created

Then, how does the consistent hashing algorithm solve the above problems?

Consistency of the hash

A consistent hash algorithm is essentially a modulo algorithm, but instead of modulo by number of servers, a consistent hash algorithm modulos a fixed value of 2^32.

IPv4 addresses consist of four groups of 8-bit binary numbers, so using 2^32 ensures that each IP address has a unique mapping

Hash ring

We can abstract these 2^32 values into a circle ⭕️, with the point directly above the circle representing 0, arranged clockwise, and so on, 1, 2, 3, 4, 5, 6… All the way to 2 to the 32 minus 1, and this circle of 2 to the 32 points is called a hash ring.

What does the hash ring have to do with a consistent hash algorithm? For example, three cache servers numbered node0, node1, and node2 have 30 million keys.

The server maps to the Hash ring

The hash (key) % N becomes the hash (server IP) % 2^32, using the server IP address, modulo 2^32, and the result must be an integer between 0 and 2^32-1. The position of the integer mapped on the Hash ring represents a server, which maps the cache servers node0, node1, and node2 to the hash ring in sequence.

The object key maps to the Hash ring

Hash (key) % 2^32, the server node and the cached key are mapped to the hash ring. Which server should cache the key on?

Object Key is mapped to the server

The first server encountered clockwise from the location of the cache object key is the server to which the current object will be cached.

Since the cached object and the hash value of the server are fixed, the object key must be cached on a fixed server if the server remains unchanged. According to the rules above, the mapping below is:

  • key-1 -> node-1
  • key-3 -> node-2
  • key-4 -> node-2
  • key-5 -> node-2
  • key-2 -> node-0

If you want to access a key, you can use the same calculation to know which server the key is cached on.

Advantages of consistent hash

We have seen the principle of consistent hash, but how does it optimize the problem of adding and reducing nodes in the cluster, and the large area of cache service caused by the common modulus algorithm?

If the service volume surges, the system needs to expand and add a server node-4. Node-4 is mapped between Node-1 and Node-2, and nodes are mapped clockwise. It is found that objects key-4 and key-5 cached on Node-2 are mapped to Node-4, and only a small part of data between Node-4 and Node-1 is affected during capacity expansion.

On the other hand, if Node-1 goes down, the object maps the node clockwise, and key-1 cached on Node-1 is remapped to Node-4, only a small part of the data between Node-0 and Node-1 will be affected.

From the above two cases, when the number of servers in the cluster changes, consistent hash will only affect a small part of the data, ensuring that the overall cache system can still provide services to the outside.

Data skew problem

In order to facilitate the understanding of the principle, the nodes in the diagram are idealized and relatively evenly distributed. However, the ideal and actual scenes are often very different. For example, I only went to the gym twice and took a shower when I got a fitness annual card.

You want to work out

When the number of server nodes is too small, it is easy to cause data skew due to uneven node distribution. As shown in the figure below, most of the cached objects are cached on the Node-4 server, resulting in the waste of resources of other nodes and most of the system pressure is concentrated on the Node-4 node. Such a cluster is very unhealthy.

The solution to data skew is also simple, we need to find a way to map nodes to the hash ring, relatively evenly distributed.

The consistent Hash algorithm introduces a virtual node mechanism, that is, multiple Hash values are calculated for each server node, and they are mapped to the Hash ring, object keys mapped to these virtual nodes, and finally cached on the real node.

For virtual nodes, the IP address of the node can be added with a hash suffix (10.24.23.227#1). For example, if the IP address of Node-1 is 10.24.23.227, the hash value of Node-1 can be calculated normally.

  • Hash (10.24.23.227#1) % 2^32

Suppose we set three virtual nodes for Node-1, node-1#1, Node-1 #2, and Node-1 #3, and hash them.

  • Hash (10.24.23.227#1) % 2^32
  • Hash (10.24.23.227#2) % 2^32
  • Hash (10.24.23.227#3) % 2^32

After virtual nodes are added in the following figure, the original nodes are evenly distributed on the Hash ring, and the pressure of other nodes is allocated.

However, it should be noted that the more virtual nodes are allocated, the more uniform the mapping will be on the Hash ring. If there are too few nodes, it is difficult to see the effect

The introduction of virtual nodes also adds new problems, to do the mapping between virtual nodes and real nodes, object key-> virtual node -> real node conversion.

Application scenarios of consistent Hash

Consistent hash should be the preferred algorithm for load balancing in distributed systems. Its implementation is flexible, and it can be implemented on both clients and middleware. For example, memcached and Redis clusters, which are commonly used in daily life, can be used to implement consistent hash.

The Memcached cluster is a special case. Strictly speaking, it is a pseudo-cluster because its servers cannot communicate with each other, and the routing of requests depends entirely on the client’s calculation of which server the cache object should go to, and its routing algorithm uses a consistent hash.

There is also the concept of a Redis hash slot. Although the implementation is different, the idea is always the same. After reading the consistent hash of this article, you will be much easier to understand the Redis slot.

There are many other application scenarios:

  • RPCThe frameworkDubboUsed to select the service provider
  • Distributed relational database sub-database sub-table: data and node mapping relationship
  • LVSLoad balancing scheduler
  • .

conclusion

No technology can be perfect, and the consistent hash algorithm also has some potential dangers. If the number of nodes on the hash ring is very large or updates frequently, the retrieval performance will be low. In addition, the entire distributed cache needs a routing service for load balancing. Once the routing service fails, the entire cache is unavailable, and high availability should be considered.

But then again, as long as the solution is good technology, a few side effects can be tolerated.