preface
Consistent Hashing algorithm (Consistent Hashing) is widely used in distributed systems. This paper tries to explain the application of Consistent Hashing algorithm and its related topics in combination with business scenarios.
1 Distributed Cache
With the expansion of services and the sharp increase of traffic, single projects are gradually divided into distributed systems. For frequently used data, we can use Redis as a caching mechanism to reduce the pressure on the data layer. Therefore, the reconstructed system architecture is shown in the figure below:
The simplest strategy for optimization is to save the frequently used data in Redis, and 3 Redis are used for high availability (no clustering, at least 6 are required). Each Redis request is sent to one of them randomly, but this strategy causes two problems:
- The same data may exist in multiple Redis databases, resulting in data redundancy
- A certain data exists in one of the Redis databases, but the Redis database is accessed again, and the database does not match the existing data. There is no guarantee that all accesses to the same key will be sent to the same Redis
To solve the above problem, we need to change the rules for storing keys into Redis slightly: Use the hash algorithm. For example, if there are three Redis, we can calculate the hash value for each access. For example, if h=hash(key)%3, we set the Redis number to 0,1, and 2 to store the hash value. The value of h is equal to the corresponding Redis number. However, hash algorithms also face problems of fault tolerance and extensibility. Fault tolerance means that when a service in the system fails, it does not affect other systems. Scalability means that when new servers are added, the entire system works correctly and efficiently.
Now suppose a Redis server is down, and to fill the gap, remove the down server from the numbering list, move the following servers one bit ahead and subtract their numbering value by one, and then recalculate each key according to h = Hash(key) % 2.
Similarly, if a new server is added, the rule also needs to be recalculated, h = Hash(key) % 4. Therefore, if a server changes in the system, the Hash value will be directly affected, and a large number of keys will be redirected to other servers, resulting in lower cache hit ratio, which is very bad in a distributed system.
A well-designed distributed hashing scheme should have good monotonicity, that is, changes to service nodes do not cause a lot of hash relocation. Hence the consistent hashing algorithm
2. Consistent hashing algorithm
Uniform hash is a special hash algorithm. After using the consistent hash algorithm, a change in the slot number (size) of the hash table requires an average remapping of K/ N keywords, where K is the number of keywords and n is the number of slots. In traditional hash tables, however, adding or removing a slot requires almost all keywords to be remapped.
To put it simply, consistent hash is to organize the entire hash space into a virtual ring. For example, if the value space of hash function H is 0-2^32-1 (the hash value is a 32-bit unsigned integer), the entire hash space ring is as follows:
Next, the server is hashed by IP or host name as a key so that its position in the hash ring can be determined.
For example, we have four data objects A, B, C and D. After hashing, their positions in the ring space are as follows:
3 Fault tolerance and scalability
What about fault tolerance and scalability using consistent hashing algorithms?
3.1 fault tolerance
What happens if RedisService2 goes down?
Then, the node corresponding to data B is saved to RedisService3. Therefore, when one of them goes down, only the previous data is disturbed (the original data is saved to the next server clockwise), and the rest of the data is not disturbed.
3.2 scalability
Let’s consider another case. Suppose we add a server, Redis4, as shown below:
Therefore, the consistent hash algorithm only needs to relocate a small part of the change space for the increase or decrease of nodes, and has good fault tolerance and scalability
4 Virtual Nodes
The previous part is all about the situation that Redis has more nodes and more balanced node distribution. If there are fewer nodes, there will be unbalanced node distribution resulting in data skew.
For example, our system has two Redis, and the locations of the distributed rings are shown in the figure below:
In order to solve the problem of unbalanced data storage, the consistent hash algorithm introduces the virtual node mechanism, that is, multiple hash values are calculated for each node, and each computed result position is placed in the corresponding nodes, which are called virtual nodes.
This can be done by adding a number after the server IP address or host name. For example, three virtual nodes can be added to each service node. RedisService1#1, RedisService1#2, RedisService1#3, RedisService2#1, RedisService2#2, RedisService2#3, RedisService2#1, RedisService2#2, RedisService2#3
The hash algorithm for data location remains the same, but the mapping of virtual nodes to real nodes has been added. For example, data C is saved to the virtual node Redis1#2, where the actual data is saved to Redis1. In this way, the problem of uneven data when there are few service nodes can be solved. In practical applications, the number of virtual nodes is usually set to 32 or more, so even data distribution can be achieved even with a few service nodes.
conclusion
This paper briefly introduces the consistent hash algorithm, the current consistent hash algorithm has basically become the standard configuration of distributed system components, therefore, we are very necessary to understand the algorithm.
Room cleaning
Guangzhou Reed Technology Java development team
Reed Technology – Guangzhou professional Internet software service company
Seize every detail, create every beauty
Follow our official account to learn more
Want to fight with us? Lagou search for “Reed Technology” or join us by sending your resume to [email protected]
Follow us, your comments and likes support us the most