This article appeared in:
Distributed lock based on Redis and Redlock algorithm
Wechat official Account: Back-end technology compass
1 introduction
Interested readers can review the previous 4 articles related to Redis’ underlying implementation and engineering architecture:
- Redis interview hot spot bottom implementation part 1
- Redis interview hot spot of the bottom implementation part -2
- Redis interview hot topics of engineering architecture part 1
- Redis interview hot topics of engineering architecture part 2
Today, I will start to learn some practical applications of Redis. I will write several common applications of Redis.
In my opinion, the most typical application of Redis is as a distributed cache system. Other applications are not killer features in nature, but are implemented based on the data types and distributed architecture Redis supports, which are small but beautiful applications.
Combined with the author’s daily work, today we study some things based on Redis distributed lock and Redlock algorithm.
2. The first lock
1. Double-sided lock
In the read-only scenario, there is no need to lock, because the data is consistent. In the read-write mixed or write scenario, if there are no restrictions and constraints, it will cause the situation of writing chaos and data inconsistency.
Any amount of concurrency is meaningless if business security and correctness cannot be guaranteed.
This reminds me of an interesting picture:
High concurrency is mostly a test of the strength of your company’s infrastructure, and the proper use of locks is a measure of individual ability.
Basically everything is two-sided. Locks provide some degree of data consistency, but they also mean complexity in maintenance and use, and of course, performance costs. The biggest lock I’ve seen is probably the CPython interpreter’s global interpreter lock GIL.
There’s no way. It’s terrible. That lock’s not right.
Improper use of locks not only does not solve the data clutter problem, but can lead to more problems such as deadlock, colloquial deadlock phenomenon:
A few years ago this would have been the case: In different places, I need to buy train tickets to go back to my hometown, but I can’t buy tickets if I lose my ID card, and I need to go back to the household registration office of my hometown by train to get a replacement ID card, so it is very difficult to live in this way.
2. Lockless programming
Since locks are so difficult to control, you have to think about high concurrency without locks.
Lockless programming is also a very interesting topic, and I can write an article about it in the future, but I’ll just mention it this time. Open your mind and don’t get stuck in the mindset that all concurrent applications must be locked.
In certain scenarios, a parallel to serial approach is chosen to avoid locking as much as possible. For example:
Post request:
. ABC def/setdata? The req…
If the URI part of the above post request is an overwrite operation, reqid=abc123789def, the service is deployed on multiple machines, and the traffic is forwarded to Nginx on the large front end and then hashed according to the reqID, the configuration of Nginx would look something like this:
upstream myservice
{
Hash assignment based on parameters
hash $urlkey;
server localhost:5000;
server localhost:5001;
server localhost:5002;
}
Copy the code
Nginx load balanced requests with the same reqID will be forwarded to one machine, of course you might say what if the cluster’s machines are dynamically tuned? All I can say is don’t think so much about it, engineering and design it.
However, forwarding to a machine still cannot guarantee serial processing, because the single machine is still multi-threaded, we still need to put all reqID data into the same thread for processing, and finally ensure intra-thread serialization, this needs to rely on the thread pool manager Disper according to the reqID hash module to carry out multi-threaded load balancing.
After Nginx and intra-thread load balancing, the same reqIDS will be processed in serial within the thread, effectively avoiding the use of lock. Of course, this design may cause thread starvation when reqIDS are not balanced, but it can be used in the case of high concurrency and large number of requests.
To describe without drawing is to say nothing:
3. Single lock and distributed lock
Lock according to the scope of use can be simply divided into: single lock and distributed lock.
Linux provides system-level single-machine locks that allow threads to synchronize and mutually exclusive resources to be shared. Single-machine locks allow concurrent control of shared resources between threads within a machine.
In a distributed deployment with high concurrency, mutually exclusive access to resources is often encountered. The most effective and common method is to add a lock to shared resources or operations on shared resources.
A distributed lock is a means of controlling synchronous access to shared resources between distributed systems and is used to coordinate their actions in a distributed system.
3. Distributed lock
1. Introduction to the implementation of distributed locking
Distributed CAP theory tells us that we need to make trade-offs:
Any distributed system has three major features: Consistency, Availability, and Partition Tolerance. However, since network partitioning is not controlled by human, when network partitioning occurs, we must choose between Availability and Consistency.
In most scenarios in the Internet domain, it is necessary to sacrifice strong consistency for high availability of the system, and the system often only guarantees the final consistency. In many scenarios, in order to ensure the final consistency of data, a lot of technical solutions are needed to support, such as distributed transaction, distributed lock and so on.
Distributed locking can be implemented in three ways:
- Create a table in the database based on the database, the table contains fields such as method name, and create a unique index on the method name field. To perform a certain method, data needs to be inserted into the table using this method name. The lock is obtained when the insert succeeds, and the lock is released when the corresponding row data is deleted after the execution
- Based on the cache database, Redis has good performance and convenient implementation, but the single-node distributed lock will cause security problems in the case of failure migration. Redlock is a cluster mode distributed lock proposed by Antirez, the author of Redis, and high availability of distributed lock is realized based on N completely independent Redis nodes
- ZooKeeper is a distributed application coordination service based on Paxos algorithm and an open source component that provides consistency services for distributed applications
2. Conditions for distributed locks
Distributed lock is more complex than single-machine lock in the application of distributed system environment, this article describes the redis-based distributed lock implementation, the lock needs to have some features:
- Mutex At any given time, only one client can hold the lock and all other clients attempting to acquire the lock will fail and return or block and wait
- Robustness when a client crashes while holding a lock and does not actively release it, it also needs to ensure that other clients can successfully lock it, much like C++ ‘s smart Pointers to avoid memory leaks
- A unique locking and unlocking must be done by the same client. A client cannot release a lock that has been added by another client, nor can a lock held by itself be released by another client
- High availability does not depend on all Redis nodes working properly, as long as most Redis nodes are working properly, clients can lock and unlock
3. Distributed lock based on single Redis node
This paper focuses on the Redlock algorithm based on multiple Redis nodes, but before expanding this algorithm, it is necessary to mention the single Redis node distributed locking principle and evolution, because the Redlock algorithm is based on this improvement.
Initially, the lock is distributed using the setnx and expire commands, but these commands are not atomic. If the lock is acquired after setnx but the client is dead, the lock cannot be released after expire is set. In version 2.8, Antirez added an extension to setNx to make setNx and EXPIRE atomic.
In the Redis system with a single matster-slave, normally the Client obtains the lock from the Master and synchronizes it to the Slave. If the Master node fails after the Client succeeds in obtaining the lock and the lock is not synchronized to the Slave, After that, with the help of Sentinel, the Slave is upgraded to Master, but there is no information about the lock that has not been synchronized before. At this time, if a new Client wants to acquire the lock from the new Master, it may occur that two clients hold the same lock. Let’s see the diagram to think about this process:
In order to ensure that its own locks can only be released by itself, it is necessary to add unique verification. In summary, the simple process of acquiring and releasing locks based on a single Redis node is implemented using lua script as follows:
SET resource_name unique_value NX PX 30000 // Release the lock and compare the unique_value to the unique_value to avoid release by errorif redis.call("get",KEYS[1]) == ARGV[1] then
return redis.call("del",KEYS[1])
else
return 0
endCopy the code
These are a few points of distributed locking based on single Redis.
4.Redlock algorithm process
Redlock algorithm is a high availability mode introduced by Antirez on the basis of single Redis node.
In the distributed Redis environment, we assume that there are N completely independent Redis nodes, and we use the same methods to acquire and release locks on the N Redis instances as in the Redis single instance.
Now assume that there are 5 Redis primary nodes (an odd number greater than 3), so that they don’t all go down at the same time. During lock acquisition and lock release, the client will do the following:
- 1. Obtain the current Unix time in milliseconds
- 2. Try to obtain the lock from five instances using the same key and unique value. When requesting the lock from Redis, the client should set a network connection and response timeout period, which should be less than the lock invalid period, to avoid client death
- 3. The client uses the current time minus the lock acquisition time to obtain the lock acquisition time. A lock is successfully acquired if and only if the lock is obtained from more than half of the Redis nodes and is used for less than the lock expiration time
- 4. If a lock is acquired, the true valid time of the key is equal to the valid time minus the time it took to acquire the lock. This is important
- 5. If for some reason, failed to get the lock (no more than half in the instance to lock or lock time was more than effective time), the client should be conducted on all instances of Redis unlock, whether Redis instance lock is successful, because may be the server response message lost but actually succeeded, released, after all, there will be no problem at a time
The above 5 steps are the important process of Redlock algorithm, is also the hot spot of interview, interested readers or record it!
5. The debate about whether Redlock algorithm is safe
1. About Dr. Martin Kleipman
On February 8, 2016, Dr. Martin Kleppmann, an expert on distributed systems, pointed out some principles of distributed lock design in an article “How to Do Distributed Locking” and raised some doubts about Antirez’s Redlock algorithm.
I found Dr. Martin Kleipman’s personal website and a brief introduction.
Check it out with Sogou translator:
1. I am a Senior Research Assistant and attached Lecturer in the Department of Computer Science and Technology, University of Cambridge, funded by the Leffulm Trust Early Career Scholarship and the Isaac Newton Trust. I work on locally first collaborative software and distributed system security. 2. I am also a researcher and director of computer Science research at Corpus Christi College, Cambridge, where I do undergraduate teaching. 3. In 2017, I published a book for O ‘Reilly called Designing Data-intensive Applications. It covers a wide range of databases and distributed data processing system architectures and is one of the publisher’s best sellers. 4. I speak regularly at conferences, and recordings of my speeches have been viewed more than 150,000 times. 5. I have worked on various open source projects, including Automerge, Apache Avro and Apache Samza. 6. From 2007 to 2014, I worked as an industrial software engineer and entrepreneur. I co-founded Rapportive(acquired by linkedin in 2012) and Go Test(acquired by RedDoor Software in 2009). 7. I have composed several musical works, including Death in February (in German), a musical theatrical adaptation of the book by Dunk Delacter, which premiered in 2007 with 150 participants.
A good person is a good person. They can teach, publish books, write open source software, start a business, and write musicals. Good people are good at everything.
2. The main points of Dr. Martin’s article
In the article, Martin Klepman talked about many basic problems of distributed systems, especially the asynchronous model of distributed computing. The article is divided into two parts: The first half describes some principles of distributed locking, and the second half puts forward some views on Redlock:
- Martin points out that even if we had a perfectly implemented distributed lock, we wouldn’t be able to achieve sufficient security without sharing resources involved to provide some kind of fencing mechanism
- Martin pointed out that since Redlock is essentially built on a synchronous model, it has a strong requirement on the system’s time, and its own security is not enough
Martin gives a timing diagram for the Fencing mechanism:
Acquiring a lock of the client when the lock is held may suspend for a long time, despite a timeout lock, to avoid the collapse of the client may never hold locks and will never release it, but if the client’s suspension for longer than a lock at the end of time, and customers don’t realize it expired, it will probably continue to do some unsafe changes, In other words, a held lock expires due to client blocking without being aware of it.
In this case, Martin pointed out that the fencing mechanism should be added, specifically the fencing token isolation mechanism, and also gave a timing diagram:
Client 1 acquires the lock and acquires the token with serial number 33, but then it goes into a long pause until the lock timeout expires, and client 2 acquires the lock and acquires the token with serial number 34 and writes it to the storage service. Client 1 then revives and sends its write to the storage service, however the storage server remembers that it has processed the write 34 with the higher token, so it rejects the request for token 33. The Redlock algorithm does not have such a unique and increasing fencing token generation mechanism, which means that the Redlock algorithm cannot avoid the operation problems after the lock expires due to client blocking, so it is not safe.
This idea I think is not completely solve the problem, because if the client 1 write operation is to perform a success, but due to the blocking timeout can no longer write also creates a mistake as a result, the client 2 will be operated on the results of this error, then any operation is doomed to be wrong.
3. Dr. Martin’s doubts about Redlock
Martin Klepman points out that Redlock is a very time-dependent algorithm, which can lead to a lot of inconsistencies. Here’s an example:
Suppose A multi-node Redis system has five nodes A/B/C/D/E and two clients C1 and C2. What happens if the clock on one of the Redis nodes jumps forward?
- Client C1 obtains A lock on nodes A, B, and C, and due to network problems, method reaches nodes D and E
- The clock on node C jumps forward, causing the lock to expire in advance
- Client C2 is locked on nodes C, D, and E and cannot reach A and B due to network problems
- Both client C1 and client C2 now think they own the lock
Distributed asynchronous model: The above situation is possible because the security of Redlock is strongly dependent on the system clock of Redis node. Once the system clock becomes inaccurate, the security of the algorithm cannot be guaranteed.
Martin is actually pointing out some fundamental problems in the study of distributed algorithms. A good distributed algorithm should be based on an asynchronous model, and its security should not depend on any timing assumptions.
Processes and messages can be delayed for any length of time in a distributed asynchronous model, and the system clock can fail in any way. These factors should not affect its safety, only could affect its activity, even in very extreme cases, most algorithms cannot be given for a limited time as a result, and should not give incorrect results, the algorithm exists in reality such as Paxos/Raft, according to the standards Redlock level of security is unattainable.
4. Dr. Martin’s conclusions and basic views
Martin expressed his views, dividing locks into two uses:
- Efficiency First using distributed locks is only for the purpose of coordinating some simple tasks across multiple clients. The occasional failure of a lock can have other adverse consequences, as if you were sending and receiving the same email twice
- Proper first use of distributed locks requires that under no circumstances should lock failure occur, which could mean data inconsistency, data loss, file corruption, or other serious problems, as with repeated doses of medication
Martin concluded as follows:
- Using distributed locking for efficiency a single Redis node locking scheme is sufficient Redlock is a heavy and expensive design
- Redlock is not a strong enough algorithm to build on an asynchronous model, and its assumptions about the system model contain a lot of dangerous elements
Martin thinks the Redlock algorithm is a bad choice because it is neither fish nor fish: it is too heavy and expensive for efficiency and not safe enough for correctness.
6. Antirez counterattack
Martin’s article was published on June 28, 2016 and Antirez was quick to respond. He published “Is Redlock Safe?” The article is addressed to antirez.com/news/101
Antirez believes that Martin’s criticism of Redlock can be summarized in two aspects:
- A distributed lock with automatic expiration must provide a fencing mechanism to ensure true mutual exclusion of shared resources. The Redlock algorithm does not provide such a mechanism
- Redlock algorithm is built on an insecure system model. It has strong requirements for the timing assumptions of the system, which cannot be guaranteed in the real system
Mr Antirez argues in detail against both.
About Fencing Mechanism
Antirez questioned why use a distributed lock and require such strong security guarantees when an fencing mechanism exists to maintain exclusive access to resources even if the lock fails.
Taking a step back, although Redlock does not provide an increasing Fencing token to isolate tokens, the same effect can be achieved by using the random string generated by Redlock, which is not increasing, but is unique.
The timing hypothesis
Antirez refuted the algorithm in the time model hypothesis set, and Martin believed that there were three main failure cases of Redlock:
- 1. The clock jumps
- 2. Long GC pause
- 3. Long network latency
In the latter two cases, Redlock has been designed and considered at the beginning and is somewhat resistant to the consequences caused by these two problems. Clock jump has a great impact on Redlock, and Redlock will not work properly once this happens. Antirez pointed out that Redlock’s requirements for the system clock do not need to be completely accurate. As long as the error does not exceed a certain range, it will not affect the system, which is completely reasonable in the actual environment, and large clock beats can be completely avoided through proper operation and maintenance.
7. Martin’s summary and reflection
The distributed system itself is very complex, and the effect of the mechanism and theory needs to be based on certain mathematical derivation. Martin and Antirez are both experts in this field, and they have their own views and reflections on some problems. More importantly, there is no perfect solution to the problem in many cases.
The debate was a very good clash of ideas in the field of distributed systems. Many people posted their views and opinions, and Dr. Martin expressed some of his views again some time after Antirez’s reaction:
For me, this is the most important point: I don’t care who is right or wrong in this debate — I care about learning from others’ work, so that we can avoid repeating old mistakes, and make things better in future. So much great work has already been done for us: by standing on the shoulders of giants, we can build better software.
By all means, test ideas by arguing them and checking whether they stand up to scrutiny by others. That’s part of the learning process. But the goal should be to learn, not to convince others that you are right. Sometimes that just means to stop and think for a while.
For Martin, it doesn’t matter who is right or wrong. He is more concerned with learning from the work of others to avoid repeating his own mistakes, just as we do better when we stand on the shoulders of giants.
In addition, only through others’ arguments and tests can we make our own ideas stand the test. Our goal is to learn from each other rather than persuade others to believe that you are right. The so-called one person is short, thinking and refuting can be closer to the truth.
After Antirez’s article was published, distributed system experts and enthusiasts all over the world actively expressed their opinions, and I found a familiar name in the comments:
8. Giant shoulders
- Martin.kleppmann.com/2016/02/08/…
- antirez.com/news/101
- Zhangtielei.com/posts/blog-…
- Zhangtielei.com/posts/blog-…
The two articles written by Tielei Dashen are very good, and this paper makes a lot of references. It is also tielei Dashen’s articles that let the author know about this wonderful huashan jian. those who are interested can directly search and read reference 3 and 4.