The original address
Last week I spent some time studying the algorithm proposed by the author of Redis RedLock to implement a distributed lock. A sentence found at the bottom of the official document.
Analysis of RedLock
Martin Kleppmann analyzed Redlock here. I disagree with the analysis and posted my reply to his analysis here.
Suddenly it seemed that things were not so simple, so I clicked in to have a look. I read the article carefully and found a wonderful world. So I settled down to study Martin’s criticism of RedLock and RedLock author Antirez’s counterattack.
Martin’s criticism
Martin comes up and says, why do we need locks? Two reasons:
- To improve efficiency, use locks to ensure that a task doesn’t have to be executed twice. For example (very expensive calculation)
- Ensure that the system is correct. Locks are used to ensure that tasks are executed properly, preventing file conflicts and data loss caused by two nodes operating the same data at the same time.
For the first reason, we have a certain tolerance for locking, even if two nodes work at the same time, the impact on the system is only some extra cost of calculation, there is no additional impact. At this time, using a single point of Redis can solve the problem well, there is no need to use RedLock to maintain so many Instances of Redis, which increases the maintenance cost of the system.
For the second reason, scenarios where correctness is strictly required (such as orders, or consumption), the RedLock algorithm does not guarantee the lock correctness.
Let’s take a look at the flaws in RedLock:
Martin gave us this graph. First of all, as we said last time, in RedLock, to prevent deadlocks, locks have an expiration time. The expiration date was caught by Martin.
- If Client 1 holds the lock for a long time and FGC exceeds the expiration date of the lock. The lock is released.
- At this point Client 2 acquires another lock and commits the data.
- At this point Client 1 wakes up from FGC and commits again.
That’s not bad. There was an error in the data. RedLock only guarantees high availability of locks, not correctness.
At this point you might say that Client 1 could solve the problem if it checked the lock owner before submitting the task. The answer is no, FGC can occur at any time, and if FGC occurs after a query, it can also occur as discussed above.
How about a programming language without GC? The answer is no, FGC is only one of the causes of system outages. IO or network congestion or fluctuations can cause system outages.
At this point, I was in despair. Fortunately, Martin gave me a solution:
Add a token- Fencing lock.
- When acquiring a lock, Client 1 also needs to acquire an increasing token. In the figure above, Client 1 also obtains a fencing token with a value of 33.
- After the FGC issue above, the Client obtained the lock with token=34.
- When submitting data, determine the size of the token. If the token is smaller than the token submitted last time, the token is rejected.
We can actually understand the token- Fencing as an optimistic lock, or CAS.
Martin also pointed out that RedLock is a distributed system that relies heavily on the system clock.
It’s still a little bit of an expiration date. If a Redis Master’s system time error causes its lock to expire prematurely and be released.
- Client 1 has obtained the locks of nodes A, B, D, and C from nodes A, B, D, and E. We believe that Client 1 has the locks
- At this point, because B’s system time is faster than the other systems, B will release the lock before the other two nodes.
- Clinet 2 can obtain locks from nodes B, D, and E. The result is that both clients hold the lock simultaneously throughout the distributed system.
At this point, Martin raised another important design point about distributed systems:
A good distributed system should be asynchronous and not time-dependent. There are program pauses, network latency, and system time errors in distributed systems. These factors do not affect distributed system security, but liVENESS property. In other words, in extreme cases, distributed systems cannot give results for a limited time at best, but they cannot give wrong results.
So to summarize Martin’s criticisms of RedLock:
- RedLock is too heavy for efficiency scenarios.
- RedLock does not guarantee correctness in highly accurate scenarios.
At this time, I feel enlightened. It’s so good.
The authors of RedLock and Redis have responded to Martin’s article in a very clear way.
Antirez response
Antirez saw Martin’s article and wrote an article in response. Will the plot be reversed?
Antirez summarizes Martin’s allegations against RedLock:
- Distributed locks have an automatic release feature. The mutual exclusion of the lock is valid only within the expiration period. After the lock is released, multiple clients hold the lock.
- RedLock’s entire system is built on a system model that is not guaranteed by the actual system. In this case, the system assumes that time is synchronous and reliable.
For the first question: Antirez wrote a lot in great detail. After careful reading for a long time, the question in my mind was not solved. Review RedLock’s steps to acquire the lock:
- Get start time
- Go to each node to obtain the lock
- Get time again.
- Calculate the lock acquisition time and check whether the lock acquisition time is shorter than the lock acquisition time.
- Hold the lock and do whatever you need to do
If the program blocks between steps 1 and 3, RedLock can sense that the lock has expired without a problem. What if the program blocks after step 4? What to do?? The answer is that other distributed locks with auto-release locks do not solve this problem.
On the second charge: Antirez argues that, first of all, in the actual system, there are two aspects:
- The system pauses and the network delays.
- The time of the system takes a step.
On the first question. As mentioned above, RedLock does some small things, but there’s no way to avoid them completely. Other distributed locks with automatic release do not work either.
Secondly, Martin believes that the step of system time mainly comes from two aspects:
- Manual modification.
- A skip time clock update was received from the NTP service.
What can you say about human modification? There’s no way to avoid destruction. NTP receives a step clock update, which requires o&M to ensure. When you need to update the step time to the server, you should take a spritz approach. Multiple changes, each update time as small as possible. **
As an aside, I suddenly understand the email sent by Yunwei:
So strictly speaking, It is true that RedLock is based on the model that Time is trusted. Theoretically, Time is also wrong, but in reality, good operation and engineering of some mechanisms can maximize the guarantee of Time reliability.
Finally, Antirez also scored a critical shot, since Martin’s proposed system used f4.2 Token to ensure sequential processing of data. Why do I need RedLock or other distributed locks?
review
Read 2 people’s blog contacts, the feeling is to see martial arts play inside the master fight, quite readily. The two of them are clear in their thinking. Martin sees RedLock’s death point and hits it violently. Antirez successfully defuses it. As to who is right and who is wrong? In my opinion, every system design has its own focus or limitation. Engineering is not perfect. There is no perfect solution in real engineering. We need to understand how this works and understand the pros and cons of the solution. Understand the limitations of your options. Is it acceptable to accept the consequences of the limitations of the scheme? Architecture is an art of balance.
The last
Martin recommended using ZooKeeper for distributed transaction locking. What is the difference between Zookeeper and Redis locks? Did Zookeeper fix the problem Redis didn’t? Listen to the breakdown next time.
Reference:
- Distributed locks with Redis
- How to do distributed locking
- Is Redlock safe?
- Is Distributed Lock based on Redis secure?