【 Full version 】
Online about Redis distributed lock articles can be described as much as cow hair, do not believe that you can take the keyword “Redis distributed lock” to which search engine search will know. The ideas are similar and the algorithms seem logical, but as we set out to implement them, the more you think about them, the more you wonder.
In fact, about a year ago, there was a debate about the security of Redis distributed locks between distributed systems expert Martin Kleppmann and Redis author Antirez. Having been interested in this issue for a long time, I have been reading up on this debate the other day. The argument goes like this: In order to standardize the implementation of distributed locks based on Redis, the authors of Redis proposed a more secure implementation called Redlock. One day Martin Kleppmann wrote a blog post analyzing some of the security issues with Redlock. The Redis writer then immediately wrote a blog post disputing Martin’s analysis. But Martin says he stands by his original point. The issue then sparked a heated discussion on Twitter and Hacker News, with distributed systems experts weighing in.
For those interested in distributed systems, this event is worth watching. Whether you’re new to distributed systems or a veteran of distributed development for many years, you’ll probably learn something from this analysis and commentary. Antirez, having personally implemented a complex system such as Redis Cluster, is an expert in the distributed world. However, in analyzing the problems caused by distributed locks, different experts have come to very different conclusions, which gives us a glimpse of the complexity of the problems associated with distributed systems. In fact, what often happens in distributed system design is that many ideas that at first glance appear to have no flaws turn out to be less than perfect upon closer examination.
Here, we take a look at all sides of the debate from start to finish. In the process, it will be interesting to explore the technical details that affect the security of distributed locks. This is also a longer story. Of course, there are inevitably some small “gossip” involved.
Redlock algorithm
As mentioned at the beginning of this article, many people have tried to implement a Distributed Lock using Redis. The purpose of such distributed locks is to provide mutually exclusive access to some shared resources.
However, while these implementations are broadly similar, they differ in implementation details and the security and usability they can provide. Therefore, Antirez, the author of Redis, proposed a better implementation, called Redlock, which is the official Redis guideline for implementing distributed locks. Redlock’s algorithm is described on the Redis website:
- Redis. IO/switchable viewer/dist…
Prior to Redlock, many people implemented distributed locks based on a single Redis node. Redlock is an implementation based on multiple Redis nodes (all Master). To understand Redlock, we first need to describe the simple algorithm based on a single Redis node, which is the basis of Redlock.
Distributed lock based on single Redis node
First, the Redis client issues the following command to the Redis node to acquire the lock:
SET resource_name my_random_value NX PX 30000Copy the code
If the preceding command is executed successfully, the client successfully obtains the lock and can access the shared resource. If the preceding command fails to be executed, the lock fails to be obtained.
Notice that in the SET command above:
my_random_value
Is a random string generated by the client that is guaranteed to be unique among all lock requests from all clients over a sufficient period of time.NX
Means only whenresource_name
Only when the corresponding key value does not existSET
Success. This ensures that only the first client can acquire the lock, and no other client can acquire the lock until it is released.PX 30000
Indicates that the lock has an automatic expiration time of 30 seconds. Of course, 30 seconds is just an example, and the client can choose the appropriate expiration time.
Finally, when the client is done with the shared resource, execute the following Redis Lua script to release the lock:
if redis.call("get",KEYS[1]) == ARGV[1] then
return redis.call("del",KEYS[1])
else
return 0
endCopy the code
The Lua script executes by passing the previous my_random_value as ARGV[1] and resource_name as KEYS[1].
At this point, the distributed lock algorithm based on a single Redis node is described. There are several problems that need to be analyzed.
First of all, the lock must have an expiration date. Otherwise, when a client acquires a lock, if it crashes or is unable to communicate with the Redis node due to a network partition, it will hold the lock forever and the other clients will never be able to acquire the lock. Antirez also emphasized this point in the later analysis, and called this expiration time the lock validity time. The client that acquires the lock must complete access to the shared resource within this time.
Second question, the first step to obtain the lock operation, many online articles it is implemented as two Redis command:
SETNX resource_name my_random_value
EXPIRE resource_name 30Copy the code
Although these two commands perform the same as the SET command described in the previous algorithm, they are not atomic. If the client crashes after executing SETNX, there is no chance to execute EXPIRE, causing it to hold the lock forever.
The third problem, also pointed out by Antirez, is that it is necessary to set a random string my_random_value to ensure that the lock that a client releases must be the same lock that it owns. If the lock is acquired with a fixed value rather than a random string, the following execution sequence might occur:
- Client 1 successfully obtains the lock.
- Client 1 is blocking an operation for a long time.
- When the expiration date is up, the lock is released automatically.
- Client 2 has obtained the lock corresponding to the same resource.
- Client 1 recovers from the block and releases the lock held by client 2.
After that, client 2 has no lock to protect it from accessing the shared resource.
Fourth, the lock release operation must be implemented using Lua scripts. Releasing locks actually involves three steps: ‘GET’, ‘judge’ and ‘DEL’, and Lua scripts ensure atomicity of these three steps. Otherwise, if these three steps are executed in the client logic, a sequence similar to the one in question 3 above may occur:
- Client 1 successfully obtains the lock.
- Client 1 accesses shared resources.
- To release the lock, client 1 performs a ‘GET’ operation to GET a random string value.
- Client 1 determines the value of the random string to be equal to the expected value.
- Client 1 is blocked for a long time for some reason.
- When the expiration date is up, the lock is released automatically.
- Client 2 has obtained the lock corresponding to the same resource.
- Client 1 recovers from the block and executes
DEL
Manipulate to release the lock held by client 2.
In fact, in the analysis of problem 3 and problem 4 above, a similar execution sequence could have occurred if the client had not blocked but had a large network delay.
The previous four problems can be handled correctly as long as attention is paid to implementing distributed locks. However, antirez also points out a problem caused by failover that cannot be solved by distributed locks based on a single Redis node. It was this question that led to The emergence of Redlock.
The problem is this. If the Redis node goes down, then all clients cannot acquire locks and the service becomes unavailable. To improve the availability, we can attach a Slave to the Redis node. When the Master node becomes unavailable, the system automatically switches to the Slave (failover). However, because The master/slave replication of Redis is asynchronous, this can result in the loss of lock security during failover. Consider the following execution sequence:
- Client 1 has obtained the lock from the Master.
- The Master is down and the key that stores the lock has not been synchronized to the Slave.
- The Slave was upgraded to Master. Procedure
- Client 2 has obtained the lock for the same resource from the new Master.
Thus, both client 1 and client 2 hold the lock on the same resource. The security of the lock was breached. Antirez designed the Redlock algorithm for this problem, which we will discuss next.
[Other Questions]
What is the appropriate lock validity time that appears in the previous algorithm? If set too short, the lock may expire before the client finishes accessing the shared resource and lose protection. If the setting is too long, if one client holding the lock fails to release the lock, all other clients will be unable to acquire the lock and thus cannot function properly for a long time. It seems to be a dilemma.
Furthermore, in the previous analysis of the random string my_random_value, Antirez acknowledged in the article that the case of lock expiration due to prolonged client blocking should indeed be considered. If that happens, is the shared resource unprotected? Can Antirez’s redesigned Redlock solve these problems?
Distributed lock Redlock
Because the distributed lock based on a single Redis node introduced previously may cause security problems that cannot be solved during failover, Antirez proposes a new distributed lock algorithm Redlock, which is based on N completely independent Redis nodes (N can be set to 5 in normal cases).
The client running the Redlock algorithm performs the following steps to acquire the lock:
- Gets the current time in milliseconds.
- Execute to N Redis nodes in sequenceAcquiring a lockIn the operation. This fetch operation is the same as the previous one based on a single Redis nodeAcquiring a lockThe process is the same, containing random strings
my_random_value
, also includes expiration time (e.gPX 30000
, i.e. the duration of the lock). To ensure that the algorithm can continue to run if a Redis node is unavailable, thisAcquiring a lockThe operation also has a time out that is much smaller than the duration of the lock (on the order of tens of milliseconds). When a client fails to acquire a lock from a Redis node, it should immediately try the next Redis node. Failure should include any type of failure, such as the Redis node being unavailable, or the lock on the Redis node being held by another client. - Calculate how long it took to acquire the lock by subtracting the time recorded in step 1 from the current time. If the client successfully obtains the lock from most Redis nodes (>= N/2+1), and the total time consumed in obtaining the lock does not exceed the lock validity time, then the client considers that the lock is successfully obtained. Otherwise, the lock fails to be obtained.
- If the lock is acquired successfully, the lock duration should be recalculated, which is equal to the original lock duration minus the lock duration calculated in step 3.
- If the lock acquisition ultimately fails (perhaps because the number of Redis nodes acquiring the lock is less than N/2+1, or the entire lock acquisition process takes longer than the initial validity of the lock), the client should immediately initiate lock release for all Redis nodes (the Redis Lua script described earlier).
Of course, the process described above is to acquire the lock, and the process of releasing the lock is relatively simple: the client initiates the lock release operation to all Redis nodes, regardless of whether they were successful in acquiring the lock at the time.
Since Redlock works as long as most of the N Redis nodes work, it is theoretically more available. The problem of distributed lock failure in failover of single Redis node discussed earlier does not exist in Redlock, but if a node crashes and restarts, the lock security will be affected. The degree of impact depends on how persistent Redis is to the data.
Suppose there are five Redis nodes: A, B, C, D, and E. Imagine the following sequence of events:
- Client 1 successfully locked A, B, and C, and obtained the lock (but D and E were not locked).
- Node C crashes and restarts, but the lock added by client 1 on node C is not persisted and is lost.
- After node C is restarted, client 2 locks C, D, and E, and the lock succeeds.
Thus, both client 1 and client 2 acquire the lock (for the same resource).
By default, Redis AOF persistence writes disks once per second (that is, fsync is performed), so in the worst case, 1 second of data can be lost. To minimize loss of data, Redis allows you to set it to fsync every time you modify data, but this can degrade performance. Of course, even if fsync is performed, it is still possible to lose data (it depends on the system, not the Redis implementation). Therefore, lock failure due to node restart, as analyzed above, is always possible. To deal with this problem, Antirez also proposed the concept of delayed restarts. In other words, after a node crashes, it is not restarted immediately. Instead, it is restarted after a period of time that is longer than the lock validity time. In this way, locks that the node participates in will expire until the restart, and it will not affect existing locks after the restart.
One more detail about Redlock is worth analyzing: Antirez specifically states in the algorithm description that the client should release the lock to all Redis nodes when the lock is finally released. That is, even if the lock on a node was not successfully obtained at the time, the node should not be omitted when the lock is released. Why is that? Imagine a situation where a lock request sent by a client to a Redis node successfully reaches the Redis node, and the node successfully performs the SET operation, but the response package it returns to the client is lost. From the client’s point of view, the lock request failed due to timeout, but from the Redis side, the lock was successful. Therefore, when the lock is released, the client should also request those Redis nodes that failed to acquire the lock. In fact, this can happen in an asynchronous communication model: client communication to the server is fine, but the opposite direction is problematic.
[Other Questions]
In the previous discussion of distributed locks for single Redis nodes, we finally raised the question that if the client blocks for a long time and the lock expires, then it is not safe to access the shared resource (without the protection of the lock). Does this problem improve in Redlock? Obviously, this problem still exists in Redlock.
In addition, after the successful lock acquisition in the fourth step of the algorithm, if the process of lock acquisition takes a long time, the recalcitated remaining lock validity time is very short, then we still have time to complete the shared resource access? If we think it’s too short, should we release the lock immediately? How short is that? Another choice dilemma.
The analysis of Martin
Martin Kleppmann published a blog entitled “How to Do Distributed locking “on February 8, 2016. The address is as follows:
- Martin.kleppmann.com/2016/02/08/…
In this article, Martin discusses many of the fundamental issues of distributed systems (especially the asynchronous model of distributed computing), which is worth reading for practitioners of distributed systems. This article can be roughly divided into two parts:
- The first half is not about Redlock. Martin pointed out that even if we had a perfectly implemented distributed lock (with auto-expiration), we would still not have sufficient security without a shared resource participating to provide some sort of fencing mechanism.
- The second part is a criticism of Redlock itself. Martin pointed out that because Redlock is essentially based on a synchronization model, the timing assumption of the system is very strong, so the security itself is not enough.
First let’s discuss the key points of the first half. Martin gave the following sequence diagram:
In the sequence diagram above, assuming that the lock service itself is not problematic, it always ensures that at most one client can acquire the lock at any time. Lease is equivalent to a lock that automatically expires. A long GC pause occurs after client 1 acquires the lock, during which the lock it acquired expires and client 2 acquires the lock. When the client 1 recovered from GC pause, it does not know its lock has expired, it is still to a Shared resource (pictured above) is a storage service launched write data request, while the lock is actually the client 2 hold, so two conflict is likely to be the client’s written request (lock mutex function failure).
At first glance, one might say that since client 1 does not know that the lock it holds has expired after recovering from GC Pause, it can determine whether the lock has expired before accessing the shared resource. But if you think about it, it doesn’t help at all. Because THE GC pause can occur at any time, perhaps right after the judgment.
One might also argue that if the client were implemented in a language without GC, wouldn’t this be a problem? Martin pointed out that the system environment is too complex, and there are still many reasons for pauses, such as page faults due to virtual memory, or competition for CPU resources. Even without considering the pause situation, network latency can still produce similar results.
To sum up, even if there is no problem with the lock service itself, but only a long pause or network delay on the client, it can still cause two clients to access the shared resource at the same time. In fact, this is the case we have already mentioned in the previous “client block for a long time causing the lock to expire” that question.
So how do you solve this problem? Martin came up with a method called fencing token. The fencing token is a monotonically increasing number that is returned to the client when the client successfully obtains the lock. When a client accesses a shared resource, it carries the fencing token, so that the service that provides the shared resource can check it and reject delayed access requests (avoiding conflicts). The diagram below:
In the figure above, client 1 acquired the lock first and therefore had a smaller fencing token equal to 33, while client 2 acquired the lock later and had a larger fencing token equal to 34. After recovering from GC Pause, client 1 still sends access requests to the storage service, but with a fencing token = 33. The storage service discovers that it has already processed request 34, so it will reject request 33. In this way, conflict is avoided.
Now let’s talk about the second half of Martin’s paper.
Martin constructs sequences of events that allow Redlock to fail (two clients holding the lock at the same time). To illustrate Redlock’s over-reliance on timing, we first give an example (again, assuming there are 5 Redis nodes A, B, C, D, E) :
- Client 1 successfully acquired the lock from Redis nodes A, B, and C (most nodes). The communication with D and E fails due to network problems.
- The clock on node C jumps forward, causing locks maintained on it to expire quickly.
- Client 2 successfully acquired the lock of the same resource from Redis nodes C, D, and E (most nodes).
- Both client 1 and client 2 now think they hold the lock.
This situation is possible because the safety property of Redlock is highly dependent on the system clock. If the system clock becomes inaccurate, the security of the algorithm cannot be guaranteed. A good distributed algorithm should be based on the asynchronous model, and the security of the algorithm should not depend on any timing assumptions. In an asynchronous model: processes can be paused for any length of time, messages can be delayed in the network for any length of time, or even lost, and the system clock can fail in any way. For a good distributed algorithm, these factors should not affect its safety property, but only its liVENESS property, which means that even in very extreme cases (such as a serious system clock error), At best, the algorithm cannot give a result in a limited time, and should not give a wrong result. Such algorithms exist in the real world, such as the well-known Paxos, or Raft. But clearly Redlock’s level of security is not up to par by this standard.
Martin then gives an example of a Redlock failure caused by the client’S GC pause. As follows:
- Client 1 initiates lock requests to Redis nodes A, B, C, D, and E.
- Each Redis node has returned the result of the request to client 1, but client 1 enters a long GC pause before receiving the result.
- Locks expire on all Redis nodes.
- Client 2 has obtained the lock on A, B, C, D, and E.
- Client 1 recovers from GC Pause and receives the result of the previous step 2 request from each Redis node. Client 1 thinks it has successfully acquired the lock.
- Both client 1 and client 2 now think they hold the lock.
The example that Martin gave is actually a little bit problematic. In the Redlock algorithm, after the client completes the lock acquisition request from each Redis node, it calculates the time consumed in the process and then checks whether the lock validity time has been exceeded. In step 5 of the above example, client 1, after recovering from GC Pause, will use this check to discover that the lock has expired and will no longer consider itself successful in obtaining the lock. Antirez later pointed this out in his rebuttal, but Martin argued that this detail had no fundamental impact on Redlock’s overall security.
Leaving this detail aside, we can analyze what Martin is trying to do with this example. At first glance, this example looks basically the same as the sequence diagram for GC pauses shown in the analysis of generic distributed locks in the first half of this article, except that the GC pauses occur after client 1 acquires the lock, while the GC pauses occur before client 1 acquires the lock. But the emphasis is not quite the same. Martin constructed the example here to emphasize that in a distributed asynchronous environment, a long period of GC pause or message delay (in the above example, replacing GC pause with message delay between the Redis node and client 1, logically unchanged) can cause the client to acquire an expired lock. From client 1’s point of view, Redlock’s security is broken because by the time client 1 receives the lock, the lock is already invalid, and Redlock also assigns the lock to client 2. In other words, the lock expires while the Redis server is distributing it to the client, but there is no effective mechanism for the client to be aware of the problem. In the previous example, when client 1 received the lock, the lock was still valid. The security of the lock service itself could be considered as not broken. Although there was a problem later, the problem was due to the interaction between client 1 and the shared resource server.
Another insightful point in Martin’s article is the distinction between the uses of locks. He divided the use of locks into two categories:
- For efficiency, coordinate clients to avoid duplication of work. Even if the lock fails occasionally, it is just possible to do some operations more than once without other adverse consequences. Like sending the same email over and over again.
- For correctness. Lock invalidation should never happen under any circumstances, as this could mean data inconsistency, data loss, file corruption, or other serious problems.
Finally, Martin came to the following conclusion:
- If distributed locking is used for efficiency, allowing occasional lock failures, then the locking scheme using a single Redis node is sufficient, simple and efficient. Redlock is a double feature.
- If distributed locking is for correctness, then don’t use Redlock. It is not a strong enough algorithm based on the asynchronous model, and its assumptions about the system model contain many dangerous elements (for timing). Moreover, it does not have a mechanism to offer fencing tokens. What technology should be used? Martin believes that a solution like Zookeeper, or a database that supports transactions, should be considered.
Martin describes Redlock’s algorithm as follows:
There will be neither fish nor horse.
[Other Questions]
- The fencing token scheme proposed by Martin needs to modify the service that provides shared resources. Is this feasible in reality?
- According to Martin, it appears that if the resource server implements a Fencing token, it can maintain mutually exclusive access to the resource even if the distributed lock fails. Does this mean that distributed locks have no purpose at all?
- The resource server needs to check the size of the fencing token. If the resource access service has multiple nodes (distributed), how to ensure that the fencing token increases on multiple nodes?
- In Martin’s example for fencing tokens, the order of the two fencing tokens arriving at the resource server was reversed (the small fencing token arrived later), and the resource server detected the problem. If GC pause occurs on both client 1 and client 2, both fencing tokens are delayed, and they arrive at the resource server almost simultaneously, but in the same order, the resource server will not be able to check the problem. Is access to the resource in conflict?
- Is the distributed fencing scheme absolutely correct? Can you prove it?
(Above is the upper part)
Since I finished writing the first half of this topic, I feel many small voices in my head, lingering for a long time. It was as if they were quarrelling with each other over trivial matters. Indeed, the topic of distribution is such that it is trivial, and what everyone is saying sounds plausible.
Today, we continue the second half of this topic. In this post, we’ll start with Antirez’s rebuttal of Martin Kleppmann’s argument, then move on to some discussion on Hacker News, and then we’ll talk about distributed locks based on Zookeeper and Chubby, Compare that to Redlock. Finally, we’ll refer to Martin’s summary of the incident.
Antirez retort
After Martin published his blog analysis of How to Do Distributed locking, the article was widely discussed on Twitter and Hacker News. But it will be interesting to hear what Redlock’s Antirez has to say about this.
Martin’s article was published on February 8, 2016, but according to Martin, he sent the draft to Antirez for review a week before it was published, and they discussed it via email. The day after Martin’s article was published, Antirez posted a rebuttal on his blog entitled “Is Redlock Safe?” , the address is as follows:
- antirez.com/news/101
This is a game between the best. Antirez’s article is also very clear and involves a great deal of detail. Antirez argues that Martin’s article’s criticism of Redlock can be summarized in two aspects (corresponding to the first and last parts of Martin’s article) :
- Distributed locks with automatic expiration must provide some fencing mechanism to ensure true mutually exclusive protection of shared resources. Redlock does not provide such a mechanism.
- Redlock is built on an insecure system model. It has strong requirements on timing assumptions of the system, and these requirements cannot be guaranteed in real systems.
Antirez refutes both aspects separately.
First, the fencing mechanism. Antirez questions Martin’s argument: why use a distributed lock and require it to be so secure when there is already a fencing mechanism that can maintain mutually exclusive access to resources even if the lock fails? Even though Redlock doesn’t offer the incremental fencing token Martin mentioned, the random string (my_random_value) generated by Redlock can achieve the same effect. This random string is not incremented, but it is unique and can be called a unique token. Antirez gives an example: For example, you can use it to implement a “Check and Set” operation, which reads:
When starting to work with a shared resource, we set its state to ”
“, Then we operate the read-modify-write only if the token is still the same when we write. When we start interacting with the shared resource, we set its state to ”
” and then perform “read-modify-write back” only if the token is unchanged.
When I saw this description for the first time, I personally did not understand it. “Check and Set” is probably the usual CAS operation we’ve heard about, but antirez doesn’t elaborate on how it works in this scenario (we’ll talk about it later on Hacker News).
Then, Antirez’s rebuttal focuses on the second aspect: model assumptions about timing. As mentioned in our previous analysis of Martin’s article, there are three main situations where Martin believes Redlock will fail:
- The clock jumped.
- Long GC pauses.
- Long network latency.
Antirez must have realized that the first of these three situations is actually the most deadly to Redlock: the clock jumps. When that happens, Redlock won’t work. In the latter two cases, Redlock was designed with that in mind and is somewhat immune to their consequences. Therefore, Antirez then focused on demonstrating that large clock beats can be avoided with proper operation and maintenance, and that Redlock’s clock requirements can be met in real systems.
Martin gave two specific examples of possible clock jumps when referring to them:
- The system administrator manually changes the clock.
- A large clock update event was received from the NTP service. Procedure
Antirez counters:
- Manually changing the clock for human reasons, just don’t do it. Otherwise, if someone manually changes the Raft protocol’s persistence log, even Raft won’t work.
- Using an NTPD program that does not “jump” to adjust the system clock (possibly with proper configuration), the clock changes are made with many small adjustments.
Redlock doesn’t need the clock to be perfectly accurate, it just needs the clock to be nearly accurate. For example, if you were supposed to keep a time of five seconds, you might actually remember 4.5 seconds, and then another 5.5 seconds, with some margin of error. But as long as the error is within a certain range, this doesn’t matter to Redlock. Antirez argues that such demands on clock accuracy are not too high and are perfectly reasonable in a real-world environment.
Well, that’s it. If you believe What Antirez says about clocks here, then the rest of Antirez’s analysis is pretty much in order.
Martin made exactly one error in his analysis of the latter two situations in which Redlock can fail (as mentioned in the first half of this article). In Martin’s example of a Redlock failure caused by the client’S GC pause, the effect of the GC pause was equivalent to a long message delay between the lock server and the client. Redlock can handle this situation. Recall the specific process of Redlock algorithm, its use process can be roughly divided into five steps:
- Gets the current time.
- Complete the lock acquisition process (interacting with N Redis nodes).
- Get the current time again.
- Subtract the two times and calculate whether the lock acquisition process took so long that the lock expired. If it doesn’t expire,
- The client holds the lock to access the shared resource.
In Martin’s example, GC pause, or network delay, actually occurs between steps 1 and 3 above. Whatever is causing the big delay between step 1 and step 3 (process pauses, network latency, etc.) can be detected in step 4 and will not allow the client to get a lock that it thinks is valid but actually has expired. Of course, this check depends on the system clock without big jumps. This is why Antirez defended the clock condition earlier.
Some people say, after step 3, there may still be a delay. Yes, Antirez acknowledges this, and he has a very interesting argument for it, which goes like this:
The delay can only happen after steps 3, resulting into the lock to be considered ok while actually expired, that is, we are back at the first problem Martin identified of distributed locks where the client fails to stop working to the shared resource before the lock validity expires. Let me tell again how this problem is common with all the distributed Locks implementations, and how the token as a solution is both unrealistic and can be used with Redlock as well. The delay can only occur after step 3, which results in the lock being considered valid but actually expired. In other words, we return to the first problem Martin pointed out, that the client was unable to complete the interaction with the shared resource before the lock’s validity expired. Let me reiterate that this problem is common to all distributed lock implementations, and token-based solutions are impractical, but can also be used with Redlock.
What exactly is the “first problem Martin pointed out” by Antirez here? As mentioned in the first part of this paper, Martin’s paper is divided into two parts. The first part is not directly related to Redlock, but points out that any distributed lock with automatic expiration function may fail without fencing mechanism. What Antirez is referring to here is the first half of Martin’s essay. In other words, the effect of high latency on Redlock is consistent with Martin’s analysis of all distributed locks in the first half of the article, but not just Redlock. Redlock’s implementation has ensured that it is as secure as any other distributed lock. Of course, Redlock doesn’t seem to offer the kind of incremental tokens Martin proposed compared to other “better” distributed locks, but antirez has already analyzed that this way of arguing for tokens is itself “unrealistic” or, to say the least, The unique token that Redlock can provide can provide exactly the same effect.
Also, Antirez and Martin have the following conversation on Twitter about the impact of big delays on Redlock:
antirez:
@martinkl so I wonder if after my reply, we can at least agree about unbound messages delay to don’t cause any harm.Martin:
@antirez Agree about message delay between app and lock server. Delay between app and resource being accessed is still problematic.Antirez asks: I was wondering, after I posted my response, if we could agree on one point, that a large message delay would not harm the Redlock operation. Martin: I agree with you on message latency between the client and the lock server. However, the latency between the client and the resource being accessed is problematic.
Martin agrees with the lock validity check made by Redlock in step 4. But he thinks the delay between the client and the resource server can still be a problem. Martin is a little vague here. As antirez analyzed earlier, delays between client and resource servers affect the implementation of all distributed locks, not just Redlock.
That’s what Antirez said in his blog post. There are a few points worth noting:
- Antirez agrees that large system clock jumps will cause Redlock to fail. He differed from Martin in that he believed that large clock jumps could be avoided in real systems. Of course, it depends on the infrastructure and operation.
- Antirez designed Redlock with network latency and application pauses in mind. However, Antirez acknowledges that there is no good way to deal with the delay between the client and the resource server (that is, the delay that occurs after step 3 of the algorithm).
At this point in the discussion, it doesn’t really matter who’s right or wrong between Martin and Antirez. As long as we have a good understanding of the level of security that Redlock (or any other distributed lock) can provide, we can make our own choices.
Some discussion on Hacker News
There was a lot of discussion on Hacker News about Martin’s and Antirez’s blogs. These discussions are held at the following address:
- According to Martin’s blog discussion: news.ycombinator.com/item?id=110…
- The debate on antirez blog: news.ycombinator.com/item?id=110…
On Hacker News, Antirez actively participates in the discussion, while Martin stays out of it.
Let me share with you some interesting points from these discussions (focusing on the fencing token mechanism).
Antirez didn’t elaborate on the “Check and Set” operation in his blog. Sure enough, someone asked on Hacker News. Antirez responded as follows:
You want to modify locked resource X. You set X.currlock = token. Then you read, do whatever you want, and when you write, you “write-if-currlock == token”. If another client did X.currlock = somethingelse, the transaction fails.
Suppose you want to modify resource X, follow the steps defined in the pseudo-code below.
- Set x. currlock = token.
- Read resource X (including its value and the accompanying X.currlock).
- Change the value of resource X following the “write-if-currlock == token” logic. If the value of x. currlock is still the same as the token that was set in the token, then the value of x. currlock should be changed. If x. currlock is already a different value, then another party is trying to modify it, and the current modification is discarded to avoid a conflict.
Then a Hacker News user named Viraptor took issue with this sequence:
- A: X.currlock = Token_ID_A
- A: resource read
- A: is X.currlock still Token_ID_A? yes
- B: X.currlock = Token_ID_B
- B: resource read
- B: is X.currlock still Token_ID_B? yes
- B: resource write
- A: resource write
In the last two steps, two clients A and B write at the same time, which conflicts. However, this user may have misunderstood antirez’s modification process. According to Antirez, determining whether x. currlock has been modified and writes to resources should be an atomic operation. Only in this way can it be understood logically, otherwise the process is seriously flawed. This is why Antirez had previously questioned the fencing mechanism: why would a distributed lock be necessary when the resource server itself can provide mutually exclusive atomic operations? Therefore, Antirez thinks that the fencing mechanism is cumbersome, and he came up with the “Check and Set” operation only to prove that Redlock can do the same thing when it comes to offering fencing tokens. However, there is still some ambiguity here. If “write-if-currlock == token” is considered an atomic operation, this logic must be executed on the resource server, so why should the second step “read resource X”? Unless the “read resource X” operation is also performed on the resource server, it is contained in the “judge – write back” atomic operation. Without this understanding, it is hard to see how the read-judge write back operations can be atomic if they are all performed on the client side. We will ignore the “read resource X” step for the moment in the following discussion.
This “Check and Set” operation based on random token has at least two differences when compared with the incremental fencing token proposed by Martin:
- Check and Set requires two steps for write operations (Set token and judge write back), whereas the incremental fencing token mechanism requires only one step (send a write request to the resource server with the token).
- The incremental fencing token mechanism ensures the final sequence of operations on shared resources. Operations that are delayed for a long time cannot operate shared resources. However, a random token-based “Check and Set” operation does not guarantee this order, and an operation that is too late may still operate on a shared resource (in a mutually exclusive way, of course) if it arrives later.
For the former difference, as we will see later in the analysis, if the resource server is also distributed, then the use of increasing fencing token also becomes a two-step.
Antirez believes that the difference in the order of the latter operation is meaningless. The key is mutually exclusive access. He wrote the following words:
So the goal is, when race conditions happen, to avoid them in some way. …… Note also that when it happens that, because of delays, the clients are accessing concurrently, The lock ID has little to do with the order in which the operations were indented to happen. Our goal is to somehow avoid competing conditions when they arise. . It is also important to note that when such a race condition occurs, such as when clients are accessing simultaneously due to latency, the order of lock ids is independent of the order in which the operations are actually intended to be performed.
The lock ID here is the same thing as Martin’s incremental token.
Antirez then gives an example of an “add a name to a list” operation:
- T0: Client A receives new name to add from web.
- T0: Client B is idle
- T1: Client A is experiencing pauses.
- T1: Client B receives new name to add from web.
- T2: Client A is experiencing pauses.
- T2: Client B receives a lock with ID 1
- T3: Client A receives a lock with ID 2
You see, two clients (actually Web servers) perform the “add name” operation. A is in the first place of B, but B is in the first place of A. Therefore, antirez says, the order in which lock ids are sized has nothing to do with the order in which those operations are actually intended to be performed. The key is to be able to exclude a sequence, mutually exclusive access on the line. It does not matter whether the lock ID is incremental or a random token.
The fencing token mechanism proposed by Martin left endless doubts to people. This is mainly because his description of the mechanic lacks too much technical detail. From the discussion above, Antirez’s view of this mechanism is that it is no different from a Random token, and it requires the resource server itself to provide some kind of mutual exclusion mechanism, which almost makes the existence of distributed locks meaningless. On Hacker News, there are two other interesting questions about the fencing token:
- (1) Architectural details of the resource server itself.
- (2) The implementation details of checking fencing token by the resource server, such as whether to provide an atomic operation.
On Hacker News, a user named Dwenzek posted the following comment on the above question:
. the issue around the usage of fencing tokens to reject any late usage of a lock is unclear just because the protected resource and its access are themselves unspecified. Is the resource distributed or not? If distributed, does the resource has a mean to ensure that tokens are increasing over all the nodes? Does the resource have a mean to rollback any effects done by a client which session is interrupted by a timeout?
…… The issue of using a fencing token to deny delayed requests is not clear, because the protected resource and access to it is not clearly defined. Are resource services distributed? If so, is there a way for resource services to ensure that tokens are incremented on all nodes? Does the resource service have a way to roll back the effects of a client Session that is interrupted due to expiration?
These questions aren’t answered on Hacker News. Flavio Junqueira, another expert on distributed systems, wrote about how a distributed resource server architecture handles fencing tokens in a blog post (we’ll get to that later).
On Hacker News, a user named Reza_n posts the following question:
I understand how a fencing token can prevent out of order writes when 2 clients get the same lock. But what happens when those writes happen to arrive in order and you are doing a value modification? Don’t you still need to rely on some kind of value versioning or optimistic locking? Wouldn’t this make the use of a distributed lock unnecessary?
I understand how the fencing token prevents disorder when two clients acquire a lock at the same time. But what happens if two writes happen to arrive in order, and they are modifying the same value? Wouldn’t it still rely on some sort of data version number or optimistic locking mechanism? Wouldn’t that make distributed locks unnecessary?)
A Hacker News user named Terr_ answers:
I believe the “first” write fails, because the token being passed in is no longer “the lastest”, which indicates their lock was already released or expired.
(原 文 : I think the “first” write request will fail because the token it passed in is no longer “up to date”, which means the lock has been released or expired.)
Is Terr_ right or wrong? This is hard to say, depending on the implementation details of the resource server to check the Fencing token. Let’s analyze it briefly.
For simplicity, let’s assume that there is a file server (regardless of the distributed case) that has remote access through RPC that cannot provide mutually exclusive access to files (otherwise we wouldn’t need distributed locks). Now we add the fencing token checking logic according to Martin. Since Martin didn’t go into specifics, we’re guessing there are at least two possibilities.
In the first possibility, we modified the code of the file server to allow it to accept an additional fencing token parameter, and added a simple judgment logic before all the processing to ensure that only the current received fencing token is larger than the previous value can the later access be allowed. And once that’s done, the rest of the process doesn’t change.
Now imagine the scenario described by Reza_n, where GC pause occurs on both client 1 and client 2, both fencing tokens are delayed, and they arrive at the file server almost simultaneously, in the same order. So, our new logic should let both requests go, and then they operate on the file almost simultaneously, and still conflict. Since Martin claimed that the fencing token can guarantee the correctness of distributed locks, we may have misunderstood the above possible guesses.
There is a second possibility, of course, that we did make a major change to the file server so that the token logic and subsequent processing of the file are placed in a single atomic operation. This may be closer to Antirez’s understanding. In this case, both writes should succeed in the scenario described by Reza_n earlier.
Are ZooKeeper-based distributed locks more secure?
Many people (including Martin) believe that if you want to build a more secure distributed lock, you should use ZooKeeper instead of Redis. So, for the sake of comparison, let’s detach from the topic of this article for a moment and discuss whether zooKeeper-based distributed locks provide absolute security. Does it need the fencing token mechanism to protect it?
We have to mention a blog by Flavio Junqueira, a distributed expert, entitled “Note on Fencing and Distributed locks” with the following address:
- FPJ. Me / 2016/02/10 /…
Flavio Junqueira, one of ZooKeeper’s authors, wrote this blog post a few days after the Martin/Antirez controversy. Here he gives a description of how to build distributed locks based on ZooKeeper (although it’s not the only way) :
- The client attempts to create a ZNode, for example
/lock
. The first client is created successfully, which is equivalent to holding the lock; Other clients will fail to create (zNode already exists) and fail to acquire the lock. - Once the client that holds the lock is done accessing the shared resource, zNode is deleted so that other clients can then acquire the lock.
- Znode should be created ephemeral. This is a feature of ZNode that guarantees that if the client that created zNode crashes, the corresponding ZNode will be deleted automatically. This guarantees that the lock will be released.
The lock looks perfect, with no Redlock expiration issues and the ability to automatically release the lock when needed. But on closer inspection, not so much.
How does ZooKeeper detect that a client has crashed? In fact, each client maintains a Session with one of ZooKeeper’s servers, which relies on regular heartbeats to maintain it. If ZooKeeper does not receive the client’s heartbeat for a long time (this is called the Sesion expiration time), then it considers the Session to be expired and all ephemeral ZNodes created through this Session are automatically deleted.
Imagine the following execution sequence:
- Client 1 creates a Znode node
/lock
, obtained the lock. - Client 1 enters a long GC pause.
- The Session that client 1 connected to ZooKeeper expired. Procedure The znode node
/lock
Is automatically deleted. - Client 2 creates a ZNode node
/lock
“, thus obtaining the lock. - Client 1 recovers from the GC Pause and still thinks it holds the lock.
Finally, both client 1 and client 2 think they have a lock and conflict. This is similar to the situation Martin described earlier in his article when a distributed lock fails due to GC pause.
It seems that distributed locks implemented with ZooKeeper are not necessarily secure. It still has its problems. However, ZooKeeper, as a framework for providing solutions for distributed applications, provides some very nice features that solutions like Redis don’t. An example is the automatic deletion of ZNodes of the Ephemeral type mentioned earlier.
Another useful feature is ZooKeeper’s Watch mechanism. This mechanism can be used, for example, when a client attempts to create a /lock and finds that it already exists, the creation fails, but the client does not necessarily declare that it failed to acquire the lock. The client can enter a waiting state, waiting for ZooKeeper to notify it through the Watch mechanism when the /lock node is deleted, so that it can proceed with the creation (acquiring the lock). This allows a distributed lock to be used on the client like a local lock: failure to lock blocks until the lock is acquired. Such features are not available in Redlock.
To summarize, zooKeeper-based locks have two different implementation features compared to Redis-based locks:
- Under normal circumstances, a client can hold the lock for any length of time, which ensures that it has done all required resource access before releasing the lock. This avoids the dilemma of how long to set the lock validity time for Redis based locks. In fact, ZooKeeper-based locks rely on sessions (heartbeat) to maintain lock holding status, whereas Redis does not support Sesion.
- Zookeeper-based locks support events that wait for the lock to be released after a lock acquisition failure. This gives the client more flexibility in using the lock.
By the way, the zooKeeper-based distributed lock implementation described above is not optimal. It causes a “herd effect” that reduces lock acquisition performance. See the link below for a better implementation:
- Zookeeper.apache.org/doc/r3.4.9/…
Let’s go back to Flavio Junqueira’s analysis of fencing tokens. Flavio Junqueira pointed out that the fencing token mechanism essentially requires clients to “mark” a shared resource each time they access it before performing any operations. This “flag” ensures that the client with the old lock request (if delayed) cannot manipulate the resource. This marking operation can take many forms, of which the Fencing token is a typical one.
Then Flavio Junqueira mentioned using an increasing epoch number (equivalent to Martin’s Fencing token) to protect shared resources. For distributed resources, let’s assume, for the sake of discussion, that the distributed resource is a small replicated data store that writes data to all nodes. The simplest way to do this is to mark the Epoch number to each resource node before performing any operations on the resource. In this way, each node ensures that the old (that is, small) epoch number cannot manipulate data.
Of course, further discussion here may involve the implementation details of this data storage service. In a real system, for example, it would be possible to be fault-tolerant if the marking and writing operations described above were done on most nodes (Flavio Junqueira did not elaborate). What we can see here, most importantly, is how this markup operation works. This is similar to the Paxos protocol, which requires that each proposal corresponds to an increasing number and that a prepare request be executed before an Accept request is executed. The random token approach proposed by Antirez clearly does not meet Flavio Junqueira’s definition of a “token” operation because it cannot distinguish between the new token and the old token. Only increasing numbers ensure eventual convergence to the latest result of the operation.
In this example of distributed data storage service (shared resource), when the client performs a write operation after marking, the node of the storage service needs to determine whether the epoch number is up to date, and then determine whether the write operation can be performed. If we follow the analysis thought in the previous section, are the epoch judgment and the subsequent write operation in an atomic operation? Atomic, we believe, according to Flavio Junqueira’s description. Now that the resource itself can provide atomic mutex operations, does distributed locking still make sense? I should say yes. Clients can use distributed locks to effectively avoid collisions and wait for write opportunities, which is especially useful for distributed resources with multiple nodes (for efficiency reasons, of course).
How does Chubby’s distributed lock make fencing?
You can’t mention distributed locks without mentioning Google’s Chubby.
Chubby is a distributed lock service used internally by Google, somewhat similar to ZooKeeper, but with many differences. Chubby makes public The materials, mainly a paper called “The Chubby Lock Service for coupled distributed Systems”. Download address:
- Research.google.com/archive/chu…
In addition, there is a nice talk about Chubby on YouTube.
- www.youtube.com/watch?v=PqI…
Chubby, of course, also considered lock failure due to delays. One passage in the paper describes it as follows:
a process holding a lock L may issue a request R, but then fail. Another process may ac- quire L and perform some action before R arrives at its destination. If R later arrives, it may be acted on without the protection of L, and potentially on inconsistent data.
A process holding lock L makes request R, but the request fails. Another process acquires lock L and performs some action before request R reaches the destination. If request R later arrives, it is possible to operate without the protection of lock L, with the potential risk of inconsistent data.
This is very similar to Martin’s analysis.
The mechanism Chubby proposed to solve (alleviate) this problem is called sequencer, which is similar to the fencing token mechanism. The lock holder can request a sequencer at any time, which is a string of bytes that consists of three parts:
- Lock name.
- Lock acquisition mode (exclusive or shared).
- Lock Generation number (a 64-bit monotonically increasing number) The function is equivalent to a fencing token or epoch number.
The client takes the Sequencer and passes it to the resource server as it manipulates the resource. The resource server is then responsible for checking the effectiveness of sequencer. There are two ways to check:
- Call the API provided by Chubby, CheckSequencer(), and upload the entire Sequencer to check. This check is to ensure that the lock held by the client is still valid during resource access.
- Compare the Sequencer sent by the client with the latest Sequencer observed by the resource server. It can be understood as similar to the fencing token check described by Martin.
Of course, if the resource service itself is not easy to modify for compatibility reasons, Chubby also provides a mechanism:
- The lock – delay. Chubby allows clients to specify a lock-delay time value (default is 1 minute) for locks they hold. Chubby does not immediately release the lock when it detects that a client is out of contact. Instead, it prevents other clients from acquiring the lock for a period of time specified by lock-delay. The purpose of this design is to allow the recipients of the lock enough time to empty the queue before the lock is allocated to the new clients. The purpose is to minimize the number of unprocessed requests that arrive late.
It can be seen that in order to deal with lock failure, Chubby provides three processing methods: CheckSequencer() check, the latest sequencer comparison with the last time, and lock-delay, which guarantee security from strong to weak. Moreover, none of these treatments is of itself a guarantee of absolute correctness. Chubby does, however, provide a monotonically increasing Lock generation number, which allows resource servers to leverage it for greater security when needed.
About the clock
In the debate between Martin and Antirez, the most serious conflict is whether the assumption of the system clock is reasonable. Martin believes that the system clock will inevitably jump (which is consistent with the asynchronous model of distributed algorithm), while Antirez believes that in practice, the system clock can ensure that no large jump occurs.
Martin had this to say about the split (quote) :
So, fundamentally, this discussion boils down to whether it is reasonable to make timing assumptions for ensuring safety properties. I say No, Salvatore says yes — but that’s OK. Engineering discussions rarely have one right answer.
Ultimately, the discussion came down to the question of whether assumptions about time keeping to ensure safety were justified. I think it’s unreasonable, and Antirez thinks it’s reasonable — but that’s okay. Discussions of engineering problems rarely have one right answer.)
So can a clock be trusted in a real system? Julia Evans wrote an article, “TIL: Clock Skew exists,” summarizing and analyzing a lot of actual data related to clock skew. Address of this article:
- JVNS. Ca/blog / 2016/0…
Julia Evans concludes at the end of her article:
Clock skew is real (clock skew does exist in reality)
Martin’s post-mortem
As we noted earlier, Martin was almost always on the sidelines while the debate raged on. However, Martin summarized the whole story into a long story after the incident. If you want the most comprehensive overview of what happened before and after this incident, you are advised to read Martin’s summary:
- Storify.com/martinkl/re…
At the end of this story summary, Martin writes a number of sentimental comments:
For me, this is the most important point: I don’t care who is right or wrong in this debate — I care about learning from others’ work, so that we can avoid repeating old mistakes, and make things better in future. So much great work has already been done for us: by standing on the shoulders of giants, we can build better software. …… By all means, test ideas by arguing them and checking whether they stand up to scrutiny by others. That’s part of the learning process. But the goal should be to learn, not to convince others that you are right. Sometimes that just means to stop and think for a while.
The most important thing for me is this: I don’t care who is right or wrong in this debate — I only care about what can be learned from the work of others so that we can avoid repeating the mistakes of the past and make the future better. Great things have been done for us: we can build better software on the shoulders of giants. . Be sure to examine any ideas thoroughly, by proving them and checking that they stand up to scrutiny. That’s part of the learning process. But the goal should be to gain knowledge, not to convince others that you are right. Sometimes it just means to stop and think.
The distributed lock debate has been thoroughly reviewed and analyzed.
Depending on the two uses of locking, you can choose your own implementation of distributed locking if it is only for efficiency. Of course, you need to be clear about the security deficiencies and what the consequences are. If you are concerned with correctness, please be careful. In the discussion of this article, we have gone furthest on the correctness of distributed locks by analyzing ZooKeeper distributed locks, monotonically increasing epoch numbers, and marking distributed resources. Please examine the arguments carefully.
Martin left us a lot of questions, especially the fencing token mechanism he proposed. He wrote on his blog that this will be discussed in more detail in chapters 8 and 9 of his new book, Designing Data-Intensive Applications. The book is still available for pre-sale. I felt that this would be a book worth reading, as opposed to a short book published for fame or money. You can see that the author has put a lot of effort into this book.
Finally, I believe that this discussion is far from over. Distributed Locks and the fencing scheme can be a long-term project that we can think about over time as we learn more about Distributed systems. Think about its deeper nature and its theoretical proof.
(after)
Thanks to:
I would like to extend my heartfelt thanks to several friends who took the time to review the draft of this article: Fu Lei, author of CacheCloud, Li Weibo of Kuaishou, Li Bo of Ali. Of course, if there are any mistakes or omissions in the article, I am responsible for them.
Other selected articles:
- Redis internal data structure (7) — intSet
- Redis internal data structure details (6) – Skiplist
- Redis internal data structure details (5) – QuickList
- Redis internal data structure (4) – Ziplist
- Redis internal data structure (3) – Robj
- Redis internal data structure details (2) – SDS
- Redis internal data structure (1) – dict
- Three levels of knowledge
- The growth curve of technology
- Authentic technology with wild way