• How to implement a distributed lock based on Redis?
  • Is Redis Distributed Lock really secure?
  • What’s wrong with Redlock? Is it safe?
  • Industry debate Redlock, what is the debate about? Which view is right?
  • Distributed lock Redis or Zookeeper?
  • What are the considerations for implementing a “fault tolerant” distributed lock?

How to implement distributed lock?

Let’s start with the simplest one. To implement distributed locks, Redis must be “mutually exclusive”. We can use the SETNX command. This command says SET if Not eXists, the value of the key is SET if it does Not exist, otherwise nothing is done. Two client processes can execute this command, mutually exclusive, to achieve a distributed lock. Client 1 applies for locking and the locking succeeds:

Client 1 applies for locking and the locking succeeds:

127.0.0.1:6379> SETNX lock 1 (integer) 1 // Client 1, the lock is successfully addedCopy the code

Client 2 applied to lock because it failed to lock after it arrived:

127.0.0.1:6379> SETNX lock 1 (integer) 0 // Failed to lock client 2Copy the code

At this point, the locked client can operate on the “shared resource”, for example, modify a row of MySQL data, or call an API request. After the operation is complete, the lock should be released in a timely manner to give the opportunity to later operators to operate shared resources. How to release the lock? Delete the key by using the DEL command:

127.0.0.1:6379> DEL lock // Release the lock (integer) 1Copy the code

The logic is very simple, and the overall distance is like this:However, it has a big problem. When client 1 takes the lock, it will cause a “deadlock” if the following scenario occurs: \

  1. The program failed to release the lock in time because the service logic was abnormal. Procedure
  2. Process hung, no chance to release lock

At this point, the client will hold the lock forever, and other clients will never get the lock. How to solve this problem?

How do I avoid deadlocks?

An easy solution would be to assign a “lease period” to the lock when it is applied for. When implemented in Redis, this key is given an “expiration time”. We assume that the shared resource will not be operated for more than 10 seconds, so when locking, set the key to expire for 10 seconds:

127.0.0.1:6379> SETNX lock 1 // Lock (integer) 1 127.0.0.1:6379> EXPIRE lock 10 // Automatically expires after 10s (integer) 1Copy the code

This way, regardless of whether the client is abnormal, the lock can be “automatically released” after 10 seconds, and other clients can still hold the lock. But is that really okay? There are still problems. The current operation, lock, set expiration are two commands, is it possible to only execute the first, but the second is “too late” to execute the situation? Such as:

  1. SETNX execution succeeded. EXPIRE execution failed due to network problem
  2. SETNX executed successfully, Redis is down abnormally, EXPIRE did not have a chance to execute
  3. SETNX executes successfully, client crashes abnormally, and EXPIRE does not have a chance to execute

In short, these two commands are not guaranteed to be atomic operations (successful together), and there is a potential risk that expiration time Settings will fail and “deadlock” issues will still occur. How to do? Prior to Redis 2.6.12, we needed to make sure SETNX and EXPIRE performed atomically, and how to handle various exceptions. However, after Redis 2.6.12, Redis extended the parameter of the SET command to use this command:

127.0.0.1:6379> SET lock 1 EX 10 NX OKCopy the code

This solves the deadlock problem and is relatively simple. Let’s analyze it again. What else is wrong with it? Consider this scenario:

  1. Client 1 is locked successfully and starts to operate shared resources
  2. The time of client 1 operating on the shared resource “exceeds” the lock expiration time, and the lock is automatically released.
  3. Client 2 is locked successfully and starts to operate shared resources
  4. Client 1 completes the shared resource operation and releases the lock (but on client 2)

See, there are two serious problems here:

  1. Lock expiration: Client 1 takes too long to operate the shared resource. As a result, the lock is automatically released and then held by client 2
  2. Release someone else’s lock: Client 1 releases the lock of client 2 after the shared resource operation is complete

What are the causes of these two problems? Let’s look at them one by one. The first problem may be caused by the inaccuracy with which we evaluate the operation shared resources. For example, if the “slowest” time to operate a shared resource might take 15s and we set the expiration time to 10s, there is a risk that the lock will expire early. The expiration time is too short, so increase the redundancy time, for example, set the expiration time to 20s, so it is always ok? This can “mitigate” the problem and reduce the probability of a problem, but it still doesn’t “fix” the problem. Why is that? The reason is that after the client obtains the lock, it may encounter complex scenarios when operating the shared resources, for example, internal exceptions occur in the program, network requests timeout, and so on. Since it’s an “estimated” time, it can only be approximated, unless you can anticipate and cover all the scenarios that cause it to take longer, which is hard. Is there a better solution? Don’t worry, I will explain the corresponding solution to this problem in detail later. Let’s move on to the second question. The second problem is that one client releases a lock held by another client. Think about it. What’s the key to this problem? The key point is that each client is “brainless” when releasing the lock, and does not check whether the lock is still “owned by itself”, so there will be a risk of releasing the lock of others. Such unlock process is not “rigorous”. How to solve this problem?

What if the lock is released by someone else?

The solution is to set a “unique identifier” that only the client knows when locking. For example, it can be its own thread ID, or it can be a random and unique UUID. Here we use the UUID as an example:

// SET lock VALUE to UUID 127.0.0.1:6379> SET lock $UUID EX 20 NX OKCopy the code

It is assumed that the 20s operation shared time is sufficient, regardless of the automatic lock expiration.

After that, before releasing the lock, you need to determine whether the lock is still owned by you. The pseudo-code can be written as follows:

If redis.get("lock") == $uuid: redis.del("lock")Copy the code

The lock is released using the GET + DEL commands, and the atomicity problem comes up again.

  1. Client 1 performs GET to determine that the lock is its own
  2. Client 2 executes the SET command to forcibly acquire the lock (although the probability is low, we need to carefully consider the security model of the lock)
  3. Client 1 performs DEL, but releases the lock on client 2

Thus, these two commands still have to be executed atomically. How does the atom do that? The Lua script. We can write this logic in a Lua script and let Redis execute it. Because Redis processes each request in a “single thread,” while executing a Lua script, other requests must wait until the Lua script completes, so no other commands can be inserted between GET + DEL.The Lua script to safely release the lock is as follows: \

If redis. Call ("GET",KEYS[1]) == ARGV[1] then return redis. Call ("DEL",KEYS[1]) else return 0 endCopy the code

Well, so all the way optimization, the whole lock, unlock the process is more “rigorous”. Here we first summarize, distributed lock based on Redis implementation, a rigorous process is as follows:

  1. Lock: SET lock_key uniqueidEXunique_id EX uniqueidEXexpire_time NX
  2. Operating shared Resources
  3. Release lock: Lua script, first GET to determine whether the lock belongs to oneself, then DEL release lock

Ok, with this complete lock model, let’s go back to the first problem mentioned earlier.

How to evaluate the lock expiration time? As mentioned earlier, if the expiration time of the lock is not evaluated properly, the lock is at risk of “premature” expiration. The compromise was to try to “redundancy” the expiration time and reduce the chance that the lock would expire early. This solution is not a perfect solution to the problem, so what to do? Is it possible to design such a scheme:Lock, first set an expiration time, and then we open a “daemon thread”, regularly to detect the expiration time of the lock, if the lock is about to expire, the operation of shared resources has not been completed, then automatically “renew” the lock, reset the expiration time.This is actually a better solution. If you’re a Java technology stack, fortunately, there’s a library that encapsulates all of this:Redisson. Redisson is a Java language implementation of the Redis SDK client. When using distributed locks, it uses the “automatic renewal” solution to avoid lock expiration. This daemon thread is commonly referred to as the “watchdog” thread.In addition, the SDK also packages a number of easy-to-use features: \

  • Reentrant lock
  • Optimistic locking
  • Fair lock
  • Read-write lock
  • Redlock (more on that below)

The SDK provides a friendly API that can manipulate distributed locks in the same way as local locks. If you are a Java technology stack, you can use it directly.

We will not focus on the use of Redisson, you can see the official Github to learn how to use, relatively simple.

Here we briefly summarize the realization of distributed lock based on Redis, the problems encountered in front, and the corresponding solutions:

  • Deadlock: Set the expiration time
  • Bad expiration time evaluation, lock expires early: daemon thread, automatic renewal
  • Lock is released by others: the lock is written with a unique identifier. The identifier is checked before the lock is released

What other problem scenarios can compromise the security of Redis locks? The scenarios previously analyzed were all about the potential problems of locking in a “single” Redis instance and did not go into Redis deployment architecture details. However, when we use Redis, we generally adopt the mode of master-slave cluster + sentry deployment. The advantage of this is that when the master library goes down abnormally, Sentry can realize “automatic failover” and promote the slave library to the master library to continue to provide services, so as to ensure availability. Will the distributed lock remain secure in the event of a master/slave switch? Consider this scenario:

  1. Client 1 runs the SET command on the primary database and locks the primary database successfully
  2. The SET command has not been synchronized to the slave library (master/slave replication is asynchronous).
  3. The slave library is promoted by sentry to the new master library, this is locked to the new master library, lost!

As you can see, distribution locks may still be affected when Redis copies are introduced.

How to solve this problem? To that end, the Redis authors propose a solution that we often hear aboutRedlock. Can it really solve the above problem?

Is Redlock really secure?

Okay, finally, the big story of this article. Ah? So many problems mentioned above, is it just the foundation? Yeah, those are just the appetizers, the real hard stuff, just starting from here. If the above content, you do not understand, I suggest you read it again, first clarify the whole lock, unlock the basic process. If you already know something about Redlock, you can go over it again with me. If you don’t know something about Redlock, that’s ok. I’ll introduce you to it again. It is worth noting that I will not only talk about the principles of Redlock, but also raise a lot of questions about “distributed systems”, so you should follow my lead and analyze the answers in your head. Now let’s see how the Redlock solution proposed by the author of Redis solves the lock failure problem after the master/slave switchover. Redlock’s approach is based on two premises:

  1. You no longer need to deploy the slave library and sentinel instance, just the master library
  2. However, you need to deploy multiple primary libraries, and at least five instances are officially recommended

In other words, to use Redlock, you need to deploy at least 5 instances of Redis, all of which are the master library and have no relationship to each other. They are isolated instances.

Note: Deploy either a Redis Cluster or five simple Redis instances.

How does Redlock work?

The overall process is as follows, which is divided into five steps:

  1. The client first gets “current timestamp T1”
  2. The client initiates a lock request to each of the five Redis instances (using the SET command mentioned above), and each request will SET a timeout time (milliseconds, much less than the lock validity time). If one of the instances fails to lock (including network timeout, lock being held by other people, etc.), Immediately apply for a lock on the next Redis instance
  3. If the client successfully locks more than 3 (most) Redis instances, the client will obtain “current timestamp T2” again. If T2-T1 < the lock expiration time, the client is considered to have successfully locked; otherwise, the client is considered to have failed to lock
  4. The lock was successfully locked to perform operations on the shared resource (such as modifying a MySQL row or making an API request).
  5. Lock failure, issue lock release request to “all nodes” (Lua script release lock mentioned earlier)

Let me summarize it for you in four key points:

  1. The client applies for locks on multiple Redis instances
  2. Ensure that most nodes are locked successfully
  3. The total locking time on most nodes is less than the lock expiration time
  4. To release a lock, issue a lock release request to all nodes

The first read may not be easy to understand, I suggest you read the above text several times, deepen memory. Then, keeping these five steps in mind is very important. Following this process, we will examine the various assumptions that could lead to lock failure.

Ok, now that we understand the Redlock process, let’s see why Redlock is doing this. 1) Why lock multiple instances? In essence, for the purpose of “fault tolerance”, some instances of abnormal downtime, the rest of the instances are locked successfully, the entire lock service is still available. 2) Why are most locks successful? Multiple Instances of Redis are used together to form a “distributed system”. In distributed system, there will always be “abnormal nodes”, so when talking about distributed system problems, we need to consider how many abnormal nodes can reach, and it will still not affect the “correctness” of the whole system. This is a distributed system “fault tolerance” problem, the conclusion of this problem is: if there are only “failure” nodes, as long as most of the nodes are normal, the whole system can still provide the correct service.

The model of this problem, which we often hear about as the “Byzantine general” problem, is to look at the derivation of the algorithm.

3) Why do I need to calculate the total lock time after step 3 is completed? Because the operation is performed on multiple nodes, the operation will definitely take longer than that of a single instance. Moreover, because network requests are complex, there may be delays, packet loss, timeout, and other situations. The more network requests, the greater the probability of exceptions. So, even if most nodes are locked successfully, if the cumulative lock time has “exceeded” the lock expiration time, then some instances of the lock may have failed and the lock would be meaningless. 4) Why do all nodes need to be operated to release locks? When a Redis node is locked, the lock may fail because of network reasons. For example, if a client successfully locks a Redis instance, but fails to read the response result due to a network problem, the lock is actually successfully locked on Redis. Therefore, locks on all nodes need to be released to clear the remaining locks on nodes, regardless of whether the locks have been successfully added. It seems that Redlock does solve the problem of Redis node outage and lock failure, ensuring the lock “security”. But is this really the case?

Are ZooKeeper-based locks secure?

If you’re familiar with Zookeeper, the distributed lock implementation looks like this:

  1. Clients 1 and 2 both attempt to create “temporary nodes” such as /lock
  2. If client 1 arrives first, the lock succeeds, but client 2 fails to lock
  3. Client 1 operates shared resources
  4. Client 1 deletes the /lock node and releases the lock

Zookeeper uses a “temporary node” to ensure that client 1 can hold the lock as long as the connection continues. Furthermore, if client 1 crashes unexpectedly, the temporary node is automatically deleted, ensuring that the lock is always released.Not bad, there is no lock expiration trouble, but also can automatically release the lock when abnormal, do not feel very perfect?It’s not. After client 1 creates a temporary node, how does Zookeeper ensure that the client always holds the lock? And the reason is,Client 1 maintains a Session with the Zookeeper server. This Session relies on the scheduled heartbeat of the client to maintain the connection.If Zookeeper does not receive heartbeat messages from the client for a long time, the Session expires and the temporary node is deleted.Similarly, based on this issue, let’s discuss how GC issues affect Zookeeper locks:

  1. Client 1 succeeded in creating a temporary node /lock and obtained the lock
  2. A long GC occurred on client 1
  3. Client 1 failed to send heartbeat messages to Zookeeper, and Zookeeper deleted the temporary node.
  4. Client 2 succeeded in creating a temporary node /lock and obtained the lock
  5. Client 1 GC ends, it still thinks it holds the lock (conflict)

Zookeeper cannot ensure the security of process GC and network delay exceptions. This is what the Redis author mentioned in the rebuttal article: If the client has acquired the lock, but the client and the lock server “lose contact” (such as GC), then not only Redlock has a problem, other locking services have similar problems, and Zookeeper has the same problem! So, here we can conclude that a distributed lock, in extreme cases, is not necessarily secure. If your business data is very sensitive, be aware of this issue when using distributed locks. You cannot assume that distributed locks are 100% secure. Zookeeper uses distributed locks to create distributed locks.

  1. There is no need to consider the expiration time of the lock
  2. Watch mechanism. If locking fails, watch can wait for lock release to realize optimistic locking

But its disadvantages are:

  1. Not as good as Redis
  2. High deployment and o&M costs
  3. The client is disconnected from Zookeeper for a long time, and the lock is released

My understanding of distributed locks

Well, in front of the detailed introduction of Redis based on Redlock and Zookeeper implementation of distributed lock, security issues in various abnormal situations, I want to talk with you about my views, just for reference, do not like spray. 1) Redlock or not? As mentioned above, Redlock will only work if the clock is “correct”, and if you can guarantee this, you can use it. But getting the clock right, I don’t think it’s as easy as you think. First, from the hardware point of view, the clock offset is inevitable. For example, CPU temperature, machine load, and chip material can all cause the clock to shift. Secondly, from my working experience, I have encountered clock error and operation and maintenance force to modify the clock, thus affecting the correctness of the system. Therefore, human error is difficult to completely avoid. So, my personal view of Redlock is to avoid it as much as possible, it’s not as powerful as the stand-alone Redis, it’s more expensive to deploy, and I’d still prefer to use the master-slave + sentry model for distributed locking. How can correctness be guaranteed? The second point gives you the answer. 2) How to use distributed locks correctly? When analyzing Martin’s point of view, it mentioned the fecing token scheme, which inspired me a lot. Although this scheme has great limitations, it is a very good idea to ensure the “correctness” of the scene. Therefore, we can combine the two: 1, use distributed lock, in the upper layer to complete the purpose of “mutually exclusive”, although the lock will fail in extreme cases, but it can maximize the concurrent request block in the top layer, reduce the pressure on the operation resource layer. 2, but for the business that requires absolutely correct data, we must do a good job in the resource layer “bottom”, the design idea can refer to fecing token scheme to do. Combining the two approaches, I think for most business scenarios this is enough.

conclusion

All right, so to sum up. In this article, we mainly discuss whether distributed lock based on Redis is safe. From the simplest implementation of distributed lock, to deal with all kinds of abnormal scenarios, and then lead to Redlock, as well as two distributed experts debate, Redlock applicable scenarios. Finally, we also compare the problems Zookeeper may encounter when making distributed locks, and the differences with Redis. I’ve summarized these into mind maps for your convenience.

Why do most scenarios use Redis for distributed locking instead of ZK?

Zk does not support high availability (CP), election takes time, during the election period can not provide services to the outside, Redis (AP) throughput is higher, cluster, master and slave, sentinel matching is perfect, Redis single-thread processing multiplexer