Analysis of Redis distributed locking solution

1 background

Our daily in e-commerce sites shopping often meet some high concurrency scenarios, such as electricity, often appear on the App, limited coupons for seconds kill activity, where shall we go and net’s train ticket ticket system, etc., these scenes have a common characteristic is the traffic surge, although in system design, through the optimization of current limiting, asynchronous, wait in line, But the overall concurrency is still several times more than usual, in order to avoid concurrency problems, prevent inventory overselling, to provide users with a good shopping experience, these systems will use the lock mechanism.

For single-process concurrency scenarios, you can avoid concurrency problems by using locks provided by the programming language and corresponding class libraries, such as the Synchronized syntax in Java and the ReentrantLock class.



In a distributed scenario, it is necessary to use distributed locking technology to realize the synchronous access of different client threads to code and resources and ensure the security of processing shared data under multiple threads.



So what is distributed locking? Distributed lock is a kind of lock implementation to control the access of shared resources between distributed systems or different systems. If different systems or different hosts of the same system share a certain resource, they often need to be mutually exclusive to prevent interference with each other and ensure consistency.

A relatively secure distributed lock generally needs the following features:

  • Mutual exclusivity. Mutex is the basic feature of a lock, which can only be held by one thread at a time, performing critical section operations.
  • Timeout release. By releasing the timeout, you can avoid deadlocks, prevent unnecessary thread waiting and waste of resources, similar to the innodBlockWait_TIMEOUT parameter configuration in MySQL’s InnoDB engine.
  • Reentrancy. A thread holding a lock can request the lock again, preventing the lock from being released until the thread has completed a critical section operation.
  • High performance and availability. The performance overhead of locking and releasing processes should be as low as possible, while also ensuring high availability to prevent the accidental failure of distributed locks.

As you can see, the implementation of distributed locking is not only about locking resources, but also needs to meet some additional characteristics to avoid deadlock, lock failure and other problems.

2. Implementation of distributed lock

At present, there are many ways to implement distributed locking, and the common ones are:

  • Memcached distributed lock

Use Memcached’s add command. This command is atomic and can only add successfully if the key does not exist, which means the thread has acquired the lock.

  • Zookeeper distributed locks

Zookeeper’s sequential temporary nodes are used to implement distributed locking and waiting queues. ZooKeeper is a framework designed specifically for distributed applications. It offers excellent features such as the ephemeral zNode automatic deletion function, as well as a watch mechanism. A distributed lock can be used by the client as if it were a local lock: the lock fails and blocks until the lock is acquired.

  • Chubby

Google implements a coarser-grained distributed locking service that is somewhat similar to ZooKeeper, but with many differences. Chubby uses the Sequencer mechanism to solve the lock failure problem caused by delayed requests.

  • Redis distributed locks

The redis-based distributed lock is implemented in a similar way to Memcached, using the Redis SETNX command, which is also atomic and can only be set if the key does not exist. Redlock, a distributed lock based on Redis multi-machine implementation, is a more secure and effective implementation mechanism proposed by Antirez, the author of Redis, to standardize the implementation of Redis distributed lock.

This paper mainly discusses and analyzes several implementation methods and existing problems of distributed lock based on Redis.

3 Redis distributed locks

With Redis as a distributed lock, the goal is essentially that one process occupies the only “pit” in Redis, and when other processes try to occupy the pit, they either give up or wait to try again later.

At present, there are two main types of distributed lock based on Redis, one is based on single machine, the other is based on Redis multi-machine. No matter which implementation mode, the three core elements of distributed lock, namely lock, unlock and lock timeout, need to be realized.

3.1 Distributed lock based on Redis single machine

3.1.1 Using the SETNX instruction

The simplest way to do this is to use the Redis SETNX directive, which sets the value of the key to value only if it does not exist. If the key already exists, the SETNX command does not take any action. The key is the unique identifier of a lock and can be named according to the resource that needs to be locked for services. For example, in a store in the second kill activity to lock a product, then key can be set to lock_resource_id, value can be set to any value, after the resource is used, use DEL to delete the key to release the lock, the whole process is as follows:

SETNX lock_resource_id lock_value
do something
DEL lock_resource_id

Copy the code

Obviously, this way of acquiring a lock is very simple, but there is also a problem, is that we mentioned above distributed lock lock timeout problems, one of the three core elements, namely if get lock process appeared in the process of business logic to handle exceptions, may lead to DEL instructions have not been able to execute, lead to lock cannot be freed, that resource will be locked forever.



Therefore, after obtaining a lock using SETNX, you must set an expiration time for the key to ensure that even if it is not explicitly released, it will be released automatically after the lock is acquired for a certain period of time to prevent the resource from being monopolized for a long time. Since SETNX does not support setting the EXPIRE time, an additional EXPIRE directive is required. The process is as follows:

SETNX lock_resource_id lock_value
EXPIRE lock_resource_id 10
do something
DEL lock_resource_id

Copy the code

There is still a serious problem with this implementation of distributed locking. Since SETNX and EXPIRE are nonatomic, if an exception occurs between SETNX and EXPIRE, SETNX succeeds, but EXPIRE does not, This causes the lock to become “immortal,” which can lead to the aforementioned lock timeout problem, where other processes cannot properly acquire the lock.

3.1.2 Using the SET extension instruction

To solve the problem of SETNX and EXPIRE being nonatomic, you can use the Redis SET extension to make SETNX and EXPIRE atomic. The process is as follows:

SET lock_resource_id lock_value NX EX 10 
do something
DEL lock_resource_id

Copy the code

In the SET directive:

  • NX indicates that the SET succeeds only when the key corresponding to lock_resource_id does not exist. This ensures that only the first client to request the lock can acquire it, and that no other client can acquire the lock until the lock is released.
  • EX 10 Indicates that the lock will expire after 10 seconds. Services can set the time as required.

However, this approach still cannot completely solve the distributed lock timeout problem:

  • The lock was released prematurely. If thread A takes too long to execute logic between locking and releasing the lock (or if thread A is blocked) to release the lock after its expiration, but thread A’s logic in the critical section has not finished executing, thread B can reacquire the lock before it expires. The critical section code cannot be executed strictly serially.
  • The lock was deleted by mistake. If thread A finishes executing, and it doesn’t know that thread B is holding the lock, thread A will continue to execute the DEL instruction to release the lock. If thread B hasn’t finished executing its logic in the critical section, thread A will actually release the lock on thread B.

To avoid the above, it is recommended not to use a Redis distributed lock in a scenario that takes a long time to execute. In addition, a safer practice is to judge the lock before DEL release and verify whether the current lock is owned by yourself.

The specific implementation is to set the lock value to a unique random number (or thread ID), release the lock to determine whether the random number is consistent, and then execute the release operation, to ensure that the lock is not released by other threads, unless the lock expires and is automatically released by the server, the whole process is as follows:

set lock_resource_id random_value nx ex 10
do something
if random_value==lock_resource_id.value 
 del lock_resource_id 

Copy the code

However, judging value and deleting key are separate operations and are not atomic, so you need to use Lua scripts to handle this, because Lua scripts guarantee the atomic execution of multiple instructions in a row.

if redis.call("get",KEYS[1]) == ARGV[1] then
 return redis.call("del",KEYS[1])
else
 return 0
end

Copy the code



Redis single-node distributed locking is basically complete, but it is not a perfect solution, it is relatively complete, because it does not fully solve the problem of other threads taking advantage of the timeout lock when the current thread is released early.

3 Use the Distributed lock of Redisson

How can you solve the problem of locks being released early?

The reentry feature of the lock can be used to enable the thread that obtains the lock to start a daemon thread of the timer. Every expireTime/3 is executed once to check whether the lock of this thread exists. If it exists, the expiration time of the lock is reset to expireTime, that is, the daemon thread is used to “renew” the lock to prevent the lock from being released early due to expiration.

Of course, the logic for the business to implement this daemon is complex, and some unknown problems may arise.

At present, the open source framework Redisson, which is widely used by Internet companies in production environment, has solved this problem well. It is very simple and easy to use, and supports various deployment architectures such as Redis single instance, Redis M-S, Redis Sentinel and Redis Cluster.

Interested friends can consult the official document or source code:

Github.com/redisson/re…

The implementation principle is shown in the figure (Redis cluster is taken as an example) :



Distributed lock based on Redis multi-machine implementation Redlock

In fact, all the above kinds of distributed locking based on Redis single-machine implementation have a problem, that is, the locking only works on one Redis node. Even though Redis guarantees high availability through Sentinel, the replication of Redis is asynchronous. If the Master node fails to synchronize data after acquiring the lock, threads on other clients can still acquire the lock. Therefore, the lock security is lost.

The whole process is as follows:

  • Client A obtains the lock from the Master node.
  • The Master node is faulty, and the key corresponding to the lock is not synchronized to the Slave node during the Master/Slave replication.
  • The Slave node is upgraded to the Master node, but no data is locked in the Master node.
  • Client B requests a new Master node and obtains the lock of the same resource.
  • Multiple clients hold a lock on the same resource, and the lock is not mutually exclusive.

Because of this, in the distributed environment of Redis, The Redis author Antirez provides the RedLock algorithm to implement a distributed lock, which looks like this:

Assuming that there are N (N>=5) Redis nodes that are completely independent of each other and there is no primary/secondary replication or other cluster coordination mechanism, ensure that locks are acquired and released on these N nodes in the same way as under the Redis single instance.

To obtain a lock, the client must perform the following operations:

  • Gets the current Unix time, in milliseconds.

  • Try to acquire locks from five instances using the same key and a unique value (such as a UUID) in sequence. When requesting a lock from Redis, the client should set a network connection and response timeout, which should be less than the lock expiration time. For example, if the lock automatically expires for 10 seconds, the timeout should be between 5 and 50 milliseconds. This prevents the client from waiting for a response if the server Redis has already died. If the server does not respond within the specified time, the client should try to obtain the lock from another Redis instance as soon as possible.

  • The client uses the current time minus the time when the lock was acquired (the time recorded in Step 1) to obtain the time when the lock was acquired. A lock is obtained successfully if and only if the lock is obtained from most (N/2+1, here is 3 nodes) Redis nodes, and the time used is less than the lock failure time.

  • If a lock is acquired, the true validity time of the key is equal to the validity time minus the time taken to acquire the lock (calculated in Step 3).

  • If, for some reason, the lock fails to be acquired (not in at least N/2+1 Redis instances or the lock time has expired), the client should unlock all Redis instances (using the Redis Lua script).

The process of releasing a lock is relatively simple: the client initiates a lock release to all Redis nodes, including those that fail to lock. Antirez makes a point of this in the description of the algorithm. Why?

The reason is that a node may lose the response packet returned to the client after successfully locking. This situation is possible in the asynchronous communication model: it is normal for the client to communicate with the server, but the opposite direction is problematic. Although for the client, the lock fails due to the response timeout, for the Redis node, the SET instruction is successfully executed, which means that the lock is successfully executed. Therefore, when the lock is released, the client should also make a request to the Redis node that failed to acquire the lock at the time.

In addition, antirez also introduced the concept of delayed restart, which means that a node should not restart immediately after a crash, but wait a period of time before restarting. The period of time should be longer than the valid lock time.

For more in-depth study of Redlock, interested friends can refer to the official document:

Redis. IO/switchable viewer/dist…

conclusion

Distributed system design is to achieve a balance of complexity and benefits, both as safe and reliable as possible, but also to avoid overdesign. Redlock does provide more secure distributed locking, but it comes at a cost, requiring more Redis nodes. In the actual business, the general use of Redis based on the single point of distributed lock can meet most of the needs, occasionally data inconsistency, can be solved by manual intervention to fill the data, is the so-called “technology is not enough, manual to gather”! .