What is distributed locking?

In the case of multi-machine application service, we need to securely process a resource. Usually, we can perform synchronous access to the resource or use CAS mode to perform operations. If you want to synchronize, you can no longer use the synchronized keyword or lock like a single machine, so you need a distributed lock.

Why do we need distributed locks?

Locks can be divided into two categories according to their use:

  • Allows multiple clients to operate on shared resources

In this case, the operation on the shared resource must be idempotent, no matter how many times you do it, the result will not be different. Locking is used to improve efficiency by avoiding repeated operations on shared resources.

  • Only one client can operate a shared resource

In this case, operations on shared resources are generally non-idempotent. In this case, if multiple clients operate on a shared resource, it can mean inconsistent data and data loss. This is where distributed locks are needed.

How to implement distributed locking

The main implementation idea of distributed lock is to provide an intermediate layer to ensure that multiple services can correctly access the lock. There are three commonly used implementations:

  • Based on relational databases
  • Based on a Zookeeper
  • Based on the Redis

Based on relational databases

There are two kinds of implementation of distributed lock based on database, which respectively use the principle of primary key uniqueness and row lock to achieve.

Primary key uniqueness implementation

lock

When competing for the lock to the DB in writing a record, the record is mainly contains the id, the current thread of holding, the number of reentrant and creation time, if you insert successful said the current thread for the lock, if you insert a failure Prove that locks are occupied by the others, then, wait for a while to continue to compete for, until the fight for or timeout. If the lock thread is the current thread, update the lock count+1 and execute the logic after the lock. If it is not the current lock, retry.

Release the lock

If a lock is held, the count-1 of the lock is reduced by one each time it is released on reentrant.

Existing problems

  • Timeout protection
  • A thread can take a long time, which can cause other threads to never acquire the lock. If possible, you can use a scheduled task to scan for and delete locks that exceed a certain threshold. However, deletion can cause concurrency problems if locked tasks take a long time to execute. So you need to have a good idea of how long the timeout will be.
  • A single point of the problem
  • You can set a master slave, but it’s a bit wasteful to have a master slave for a lock. The system is unavailable during the primary/secondary switchover.
  • Concurrency problem
  • When the concurrency is large, it can be imagined that many requests will keep accessing the database, resulting in resource waste or even database failure. At this time, we can increase the time interval for obtaining locks, but this will also reduce the system throughput.


Row locking implementation

lock

First, start a transaction at lock time, and use for UPDATE to add an explicit row lock so that you can use this row-level exclusive lock to implement distributed locks

Release the lock

Commit the transaction to release the lock.

Existing problems

  • Connection pool overflow and transaction timeout issues
  • A single point of the problem
  • Row lock escalation to table lock problem
  • Problem of large concurrency (ibid.)

Based on the Redis

Distributed locks based on Redis have come a long way from the original single-player master/Slave/Sentinel cluster. The next step is to look at the implementation and problems of each stage through historical evolution.

Single Redis

lock

SET resource_name random_value NX PX 30000Copy the code

unlock

Here use lua script, because the lua script has atomicity, if you don’t use lua, so may thread 1 network delay in releasing the lock to lock was released, overtime this time for other threads can access to the lock, and when the release of thread 1 operation to redis will delete the other thread lock by mistake. Of course, in addition to Lua scripts, as long as the operation is atomic, for example, Ali’s Tair also provides CAD to release locks.

if redis.call("get",KEYS[1]) == ARGV[1] then
    return redis.call("del",KEYS[1])
else
    return 0
endCopy the code

Existing problems

  • How to set the expiration time? It is possible that thread 1 fails to complete the operation within a specified time due to network problems after acquiring the lock. Then the lock is released and acquired by thread 2. In this case, thread 1 May complete the operation and delete the lock of thread 2 by mistake. In fact, resource insecurity can occur as long as the operation is not atomic and as long as the time delay is not guaranteed.
  • Some people may say that you can check whether the lock is still owned by the client after the client operates the shared resource. If the lock is still owned by the current client, the client submits the resource to release the lock. If the client does not own the resource, it does not commit the resource. But in fact, the two steps of determining whether the lock is owned by the current client and committing/not committing the resource are not atomic and there is a delay. It is possible to determine whether the lock is owned by the current client, but it is no longer owned by you when committing the resource.
  • The single-node Dr Processing is weak.

Master slave/Sentry Redis

To address the second problem above, there are several Redis architectures that are master-slave/sentry architectures. The process of locking and unlocking is the same as above.

Existing problems

Because the master-slave replication of Redis is delayed, it is possible that thread 1 has acquired the lock of the master node, but the lock information has not reached the slave node. After the master node goes down, the lock will be lost after the sentry mode changes to the master node. This problem can be solved with RedLock in clustered mode.

Redis cluster

Antirez, the author of Redis, proposed RedLock algorithm to deal with the master-slave replication delay in sentinel mode. The process is as follows:

  • Gets the current time in milliseconds.
  • Take turns to request locks on N nodes with the same key and random value. In this step, the client requests locks on each master with a timeout that is much smaller than the total lock release time. For example, if the automatic lock release time is 10 seconds, the timeout time for each node lock request may be in the range of 5-50 milliseconds. This prevents a client from blocking on a failed master node for too long. If a master node becomes unavailable, we should try the next master node as soon as possible.
  • The client calculates how long it took to acquire the lock in step 2, and the lock is considered successful only if the client successfully acquired the lock on most master nodes (at least N/2 + 1) and the total elapsed time does not exceed the lock release time.
  • If the lock was acquired successfully, the automatic lock release time is now the initial lock release time minus the time it took to acquire the lock.
  • If the lock acquisition fails, either because less than half of the locks were acquired successfully or because the total elapsed time exceeded the lock release time, the client will release the lock on every master node, even those locks that it considers unsuccessful.

Existing problems

For questions and answers about RedLock, check out Martin and Antirez’s article

  • Martin’s question: martin.kleppmann.com/2016/02/08/…
  • Antirez’s response: antirez.com/news/101

Here’s how I understand their argument. First of all, I think RedLock has a problem, but it’s debatable whether the problem can be ignored by business and user actions.

First of all, Martin’s doubts can be divided into the following two points:

  • The first is also a problem with Redis distributed locks, namely that there is no way to guarantee timeout Settings. The solution given by Martin is similar to obtaining a version number when acquiring a lock, similar to CAS to lock and unlock.
  • The second point is the expiration time, if a Redis Master system time error causes the lock to be released early. In fact, the essence of this problem is that time cannot be used as a security guarantee for distributed systems, where various problems may occur: program pause, network delay, system time errors, etc.

And then Antirez’s response to the question with Martin:

  • First of all, for the first point (here in RedLock lock step 5, namely the client holds a lock timeout to access to a Shared resource cannot judge timeout), if you want to ensure the timeout Settings, you need to set up a version number, so don’t need a distributed lock, it is ok to directly through the version number to the CAS (actually I don’t think so, A distributed lock can be used to block most of the clients without performing the corresponding operation, and the version number is a supplement to the problem of the last distributed lock. The problem of lock failure due to timeout cannot be solved by other distributed locks.
  • On the second point, it’s a matter of time:
  • For example, if there are five nodes A, B, C, D, and E, client 1 obtains NODE A, B, and C, and the clock on node C jumps forward. As A result, the maintenance lock expires. In this case, client 2 May obtain node C, D, and E and hold the lock at the same time.
  • Antirez’s solution to the time-hopping lock failure problem is to prohibit manual system time modification and use an NTPD program that doesn’t “jump” the system clock.
  • The last problem is that after a node crashes and restarts, multiple clients may hold locks. Antirez’s solution is to delay the restart, that is, after a node crashes, it does not restart it immediately, but wait for a period of time, which is longer than the validity of the lock.

Based on a Zookeeper

The cluster

Lock and release

Implement a distributed lock using the Zookeeper feature that a node cannot be created repeatedly

  • Check to see if the target node has been created. If so, wait for the lock.
  • If not, a transient Node is created to indicate that the lock has been occupied.
  • If the creation fails, the lock is already held by another thread, and the lock is also waiting.
  • When the lock is released, or when the current Session times out, the node is deleted and the thread waiting for the lock is woken up to fight for the lock.

Existing problems

  • A stammer occurs in the case of a large number of locks, when a node is deleted and a large number of watcher threads on that node are called back.
    • As for the shock group, temporary sequence nodes can be used to solve the problem. The general process of locking and unlocking is as follows:
  1. If client 1 wants to lock a file, it creates a temporary sequential file, file_0001, to determine whether it is the smallest file
  2. If client 2 wants to lock the file, it creates a temporary file file_0002 to determine whether it is the smallest file. If it is not, client 2 will use ZK’s API to add a listener to a sequential node on its sequential node to hear whether the previous sequential node is deleted
  3. After client 1 releases the lock, the file_0002 monitor is awakened, and client 2 finishes locking
  • There will also be multiple client operations sharing resources.
  • If ZooKeeper fails to detect the heartbeat of the client for a long time, it considers that the Session has expired, and the temporary node created by the Session is automatically deleted.


About the choice of distributed locks

Generally distributed locks are implemented using Redis or Zookeeper. The comparison between Redis and CooKeeper is as follows:

  • Redis provides better read/write performance than ZooKeeper. In high-concurrency scenarios, if ZooKeeper is used as a distributed lock, lock acquisition may fail, resulting in performance bottlenecks
  • Zookeeper is much more reliable than Redis
  • Zookeeper can implement read/write locks, but Redis cannot
  • In the Watch mechanism of ZooKeeper, when the client tries to create a node, it finds that the node already exists, and the creation fails. Then, it enters a waiting state. When the node is deleted, ZooKeeper notifies it through the watch mechanism, so that it can continue to complete the creation operation (obtaining the lock). This allows a distributed lock to be used on the client like a local lock: failure to lock blocks until the lock is acquired. Redis cannot implement this mechanism.