Recently I encountered an interesting problem about distributed lock, during which there were a lot of interesting problems and discussions, here to record.

In most scenarios, many programmers like to use Redis to do distributed lock, but the company’s cache service recently banned Lua script in order to promote standardization, making the original distributed lock implementation have to find another way out, and finally chose ZK to do distributed lock, because Go-ZooKeeper only supports blocking lock. Some modifications have been made to make it support non-blocking and waiting for failure time locks, students with similar needs can also refer to

Github.com/zhaocheng-d…

Why do we all like redis for distributed locks

In fact, we all know that Redis will lose data, using Redis to do distributed lock is risky, but why do we like to use Redis? I understand actually 2 reasons

  • Because the service itself has done some caching with Redis, if the use of ZK and other safe and reliable, but also to introduce a SET of ZK, development costs and operation and maintenance costs are very high. With DB, you typically have to build tables, and scenarios like non-blocking locks are not supported.
  • Redis is unreliable, but 99 percent of the scenarios are error-free and fast. Many businesses have to say that the reliability requirements are not so high, and the cost of fixing the data is still acceptable.

Why go to Lua script

The problem with Lua scripts is that they are not controllable. Caches are sensitive to latency, and a large transaction may block the cached service responding to other requests, so it is reasonable to disable Lua scripts. However, in the case of distributed locks, I do not agree with cache directly rejecting such services. There are two basic ideas that can solve the problem:

  • One is that caching services with proxy nodes in front of them can provide specific custom protocols as lua scripts distribute lock and unlock functions, because 2 operations are very common requirements and are not large transactions.
  • Another idea is to use update/delete operations with version numbers to implement distributed locking as well, and many caching services now provide similar functionality.

Why zK

Technology selection is a pain in the ass, there must be a trade-off, first of all to see what the demand scenario is. I am responsible for a series of services that initiate asynchronous processes. When external services call my system, I will drop the data into the database first and then pick it up for processing through asynchronous coroutines. Distributed locking is necessary if the data is not to be processed repeatedly.

Of course, if only for this scenario, it is also possible to use DB to do it, or to change the consumption queue, but the redis lock is also a problem that must be solved. ^_^

In a nutshell, we want a bunch of computing power to compete for computing resources, so WHAT I really want is a non-blocking lock with timeout and retry.

Next look at the company’s selection and current resources

  • Ideally I would like to have a cache of resources waiting for version numbers, but there are none.
  • Both ZK and ETCD are functionally OK. After all, GO is more compatible with ETCD. The problem is that ETCD is not maintained by people with basic components, so ZK is basically the better choice.
  • During this period, I also tried to investigate a similar distributed KV storage of TIKV. Unfortunately, this database is not suitable for distributed locks because it has problems with hot spots.

Go – a zookeeper

Of course, ZK can do distributed lock, the problem is that it does not match with GO, investigate go’S ZK client on the market, see a better Go-ZooKeeper, but in fact, this library has not been updated for a long time. The last one was in August of ’19 and only supported blocking locks, so some changes were made to these features. By the way, the basic principle of zK distributed lock is reviewed

Zk distributed lock implementation principles

The basic principles of ZK distributed locking can be broken down into several steps:

  1. Create a temporary sequence node node
  2. Gets the children of all nodes of the current node
    1. If the node created by yourself has the smallest sequence number among the child nodes of all nodes, it means that the current lock is obtained.
    2. If not, the previous node of the current node needs to listen, and when it disappears, the lock is acquired. Using the above mechanism, ZK avoids the herding of locks.

Retry, non-blocking, timeout mechanism

The previous section is actually done with Go-ZooKeeper. The problem is that it doesn’t support some minor features, so I made some minor changes to it and you can see the code.

  • Retry The original retry policy was fixed 3, changed to the specified parameter.
  • Non-blocking in the second step above the current node is not the smallest child node directly abandoned.
  • The timeout mechanism selects the done chan of the current context during listening.

Does go have a reentrant lock?

There is also an interesting problem during this period. There is a concept called reentrant lock in Java, but it seems to fail in the GO scenario.

The main reason is that GO uses coroutines, which are not one-to-one corresponding to the operating system threads, while Java is a thread, which is one-to-one corresponding to the operating system threads.

When implementing a reentrant lock, Java’s policy is usually based on the current thread, maintaining a semaphore +1 if it is the current thread. But you can’t implement it with the current coroutine in Go, but I haven’t heard of goroutineID yet. It might be possible to set a UUID into the context every time you create the Go routine, but it depends on the project.

On top of that, I wonder if there was a problem with reentrancy in Java in the past. Because if one thread switches after logic A has acquired the lock, and the current thread is assigned to another logic B, isn’t it possible to re-enter the lock in previous working logic A? That seems to be problematic, right? It seems that the best way to do this is not to use reentrant locks.