preface

In distributed system, redis distributed lock is more simple and efficient, so it has become the first distributed lock and has been used in many business scenarios.

In particular, the emergence of distributed configuration centers, such as Apollo and NoCOS, makes ZooKeeper’s status lower and lower. Zookeeper distributed locks are more complex and not easy to use.

So let’s use redis distributed locks.

Not to say that using redis distributed locks, you can rest easy, if not good, will lead to some unexpected trouble.

Today we focus on talking about redis distributed lock some pits, to have a need for a friend reference.

Recently, I accidentally got a copy of the notes written by a big boss of BAT factory. All of a sudden, I got through to both my supervisor and my supervisor. I felt that the algorithm was not as difficult as I imagined.

BAT boss wrote the brush notes, let me get the offer soft

1 Non-atomic operation

With Redis distributed locks, the setNx command is probably the first thing that comes to mind.

if (jedis.setnx(lockKey, val) == 1) {
   jedis.expire(lockKey, timeout);
}
Copy the code

Easy, three times five divided by two, you can write code.

This code does lock successfully, but have you noticed any problems?

The locking operation is separate from the subsequent set timeout and is not atomic.

If the lock succeeds, but the timeout fails, the lockKey becomes never invalid. In high concurrency scenarios, if a large number of lockkeys succeed in locking but do not fail, redis memory space may be directly insufficient.

So, is there a locking command that guarantees atomicity?

The answer is: yes, see below.

2 Forgot to release the lock

Using setNx to lock and set timeout is separate, not atomic.

There is also the set command in Redis, which can specify multiple parameters.

String result = jedis.set(lockKey, requestId, "NX"."PX", expireTime);
if ("OK".equals(result)) {
    return true;
}
return false;
Copy the code

Among them:

  • lockKey: Lock identifier
  • requestId: the request id
  • NX: Sets the key only when the key does not exist.
  • PX: Sets the key expiration time to millisecond milliseconds.
  • expireTime: Expiration time

The set command is an atomic operation, locking and setting a timeout, which can be easily done with a single command.

nice

Using the set command to lock looks fine on the surface. But if you think about it, isn’t it a little unreasonable to release a lock every time it expires? After locking, if the lock is not released in time, there will be many problems.

A more reasonable use of distributed locks is:

  1. Manual lock
  2. Business operations
  3. Manually release lock
  4. If manual lock release fails, redis automatically releases the lock when the timeout period is reached.

The general flow chart is as follows:So how do you release the lock?

try{
  String result = jedis.set(lockKey, requestId, "NX"."PX", expireTime);
  if ("OK".equals(result)) {
      return true;
  }
  return false;
} finally {
    unlock(lockKey);
}  
Copy the code

You need to catch exceptions in the business code and then release the lock in finally. In other words: the lock needs to be released whether the code execution succeeds or fails.

At this point, some friends may ask: if the system is restarted when the lock is released, or the network is disconnected, or the machine room is broken, will not lead to the failure of the lock release?

And that’s a good question, because these small probability problems do exist.

But remember earlier when we set timeout for the lock? The lock will be automatically released by Redis after the timeout we set, even if the release fails due to an exception.

But is it enough just to release the lock in finally?

3 released someone else’s lock

It is not enough to release the lock only in finally, because the position of releasing the lock is still wrong.

What’s wrong?

A: In multithreaded scenarios, it is possible to release someone else’s lock.

Some friends might argue that if thread A acquired the lock in A multi-threaded scenario, thread B could not acquire the lock if thread A did not release the lock. Why did thread B release the lock?

A: If thread A and thread B both use lockKey to lock. Thread A succeeded in locking, but the timeout period was exceeded because the service function took A long time. At this point, Redis automatically releases the lockKey lock. At this point, thread B can lock the lockKey successfully and then perform its business operation. At this point, thread A completes its business function and releases the lockKey. There’s A problem, thread B’s lock was released by thread A.

At this point, THREAD B must be crying in the bathroom and still talking.

So, how to solve this problem?

I don’t know if you noticed? RequestId = requestId = requestId = requestId = requestId = requestId;

A: requestId is used when a lock is released.

if (jedis.get(lockKey).equals(requestId)) {
    jedis.del(lockKey);
    return true;
}
return false;
Copy the code

Before releasing the lock, obtain the value of the lock (the value previously set is requestId) and check whether the value is the same as the value previously set. If the value is the same, the lock can be deleted and success is returned. If not, failure is returned.

In other words: they can only release their own locks, not others added locks.

Why requestId? Why not userId?

A: If a userId is used, it is not unique to the request. Multiple requests may use the same userId. RequestId is globally unique, and there is no lock and lock release mess.

In addition, using lua script, you can also solve the problem of releasing someone else’s lock:

if redis.call('get', KEYS[1]) == ARGV[1] then 
 return redis.call('del', KEYS[1]) 
else 
  return 0 
end
Copy the code

The Lua script ensures that checking for a lock and deleting a lock are atomic operations, and it works better to release the lock.

Speaking of Lua scripts, lua scripts are also recommended for lock operations:

if (redis.call('exists', KEYS[1= =])0) then
    redis.call('hset', KEYS[1], ARGV[2].1); 
    redis.call('pexpire', KEYS[1], ARGV[1]); 
 return nil; 
end
if (redis.call('hexists', KEYS[1], ARGV[2]) = =1)
   redis.call('hincrby', KEYS[1], ARGV[2].1); 
   redis.call('pexpire', KEYS[1], ARGV[1]); 
  return nil; 
end; 
return redis.call('pttl', KEYS[1]);
Copy the code

This is redisson framework lock code, write good, we can use for reference.

4 A large number of failed requests

The above locking method may seem fine, but if you think about it, if there are 10,000 requests competing for the lock at the same time, only one request may be successful and the other 9,999 requests will fail.

In the seckill scenario, what’s the problem?

A: For every 10,000 requests, one is successful. 10,000 more requests, 1 successful. And so on, until stocks are low. This becomes a uniform distribution of seconds, which is not what we expect.

How to solve this problem?

And here’s another scenario:

For example, if two threads upload files to SFTP at the same time, create directories before uploading the files. Assume that both threads need to create directories with the current date, for example: 20210920. If you do not control the concurrent creation, the second thread will fail.

Add a Redis distributed lock to create a directory that already exists. Create a directory that does not exist.

The pseudocode is as follows:

try {
  String result = jedis.set(lockKey, requestId, "NX"."PX", expireTime);
  if ("OK".equals(result)) {
    if(! exists(path)) { mkdir(path); }return true; }}finally{
    unlock(lockKey,requestId);
}  
return false;
Copy the code

A: It is not enough to add the redis distributed lock, because if the second request fails to add the lock, then return the failed request. Or return to success?

Obviously you cannot return a failure, and if the return fails, the problem is still unresolved. If the file has not been uploaded successfully, returning success is even more problematic. Headache, how should solve after all?

A: Use spin locks.

try {
  Long start = System.currentTimeMillis();
  while(true) {
     String result = jedis.set(lockKey, requestId, "NX"."PX", expireTime);
     if ("OK".equals(result)) {
        if(! exists(path)) { mkdir(path); }return true;
     }
     
     long time = System.currentTimeMillis() - start;
      if (time>=timeout) {
          return false;
      }
      try {
          Thread.sleep(50);
      } catch(InterruptedException e) { e.printStackTrace(); }}}finally{
    unlock(lockKey,requestId);
}  
return false;
Copy the code

Within a specified period of time, say 500 milliseconds, the spin keeps trying to lock (i.e. in an infinite loop, keeps trying to lock) and returns if it succeeds. If it fails, sleep for 50 milliseconds and try again. If the lock is not successfully added after the timeout period, failure is returned.

5 Lock reentrant problem

We all know that Redis distributed locks are mutually exclusive. If we lock a key, if the lock corresponding to the key is not invalid, then use the same key to lock, most likely will fail.

Yes, most of the scenes are fine.

Why most of the scenes?

Because there are also scenarios like this:

Suppose that in a request, you need to obtain a menu tree or classification tree that meets the criteria. Taking the menu as an example, you need to start from the root node in the interface, recursively iterate through all the child nodes that meet the criteria, and then assemble a menu tree.

It should be noted that the menu is not static, in the background system operating students can dynamically add, modify and delete the menu. In order to ensure that in the case of concurrency, it is possible to obtain the latest data each time, we can add redis distributed lock.

Redis distributed lock is the right idea. But then the problem is, you recurse multiple times in a recursive method, adding the same lock each time. Of course, the first level of recursion can be locked successfully, but the second and third levels of recursion… Level N, won’t the lock fail?

The pseudo-code for adding locks to the recursive method is as follows:

private int expireTime = 1000;

public void fun(int level,String lockKey,String requestId){
  try{
     String result = jedis.set(lockKey, requestId, "NX"."PX", expireTime);
     if ("OK".equals(result)) {
        if(level<=10) {this.fun(++level,lockKey,requestId);
        } else {
           return; }}return;
  } finally{ unlock(lockKey,requestId); }}Copy the code

If you just use it this way, it doesn’t seem to be a problem. But when you finally execute the program, there’s only one thing waiting for you: an exception.

Because from the root node, the first level of recursion is locked successfully, and the second level of recursion is directly entered. Because the lock for requestId as key already exists, the second level recursion will most likely fail to lock and return to the first level. The first level then releases the lock normally, and the whole recursive method returns directly.

Now, you know what the problem is?

Yes, the recursion method actually only executes the first recursion and then returns. The other recursion levels cannot be executed at all due to locking failure.

So how can this problem be solved?

A: Use a reentrant lock.

Take the Redisson framework as an example, which implements reentrant locking internally.

There is a saying in ancient times: it is useless to call yourself a hero if you do not know Chen Jinnan.

I say: if a distributed lock does not know Redisson, it is useless to call it a good lock. Ha ha ha, just for fun.

This shows that Redisson has a very high status in redis distributed locks.

The pseudocode is as follows:

private int expireTime = 1000;

public void run(String lockKey) {
  RLock lock = redisson.getLock(lockKey);
  this.fun(lock,1);
}

public void fun(RLock lock,int level){
  try{
      lock.lock(5, TimeUnit.SECONDS);
      if(level<=10) {this.fun(lock,++level);
      } else {
         return; }}finally{ lock.unlock(); }}Copy the code

The above code may not be perfect, but here is just a general idea, if you have a need for this aspect, you can parameter.

Next, let’s talk about how redisson reentrant locks are implemented.

Locking is mainly achieved through the following script:

if (redis.call('exists', KEYS[1= =])0) 
then  
   redis.call('hset', KEYS[1], ARGV[2].1);        redis.call('pexpire', KEYS[1], ARGV[1]); 
   return nil; 
end;
if (redis.call('hexists', KEYS[1], ARGV[2= =])1) 
then  
  redis.call('hincrby', KEYS[1], ARGV[2].1); 
  redis.call('pexpire', KEYS[1], ARGV[1]); 
  return nil; 
end;
return redis.call('pttl', KEYS[1]);
Copy the code

Among them:

  • KEYS [1] : the lock
  • ARGV[1] : expiration time
  • ARGV[2] : uuid + “:” + threadId, can be considered requestId
  1. If the lock name does not exist, add the lock.
  2. If both the lock name and the requestId value exist, run the hincrby command to count the lock name and the requestId value, increasing by 1 each time. Notice that this is the key to the reentrant lock, which increases by 1 every time the lock is reentrant.
  3. If the lock name exists but the value is not requestId, the expiration time is returned.

Lock release is mainly achieved through the following script:

if (redis.call('hexists', KEYS[1], ARGV[3= =])0) 
then 
  return nil
end
local counter = redis.call('hincrby', KEYS[1], ARGV[3] -1)
if (counter > 0) 
then 
    redis.call('pexpire', KEYS[1], ARGV[2]);       return 0; 
 else 
   redis.call('del', KEYS[1]); 
   redis.call('publish', KEYS[2], ARGV[1]); 
   return 1; 
end; 
return nil
Copy the code
  1. If the lock name and requestId do not exist, time returns.
  2. If the lock name and the requestId value are present, the reentrant lock is reduced by 1.
  3. If the value of the reentrant lock is still greater than 0 after the value is reduced by 1, there are references. Try setting the expiration time again.
  4. If the value of the reentrant lock is still equal to 0 after the reduction of 1, the lock can be deleted and a message can be sent to the waiting thread to snatch the lock.

Again, if your system can tolerate temporary data inconsistencies, you can leave it unlocked. I’m just using this as an example. This section is not applicable to all scenarios.

6 lock competition issues

If there is a heavy write scenario, using a normal Redis distributed lock is fine.

However, in some business scenarios, there is less writing and a lot of reading. Would it be bad to use a normal Redis distributed lock?

As we all know, the coarser the lock granularity, the more fierce the competition between multiple threads for the lock, resulting in the longer waiting time of multiple threads, the worse the performance.

Therefore, the first step to improve the performance of Redis distributed locks is to reduce the granularity of the locks.

6.1 read-write lock

As we all know, the purpose of locking is to ensure the security of reading and writing data in a concurrent environment, that is, without data errors or inconsistencies.

However, in most practical business scenarios, read data is more frequent than write data. However, concurrent read operations between threads do not involve concurrent security issues. There is no need to add mutex to read operations. As long as the locks of concurrent read and write operations are mutually exclusive, it can improve the performance of the system.

Let’s take the Redisson framework as an example, which already implements read and write locks internally.

The pseudo-code for the read lock is as follows:

RReadWriteLock readWriteLock = redisson.getReadWriteLock("readWriteLock");
RLock rLock = readWriteLock.readLock();
try {
    rLock.lock();
    // Business operation
} catch (Exception e) {
    log.error(e);
} finally {
    rLock.unlock();
}
Copy the code

The pseudo-code for writing locks is as follows:

RReadWriteLock readWriteLock = redisson.getReadWriteLock("readWriteLock");
RLock rLock = readWriteLock.writeLock();
try {
    rLock.lock();
    // Business operation
} catch (InterruptedException e) {
   log.error(e);
} finally {
    rLock.unlock();
}
Copy the code

The biggest advantage of separating read and write locks is that the performance of read operations is improved, because the read and write locks are shared without mutual exclusion. In our actual business scenario, most data operations are read operations. Therefore, if you improve the performance of the read operation, you also improve the performance of the lock as a whole.

Here is a summary of the characteristics of a read-write lock:

  • Read and read are shared, not mutually exclusive
  • Read and write are mutually exclusive
  • Write and write are mutually exclusive

6.2 the lock block

In addition, to reduce the granularity of locks, it is common to segment large locks.

In Java, ConcurrentHashMap divides data into 16 segments. Each segment has a separate lock. Data in different lock segments do not interfere with each other to improve lock performance.

In a real business scenario, we could do the following:

For example, in the second kill inventory scenario, there are now 2000 items in the inventory, and the user can kill them in seconds. To prevent oversold situations, it is common to place a lock on inventory. If there are 1W users competing for the same lock, obviously the system throughput will be very low.

In order to improve the system performance, we can divide the inventory into sections, such as 100 sections, so that each section has 20 items to participate in the second kill.

During seckilling, the user ID is first used to obtain the hash value and then divided by 100 to obtain modulo. A user of module 1 accesses paragraph 1 inventory, a user of module 2 accesses paragraph 2 inventory, a user of module 3 accesses paragraph 3 inventory, and so on, until finally a user of module 100 accesses paragraph 100 inventory.

This can greatly reduce lock conflicts in multi-threaded environments. In the past, multiple threads can only compete for 1 lock at the same time, especially in the second kill scenario, the competition is too fierce, it can be described as deadly, the result is to lead to the majority of threads in the lock waiting. Now that there are multiple threads competing for 100 locks at the same time, there are fewer threads waiting and system throughput increases.

One caveat: while segmenting locks can improve system performance, it can also increase system complexity. Because it requires the introduction of additional routing algorithms, cross-segment statistics and other features. In our actual business scenario, we need to consider the combination, not necessarily to segment the lock.

7 Lock timeout problem

As mentioned earlier, redis will automatically release the lock from thread A if thread A successfully locks the lock, but the business function takes A long time and exceeds the timeout period.

Some friends may say: when the timeout period expires, the lock is released, and it has no effect on the function.

A: Wrong, wrong, wrong. It actually affects functionality.

The general purpose of locking is to prevent data anomalies when accessing critical resources. For example, thread A is changing the value of data C, and thread B is also changing the value of data C. If we do not control the value of data C, the value of data C will be faulty in the case of concurrency.

To ensure the mutual exclusivity of A method, or piece of code, that if thread A executes A piece of code, other threads are not allowed to execute it at the same time, we can use the synchronized keyword to lock it.

But this kind of lock has great limitation, can only guarantee the mutual exclusion of single node. Redis distributed locks are used if mutual exclusion is required across multiple nodes.

With all that foreshadowing, let’s get back to business.

Assume that thread A plus redis distributed lock code, including code 1 and code 2.Because the business operation to be performed by this thread is very time-consuming, the program has reached the set timeout time after executing code 1, and Redis automatically releases the lock. And code 2 hasn’t been executed yet.

At this point, code 2 is equivalent to the streaking state, and mutual exclusion is not guaranteed. If a critical resource is accessed within it and accessed by other threads, a data exception may occur. (PS: When I say access to critical resources, I mean not only read, but also write)

So, how to solve this problem?

A: If the timeout period has been reached but the business code has not finished executing, the lock needs to be automatically renewed.

We can use the TimerTask class to implement automatic renewal:

Timer timer = new Timer(); 
timer.schedule(new TimerTask() {
    @Override
    public void run(Timeout timeout) throws Exception {
      // Automatic renewal logic}},10000, TimeUnit.MILLISECONDS);
        
Copy the code

After obtaining the lock, a scheduled task is automatically started, and the expiration time is automatically refreshed every 10 seconds. This mechanism has an ambitious name in the Redisson framework: Watch Dog, the legendary guard dog.

Of course, we still prefer to use lua scripts to implement automatic renewal functions, such as:

if (redis.call('hexists', KEYS[1], ARGV[2= =])1) then 
   redis.call('pexpire', KEYS[1], ARGV[1]);
  return 1; 
end;
return 0;
Copy the code

Note that when implementing automatic renewal, you also need to set a total expiration time, which can be consistent with Redisson and set to 30 seconds. If the business code reaches this total expiration date and has not finished executing, it is no longer automatically renewed.

The function of automatic renewal is to start a scheduled task after obtaining the lock, and check whether the lock exists every 10 seconds. If so, the expiration time will be updated. If the business method is still not executed after three renewals, i.e., 30 seconds, it is not renewed.

8 Master/slave replication Problem

So much of what we’ve covered above is fine for a single instance of Redis.

But, if there are multiple instances of Redis. For example, if you do master/slave, or use Sentinel mode, the redis distributed lock feature, you will have problems.

What exactly is the problem?

Assume that Redis is currently using a master-slave mode, with one master node and three slave nodes. The master node writes data, and the slave node reads data.There was a lot of peace and harmony. All redis locks are performed on the master. After the locks are successfully added, the data is asynchronously synchronized to all slaves.

And then one day, for some irreversible reason, the master node dies.

This will require upgrading a slave to a new master if Slave1 is elected.

If there is A lock A that is sad, the master hangs just after the lock is successfully added, and does not have time to synchronize to Slave1.

This will result in the loss of lock A on the new master node. Later, if there is A new thread, use lock A to lock, still can succeed, distributed lock failed.

So, what about solving this problem?

A: To solve this problem, the Redisson framework provides a special class: Redisson Redlock, which uses the Redlock algorithm.

RedissonRedLock solves the problem as follows:

  1. We need to set up several independent Redis environments, let’s say we set up three here.
  2. Each environment has a Redisson node.
  3. Multiple Redisson nodes form a Redisson Redlock.
  4. Environments include single-machine, master-slave, sentinel, and cluster modes, and can be one or a combination of them.

Here we take the master and slave as an example. The architecture diagram is as follows:

The RedissonRedLock process is as follows:

  1. Loop to lock all redisson nodes, assuming the number of nodes is N (N = 5 in this example).
  2. The entire RedissonRedLock is successful if N/2 + 1 of the N nodes are successfully locked.
  3. If less than N/2 + 1 of the N nodes were successfully locked, the entire RedissonRedLock failed.
  4. If the total lock time of each node is greater than or equal to the maximum waiting time, a failure message is displayed.

As can be seen from the above, the Redlock algorithm can indeed solve the problem of distributed lock failure in multi-instance scenarios if the master node fails.

But it also raises new questions, such as:

  1. We need to build additional sets of environments and apply for more resources. We need to evaluate whether the funds are sufficient.
  2. If there are N Redisson nodes, you need to lock them N times, or at least N/2+1 times to know whether the Redlock is successful. Obviously, the extra time cost is a bit more than worth it.

This shows that RedissonRedLock is not used much in real business scenarios, especially in high-concurrency services. Recently, I accidentally got a copy of the notes written by a big boss of BAT factory. All of a sudden, I got through to both my supervisor and my supervisor. I felt that the algorithm was not as difficult as I imagined.

BAT boss wrote the brush notes, let me get the offer soft

In a distributed environment, CAP is inescapable.

CAP refers to being in a distributed system:

  • Consistency
  • Availability
  • Partition tolerance

These three elements can at most achieve two at the same time, not all at once.

If your actual business scenario, more need to ensure data consistency. Use a CP type distributed lock, such as ZooKeeper, which is disk-based and may not perform as well, but generally does not lose data.

If your actual business scenario, it is more necessary to ensure high availability of data. Use an AP type of distributed lock, such as Redis, which is memory-based and has better performance but risks data loss.

In fact, in most of our distributed business scenarios, redis distributed lock is sufficient, really don’t take it too seriously. Because the data inconsistency problem can be solved by the final consistency solution. But if the system becomes unavailable, it’s a critical blow to the user.

One last word (attention, don’t fuck me for nothing)

If this article is of any help or inspiration to you, please scan the QR code and pay attention to it. Your support is the biggest motivation for me to keep writing.

Ask for a key three even: like, forward, look.

Pay attention to the public account: [Su SAN said technology], in the public account reply: interview, code artifact, development manual, time management have excellent fan welfare, in addition reply: add group, can communicate with a lot of BAT big factory seniors and learn.