Reprinted from: juejin.cn/post/697124…

In this article, I want to talk with you about the “security” of Redis distributed locks.

Redis distributed lock topic, a lot of articles have written rotten, why I need to write this article?

Because I find that 99% of the articles on the Internet don’t really address this issue clearly. Lead to a lot of readers read a lot of articles, still in the clouds. For example, can you answer these questions clearly?

  • How to implement a distributed lock based on Redis?
  • Is Redis Distributed Lock really secure?
  • What’s wrong with Redlock? Is it safe?
  • Industry debate Redlock, what is the debate about? Which view is right?
  • Distributed lock Redis or Zookeeper?
  • What are the considerations for implementing a “fault tolerant” distributed lock?

In this article, I will try to make these issues thoroughly clear.

After reading this article, you will not only have a thorough understanding of distributed locks, but also a deeper understanding of “distributed systems.”

The article is a bit long, but a lot of dry goods, I hope you can read it patiently.

Why do we need distributed locks?

Before we start talking about distributed locks, it’s important to briefly explain why distributed locks are needed.

Corresponding to distributed lock is “single machine lock”. When we write multithreaded programs, to avoid data problems caused by simultaneous operation of a shared variable, we usually use a lock to “mutually exclusive” to ensure the correctness of shared variables, and its use scope is in the “same process”.

If multiple processes need to operate on a shared resource at the same time, how can they be mutually exclusive?

For example, today’s business applications are usually microservice architectures, which also means that an application will deploy multiple processes. Therefore, if multiple processes need to modify the same row in MySQL, in order to avoid data errors caused by operations out of order, we need to introduce “distributed locks” to solve this problem.

To implement distributed locking, you must use an external system on which all processes apply for “locking”.

This external system, on the other hand, had to implement the “mutually exclusive” capability, where two requests coming in at the same time would only return success for one process and failure (or wait) for the other.

This external system can be MySQL, Redis or Zookeeper. But in pursuit of better performance, we usually choose to use Redis or Zookeeper.

Here I will take Redis as the main line, from the shallow to the deep, take you to analyze the depth of distributed lock, all kinds of “security” problems, to help you thoroughly understand distributed lock.

How to implement distributed lock?

Let’s start with the simplest one.

To implement distributed locks, Redis must be “mutually exclusive”. We can use the SETNX command. This command says SET if Not eXists, the value of the key is SET if it does Not exist, otherwise nothing is done.

Two client processes can execute this command, mutually exclusive, to achieve a distributed lock.

Client 1 applies for locking and the locking succeeds:

127.0. 01.:6379> SETNX lock 1
(integer) 1     // Client 1, lock successfully
Copy the code

Client 2 applied to lock because it failed to lock after it arrived:

127.0. 01.:6379> SETNX lock 1
(integer) 0     // Client 2, lock failed
Copy the code

At this point, the locked client can operate on the “shared resource”, for example, modify a row of MySQL data, or call an API request.

After the operation is complete, the lock should be released in a timely manner to give the opportunity to later operators to operate shared resources. How to release the lock?

Delete the key by using the DEL command:

127.0. 01.:6379> DEL lock / / releases the lock
(integer) 1
Copy the code

The logic is very simple, and the overall distance is like this:

However, it has a big problem. When client 1 takes the lock, it can cause a “deadlock” if the following scenario occurs:

  1. The program failed to release the lock in time because the service logic was abnormal. Procedure
  2. Process hung, no chance to release lock

At this point, the client will hold the lock forever, and other clients will never get the lock.

How to solve this problem?

How do I avoid deadlocks?

An easy solution would be to assign a “lease period” to the lock when it is applied for.

When implemented in Redis, this key is given an “expiration time”. We assume that the shared resource will not be operated for more than 10 seconds, so when locking, set the key to expire for 10 seconds:

127.0. 01.:6379> SETNX lock 1    / / lock
(integer) 1
127.0. 01.:6379> EXPIRE lock 10  // It will automatically expire 10 seconds later
(integer) 1
Copy the code

This way, regardless of whether the client is abnormal, the lock can be “automatically released” after 10 seconds, and other clients can still hold the lock.

But is that really okay?

There are still problems.

The current operation, lock, set expiration are two commands, is it possible to only execute the first, but the second is “too late” to execute the situation? Such as:

  1. SETNX execution succeeded. EXPIRE execution failed due to network problem
  2. SETNX executed successfully, Redis is down abnormally, EXPIRE did not have a chance to execute
  3. SETNX executes successfully, client crashes abnormally, and EXPIRE does not have a chance to execute

In short, these two commands are not guaranteed to be atomic operations (successful together), and there is a potential risk that expiration time Settings will fail and “deadlock” issues will still occur.

How to do?

Prior to Redis 2.6.12, we needed to make sure SETNX and EXPIRE performed atomically, and how to handle various exceptions.

However, after Redis 2.6.12, Redis extended the parameter of the SET command to use this command:

// a command guarantees atomic execution
127.0. 01.:6379> SET lock 1 EX 10 NX
OK
Copy the code

This solves the deadlock problem and is relatively simple.

Let’s analyze it again. What else is wrong with it?

Consider this scenario:

  1. Client 1 is locked successfully and starts to operate shared resources
  2. The time of client 1 operating on the shared resource “exceeds” the lock expiration time, and the lock is automatically released.
  3. Client 2 is locked successfully and starts to operate shared resources
  4. Client 1 completes the shared resource operation and releases the lock (but on client 2)

See, there are two serious problems here:

  1. Lock expiration: Client 1 takes too long to operate the shared resource. As a result, the lock is automatically released and then held by client 2
  2. Release someone else’s lock: Client 1 releases the lock of client 2 after the shared resource operation is complete

What are the causes of these two problems? Let’s look at them one by one.

The first problem may be caused by the inaccuracy with which we evaluate the operation shared resources.

For example, if the “slowest” time to operate a shared resource might take 15s and we set the expiration time to 10s, there is a risk that the lock will expire early.

The expiration time is too short, so increase the redundancy time, for example, set the expiration time to 20s, so it is always ok?

This can “mitigate” the problem and reduce the probability of a problem, but it still doesn’t “fix” the problem.

Why is that?

The reason is that after the client obtains the lock, it may encounter complex scenarios when operating the shared resources, for example, internal exceptions occur in the program, network requests timeout, and so on.

Since it’s an “estimated” time, it can only be approximated, unless you can anticipate and cover all the scenarios that cause it to take longer, which is hard.

Is there a better solution?

Don’t worry, I will explain the corresponding solution to this problem in detail later.

Let’s move on to the second question.

The second problem is that one client releases a lock held by another client.

Think about it. What’s the key to this problem?

The key point is that each client is “brainless” when releasing the lock, and does not check whether the lock is still “owned by itself”, so there will be a risk of releasing the lock of others. Such unlock process is not “rigorous”.

How to solve this problem?

What if the lock is released by someone else?

The solution is to set a “unique identifier” that only the client knows when locking.

For example, it can be its own thread ID, or it can be a random and unique UUID. Here we use the UUID as an example:

// Set the lock VALUE to the UUID
127.0. 01.:6379> SET lock $uuid EX 20 NX
OK
Copy the code

It is assumed that the 20s operation shared time is sufficient, regardless of the automatic lock expiration.

After that, before releasing the lock, you need to determine whether the lock is still owned by you. The pseudo-code can be written as follows:

// The lock is owned by the owner
if redis.get("lock") == $uuid:
    redis.del("lock")
Copy the code

The lock is released using the GET + DEL commands, and the atomicity problem comes up again.

  1. Client 1 performs GET to determine that the lock is its own
  2. Client 2 executes the SET command to forcibly acquire the lock (although the probability is low, we need to carefully consider the security model of the lock)
  3. Client 1 performs DEL, but releases the lock on client 2

Thus, these two commands still have to be executed atomically.

How does the atom do that? The Lua script.

We can write this logic in a Lua script and let Redis execute it.

Because Redis processes each request in a “single thread,” while executing a Lua script, other requests must wait until the Lua script completes, so no other commands can be inserted between GET + DEL.

The Lua script for safely releasing locks is as follows:

// Determine if the lock belongs to you before releasing itif redis.call("GET",KEYS[1]) == ARGV[1]
then
    return redis.call("DEL",KEYS[1])
else
    return 0
end
Copy the code

Well, so all the way optimization, the whole lock, unlock the process is more “rigorous”.

Here we first summarize, distributed lock based on Redis implementation, a rigorous process is as follows:

  1. Lock:SET lock_key $unique_id EX $expire_time NX
  2. Operating shared Resources
  3. Release lock: Lua script, first GET to determine whether the lock belongs to oneself, then DEL release lock

Ok, with this complete lock model, let’s go back to the first problem mentioned earlier.

How to evaluate the lock expiration time?

How to evaluate the lock expiration time?

As mentioned earlier, if the expiration time of the lock is not evaluated properly, the lock is at risk of “premature” expiration.

The compromise was to try to “redundancy” the expiration time and reduce the chance that the lock would expire early.

This solution is not a perfect solution to the problem, so what to do?

Is it possible to design such a scheme: when adding a lock, first set an expiration time, and then we start a “daemon thread” to periodically detect the expiration time of the lock. If the lock is about to expire and the operation of shared resources has not been completed, then the lock will be automatically “renewed” and reset the expiration time?

This is actually a better solution.

If you’re a Java technology stack, fortunately, there’s a library that encapsulates all of this: Redisson.

Redisson is a Java language implementation of the Redis SDK client. When using distributed locks, it uses the “automatic renewal” solution to avoid lock expiration. This daemon thread is commonly referred to as the “watchdog” thread.

In addition, the SDK packages a number of easy-to-use features:

  • Reentrant lock
  • Optimistic locking
  • Fair lock
  • Read-write lock
  • Redlock (more on that below)

The SDK provides a friendly API that can manipulate distributed locks in the same way as local locks. If you are a Java technology stack, you can use it directly.

We will not focus on the use of Redisson, you can see the official Github to learn how to use, relatively simple.

Here we briefly summarize the realization of distributed lock based on Redis, the problems encountered in front, and the corresponding solutions:

  • Deadlock: Set the expiration time
  • Bad expiration time evaluation, lock expires early: daemon thread, automatic renewal
  • Lock is released by others: the lock is written with a unique identifier. The identifier is checked before the lock is released

What other problem scenarios can compromise the security of Redis locks?

The scenarios previously analyzed were all about the potential problems of locking in a “single” Redis instance and did not go into Redis deployment architecture details.

However, when we use Redis, we generally adopt the mode of master-slave cluster + sentry deployment. The advantage of this is that when the master library goes down abnormally, Sentry can realize “automatic failover” and promote the slave library to the master library to continue to provide services, so as to ensure availability.

Will the distributed lock remain secure in the event of a master/slave switch?

Consider this scenario:

  1. Client 1 runs the SET command on the primary database and locks the primary database successfully
  2. The SET command has not been synchronized to the slave library (master/slave replication is asynchronous).
  3. The slave library is promoted by sentry to the new master library, this is locked to the new master library, lost!

As you can see, distribution locks may still be affected when Redis copies are introduced.

How to solve this problem?

To this end, the authors of Redis propose a solution that we often hear about: Redlock.

Can it really solve the above problem?

Is Redlock really secure?

Okay, finally, the big story of this article. Ah? So many problems mentioned above, is it just the foundation?

Yeah, those are just the appetizers, the real hard stuff, just starting from here.

If the above content, you do not understand, I suggest you read it again, first clarify the whole lock, unlock the basic process.

If you already know something about Redlock, you can go over it again with me. If you don’t know something about Redlock, that’s ok. I’ll introduce you to it again.

It is worth noting that I will not only talk about the principles of Redlock, but also raise a lot of questions about “distributed systems”, so you should follow my lead and analyze the answers in your head.

Now let’s see how the Redlock solution proposed by the author of Redis solves the lock failure problem after the master/slave switchover.

Redlock’s approach is based on two premises:

  1. You no longer need to deploy the slave library and sentinel instance, just the master library
  2. However, you need to deploy multiple primary libraries, and at least five instances are officially recommended

In other words, to use Redlock, you need to deploy at least 5 instances of Redis, all of which are the master library and have no relationship to each other. They are isolated instances.

Note: Deploy either a Redis Cluster or five simple Redis instances.

How does Redlock work?

The overall process is as follows, which is divided into five steps:

  1. The client first gets “current timestamp T1”
  2. The client initiates a lock request to each of the five Redis instances (using the SET command mentioned above), and each request will SET a timeout time (milliseconds, much less than the lock validity time). If one of the instances fails to lock (including network timeout, lock being held by other people, etc.), Immediately apply for a lock on the next Redis instance
  3. If the client successfully locks more than 3 (most) Redis instances, the client will obtain “current timestamp T2” again. If T2-T1 < the lock expiration time, the client is considered to have successfully locked; otherwise, the client is considered to have failed to lock
  4. The lock was successfully locked to perform operations on the shared resource (such as modifying a MySQL row or making an API request).
  5. Lock failure, issue lock release request to “all nodes” (Lua script release lock mentioned earlier)

Let me summarize it for you in four key points:

  1. The client applies for locks on multiple Redis instances
  2. Ensure that most nodes are locked successfully
  3. The total locking time on most nodes is less than the lock expiration time
  4. To release a lock, issue a lock release request to all nodes

The first read may not be easy to understand, I suggest you read the above text several times, deepen memory.

Then, keeping these five steps in mind is very important. Following this process, we will examine the various assumptions that could lead to lock failure.

Ok, now that we understand the Redlock process, let’s see why Redlock is doing this.

1) Why lock multiple instances?

In essence, for the purpose of “fault tolerance”, some instances of abnormal downtime, the rest of the instances are locked successfully, the entire lock service is still available.

2) Why are most locks successful?

Multiple Instances of Redis are used together to form a “distributed system”.

In distributed system, there will always be “abnormal nodes”, so when talking about distributed system problems, we need to consider how many abnormal nodes can reach, and it will still not affect the “correctness” of the whole system.

This is a distributed system “fault tolerance” problem, the conclusion of this problem is: if there are only “failure” nodes, as long as most of the nodes are normal, the whole system can still provide the correct service.

The model of this problem, which we often hear about as the “Byzantine general” problem, is to look at the derivation of the algorithm.

3) Why do I need to calculate the total lock time after step 3 is completed?

Because the operation is performed on multiple nodes, the operation will definitely take longer than that of a single instance. Moreover, because network requests are complex, there may be delays, packet loss, timeout, and other situations. The more network requests, the greater the probability of exceptions.

So, even if most nodes are locked successfully, if the cumulative lock time has “exceeded” the lock expiration time, then some instances of the lock may have failed and the lock would be meaningless.

4) Why do all nodes need to be operated to release locks?

When a Redis node is locked, the lock may fail because of network reasons.

For example, if a client successfully locks a Redis instance, but fails to read the response result due to a network problem, the lock is actually successfully locked on Redis.

Therefore, locks on all nodes need to be released to clear the remaining locks on nodes, regardless of whether the locks have been successfully added.

It seems that Redlock does solve the problem of Redis node outage and lock failure, ensuring the lock “security”.

But is this really the case?

Redlock debate Who’s right and who’s wrong?

Redis authors put forward this scheme, immediately by the industry’s famous distributed system experts questioned!

The expert, Martin, is a distributed systems researcher at the University of Cambridge in England. Prior to that, he was a software engineer and entrepreneur working on large-scale data infrastructure. It is also a frequent conference speaker, blogs, writes books and is an open source contributor.

He immediately wrote an article questioning Redlock’s algorithmic model and offering his own thoughts on the design of distributed locks.

Not to be outdone, Antirez, the Redis author, wrote a rebuttal and detailed design details of Redlock’s algorithm model.

Moreover, the debate on this issue also caused a very heated discussion on the Internet at that time.

Two people clear thinking, sufficient arguments, this is a master of the game, is also a very good collision of ideas in the field of distributed systems! Both sides are experts in the field of distributed systems, but they make so many opposing claims about the same issue.

Below, I will extract important points from their controversial articles and present them to you.

Warning: the following information is very large, may not understand, best slow down to read.

Martin, a distributed expert, is skeptical of Relock

In his article, he mainly elaborated four arguments:

1) What is the purpose of distributed locks?

Martin says you need to know what you’re doing with distributed locks.

He sees two purposes.

First, efficiency.

Using the mutual exclusion capability of distributed locks is to avoid doing the same work twice unnecessarily (such as some expensive computing task). If the lock fails, it will not cause “vicious” consequences, such as sending two emails, etc., so it does not matter.

Second, correctness.

Locks are used to prevent concurrent processes from interfering with each other. If the lock fails, it will cause multiple processes to operate the same data at the same time, resulting in serious data errors, permanent inconsistencies, data loss and other malignant problems, just like giving a patient a repeated dose of drugs, the consequences are very serious.

If you’re looking for the former — efficiency — then Redis alone is fine, he says, and even the occasional lock failure (downtime, master-slave switching) won’t have serious consequences. Using Redlock is too heavy to be necessary.

For the sake of correctness, Martin believes that Redlock cannot meet the security requirements at all, and there is still the lock failure problem!

2) Problems encountered by locks in distributed systems

A distributed system, Martin says, is more like a complex “beast,” with all sorts of anomalies you wouldn’t expect.

These exception scenarios consist of three main blocks, which are also the three mountains that distributed systems encounter: NPCS.

  • N: Network Delay
  • P: Process Pause
  • C) Clock Drift D) Clock Drift

Martin pointed out the Redlock security issue with a process pause (GC) example:

  1. Client 1 requests to lock nodes A, B, C, D, and E
  2. After client 1 gets the lock, it enters GC (takes a long time).
  3. Locks on all Redis nodes have expired
  4. Client 2 has obtained the locks on A, B, C, D, and E
  5. Client 1 successfully obtains the lock when the GC is complete
  6. Client 2 also thinks it has acquired the lock, causing a “conflict”.

According to Martin, GC can happen at any point in the program, and the execution time is not controllable.

Note: Of course, even programming languages without GC can cause Redlock problems with network latency and clock drift. Martin is just using GC as an example.

3) It is unreasonable to assume that the clock is correct

Alternatively, when multiple Redis node “clocks” fail, the Redlock lock fails.

  1. Client 1 obtains the locks on nodes A, B, and C, but fails to access NODES D and E due to network problems
  2. The clock on node C jumps forward, causing the lock to expire
  3. Client 2 obtains locks on nodes C, D, and E, but fails to access nodes A and B due to network problems
  4. Both clients 1 and 2 now believe they hold a lock (conflict)

Martin felt that Redlock had to “rely heavily” on multiple nodes’ clocks being synchronized, and that the model would be broken if any of the nodes’ clocks failed.

A similar problem can occur even if C is not a clock skip, but a “reboot immediately after crash”.

Martin goes on to explain that it is quite possible for a machine’s clock to fail:

  • The system administrator “manually modified” the machine clock
  • Big “jump” in synchronizing machine clocks with NTP time

In conclusion, Martin believes that Redlock’s algorithm is based on the “synchronous model”, and there is a large amount of data research shows that the assumption of synchronous model is problematic in distributed systems.

In a chaotic distributed system, you can’t assume that the system clock is correct, so you have to be very careful with your assumptions.

4) Propose a fecing token scheme to ensure its correctness

Correspondingly, Martin proposed a scheme called fecing token to ensure the correctness of distributed locks.

The model flow is as follows:

  1. When a client acquires a lock, the lock service can provide an “incremental” token
  2. The client uses this token to manipulate the shared resource
  3. Shared resources can reject requests from “latecomers” based on tokens

This way, no matter which NPC exceptions occur, the distributed lock is secure because it is built on an “asynchronous model.”

Redlock doesn’t offer a fecing token solution, so it can’t guarantee security.

He also said that a good distributed lock, no matter how the NPC happens, can give a result without giving a wrong result. This only affects the “performance” (or activity) of the lock, not its “correctness”.

Martin’s conclusion:

1. Redlock is neither fish nor fowl: For efficiency, Redlock is heavy and unnecessary, and for correctness, Redlock is insecure.

2. Unreasonable clock assumptions: This algorithm makes dangerous assumptions about the system clock (assuming that the machine clocks of multiple nodes are consistent). If these assumptions are not met, the lock will fail.

3. Correctness cannot be guaranteed: Redlock cannot provide a scheme similar to fencing token, so it cannot solve the correctness problem. To ensure correctness, use software that has a common system, such as Zookeeper.

Ok, so that’s Martin’s argument against Redlock, and it seems valid.

Here’s how Redis author Antirez counters.

Rebuttal by Redis author Antirez

In the Redis author’s article, there are three main points:

1) Explain the clock problem

First, the Redis authors saw right through the core of the issue: the clock.

According to the Redis authors, Redlock doesn’t need a perfectly consistent clock, just roughly the same, allowing for “errors.”

For example, to time 5s, but the actual record may be 4.5s, and then 5.5s, there is a certain error, but as long as the lock failure time does not exceed the “error range”, the accuracy of the clock is not very high, and this is also in line with the real environment.

In response to the “clock modification” issue, Redis retorts:

  1. Manually change the clock: Just don’t do that, otherwise you’ll just change the Raft log and it won’t work…
  2. Clock jumping: It can be done with “proper operations” to ensure that the machine clock does not jump too much (with small adjustments at a time)

Why do Redis authors give priority to explaining clock issues? Because in the later refutation process, need to rely on this basis to do further explanation.

2) Explain network latency and GC problems

Network latency and process GC may cause Redlock to fail.

So let’s go back to Martin’s question and suppose:

  1. Client 1 requests to lock nodes A, B, C, D, and E
  2. Client 1 takes the lock and enters the GC
  3. Locks on all Redis nodes have expired
  4. Client 2 obtains the locks on nodes A, B, C, D, and E
  5. Client 1 successfully obtains the lock when the GC is complete
  6. Client 2 also thinks the lock has been acquired and a “conflict” occurs.

Redis retorts that this assumption is flawed, and that Redlock can guarantee lock security.

What’s going on here?

Remember the five steps that introduced the Redlock process earlier? I’ll bring it back here for you to review.

  1. The client first gets “current timestamp T1”
  2. The client initiates a lock request to each of the five Redis instances (using the SET command mentioned above), and each request will SET a timeout time (milliseconds, much less than the lock validity time). If one of the instances fails to lock (including network timeout, lock being held by other people, etc.), Immediately apply for a lock on the next Redis instance
  3. If the client successfully locks more than three (most) Redis instances, the current timestamp T2 is obtained again. If T2-T1 is less than the lock expiration time, the client is considered to have successfully locked; otherwise, the client is considered to have failed to lock
  4. The lock was successfully locked to perform operations on the shared resource (such as modifying a MySQL row or making an API request).
  5. Lock failure, issue lock release request to “all nodes” (Lua script release lock mentioned earlier)

Note that the key is 1-3. In step 3, why do I need to retrieve “current timestamp T2” after successfully locking? T2-t1, compared to the expiration of the lock, right?

The author of Redis points out that if the network delay, process GC and other time-consuming anomalies occur in 1-3, they can be detected in step 3, T2-T1. If the expiration time of the lock is exceeded, it is considered that the lock will fail, and then it is good to release the lock on all nodes.

If the client believes that the network delay and process GC occurred after Step 3, when the client confirmed that it had acquired the lock, it had a problem on the way to operate the shared resource, and the lock failed, then this is not just a Redlock problem. Any other locking service, such as Zookeeper, They all have similar problems. That’s out of the question.

Here’s an example to illustrate:

  1. The client successfully obtained the lock through Redlock (passes the logic of checking the lock success and lock time of most nodes)
  2. When the client starts to operate shared resources, network latency and process GC take a long time
  3. At this point, the lock is automatically released when it expires
  4. The client starts to operate MySQL (lock may be taken by someone else, lock invalid)

The Redis authors conclude here:

  • Redlock will be able to detect any time-consuming problems the client experiences in step 3 before it gets the lock
  • If an NPC occurs after the client has acquired the lock, Redlock and Zookeeper cannot do anything about it

Therefore, Redis author believes that Redlock can guarantee the correctness of the clock on the basis of ensuring the correctness.

3) Question the fencing token mechanism

The author of Redis also raised questions about fecing token mechanism. There are two main questions, which are most difficult to understand here. Please follow my thoughts closely.

First, the solution must require the shared resource server to be able to reject the old token.

For example, to manipulate MySQL, get an incrementing token from the lock service, and then the client needs to change a MySQL line with this token. This takes advantage of MySQL’s “transaction isolation” to do this.

//Both clients must leverage transactions and isolation for their purposes//Note the token UPDATE conditiontable T SET val = $new_val WHERE id = $id AND current_token < $token
Copy the code

But what if you’re not using MySQL? For example, writing a file to disk, or making an HTTP request, this scheme is powerless, which puts higher demands on the resource server to operate on.

That is to say, most of the resource servers to operate, are not mutually exclusive capability.

Furthermore, since resource servers are mutually exclusive, why distribute locks?

So the Redis authors argue that the scheme is untenable.

Second, even if Redlock does not provide the fecing token capability, Redlock does provide a random value (UUID) that can be used to achieve the same effect as the fecing token.

How do you do that?

Redis only mentioned that similar functions of fecing token can be completed, but did not expand the relevant details. According to the information I checked, the general process should be as follows, if there is any error, welcome to exchange ~

  1. The client uses Redlock to get the lock
  2. Before operating on a shared resource, the client marks the VALUE of the lock on the shared resource to be operated
  3. The client processes the business logic, and finally, when modifying the shared resource, determines whether the tag is the same as before.

MySQL, for example, is like this:

  1. The client uses Redlock to get the lock
  2. Before the client modifies a row in the MySQL table, update the VALUE of the lock to a field in the row (for example, the current_token field)
  3. The client handles the business logic
  4. MySQL > select VALUE from ‘WHERE’
UPDATE table T SET val = $new_val WHERE id = $id AND current_token = $redlock_value
Copy the code

This scheme relies on MySQL transaction mechanism and achieves the same effect as fecing token mentioned by the other party.

However, there is still a small problem, which is raised by the Internet users during the discussion: two clients through this scheme, first “mark” and then “check + modify” the shared resources, so the operation sequence of the two clients can not be guaranteed?

With the fecing token Martin mentioned, because the token is monotonically increasing, the resource server can reject small token requests, ensuring “sequential” operations!

Redis gives a different explanation of this problem, which makes sense to me. He explains that the essence of distributed locks is to “mutually exclusive”, as long as two clients can ensure that one succeeds and the other fails in the concurrent process, it doesn’t need to care about “sequential”.

Martin has been concerned about this sequential issue in his previous questioning, but Redis’ author holds a different view.

To sum up, Redis authors conclude:

1. The author agrees with the other side about the impact of “clock jumping” on Redlock, but believes that clock jumping can be avoided, depending on infrastructure and operations.

2, In the design of Redlock, fully consider the NPC problem, before Redlock step 3, can ensure the correct lock, but after step 3, NPC occur, not only Redlock problems, other distributed lock services also have problems, so it is not discussed.

Do you think it’s funny?

In distributed system, a small lock may encounter so many problem scenarios, affecting its security!

I don’t know which side you agree with more after reading the views of both sides?

Don’t worry, I will synthesize the above arguments later, talk about their own understanding.

Ok, having said the two sides of the Redis distributed lock debate, you may have noticed that Martin, in his article, recommends the use of Zookeeper for distributed locks, arguing that it is more secure. Is this true?

Are ZooKeeper-based locks secure?

If you’re familiar with Zookeeper, the distributed lock implementation looks like this:

  1. Clients 1 and 2 both attempt to create “temporary nodes” such as /lock
  2. If client 1 arrives first, the lock succeeds, but client 2 fails to lock
  3. Client 1 operates shared resources
  4. Client 1 deletes the /lock node and releases the lock

Zookeeper uses a “temporary node” to ensure that client 1 can hold the lock as long as the connection continues.

Furthermore, if client 1 crashes unexpectedly, the temporary node is automatically deleted, ensuring that the lock is always released.

Not bad, there is no lock expiration trouble, but also can automatically release the lock when abnormal, do not feel very perfect?

It’s not.

After client 1 creates a temporary node, how does Zookeeper ensure that the client always holds the lock?

The reason is that client 1 maintains a Session with the Zookeeper server. This Session relies on the periodic heartbeat of the client to maintain the connection.

If Zookeeper does not receive heartbeat messages from the client for a long time, the Session expires and the temporary node is deleted.

Similarly, based on this issue, let’s discuss how GC issues affect Zookeeper locks:

  1. Client 1 succeeded in creating a temporary node /lock and obtained the lock
  2. A long GC occurred on client 1
  3. Client 1 failed to send heartbeat messages to Zookeeper, and Zookeeper deleted the temporary node.
  4. Client 2 succeeded in creating a temporary node /lock and obtained the lock
  5. Client 1 GC ends, it still thinks it holds the lock (conflict)

Zookeeper cannot ensure the security of process GC and network delay exceptions.

This is what the Redis author mentioned in the rebuttal article: If the client has acquired the lock, but the client and the lock server “lose contact” (such as GC), then not only Redlock has a problem, other locking services have similar problems, and Zookeeper has the same problem!

So, here we can conclude that a distributed lock, in extreme cases, is not necessarily secure.

If your business data is very sensitive, be aware of this issue when using distributed locks. You cannot assume that distributed locks are 100% secure.

Now let’s summarize the advantages and disadvantages of Using distributed locks in Zookeeper:

Advantages of Zookeeper:

  1. There is no need to consider the expiration time of the lock
  2. Watch mechanism. If locking fails, watch can wait for lock release to realize optimistic locking

But its disadvantages are:

  1. Not as good as Redis
  2. High deployment and o&M costs
  3. The client is disconnected from Zookeeper for a long time, and the lock is released

My understanding of distributed locks

Well, in front of the detailed introduction of Redis based on Redlock and Zookeeper implementation of distributed lock, security issues in various abnormal situations, I want to talk with you about my views, just for reference, do not like spray.

1) Redlock or not?

As mentioned above, Redlock will only work if the clock is “correct”, and if you can guarantee this, you can use it.

But getting the clock right, I don’t think it’s as easy as you think.

First, from the hardware point of view, the clock offset is inevitable.

For example, CPU temperature, machine load, and chip material can all cause the clock to shift.

Secondly, from my working experience, I have encountered clock error and operation and maintenance force to modify the clock, thus affecting the correctness of the system. Therefore, human error is difficult to completely avoid.

So, my personal view of Redlock is to avoid it as much as possible, it’s not as powerful as the stand-alone Redis, it’s more expensive to deploy, and I’d still prefer to use the master-slave + sentry model for distributed locking.

How can correctness be guaranteed? The second point gives you the answer.

2) How to use distributed locks correctly?

When analyzing Martin’s point of view, it mentioned the fecing token scheme, which inspired me a lot. Although this scheme has great limitations, it is a very good idea to ensure the “correctness” of the scene.

So, we can combine the two:

1, the use of distributed lock, in the upper level to complete the purpose of “mutually exclusive”, although the lock will fail in extreme cases, but it can maximize the concurrent request block in the top layer, reduce the pressure of operation resource layer.

2, but for the business that requires absolutely correct data, we must do a good job in the resource layer “bottom”, the design idea can refer to fecing token scheme to do.

Combining the two approaches, I think for most business scenarios this is enough.

conclusion

All right, so to sum up.

In this article, we mainly discuss whether distributed lock based on Redis is safe.

From the simplest implementation of distributed lock, to deal with all kinds of abnormal scenarios, and then lead to Redlock, as well as two distributed experts debate, Redlock applicable scenarios.

Finally, we also compare the problems Zookeeper may encounter when making distributed locks, and the differences with Redis.

I’ve summarized these into mind maps for your convenience.

Afterword.

The amount of information in this article is actually very large, and I think we should make the distribution lock problem completely clear.

If you don’t get it, I suggest you read it a few more times and construct hypothetical scenarios in your head.

In writing this article, I’ve re-read the Redlock debate in two of the most rewarding posts I’ve ever seen, and I’d like to share some of my thoughts with you.

1, in distributed system environment, the seemingly perfect design may not be so “tight”, if a little scrutiny, you will find all kinds of problems. So when you think about distributed systems, you have to be careful.

2. From the debate of Redlock, we should not pay too much attention to right and wrong, but learn more about the thinking way of god and the rigorous spirit of strict examination of a problem.

Finally, I would like to end with Martin’s reflections after the Redlock debate:

“Great things have been done for us: we can build better software by standing on the shoulders of giants. Anyway, it’s part of the learning process to argue and check whether they stand up to the scrutiny of others. But the goal should be to gain knowledge, not to convince others that you’re right. Sometimes it just means to stop and think.”

‘.

Personal website

  • Github Pages
  • Gitee Pages