This is the first day of my participation in the Gwen Challenge in November. Check out the details: the last Gwen Challenge in 2021
In many scenarios, in order to ensure the final consistency of data, we need a lot of technical solutions to support, such as distributed transactions, distributed locks and so on. There are many distributed locking schemes or libraries based on Redis implementations, but some libraries do not address the pitfalls of distributed environments.
Distributed locking features
- Mutex only one client can hold the lock at a time; This is the basic property of distributed locks.
- Deadlock-free Each lock request can eventually obtain the lock; Even clients holding locks can crash or encounter exceptions.
Different implementations
Many distributed lock implementations are based on distributed consensus algorithms (Paxos, Raft, ZAB, Pacifica) such as Chubby based on Paxos, Zookeeper based on ZAB, and Consul based on Raft. The authors of Redis also propose a distributed lock called RedLock.
In the following sections, I’ll show you how to implement distributed locks step by step based on Redis, and in each step, I’ll try to solve a problem that can occur in a distributed environment.
Scenario 1: Single-instance Redis
For simplicity, suppose we have two clients and one Redis instance. A simple implementation would be:
boolean tryAcquire(String lockName, long leaseTime, OperationCallBack operationCallBack) {
/ / lock
boolean getLockSuccessfully = getLock(lockName, leaseTime);
if (getLockSuccessfully) {
try {
operationCallBack.doOperation();
} finally {
releaseLock(lockName);
}
return true;
} else {
return false; }}boolean getLock(String lockName, long expirationTimeMillis) {
// Create a unique lockValue for the current thread
String lockValue = createUniqueLockValue();
try {
// If lockName is not locked, the lockName is saved in redis as the key and the expiration time is specified
String response = storeLockInRedis(lockName, lockValue, expirationTimeMillis);
return response.equalsIgnoreCase("OK");
} catch (Exception exception) {
releaseLock(lockName);
throwexception; }}void releaseLock(String lockName) {
String lockValue = createUniqueLockValue();
// remove lock lockName if lockValue is lockValue
removeLockFromRedis(lockName, lockValue);
}
Copy the code
What’s wrong with this approach?
If client 1 requests the server to obtain a lock and the lock timeout period is specified, client 1 obtains an expired lock if the response time of the server is longer than the lock timeout period. In this case, client 2 can also obtain the lock for service operations. This breaks the mutually exclusive principle that distributed locks are supposed to have.
To solve this problem, we should set a request timeout to the Redis client, which should be less than the lock timeout.
At the time this was not a complete solution to the problem, assuming that the Redis server was restarted due to a power failure, there would be other problems, so let’s look at the second scenario.
Scenario 2: Single point of failure of single instance Redis
If you know anything about Redis data persistence, you know that Redis has two ways to persist data.
Redis Database (RDB) : Saves Redis data snapshots to disks ata specified interval.
AOF(append-only File) : records the write operation commands received by the server. These operation commands can be executed again when the service restarts to restore the original data.
By default, only the RDB mode is enabled. The configuration is as follows:
save 900 1
save 300 10
save 60 10000
Copy the code
For example, the first line indicates that if there is a write operation within 900 seconds (15min), the data is synchronized to the data file.
Therefore, in the worst case, it takes 15 minutes to save a locked data. If the Redis service is restarted after a power failure, the locked data in memory cannot be recovered. Other clients can also obtain the same lock:
To solve this problem, we have to use the fsync=always option to enable AOF and then set the key in Redis.
Note that enabling this option has some impact on Redis performance, but we need this option for strong consistency.
Scenario 3: Primary/secondary Replication
In this configuration, we have one or more instances (often referred to as slave instances or replicas) that are exact copies of the master instance.
By default, replication in Redis is asynchronous; This means that the master server does not wait for the command to be processed by the replica before returning it to the client.
The problem is that the primary server can fail and fail over before replication occurs; After that, if another client requests the lock, it will succeed! Or suppose there is a temporary network problem, so one of the replicas does not receive a command, the network becomes stable, and failover occurs quickly; The node that does not receive the command becomes the primary node.
Eventually, the lock will be removed from all instances! The picture below illustrates the situation:
As a solution, there is a wait command that waits for a specified number of confirmed copies and returns the number of copies, acknowledges that the previous write command sends a wait command, in both cases reaches a specified number of copies or times out.
For example, if we have two copies, the following command will wait at most 1 second (1000 ms) to get confirmation from both copies and return:
WAIT 2 1000
Copy the code
So far, so good, but there’s another problem; Copies may lose writes (due to the wrong environment). For example, if a copy fails before the save operation is complete and the master node also fails, the failover operation selects the restarted copy as the new master node. After synchronizing with the new master, none of the replicas and the new master have keys from the old master!
In order for all slave and master servers to be exactly the same, we should enable AOF of fsync=always for all Redis instances before acquiring locks.
Note: In this approach, we have compromised availability for consistency, and AOF has a performance cost.
Scenario 4: Automatic refresh lock
In this scenario, the acquired lock can be held as long as the client is alive and the connection is healthy.
We need a mechanism to flush locks before they expire. We should also consider cases where locks cannot be refreshed; In this case, you must quit immediately.
In addition, when the lock holder releases the lock, other clients should be able to wait to acquire the lock and enter the critical section:
summary
Here, I’m solving a new problem with each step. But there are some important issues that remain unresolved and I want to point out here.
- Clock drift between different nodes;
- Long thread pause or process pause occurs on the client after obtaining the lock;
- One client may wait a long time to acquire a lock, while another client acquires the lock immediately; Unfair lock.
Many tripartite libraries use Redis to provide distributed lock services, and we should understand how they work and what problems can occur, and make trade-offs between their correctness and performance.
If it helps you, a “like” is the biggest encouragement for me!