“This is the 19th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”

When using a master-slave cluster, I had a problem: our master-slave cluster had 1 master library, 5 slave libraries, and 3 sentinel instances. In the process of using it, we found that some data sent by the client was lost, which directly affected the data reliability of the business layer.

Through a series of troubleshooting, we found that this was actually caused by the split brain problem in the master-slave cluster.

Split brain refers to two master nodes in a master/slave cluster, both of which can receive write requests. The most immediate effect of split brain is that the client does not know which primary node to write data to. As a result, different clients write data to different primary nodes. And, in severe cases, brain splintering can lead to further data loss.

So why does split brain occur in master-slave clusters? Why does split-brain cause data loss? How can we avoid brain splitting? In this lesson, I will combine the real problem I met with you to analyze and locate the problem, and help you master the causes, consequences and coping methods of brain split.

Why does split-brain occur?

As I mentioned earlier, the problem I initially found was that the data sent by the client was lost in the master/slave cluster. So, first of all, why did the data get lost? Is there a data synchronization problem?

Step 1: Determine if there is a problem with data synchronization

The most common reason for data loss in a master/slave cluster is that the data in the master database has not been synchronized to the slave database. As a result, the master database fails. After the slave database is upgraded to the master database, the unsynchronized data is lost.

As shown in the figure below, data a:1, b:3, newly written to the master database, were lost because they were not synchronized to the slave database before the master database failed.

If this is the case of data loss, we can determine this by comparing the difference in replication progress between the master and slave libraries, that is, calculating the difference between master_REPL_offset and slave_REPL_offset. If slave_REPL_offset on the slave is less than master_REPL_offset on the original master, then we can assume that the data loss is due to incomplete data synchronization.

We also monitored master_REPL_offset on the master and slave_REPL_offset on the slave when deploying the master and slave clusters. However, when we found that the data was missing, we checked the slave_REPL_offset before the update of the new master and the master_REPL_offset of the original master. They were consistent, that is, the slave database upgraded to the new master was already consistent with the data of the original master database at the time of the update. So why is the data sent by the client still missing?

At this point, our first assumption is overturned. At this point, it occurs to us that all data operations are sent from the client to the Redis instance, so can we detect problems in the client operation logs? Next, we moved on to the client side.

Step 2: Check the operation logs of the client to find the split brain

When checking the operation logs of the client, we found that after the master/slave switchover, a client was still communicating with the original master library, but did not interact with the upgraded new master library. This is equivalent to having two master libraries in a master/slave cluster. This leads us to a problem that can occur when a distributed master-slave cluster fails: brain split.

However, when different clients write data to two master libraries, the new data will be distributed in different master libraries, and the data will not be lost. So why is our data still missing?

At this point, our line of investigation is broken once again. However, in analyzing problems, we always think that “starting from the principle is a good way to go back to the source”. Brain splitting occurs in the process of master-slave switchover, and we guess that there must be some link in the process of master-slave cluster switchover missed. Therefore, we focus on the execution process of master-slave switchover.

Step 3: Discover the original master library false fault caused by the split brain

When the master/slave switchover occurs, there must be more than a preset number of sentinel instances and the heartbeat of the master library timed out, and the master library is judged to be objectively offline. Then, the sentinel starts to perform the switchover operation. After the sentinel switch is complete, the client communicates with the new master library to send the request operation.

However, since the client is still communicating with the original master during the switch, this indicates that the original master has not actually failed (for example, the master library process has hung up). We suspect that the master library was incorrectly judged to be offline by the sentinel because it was unable to process the request for some reason and did not respond to the sentinel’s heartbeat. As a result, after being deemed offline, the original master library starts processing requests again, and the client can still communicate with the original master library before the sentinel has completed the master/slave switch, and write operations sent by the client will write data to the original master library.

In order to verify that the original master library is only a “fake failure”, we also looked at the resource usage monitoring records of the server where the original master library resides.

Indeed, we saw a period of particularly high CPU utilization on the machine where the original main library was located, as a result of a data acquisition program we deployed on the machine. Because this program basically used up the CPU of the machine, the main library of Redis was unable to respond to the heartbeat. During this period, sentry judged the main library to be offline and started the master/slave switch. However, the data acquisition program soon returned to normal and CPU utilization dropped. At this point, the original master library starts to service normal requests again.

Since the old master didn’t actually fail, we see the communication with the old master in the client operation log. By the time the slave library is upgraded to the new master, there are two master libraries in the master/slave cluster, and here we have a clear idea of why brain split occurs.

To help you understand, let me draw another picture of what happens.

After figuring out the cause of brain split, we analyzed the principle process of master-slave switch and soon found the cause of data loss.

Why does split-brain cause data loss?

After the master/slave switchover, once the slave database is upgraded to the new master database, the sentry will make the original master database execute the slave of command to perform full synchronization with the new master database. In the final phase of full synchronization, the original master library needs to clean up the local data and load the RDB file sent by the new master library. In this way, the new write data saved by the original master library during the master/slave switchover is lost.

The following figure visually shows the process of data loss from the original master library.

At this point, we fully understand how and why this problem occurs.

In the process of master/slave switchover, if the original master library is only a “false fault”, it will trigger the sentry to initiate the master/slave switchover. Once it recovers from the false fault, it will start processing requests again. In this way, it will exist at the same time with the new master library, forming a brain split. By the time the sentry synchronizes the original master with the new master, the data saved by the original master during the switch is lost.

Now, you’re probably wondering, what do we do about data loss caused by split brain?

How to deal with the split brain problem?

As I said, data loss in a master/slave cluster is ultimately due to brain split. So, we have to find a strategy for dealing with split-brain problems.

Since the problem was that the original master was still able to receive requests after a false failure, we started looking in the configuration items of the master/slave clustering mechanism to see if there were Settings that restricted the master from receiving requests.

Through search, we found that Redis has provided two configuration items to limit the request processing of the master library, namely min-rabes-to-write and Min-rabes-max-lag.

  • Min-rabes-to-write: This configuration item sets the minimum number of slave libraries that the master database can synchronize data to.
  • Min-rabes-max-lag: This configuration item sets the maximum delay (in seconds) for sending ACK messages from the slave database to the master database during data replication between the master and slave databases.

With these two configuration items, we can easily deal with the split brain problem. How do you do that?

We can use the two configuration items min-rabes-to-write and Min-Rabes-max-lag together and set certain thresholds for them respectively, assuming N and T. The combination of the two configuration items requires that at least N slave libraries are connected to the master database, and the ACK message delay cannot exceed T seconds during data replication with the master database. Otherwise, the master database will not receive requests from clients.

Even if the original master was a false fault, it could not respond to the sentinel heartbeat or synchronize with the slave during the false fault, and therefore could not ACK with the slave. As a result, the combined requirements of min-rabes-to-write and Min-rabes-max-lag cannot be met, and the original master database will be restricted to receiving client requests and the client will not be able to write new data into the original master database.

When the new master library comes online, only the new master library can receive and process client requests. At this point, the newly written data will be directly written to the new master library. The original master database is demoted to slave by sentinels, and no new data is lost even if its data is emptied.

Let me give you another example.

Suppose we set min-rabes-to-write to 1, min-rabes-max-lag to 12s, and sentry’s down-after-milliseconds to 10s. The master library was stuck for 15s for some reason, causing sentry to judge that the master library was offline. The primary/secondary switchover starts. At the same time, because the original master database was stuck for 15s, no slave database could replicate data with the original master database within 12s, and the original master database could not receive client requests. In this way, after the master/slave switch is completed, only the new master can receive the request, and there will be no brain split, so there will be no data loss problem.

summary

In this lesson, we learned about the problem of split brain when switching between master and slave. Split brain is when there are two primary libraries in a master/slave cluster that can receive write requests at the same time. In the master/slave switchover of Redis, if a split brain occurs, the client data is written to the original master library. If the original master library is demoted to the slave library, the newly written data is lost.

The main reason for the occurrence of brain split is the false failure of the original master database. Let’s summarize the two reasons for the false failure.

  1. Other programs deployed on the same server as the primary library temporarily occupy a large amount of resources (such as CPU resources), resulting in limited use of the primary library resources and unable to respond to heartbeat for a short period of time. When other programs stop using the resource, the main library returns to normal.
  2. The main library itself encountered a blocking situation, such as processing a bigkey or a memory swap (you can review the reasons for instance blocking in Lecture 19), failed to respond to heartbeat for a short time, and resumed normal request processing when the main library was cleared of blocking.

To deal with split brain, you can prevent split brain by properly configuring parameters Min-rabes-to-write and Min-rabes-max-lag during master-slave cluster deployment.

In practical applications, temporary network congestion may cause temporary ACK messages of the slave library and the master library to time out. In this case, it is not a false failure of the primary library, and we do not have to prohibit the primary library from receiving requests.

So, my suggestion to you is that assuming there are K slave libraries, you can set min-rabes-to-write to K/2+1 (if K equals 1, set it to 1) and min-rabes-max-lag to tens of seconds (e.g. 10 ~ 20s). In this configuration, If more than half of the ACK messages from both the master and slave libraries are more than a dozen seconds late, we disable the master library from receiving client write requests.

In this way, we can avoid the data loss caused by the split brain, and we will not forbid the master library to receive requests because only a few slave libraries cannot connect to the master library due to network congestion, which increases the robustness of the system.

Each lesson asking

As usual, I’ll give you a quick question, suppose we set min-rabes-to-write to 1, min-rabes-max-lag to 15s, sentry’s down-after-milliseconds to 10s, and sentry’s master-slave switchover to 5s. The main library is stuck for 12s for some reason, at which point, will the brain split? Is data lost after the primary/secondary switchover is complete?

Welcome to write down your thoughts and answers in the comments area, and we will exchange and discuss together. If you find today’s content helpful, you are welcome to share it with your friends and colleagues. I’ll see you next time.