Leader-followers Replication for distributed systems

We talked briefly earlier about the role of Replication, which simply serves to keep the same copy on multiple machines. With this copy we can do a lot of things, such as it can be used as a source of reads to spread the load, it can be used as a backup in case the original data machine fails (or deploys), etc.

The idea is simple, but when we actually make this copy, we will encounter many problems, such as whether we use synchronous or asynchronous to synchronize multiple copies, how to ensure consistency between multiple copies, etc. This article will cover all aspects of this in detail.

Leaders and Followers

We call each node storing data replica. When we have multiple nodes, the most obvious problem is how to ensure that the content of each node is the same. The most common method is leader-based mode (also known as master-slave or active/ Passive mode). In general, it works as follows:

One node is the leader. All write operations must go through the leader.
For the other nodes, we become followers. Each time the leader writes data, he also sends relevant information to each follower (replication log), and each follower updates the local data according to this log.
When read operations are performed, they can be read from either the leader or the follower.

Synchronous vs. Asynchronous Replication

This approach is common in many relational databases. One of the biggest issues encountered here is whether to use synchronous or asynchronous for replication.

This case, for example, is very simple. To update the picture of user ID 1234, it first sends the update request to the replica of the leader. After that, the leader sends the request to the replicas of the two followers, so that the user’s pictures in all replicas are updated.

Let’s use this example to analyze the difference between synchronous and asynchronous. As shown in the figure below, we see that the leader returns after the update to Follower1 has completed, which means that Follower1 is a synchronous update. The leader does not wait for Follower2 to complete the update before returning, which means that updates to Follower2 are asynchronous.

It can be seen from this figure that updates to Follower1 and 2 are delayed by a single delay. Although this delay is small in most cases, when we encounter network problems or other situations (high CPU usage, insufficient memory, etc.), this delay may be very large. There is no guarantee when it will be updated.

Therefore, the advantage of synchronous updates is that the followers are actually synchronized with the leader. If the leader has problems, we can even switch to the synchronously updated follower directly. Of course, the problem is obvious, that is, each update (write operation) needs to wait for the completion of the update of the follower. This can cause the entire request to be delayed for a long time or even time out. Therefore, if you want to do synchronous updates, generally speaking, you do not want all nodes to be synchronized, but choose the leader to be synchronized with one node. In this way, you can achieve a good trade off between the leader switchover and the synchronization efficiency.

Of course, in reality leader-based replication is almost entirely asynchronous. The problem is that the leader fails and data that is not synced to another node is lost. The advantage is that the leader is completely immune to the influence of other nodes, and can run independently even if all other nodes fail (usually an alert is raised at this point, and the onCall is ready to perform, haha). There are many reasons for this choice in a production environment, such as ensuring that nodes are healthy at all times. These nodes are usually distributed in different regions (data centers), so communication between them is not always reliable, and problems can occur if synchronization is used, etc.

The establishment of the Follower

Sometimes we need to create a new follower from scratch. This is particularly common in production environments. For example, if the machine’s disk fails and all the original follower data becomes unusable, you need to start from scratch, or at some point you don’t have enough followers. You need to build a follower on a new machine. How do you build a node from scratch?

The first thing that comes to mind is to copy it from the leader, but we know that the leader is actually constantly changing, that is to say, it is constantly written at any time, and the data copy speed is not very fast (the main bottleneck is the write speed of HDD, which is about 200MB/s at present). So not allowing writing in the process might not be a good idea either. So how does it work in general? There are probably the following steps:

Get a snapshot of a node in a database (this snapshot does not affect writes). Most databases support this functionality.
Copy this snapshot to a new follower
The Follower contacts the leader to get all the operations (log sequence) after the snapshot.
The followers then apply the operations after the snapshot based on the obtained information, which is called catch up.

Handle interrupts of nodes

As we mentioned above, any node can fail at any time. So what do we do when something does go wrong?

If the faulty node is follower, if the problem is just a service crash or server reboot, that is, the data on the disk is still good, then the common solution is to wait until the system recovers. Look at the database on disk where the replay was last played (usually stored in the header of the database), and then continue the replay based on the log received (and the log from the time the communication leader received the server reboot), as mentioned above.

If the problem node is the Leader, the situation becomes slightly more complicated. At this time, a follower needs to be selected as the new leader, and all write operations need to be directed to the new leader. Other followers also need to obtain information from the new leader. We call this process failover. Failover can be manual or automatic. Automatic Failover involves the following processes:

Determine whether the leader really has a problem. This requires some mechanism of Node down detection, such as similar heartbeat mechanism, to determine whether the leader has a real problem, and then continue the following mechanism when the problem is confirmed. In fact, how to conduct detection here is a very interesting topic. The author has also worried about this problem in the real production environment, such as how to judge whether the problem of the leader is your own problem, and who decides the problem of the leader.
Select a new leader. There is usually an election process to decide who is the best candidate. There are many mechanisms worth discussing, such as the election of the majority.
Reconfigure the system to use the new leader. When the new leader is selected, all traffic needs to be notified to switch to the new leader to process the data flow.

In fact, there are many problems encountered during Failover:

If we use async to synchronize data, the new leader may have lost data. At this time, if the old leader comes back, there will be conflicts between them. How to deal with these conflicts? A common way to do this is to lose data from the old Leader, but then we lose data from the user.
Losing write data can be a big problem. For example, we are associated with other systems, and the missing write happens to be very important to other systems, thus causing data conflicts.
In the case of multiple leaders, for example, the original leader still thinks of himself as the leader, while the new leader also thinks of himself as the leader. This is called a split brain. This is dangerous, because both leaders are receiving writes, and there is no mechanism to resolve write conflicts, obviously the situation is bad. At this point you need a leader monitor mechanism to ensure that you do not have two leaders at the same time.
How to determine the leader problem. This is a very difficult problem, because the network of production switching machine is actually very unstable sometimes, you can not receive heartbeat for 10 seconds may just be a network fluctuation. If you think 10s is an unacceptable threshold, you may be switching the leader all the time. If you set the time too long, it means that when something really goes wrong, you can’t operate for a long time. Here’s a trade off that sometimes needs to be dealt with in a realistic way.

How do I implement replication’s log

We have always said that the leader’s log can be used to catch up. How can this log be generated? A simple implementation is to generate a log for each database operation so that followers can repeat operations such as inserts, updates, etc. This implementation sounds good at first glance, but if you think about it a little bit, it seems to have a problem:

Any calls to indeterminate values, such as the NOW() or RAND() functions, are problematic. If you just pass the operation, the results will be different between the two machines.
If the operation depends on other conditions, such as UPDATE.. WHEN, then all the logs need to be executed in exactly the same order, without any difference
State-related content may not be displayed on different servers.

Of course, these problems can be solved by replacing the actual number from RAND() and so on.

I’m going to go ahead and write log

As we talked about earlier, a lot of times the database itself writes data to disk, and it also implements a log that writes forward, just in case there’s a disk write problem, so it can recover. We can also send this log to followers, so that each follower can catch up with this log.

Replication of logical logs

Another common type of log is logical log, which is different from the log used by the database engine. The basic implementation is as follows:

If you insert a row, the log contains data for all the columns of the inserted row.
If a row is deleted, the log contains enough information to distinguish the row.
If the update is the same, the log contains the information that has done enough to differentiate rows and the column information that needs to be updated (or at least modified)

conclusion

This article concludes with a summary of how leader-follower structures work, how errors are handled when they occur, and several common replication approaches.

Leader-followers Replication for distributed systems

Leaders and Followers

Synchronous vs. Asynchronous Replication

The establishment of the Follower

Handle interrupts of nodes

How do I implement replication’s log

I’m going to go ahead and write log

Replication of logical logs

conclusion

Related Posts

Python3 Programming examples (6-10)

[Golang-leetcode] maximum subsequence and –

What are the advantages of Spring Cloud Alibaba? How will the future evolve?