TiDB 5.0 was officially released last week, and one of the most exciting aspects of the TiDB cluster’s cross-center deployment capabilities in this major update is Joint Consensus support. This feature helps TiDB 5.0 fully tolerate a minority number of AZ unavailability in cross-AZ scheduling. This article will talk about the history of member changes in TiDB, then introduce the design of the new features, and finally discuss the problems we encountered in the implementation process and the solutions.

Members of the change

As the storage layer of TiDB, TiKV is responsible for data management and read/write operations. TiKV divides data into roughly the same size fragments. Each fragment has multiple copies, which are stored in different Available zones (AZs). Raft algorithm is used to ensure strong consistent read and write. When balancing scheduling is required, PD, the metadata management component of TiDB, will select the fragments that need to be adjusted and issue commands to TiKV to complete the copy relocation. Since the Raft algorithm is inherently designed for online member changes, moving copies is a natural thing to do with the member change algorithm.

The Raft algorithm used by TiKV starts with Raft from Etcd. Etcd does not implement the full Joint Consensus algorithm, but implements a special one-step change (similar to, but not identical to, the one-step change mentioned by Diego Ongaro in his doctoral thesis). Therefore, when doing replica relocation, the whole process needs to be completed in multiple steps.

For example, when PD decides it needs to move a copy from TiKV 2 to TiKV 3, it will first add a new copy to TiKV 3 via AddNode. Then use RemoveNode to delete the copy on TiKV 2 to complete the change. It is theoretically possible to RemoveNode and then AddNode to implement the change, but such an operation sequence will result in an intermediate state of 2 copies, which cannot tolerate any node downtime and is dangerous.

The addition and subtraction step, while producing only four replicas of state and tolerating single-node outages, is not 100% usable. When a cross-AZ scenario needs to be considered, PD may need to relocate the replica to another TiKV in the same AZ. In the figure above, if AZ 2 becomes unavailable while Raft Group 4 is in the state, only two copies of AZ 0 and AZ 1 will be left and quorum will not be formed, causing the whole Raft group to become unavailable. In pre-5.0 implementations, we introduced The Learner role and added the copy to the Raft group as the Learner role with the AddLearner command before entering 4 Voter. After catching up enough data, we can add and subtract. Such a step can greatly reduce the time window for the existence of 4 replicas (in milliseconds if it works), but it is still not 100% usable.

Joint Consensus

The Raft paper already provides a 100% usable member change algorithm: Joint Consensus. We use C(A, B, C) to represent a Raft group with copies A, B, and C. When changing from C(a, B, C) to C(a, b, D), introduce an intermediate state joint C(a, B, C) & C(a, B, D). When the group is in the joint state, the log is committed only if it is copied to quorum in both member lists. To make a change, change from C(a, b, C) to C(a, b, C) & C(a, b, D). When each node receives this command, it immediately changes the local member to the Joint state. After the command is committed, submit a new command to exit the joint state, changing from C(a, b, C) & C(a, b, d) to C(a, b, D). The proof of correctness of this algorithm is beyond the scope of this article, see Raft paper for more information.

Since quorum is calculated based on a list of members of two 3-replicas, the intermediate joint state, like the multiple single-step changes mentioned above, can tolerate any single-node outage. Joint also achieves 100% availability compared to multiple single-step operations. For example, in the 4-copy state in the figure, two nodes are unavailable if AZ 2 is unavailable, but only one node is unavailable for both 3-copy member lists, so the joint state can remain available.

Fall to the ground

As mentioned above, Etcd’s Raft algorithm implementation is different from Diego Ongaro’s one-step change in his paper. The Etcd algorithm was actually implemented before the doctoral thesis, the main difference being that the member change log is not actually executed until it is committed. The idea in the paper is to do it as soon as you receive it. We have been investigating the feasibility of joint Consensus since 3.0. Our initial approach was to be completely consistent with the paper, but there were too many compatibility issues and adjustments. At the same time, CockroachDB began to add early Joint Consensus support to Etcd. We finally decided to embrace the community and align ourselves with Etcd to optimize and test together.

The Joint Consensus implementation of Etcd is not completely consistent with the paper, but continues the practice mentioned above, and commit is performed. The advantage of this is that member change logging is no different from normal log processing logic and a unified process can be used. Since logs after commit are not copied, there is no need to do special change rollback like execute on receipt, which is easier to implement. However, since the COMMIT information is available only to the leader, the availability bug may occur in special scenarios due to insufficient information synchronization. If you are interested, check out our issue12 submission in the Etcd project. Here is just one simple example. Suppose a joint state C(A, B, C) & C(a, B, D), where A is the leader. If the command to exit the joint is copied to A, B, and C, A can consider the command to be committed, and synchronize the COMMIT index to C, and then crash. Therefore, C will execute the commit log, thinking that the joint has exited, and thus delete itself. B and D, unaware that the joint exit command has been committed, will still seek a vote for both quorum when an election is called. However, a has crashed and C has self-destructed, so B and D cannot obtain quorum votes from (a, B, C) and thus become unavailable.

These issues ultimately resulted in the Joint Consensus implementation adding two restrictions compared to the original paper:

1. Voter needs to be downgraded to Learner before being removed.

2. Enable the commit Index synchronization mechanism during elections.

Voter c is a voter, because voter is not a voter. When B seeks a vote from C, it will be informed of the latest COMMIT index and that the joint state has exited, so it will only attempt to seek quorum from C(a, B, D) and finally succeed in the election.

conclusion

In 5.0, we added Joint Consensus support to achieve complete tolerance of minority AZ unavailability in the process of cross-AZ scheduling. Raft algorithm itself is clear and simple, but there are different adjustments and trade-offs when it comes to engineering. If you are interested in solving similar problems in distributed systems, please join us directly by joining TiKV, Raft-RS, or sending your resume to [email protected].