Note: this article is original, please contact the author through the public account or nuggets application. Welcome to follow the blog of wechat official number: Q. Regular delivery of dry goods, practical experience, system summary, source code interpretation, technical principles.
I’m looking for an authoritative, clear and systematic article that will help you understand the Raft algorithm in depth and apply it to your engineering practice, as well as uncovering key points that may not be easily understood or misunderstood.
This article is the integration of Raft theory series. We introduce ideas of Raft algorithm based on Raft paper, and follow Raft’s modularity thinking to extract the details of difficult to understand and easy to misunderstand. Algorithms: master selection mechanism, log-based implementation of state machine mechanism, safe and correct maintenance of state machine mechanism; Description of engineering implementation: anti-brain-splitting strategy for cluster member change, solution to data inflation and quick state machine recovery strategy, linear consistent read performance optimization strategy, etc.
1. An overview of the
1.1 What is Raft?
Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems.
— In Search of an Understandable Consensus Algorithm
In distributed systems, copies are often used for fault tolerance in order to eliminate single points and improve system availability, but this brings up another problem: how do you guarantee consistency across multiple copies?
Here we are only talking about strong consistency, linear consistency. Weak consistency covers a wide range of scenarios and involves many trade-offs that fall outside the scope of the Raft series.
The so-called strong consistency (linear consistency) does not mean that all nodes in the cluster state at any moment must be completely consistent, but to a goal, make a distributed system looks is only one copy of data, and read and write operations are atomic, so the application layer can ignore system underlying synchronization problem between multiple data copy. In other words, we can consider a strongly consistent distributed system as a whole. Once a client successfully writes, all clients must be able to read the value just written. Even if a network partition fails or a few nodes fail, the entire cluster still provides services as a single node.
Consensus algorithms are used to do this, ensuring that even if a small number of nodes (≤ (n-1)/2) fail, the system will continue to provide services. Consensus algorithms are typically based on a State Replicated State Machine model, in which all nodes start from the same State, go through the same operation log, and eventually reach a consistent State.
Figure: Replicated State Machine
Consensus algorithm is the cornerstone of building strong consistent distributed system. Paxos is the representative of consensus algorithm, and Raft is a variant proposed by the author during his doctoral study of Paxos. The main advantages are easy to understand and implement, and even the key parts are given pseudo-code implementation in the paper.
1.2 Who is using Raft
The most famous system that uses Raft is ETCD, and at the heart of ETCD is the implementation of the Raft algorithm. As a distributed KV system, ETCD uses Raft to synchronize data between multiple nodes, each of which has a full amount of state machine data. Raft will help you understand why ETCD is not suitable for the most critical data, why more nodes is not better, and why an odd number of nodes is suitable for a cluster.
As a microservice infrastructure, Consul underlying uses Raft to ensure data consistency between Consul Servers. After reading Chapter 6, we’ll understand why Consul provides default, Consistent, and stale Modes, their respective scenarios, And how consul underlies these different consistency patterns by changing the Raft read model.
TiKV also uses Raft algorithm at the bottom. Although both call themselves “distributed KV storage”, the usage scenarios of TiKV are different from etCD. The goal is to support 100TB+ of data, which a single Raft cluster like ETCD would not be able to support. Therefore, TiKV uses Multi Raft to divide data into multiple regions. Each region is actually a standard Raft cluster, which implements high availability of multiple copies of data in each partition.
Raft is starting to make a big splash in the industry and there’s no more detail here on the various scenarios that Raft is used in, but if you’re interested you can refer to it here. There are a number of Raft implementations in various languages.
1.3 Basic Concepts
Raft uses the Quorum mechanism for consensus and fault tolerance, we call the manipulation of the Raft cluster a proposal, and every time a proposal is initiated, it must be approved by a majority (> N/2) of the nodes before it can be submitted.
Here, “proposal” can be interpreted as read and write operations on the cluster in a narrow sense, and “submit” can be interpreted as successful operations.
So what happens inside the Raft cluster when we launch a series of reads and writes into the cluster? Let’s start with an overview of the whole and then go over each section in detail.
First, the Raft cluster must have a leader, through which all the actions we initiate as clients to the cluster must be handled. So the first part of Raft’s core algorithm is the Leader election — the cluster can’t work without a host, vote out a host first, and then worry about the rest.
Second, what work does the master node need to host? It receives the operation request from the client and synchronizes the operation as a log to other nodes. After ensuring that most nodes have synchronized the operation, it can safely respond to the client. This part of the job is called Log replication in the Raft core algorithm.
However, since the responsibility of the master node is so great, the nodes must be careful when choosing the master node, and only the nodes that meet the conditions can be elected as the master node. In addition, the primary node must be cautious when processing operation logs. To ensure the consistency of the cluster, do not overwrite or delete operation logs that have been processed by the former primary node. This “careful handling” is all about limiting the selection and submission of the log, which is called Safety in the Raft core algorithm.
The Raft core algorithm is made up of these three sub-problems: Leader election, Log replication, and Safety. Together, these three parts implement the consensus and fault tolerance mechanisms of the Raft core.
In addition to the core algorithm, Raft also provides solutions to several problems that must be faced in engineering practice.
The first is about the infinite growth of logs. Raft wraps the actions as logs, and each node in the cluster maintains an ever-growing sequence of logs that the state machine can only get by replaying. But because this log sequence can grow over time, there must be some way to avoid endless disk usage and excessive log replay. This part is called Log compaction.
The second is about changing cluster members. A Raft cluster is not likely to be a fixed number of nodes forever, there will always be scaling and scaling, or when nodes go down and need to be replaced. Switching cluster members directly can lead to serious split brain problems. Raft provides a way to safely change cluster members. This part is called Cluster membership change.
In addition, we’ll discuss the definition of linear consistency, why Raft is not equal to linear consistency, how to achieve linear consistency based on Raft, and how to optimize read performance without achieving linear consistency.
This is an overview of most of the things we’ll be talking about in The Theory section. Here we have a general idea of what Raft is, what the pieces are and how they relate to each other.
Next we will discuss each part of the Raft algorithm in detail. Let’s start with the first part of the selection.
2. Choose the Lord
2.1 What is primary selection
The Leader election is to select a master node within a distributed system to take charge of certain tasks. After the primary selection process is performed, each node in the cluster identifies a specific, unique node as the leader.
If the system we develop meets the need of choosing the master, it is usually done directly based on ZooKeeper or ETCD, and the complexity of this part is converged to the third-party system. However, Raft, which is the basis of ETCD, also has the concept of “master selection” in its own right, and it’s a two-tiered thing: Etcd-based master selection refers to the use of third-party ETCD to make the cluster reach a consensus on who is the master node. Technically, it uses etCD’s consistency state machine, lease and watch mechanism. This can also be done by single-node MySQL/Redis. It just doesn’t have high availability; Master selection for Raft refers to a master node that is recognized by most nodes as the leader of Raft cluster to coordinate all decisions through voting, heartbeat and other mechanisms.
When your system uses ETCD to write who is the primary node, this decision is also processed within ETCD by the primary node selected by its own cluster and synchronized to other nodes.
2.2 Why did Raft choose a master?
According to the paper, the native Paxos algorithm uses a peer-to-peer approach in which all nodes are equal. Ideally, the purpose of the algorithm is to make a decision that makes sense for a simplified model. However, few systems in the industry will use this method. When a series of decisions need to be made, a leader node should be selected first and then it can coordinate all the decisions, so that the algorithm will be simpler and faster.
In addition, compared with other consistency algorithms, Raft endows the leader node with stronger leadership, which is called Strong leader. For example, log entries can only be sent from the leader node to other nodes and not the other way around, simplifying the logic of log replication and making Raft easier to understand.
2.3 Main Process selection for Raft
2.3.1 Node Roles
Each node in the Raft cluster plays one of three roles:
- Leader: processes all requests. The Leader receives operation requests initiated by clients, writes them to local logs, and synchronizes them to other nodes in the cluster.
- Follower: Passive updates of requests, receives update requests from the leader and writes them to a local file. If an operation request from a client is sent to a follower, the follower is first redirected to the leader.
- Candidate: If the follower does not receive the heartbeat from the leader within a certain period of time, the leader may be faulty. At this time, the leader election process is started and the node is switched to Candidate until the primary election is completed.
2.3.2 term
Each new election is called a term, and each term is associated with a strictly increasing integer.
The term is added each time a candidate triggers the Leader election, and if a candidate wins the election, he will assume the role of leader for the term. However, every term does not necessarily correspond to a leader. Sometimes a leader cannot be elected in a term due to election timeout. Then candicate will increment the term number and start a new election.
Term is more like a logic clock that allows you to discover which nodes have expired state. Each node saves a current term and carries this term number when communicating.
Nodes communicate with each other through RPC. There are two main types of RPC requests:
- RequestVote RPCs: used for candidate canvassing elections.
- AppendEntries RPCs: used by the leader to copy logs to other nodes and synchronize heartbeats.
2.3.3 Node Status Conversion
We know that each node in a cluster can only be in a leader, follower, or candidate state. So when will the node be in which state? The following figure shows the possible state transitions for a node:
Let’s discuss in detail the scenarios in which each transformation takes place.
2.3.3.1 Follower status transition process
Raft host selection is based on a heartbeat mechanism. Each node in the cluster starts up as a follower (Step: starts up). The leader periodically sends heartbeat packets to all nodes to maintain his authority. The method is that if a follower does not receive any heartbeat for a period of time, that is, the election times out, it will subjectively consider that there is no leader available in the system and initiate a new election (Step: times out, starts election).
The question is, how should this “election timeout” be set? If all nodes start at the same time, and after the same timeout, the whole cluster becomes inefficient, and in extreme cases, there is no primary node to choose. Raft’s clever use of a randomised timer makes “timeouts” for each node randomly generated within a certain range, which greatly reduces the possibility of multiple nodes initiating elections at the same time.
Figure: Initial state of a five-node Raft cluster, all nodes are followers, term is 1, and each node has a different election timeout timer
If a follower wants to initiate an election, the follower must first add his or her current term and switch to candidate. It then sends a “Please vote yourself” message (RequestVote RPC) to the rest of the cluster.
Figure: S1 times out first, becomes candidate and term + 1, and sends a request for votes to other nodes
2.3.3.2 Candicate state conversion process
After the Follower switches to candidate and sends the “please vote for yourself” message to the other nodes in the cluster, there are three possible outcomes, namely the three lines of candidate status extending out from the node state diagram above.
(Step: receives votes from majority of servers)
When candicate obtains votes for the same term from most (N/2+1) nodes in the cluster, it wins the election, immediately changes its role to leader and starts sending heartbeats to other nodes to maintain its authority.
Figure: “most” nodes give S1 votes
Figure: S1 becomes the leader and starts sending heartbeat to maintain authority
Each node can only cast one vote for each term on a first-come, first-served basis. This rule ensures that only one candidate will become the leader.
2. The first election defeat in 34 years, the first Step: Find a new leader or term in 34 years.
A Candidate may suddenly receive a heartbeat packet from another node claiming to be the leader while waiting for the vote reply. If the heartbeat packet contains a term no smaller than the current term of the Candidate, the Candidate will acknowledge the leader. And the identity is cut back to follower. This means that the other nodes have successfully won the election and we just need to follow suit immediately. However, if the term in the heartbeat packet is smaller than it is, the candidate rejects the request and keeps the election state.
Figure: S4 and S2 start election successively
Figure: S4 becomes the leader. After receiving the heartbeat packet from S4, S2 will immediately switch to follower following S4 as its term is not less than its current term
3) The election ran out of time (Step: times out, new election)
The third possible outcome is that the candidate neither wins nor loses. If more than one follower becomes a candidate at the same time, the votes may be split, and if no one candidate has the support of a majority of nodes, each candidate will time out. At this point, the candidate needs to add his term, and then initiate a new election. If there is no special treatment, the votes may always be divided up so that the leader cannot be elected. The “special treatment” here refers to the randomized election timeout described above.
Figure: S1-S5 are all participating in the election
Figure: No node is willing to vote for others
Figure: If there is no randomized timeout, all nodes will continue to initiate elections at the same time…
These are the three possible election outcomes for candidate.
2.3.3.3 Leader status Transition Process
The final line in the node status diagram is: Server with Higher Term. Imagine a scenario: When the leader node breaks down or the network is disconnected, other followers will not receive the heartbeat from the leader at this time, and the first node triggering the timeout will become a candidate and start to solicit votes (due to different random follower timeout times). Since the candidate’s term is larger than that of the original leader, all followers vote for it and the candidate becomes the new leader. After a period of time, the original leader recovers, receives the heartbeat packet from the new leader, and finds that the heartbeat term is larger than its own term. In this case, the node will immediately switch to the new leader followed by followers.
Animation simulation of the above process is as follows:
Figure: S4 as terM2 leader
Figure: S4 is down and S5 is about to time out first
Figure: S5 is elected leader of TERM3
Figure: S4 receives terM3 heartbeat from S5 after recovering from outage
Figure: The S4 immediately changes to the follower of the S5
This is the logic for choosing a host for Raft, but there are details (such as whether to vote for the candidate and other conditions) that depend on other parts of the algorithm that we will describe in the next chapter “Security”.
After the leader is elected by the vote, the leader should also assume the corresponding responsibility. What is the responsibility? This is called “log replication” in the next chapter.
3. Log replication
3.1 What is Log Replication
As mentioned in the previous article, consensus algorithms are generally based on the Replicated State Machine model, in which all nodes start from the same State and eventually reach a consistent State after a series of steps of the same log operation. In other words, as long as the logs of all nodes in the cluster are consistent, the state machine obtained after a series of applications will be consistent.
Raft is responsible for ensuring log consistency across all nodes in the cluster.
We also mentioned that Raft gives the leader node a stronger leader. It’s easy to understand how Raft ensures that logs are consistent, that all logs must be handed over to the Leader node and copied to the other nodes.
This process is called Log replication.
3.2 Raft Log Replication Mechanism Parsing
3.2.1 Overall process analysis
Once the leader is voted in, it assumes the responsibility of leading the cluster, receiving client requests, wrapping the operations into logs, and copying them to other nodes.
The overall process is as follows:
- The Leader serves the client, and each request from the client contains an instruction to be executed by the state replication machine.
- The Leader appends this directive as a new log to its own log collection and then makes an AppendEntries RPC request to the other nodes to attach the log to their local log collection.
- When the log has been safely replicated, that is, most (N/2+1) nodes have been replicated, the leader applies the log to its local state machine and returns the result of the successful operation to the client.
The log model of the whole cluster can be macroscopically represented as the following figure (X ← 3 means x is assigned 3) :
In addition to storing the operation instructions of the state machine, each log has a unique integer index value (log index) indicating its position in the log set. In addition, each log stores a term number (the number at the top of the log entry box, the same term number in the same color), which represents the current term when the leader receives this instruction. Logs of the same term are sent by the same leader during their term.
When a log is deemed safe by the leader node to apply to the state machine, it is said to be committed (Committed entries in the figure above). What kind of logs can be committed? The answer is: when the leader learns that the log has been successfully replicated by more than half of the nodes in the cluster. So in the figure above we can see that this log (TERm3, Index7) and the previous logs are committed, although two nodes have incomplete logs.
Raft ensures that all committed logs are persisted and “eventually” must be applied by the state machine.
Note: The “end” word here is subtle and it points to one thing: Raft only ensures log consistency within the cluster, but the state machine consistency we really want outside the cluster requires some extra work, which is highlighted in the chapter Linear Consistency and Read Performance Optimization.
3.2.2 Log Replication Process Diagram
We used Raft animation to simulate regular log copying:
As shown in the figure above, S1 is elected leader, and there are no logs yet. We simulate a client making a request to S1.
S1 receives the client request and adds a new log (term2, index1), then initiates AppendEntries RPC to other nodes in parallel.
S2 and S4 receive the request first, attach the log, and respond to S1.
The log is attached to all nodes, but since the leader has not received any response, it is unclear whether the log has been successfully copied.
When S1 receives responses from two nodes, the border of the log entry becomes a solid line, indicating that the log has been safely replicated. In a five-node cluster, the number of copies of the log has reached more than half due to the two follower nodes plus the leader node. At this point, S1 will respond to the client’s request.
The leader will then send the heartbeat packet to the followers, which will carry the log index (TERm2, index1) that is currently committed.
All followers know from heartbeat packets that the logs committed (TERm2, INDEx1) have been copied, so the borders of this log entry become solid lines on all nodes.
3.2.3 Log consistency Guarantee
We used (term2, index1) to represent a log entry. Why include term instead of index? The reason is that term can be used to check for log inconsistencies between nodes, which will be easier to understand when you read the next section.
Raft guarantees that if two log entries in different node log collections have the same term and index, they must store the same instructions.
Why can such assurances be given? Because Raft requires the leader to create only one log for the same index in a term and never modify it.
Raft also ensures that if two log entries in different node log sets have the same term and index, then all previous log entries are the same.
This is because the AppendEntries RPC issued by the leader carries an additional (term, index) of the previous log and the follower rejects the new log if it cannot find the same (term, index) log locally.
Therefore, as long as the followers continue to receive logs from the leader normally, the above conclusions can be verified by induction.
3.2.4 Possible Log Inconsistency Scenarios
When all nodes are working properly, the leader and follower logs are always the same and AppendEntries RPC never fails. However, there is always the risk that any node may go down at any time, and how to maintain the consistency of the cluster log is the real problem that we need to solve.
The figure above shows the chaos that can occur in the logs of a cluster when a TERM8 leader is first installed. For example, the follower may lack some logs (a to B), have more uncommitted logs (C to D), or both (e to f).
Note: it is impossible for a Follower to have more committed logs than a leader. This is achieved through election restrictions, as described in the next chapter, Security.
We’ll start by trying to reproduce the A ~ F scenarios above and end with Raft addressing this inconsistency.
Scenarios A to B. The Follower logs lag behind the leader logs
In this scenario, the followers are down for a period of time, follower-A starts to break down after receiving (TERM6, Index9), and follower-B starts to break down after receiving (TerM4, Index4). I won’t repeat it here.
Scenario c. Have more Follower logs than the leader
When the TerM6 leader was synchronizing (TerM6, Index11) to the followers, the leader crashed and only follower-C received the AppendEntries RPC of the log. After a series of elections, TERM7 may have timed out, or the leader may break down as soon as he takes office. Finally, the leader of TERM8 takes office, which achieves the scene C we see.
Scenario D. The Follower logs have more term7 than the leader logs
When the TERM6 leader successfully committed (TERM6, Index10), an outage occurred. At this time, the TerM7 leader takes office and synchronizes two logs to the followers. However, the logs break down before the commit time is available. The cluster then selects a TerM8 leader.
Scenario E. Follower logs have less term5 to 6 and more term4 than the leader logs
After the TerM4 leader synchronizes (TERM4, Index7) to the followers and successfully commits (TERM4, Index5) and previous logs, the outage occurs, followed by follower-E outage. Thus all log synchronization that occurs in TERm5 ~7 is missed by follower-E. When follower-E recovers, terM8’s leader is installed.
Scenario F. The Follower logs have less term4-6 and more term2-3 than the leader logs
When the Leader of TerM2 synchronizes some logs (index4 ~ 6) to the followers, there is an outage before it is too late to commit, but it quickly recovers and is selected as the leader of TerM3. It continued to synchronize some logs (index7-11) to the followers, but again failed before the commit was made. Follower-f also crashed, and when follower-F woke up, the cluster had advanced to Term8.
3.2.5 How to Handle Log Inconsistency
As you can see from the above scenarios, real world clustering is very complex, so how does Raft deal with so many inconsistent scenarios? It’s very simple violence, think of the word Strong Leader.
Raft forces followers to copy the leader’s log set to resolve inconsistencies.
That is, any logs on the follower node that are inconsistent with those on the leader node are overwritten by the logs on the leader node. This does not pose a problem because of certain election restrictions. If the follower’s log does not match the leader’s, the log must be uncommitted on the follower. Uncommitted logs are not applied to the state machine and are not perceived by external clients.
To keep the follower log set exactly the same as its own, the leader must first find the last point of agreement between the two. Because once this log is agreed upon, all the previous logs must be agreed upon as well (recall above). This confirmation is done in the AppendEntries RPC consistency check step.
The Leader maintains a next index for each follower, which indicates the next log index to be sent to the follower. When a leader has just assumed office, it initializes all next Index values as index+1 of its last log. If a follower log is inconsistent with the leader, the next AppendEntries RPC consistency check will fail. After the follower rejects the Append Entries for RPC, the leader reduces the next index value and tries again.
Eventually there must be a next index so that the leader and follower logs are consistent up to this point. In extreme cases, if the next index is 1, the follower does not have any logs consistent with the leader, and the leader must synchronize the logs from the first log.
For each follower, once the value of the next index is determined, the leader starts to synchronize the logs from that index. The followers delete the existing inconsistent logs and keep the latest ones that the leader has synchronized.
Logs for the entire cluster automatically converge using this simple mechanism. Also note that the leader never overwrites or deletes his own log, but forces followers to keep up with it.
This requires that the leader elected by the cluster must have “log correctness”, which is also related to the limitations of election mentioned above.
We’ll discuss this in detail in the next chapter.
4. Security and correctness
In the previous section we talked about how Raft algorithm selects master and copies logs, but the mechanism we have described so far does not guarantee that every node’s state will apply logs in exactly the same order. Imagine the following scenario:
- The Leader copies some logs to most nodes, and an outage occurs after the commit.
- A follower is not copied to these logs, but it participates in the election and is elected the next leader.
- The new leader synchronizes and commits logs that overwrite the previously committed logs on the other nodes.
- The state machines of each node may apply different log sequences, causing inconsistencies.
Therefore, we need to put some additional restrictions on the “master pick + log copy” mechanism to keep the state machine safe, which is the Raft algorithm.
4.1 Restrictions on elections
Taking a look at the committed log overwritten scenario, the root problem actually occurs in step 2. A Candidate must be qualified to be the cluster leader, otherwise it will cause unexpected errors to the cluster. If a Candidate is qualified for this position, he or she can add a little condition to the election.
Each candidate must carry the latest (term, index) of their local logs in the RequestVote RPC, and followers refuse to vote for the candidate if they find that the candidate’s logs are not yet new.
To win the election as leader, a Candidate must be voted by the majority of the nodes in the cluster, and its log must be at least as good as that of the majority of nodes. Because a log cannot be committed until it has been copied to most nodes, the candidate who wins the election must have all the committed logs.
Therefore, we concluded in the previous article that followers cannot produce more committed logs than the leader.
The logic of comparing two (terms, indexes) is very simple: if the terms are different the larger log updates, otherwise index the larger log updates.
4.2 Restrictions on submission
In addition to adding a little constraint on elections, we also need to add a little constraint on commit behavior to complete the last piece of the puzzle at the heart of our Raft algorithm.
Remember what commit is:
When the leader knows that a log has been successfully replicated by more than half of the nodes in the cluster, he can commit the log. The committed log must be applied by the state machine eventually.
Commit simply marks the log to indicate that it can be applied to the state machine and respond to the corresponding client requests.
However, the leader cannot commit logs from the old term at any time, even if it has been copied to most nodes. Raft’s paper gives a classic scenario:
The figure above simulates the problem scenario chronologically from left to right.
Phase A: S1 is the leader. After receiving the request, (TERm2, index2) is copied only to S2, but not to S3 ~ S5.
Phase B: S1 breaks down and S5 is elected leader of TERm3 (S3, S4, S5 votes). After receiving the request, IT saves (Term3, index2) and has not been copied to any node.
Phase C: S5 breaks down, S1 recovers, S1 is reelected leader of TERM4, and continues to copy (TERm2, index2) to S3. Most nodes have been satisfied, so we commit it.
Phase D: S1 breaks down again, S5 recovers, S5 is elected leader again (S2, S3 and S4 votes), copies (TERm3, INDE2) to all nodes and commits. Note that a fatal error occurred, and the already committed (term2, index2) was overwritten by (term3, index2).
To avoid this error, we need to add an additional restriction:
The Leader only allows the COMMIT to contain logs for the current term.
For the above scenario, the problem occurs in phase C. Even though S1, the leader of TERM4, replicates (TERm2, Index2) to most nodes, it cannot commit it directly. Instead, it must wait for terM4’s logs to arrive and be successfully copied before committing them.
Phase E: After this restriction is added, either (TERm2, index2) is never committed, so it is safe for S5 to overwrite it in phase D; Either (term2, index2) is committed along with (term4, index3), S5 cannot be the leader at all, because most nodes’ logs are newer than it, and there are no previous problems.
These are the two small restrictions added to the algorithm that are critical to ensuring the security of the state machine.
At this point we’ve covered the core of the Raft algorithm. In the next chapter we will look at two assistive techniques that are also described in this paper: cluster member change and log compression, both of which are essential parts of Raft engineering practice.
5. Cluster member change and log compression
Although we’ve looked at the core of the Raft algorithm in the previous chapters, there are still some practical issues that need to be addressed in engineering practice as compared to algorithm theory. Raft is very thoughtful to provide solutions to two common problems in this paper, which are:
- Cluster member Changes: How to safely change cluster node members.
- Log compression: How to solve the problem of unrestricted log collection growth.
In this article, we’ll look at both techniques separately.
5.1 Changing Cluster Members
In the theory described above we assume that cluster members are constant, whereas in practice it is sometimes necessary to replace downed machines or change the replication level (i.e., add or subtract nodes). One of the simplest ways to achieve this by force is to stop the cluster, change its members, and start the cluster. In this way, the whole cluster is unavailable. In addition, manual operations may bring risks.
To avoid this problem, Raft presents an automated way to change cluster members without downtime, essentially using Raft’s core algorithm to synchronize the cluster member configuration from the Leader node to the other nodes as a special log.
5.1.1 Directly Switching Cluster Member Configurations
Conclusion: All scenarios that switch the cluster from the old configuration to the new configuration are not secure.
Therefore, we cannot take it for granted that the new configuration is directly synchronized to the cluster as a log and applied. Because it is impossible for all nodes in a cluster to switch their cluster member configuration atomically “at the same time,” different nodes may see different views of the cluster during the switch, potentially resulting in multiple leaders in the cluster.
To understand this conclusion, let’s take a look at an actual problem scenario, illustrated below.
Figure 5-1
Phase a. The cluster has three nodes, S1 to S3, and the member configuration is c-old. Green indicates that the current view (member configuration) of the node is C-old.
Phase B. Two new nodes, S4 and S5, are added to the cluster, and the change is written to the leader. We represent the new member configuration of five nodes from S1 to S5 as C-new, and blue indicates that the current view of the node is C-new.
Phase C. Assume that a temporary S3 outage triggers a timeout primary selection for S1 and S5.
Phase D. S1 canvasses S2 and S3, and S5 canvasses the other four nodes. Since S2’s log is not newer than S1’s, S2 may vote for S1, and S1 wins with two votes (because S1 thinks the cluster has only three nodes). S5 will definitely get S3’s and S4’s votes, because S1 doesn’t sense S4, doesn’t send RequestVote RPC to it, and S1’s logs are behind S3’s, so S3 definitely won’t vote for S1, and S5 gets 3 votes. The cluster ended up with a fatal error of multiple master nodes, known as a split brain.
Figure 5-2
The figure above is from the paper and presents the same problem as figure 5-1 in a different form. The colors represent the same meaning as figure 5-1. At a point in time referred to by problem: Two disjoint Majorities, a cluster might have two leaders.
However, the multi-master problem is not always possible when both old and new nodes are elected at the same time, and some articles in the community may be wrong when using multi-master examples. Here is an example that I learned a lot from learning Raft protocols in this article. The article is very good, I suggest you refer to study) :
Source: zhuanlan.zhihu.com/p/27207160
Figure 5-3
This hypothetical scenario is similar to stage D in Figure 5-1, and the simulation process is as follows:
- S1 is the former leader of the cluster, S4 and S5 are added to the cluster, and the configurations are pushed to S3, S2 has not received yet.
- S1 breaks down temporarily, and S2 and S3 trigger primary selection respectively.
- Finally, S2 gets S1 and its own votes, S3 gets S4, S5 and its own votes, and two leaders appear in the cluster.
It seems that the process in Figure 5-3 is not much different from that in Figure 5-1, except that the nodes involved in primary selection are different. However, the fact is that the situation in Figure 5-3 is impossible.
Note: Passing cluster change information in the Raft paper is also done through log appending, so this is also limited by master selection. Many readers are wondering whether logs compared in the primary selection constraint must be committed. Look back at the description in Security:
Each candidate must carry the latest (term, index) of their local logs in the RequestVote RPC, and followers refuse to vote for the candidate if they find that the candidate’s logs are not yet new.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Back to Figure 5-3, the reason why this is impossible is that S1, as the original leader, has been the first to save the log of the new configuration, while S2 has not been synchronized with this log. According to the master selection restriction mentioned in the previous chapter “Security”, S1 cannot vote for S2, so S2 cannot become the leader.
5.1.2 Configuring Cluster Member Switchover in Two Phases
Raft uses a two-phase approach to smooth switching cluster member configurations to avoid the problems described in the previous section as follows:
Phase one
- The client sends c-new to the leader, who combines C-old and C-new and applies immediately, which we represent as C-old,new.
- The Leader wraps c-old,new as logs and synchronizes them to other nodes.
- The Follower applies the log immediately after receiving c-old and new. After **C-old and most nodes of new (most nodes of C-old and most nodes of C-new) are switched, the leader commits the log.
Phase two
- The Leader then wraps the C-New as a log and synchronizes it to the other nodes.
- The Follower applies immediately after receiving the C-new. If the Follower finds that it is not in the C-new list, it automatically exits the cluster.
- After the Leader confirms that most nodes of c-new have successfully switched over, he sends a successful response to the client.
The figure above shows the timeline of the process. Dashed lines represent member configuration logs that have been created but not committed, and solid lines represent committed member configuration logs.
Why does this scheme guarantee that there will not be multiple leaders? Let’s go through the process step by step.
Phase 1. C-old,new has not been committed
At this stage, the configuration of all nodes is either C-OLD, C-OLD, or new, but either way, as long as the original leader goes down, the new leader must get the votes of most nodes in the C-Old set.
Taking the scenario in Figure 5-1 as an example, S5 has no chance to become the leader at stage D, because only S3 votes for it in C-Old, not meeting the majority.
Phase 2. C-old,new has been committed, but c-new has not been delivered
The phase c-old,new is committed, which ensures that most of the nodes of C-old,new (again: most of the nodes of C-old and most of the nodes of C-new) have been copied.
Therefore, when the leader breaks down, the newly elected leader must be a node that already has C-old and new, and it is impossible to have two leaders.
Phase 3. c-new has been delivered but not yet committed
There may be three types of nodes c-old, C-old,new, and C-new in the cluster at this stage. However, since the c-old node has already gone through phase 2, it is impossible for the C-old node to become the leader. However, no matter c-old,new or C-New nodes initiate elections, they all need the consent of most C-New nodes, so it is impossible to have two leaders.
Phase 4. C-new has been committed
At this stage the C-new node is already committed, so only the C-new node can get the majority of votes to become the leader. At this point, the cluster has safely completed this round of changes and can proceed to the next round of changes.
This is a step-by-step verification of the feasibility of this two-stage approach, which Raft referred to as a Joint Consensus.
A more detailed paper on cluster member change provides other approaches, which simply demonstrate the validity of changing one node at a time and provide an optimization solution to solve the usability problem. For those of you interested, see Consensus: Bridging Theory and Practice.
5.2 Log Compression
We know that the Raft core algorithm maintains the consistency of logs, and by applying logs we have a consistent state machine where the client’s actions are wrapped into logs that Raft processes. However, in a real system, the client operation is continuous, but the log cannot grow indefinitely. Firstly, it takes up a lot of storage space, and secondly, every time the system restarts, all the logs need to be played back to get the latest state machine.
Raft therefore provides a mechanism to remove stale information that accumulates in logs called log compression.
Snapshot is a common and simple log compression method used by ZooKeeper and Chubby. In simple terms, the system status at a certain point in time is dumped and stored on the ground. In this way, all logs before this point can be discarded. So don’t get the word “compress” wrong, there is no way to “uncompress” a state machine snapshot back into a log sequence.
Note that in Raft we can only do snapshots for committed logs because only committed logs are guaranteed to eventually be applied to the state machine.
The figure above shows a node replacing (TERM1, index1) ~ (term3, Index5) logs with snapshots.
Snapshots generally contain the following contents:
- Log metadata: term and index of the last log to be applied by this snapshot
- State machine: The state machine obtained after all previous logs are applied
When the leader needs to synchronize some old logs to a follower, but the logs have been snapshot and deleted by the leader, the leader needs to send the snapshot to the follower.
Similarly, when a new node is added to the cluster or a node is down for a long time and too many logs are lost, the leader can directly send snapshots, which greatly reduces log transmission and playback time.
Synchronizing snapshots uses a new RPC method called InstallSnapshot RPC.
At this point we have basically covered the Raft paper. “In Search of an Understandable Consensus Algorithm (Extended Version),” after all, is only 18 pages long and focuses more on theoretical description than engineering practice. Consensus: Bridging Theory and Practice if you want to learn more about Raft, or write a solid Raft implementation yourself, Consensus: Bridging Theory and Practice is the place to look.
Next we’ll talk a little bit more about linear consistency and Raft read performance optimization.
6. Linear consistency and read performance optimization
6.1 What is Linear Consistency?
In the first article in this series, Basic Concepts, we mentioned that in distributed systems, copies are often used for fault tolerance in order to eliminate single points and improve system availability, but this brings up another problem of ensuring consistency across multiple copies.
What is consistency? There are many different models of consistency, and different models are used to judge the correctness of a concurrent system to varying degrees. But today we are going to talk about Strong Consistency, or linear Consistency, which is the C of CAP theory.
In fact, we briefly described what linear consistency is in the first article:
The so-called strong consistency (linear consistency) does not mean that all nodes in the cluster state at any moment must be completely consistent, but to a goal, make a distributed system looks is only one copy of data, and read and write operations are atomic, so the application layer can ignore system underlying synchronization problem between multiple data copy. In other words, we can consider a strongly consistent distributed system as a whole. Once a client successfully writes, all clients must be able to read the value just written. Even if a network partition fails or a few nodes fail, the entire cluster still provides services as a single node.
“Service as a single machine” is a sensory description of what a linearly consistent system should be, so how do we judge whether a system is linearly consistent? Stale data cannot be read in common terms, but there are two cases:
- For requests with overlapping (concurrent) invocation times, the order of effect can be arbitrarily determined.
- For requests that are called sequentially (in partial order), a later request cannot violate the result determined by the previous request.
Only according to the above two rules can judge whether a system has linear consistency. Let’s look at an example of a nonlinear consistent system.
Example diagrams in this section are taken from Designing Data-Intensive Application by Martin Kleppmann
As shown in the figure above, the referee writes the results of the World Cup into the master library, and the pages that Alice and Bob browse are read from two different slave libraries respectively. However, due to the master-slave synchronization delay, the synchronization delay of Follower 2 is higher than that of Follower 1. As a result, Bob hears Alice’s exclamation and refreshes the page to see that the competition is still in progress.
While the basic idea of linear consistency is simple, requiring a distributed system to appear to have only one copy of the data, there are many points of concern in practice. Let’s continue with a few examples.
The figure above shows a scenario where multiple users simultaneously request to read and write a system from the external perspective of the client. Each bar is a request initiated by the user. The left end is the time when the request is initiated, and the right end is the time when the response is received. Because network latency and system processing time are not fixed, the lengths of the columns are not the same.
x
The initial value is zero0
, Client C will be in a certain periodx
Written as1
.- The first read operation of Client A precedes the write operation of Client C, so the original value must be read
0
. - The last read of Client A comes after the write of Client C. If the system is linear and consistent, the new value must be read
1
. - Any other read operation that overlaps with the write operation may return
0
, may also return1
Because it is not clear at which precise point in time the write operation takes effect, in this case the read and write areconcurrent.
That alone doesn’t say that this system is linearly consistent. Suppose Client B returns 1 on the first read, and if Client A returns 0 on the second read, then this scenario does not break the above rule, but the system is still not linear because the Client sees the value of x flip back and forth between the old and the new during the write operation. This does not meet our expectation of “looking like there is only one copy of the data.”
So we need to add an additional constraint, as shown in the figure below.
After any client read returns a new value, all subsequent client reads must return a new value, so that the system satisfies linear consistency.
Let’s finish with a more complex example to refine the sequence diagram.
As shown in the figure above, each read and write operation is atomically effective at a particular point in time. We mark the effective point with a vertical line on the bar and connect the marks in chronological order. The requirement for linear consistency, then, is that the line always moves to the right in chronological order, without retreating to the left. So the result of this wire must be a valid register read/write sequence: each read by any client must return the value of the item that was last written.
Linear consistency is not limited to distributed environment, but can be simply understood as a “register” property in single-machine single-core system.
Client B’s last read does not satisfy linear consistency because it reads the wrong value if the line moves to the right (because Client A has already read the 4 written by Client C). In addition, there are some details worth pointing out in this diagram, which can solve many of the misunderstandings we tend to have when using linear uniform systems:
- The first read request of Client B is initiated before the first write request of Client D and the first write request of Client A, but the final read result of the last write request of Client A is read successfully.
- Before Client A receives A successful response for the first write request, Client B reads the value written by Client A.
The above phenomena are all reasonable under the semantics of linear consistency.
So linear Consistency is called Strong Consistency, It is also called Atomic Consistency, Immediate Consistency, or External Consistency, all of which seem to be more apt names.
6.2 Raft linear consistency read
Having learned what linear consistency is, we combined it with Raft to explore it. The first question I need to ask is, are all the systems that use Raft linear and consistent? No, Raft just provides a foundation and some additional work is needed to achieve linear consistency throughout the system.
Assuming that we want to implement a linearly consistent distributed KV system based on Raft, let’s start with the most naive solutions, point out the problems of each solution, and finally make the whole system linearly consistent.
6.2.1 Analysis of Write Master read Slave Defects
Write operations are not the focus of our attention. If you have looked at the theory a little bit, you should know that all write operations are initiated from the Leader node as a proposal, and of course all write commands should simply be handed over to the Leader. The real key is how reads are handled, which involves system-wide trade-offs about consistency.
In this scenario we assume that reads are simply initiated to followers directly, and because of Raft’s Quorum mechanism (most nodes succeed) the cluster may have one of the following states for a given proposal at a given time:
- The log of a write operation has not been copied to a small number of followers, but the leader has committed it.
- The log of a write operation has been synchronized to all followers, but the heartbeat packet has not been sent to some followers after the leader commits it.
In each of these scenarios, the client can read outdated data, and the system is clearly not linearly consistent.
6.2.2 Analysis of write Master Read Master Defects
In this scheme, we define that all read operations must also be processed by the leader node. Can’t linear consistency be satisfied when all read and write operations are processed by the Leader node? Yes!!!!! And there is more than one problem!!
Fault 1: The state machine lagged behind the committed log, causing dirty reads
Recall that when we explained commit we mentioned when a write can respond to a client:
Commit simply marks the log to indicate that it can be applied to the state machine and respond to the corresponding client requests.
This means that a proposal can respond to the client as soon as it is committed by the leader. Raft does not restrict the proposal results to be applied to the state machine before being returned to the client. So from the client’s point of view when one of our write operations is successful, the next read operation may still read the old value.
The solution to this problem is very simple. When the leader receives the read command, we only need to record the current Commit index. When the Apply Index catches up with the Commit index, the content in the state machine can be sent to the client.
Fault 2: Dirty reads are caused by network partitions
If the cluster is partitioned, the old leader is located in the minority partition, and at this moment, the old leader just doesn’t realize that he has lost the leadership. When the majority partition elects a new leader and starts subsequent write operations, the client connected to the old leader may read the old value.
Therefore, if the leader state machine is only read directly, the system still does not meet the linear consistency.
6.2.3 Raft Log Read
To ensure that the leader still has leadership when handling reads, we can also go through Raft as a proposal, and when the log corresponding to the read request can be applied to the state machine, the Leader can read the state machine and return it to the user.
This reading scheme is called Raft Log Read, or intuitively, Read as Proposal.
Why does this solution satisfy linear consistency? In this solution, all read and write requests are linearized according to the Commit Index, so that each read request can sense the latest state of the state machine after the execution of the previous write request, and the read and write logs are applied to the state machine one by one. Of course, the whole system meets the linear consistency. However, the disadvantages of this scheme are also very obvious, that is, the performance is poor, and the overhead of the read operation is almost the same as that of the write operation. And since all operations are linearized, we cannot read the state machine concurrently.
6.3 Optimization of Raft Read Performance
Next, we’ll look at several optimizations that significantly improve read performance without violating the linear consistency of the system.
6.3.1 Read Index
Compared with Raft Log Read, Read Index saves the overhead of synchronous Log, greatly improves Read throughput and reduces Read latency to some extent. Its general process is as follows:
- When the Leader receives a read request from the client, it records the current Commit index, which is called read Index.
- The Leader sends a heartbeat packet to the Followers. This step is to ensure leadership and prevent minority leaders from handling requests while the network is partitioned.
- Wait state machines apply at least to read index (that is, apply index is greater than or equal to read Index).
- A read request is executed to return the results from the state machine to the client.
The key point here is that the apply index of step 3 is greater than or equal to read index. At the time of the read request, the commit index is recorded. If the client reads anything after the commit index, the result is always linear.
6.3.2 Lease Read
Compared with Read Index, Lease Read reduces network interaction costs and thus significantly reduces Read latency.
The basic idea is that the leader sets an Election Timeout shorter than the Election Timeout as the lease period. During the lease period, we can believe that other nodes must not initiate the Election and the cluster will not have brain split. Therefore, we can directly read the master during this period. Otherwise, the Read Index process can continue, and the heartbeat packet of the Read Index can also renew the lease period.
Lease Read can be regarded as the timestamp version of the Read Index. Depending on the timestamp brings some uncertainties to the algorithm. If clock drift occurs, a series of problems may occur.
6.3.3 followers Read
In the previous two optimization schemes, no matter how much we toss and turn, the core idea is actually only two points:
- Ensure that the latest Commit index is applied at the time of reading.
- Ensure that the leader still has leadership at read time.
These two warranties correspond to the two problems described in Section 2.2.
Both Read Index and Lease Read are intended to solve the second problem. In other words, read requests must ultimately be hosted by the leader.
So does reading followers really not satisfy linear consistency? In fact, this is not the case. Here we give a feasible solution for reading followers: After receiving a read request from the client, the Follower asks the leader for the latest commit index. All log entries are bound to be synchronized to the Follower anyway. The Follower only needs to wait for the log to be committed and applied to the state machine. The result is returned to the client’s local state machine. This scheme is called Follower Read.
Note: Follower Read does not mean that we are completely independent of the leader during reading. It is theoretically impossible to be completely independent of the leader under the premise of ensuring linear consistency.
This is the core of the Raft algorithm and the most important engineering practice to consider.
If you stick with it, you’ll have a good understanding of the theory behind Raft’s algorithm. Of course, the gap between theory and engineering practice may be larger than one might think, with many details to be worked out in practice. In the following source code analysis and practice, we will combine the code to explain many of these details that are not mentioned in the theoretical part, and introduce a lot of experience in infrastructure design, stay tuned!
Welcome to pay attention to wechat public number: Q blog, regularly send dry goods, practical experience, system summary, source code interpretation, technical principle.
Thanks for your time! Original is not easy, welcome to praise, attention, comment