This article introduces the algorithm restrictions Raft puts on master selection and log replication to ensure that it is safe.
Note: This article is original, please indicate the source of reprint.
I’m looking to help you understand the Raft protocol in depth and apply it to your engineering practice. I’m also looking to explain key points that are difficult to understand or misunderstand. This series introduces Raft principles and algorithms from principle, source code and practice:
- The principle part introduces the principle of Raft algorithm based on Raft paper, and the part follows Raft’s modularity idea. Explain Leader election, Log Replication, Safety, Cluster membership change, and Log compaction.
- The source section analyzes Hashicorp/Raft to learn an industry raft implementation (Hashicorp/Raft is Consul’s underlying dependency).
- The practical part ends with a simple distributed KV storage based on Hashicorp /raft.
Links to series of historical articles:
- Raft Protocol Combat Series part 1 – Basic concepts
- Raft Protocol Series (2
- Raft protocol Series 3 – Log replication
In the previous section we talked about how Raft algorithm selects master and copies logs, but the mechanism we have described so far does not guarantee that every node’s state will apply logs in exactly the same order. Imagine the following scenario:
- The Leader copies some logs to most nodes, and an outage occurs after the commit.
- A follower is not copied to these logs, but it participates in the election and is elected the next leader.
- The new leader synchronizes and commits logs that overwrite the previously committed logs on the other nodes.
- The state machines of each node may apply different log sequences, causing inconsistencies.
So we need to put some additional restrictions on the “master pick + log copy” mechanism to ensure that the state machine is secure and the Raft algorithm is correct.
1. Restrictions on elections
Taking a look at the committed log overwritten scenario, the root problem actually occurs in step 2. A Candidate must be qualified to be the cluster leader, otherwise it will cause unexpected errors to the cluster. If a Candidate is qualified for this position, he or she can add a little condition to the election.
Each candidate must carry the latest (term, index) of their local logs in the RequestVote RPC, and followers refuse to vote for the candidate if they find that the candidate’s logs are not yet new.
To win the election as leader, a Candidate must be voted by the majority of the nodes in the cluster, and its log must be at least as good as that of the majority of nodes. Because a log cannot be committed until it has been copied to most nodes, the candidate who wins the election must have all the committed logs.
Therefore, we concluded in the previous article that followers cannot produce more committed logs than the leader.
The logic of comparing two (terms, indexes) is very simple: if the terms are different the larger log updates, otherwise index the larger log updates.
2. Restrictions on submission
In addition to adding a little constraint on elections, we also need to add a little constraint on commit behavior to complete the last piece of the puzzle at the heart of our Raft algorithm.
Remember what commit is:
When the leader knows that a log has been successfully replicated by more than half of the nodes in the cluster, he can commit the log. The committed log must be applied by the state machine eventually.
Commit simply marks the log to indicate that it can be applied to the state machine and respond to the corresponding client requests.
However, the leader cannot commit logs from the old term at any time, even if it has been copied to most nodes. Raft’s paper gives a classic scenario:
*Figure 1
Figure 1 simulates the problem scenario chronologically from left to right.
Phase A: S1 is the leader. After receiving the request, (TERm2, index2) is copied only to S2, but not to S3 ~ S5.
Phase B: S1 breaks down and S5 is elected leader of TERm3 (S3, S4, S5 votes). After receiving the request, IT saves (Term3, index2) and has not been copied to any node.
Phase C: S5 breaks down, S1 recovers, S1 is reelected leader of TERM4, and continues to copy (TERm2, index2) to S3. Most nodes have been satisfied, so we commit it.
Phase D: S1 breaks down again, S5 recovers, S5 is elected leader again (S2, S3 and S4 votes), copies (TERm3, INDE2) to all nodes and commits. Note that a fatal error occurred, and the already committed (term2, index2) was overwritten by (term3, index2).
To avoid this error, we need to add an additional restriction:
The Leader only allows the COMMIT to contain logs for the current term.
For the above scenario, the problem occurs in phase C. Even though S1, the leader of TERM4, replicates (TERm2, Index2) to most nodes, it cannot commit it directly. Instead, it must wait for terM4’s logs to arrive and be successfully copied before committing them.
Phase E: After this restriction is added, either (TERm2, index2) is never committed, so it is safe for S5 to overwrite it in phase D; Either (term2, index2) is committed along with (term4, index3), S5 cannot be the leader at all, because most nodes’ logs are newer than it, and there are no previous problems.
These are the two small restrictions added to the algorithm that are critical to ensuring the security of the state machine.
Now that we’ve covered the core of the Raft algorithm, if you’ve followed through this article you’ll have a good grasp of the core algorithm mechanism of the Raft protocol.
In the next article we will cover two assistive techniques that are also described in this paper: cluster member change and log compression, both of which are essential parts of the Raft engineering practice to help the reader fully grasp the core principles prior to source code analysis.
Welcome to forward and follow the author’s wechat official account:Q’s blog.Irregular delivery of dry goods, practical experience, system summary, source code interpretation, technical principles.