The principle of raft state replication machine based on logs is the basis for presenting a unified view of distributed systems and achieving distributed consistency.
Note: This article is original, please indicate the source of reprint.
I’m looking to help you understand the Raft protocol in depth and apply it to your engineering practice. I’m also looking to explain key points that are difficult to understand or misunderstand. This series introduces Raft principles and algorithms from principle, source code and practice:
- The principle part introduces the principle of Raft algorithm based on Raft paper, and the part follows Raft’s modularity idea. Explain Leader election, Log Replication, Safety, Cluster membership change, and Log compaction.
- The source section analyzes Hashicorp/Raft to learn an industry raft implementation (Hashicorp/Raft is Consul’s underlying dependency).
- The practical part ends with a simple distributed KV storage based on Hashicorp /raft.
Links to series of historical articles:
- Raft Protocol Combat Series part 1 – Basic concepts
- Raft Protocol Series (2
1. What is log replication
As mentioned in the previous article, consensus algorithms are generally based on the Replicated State Machine model, in which all nodes start from the same State and eventually reach a consistent State after a series of steps of the same log operation. In other words, as long as the logs of all nodes in the cluster are consistent, the state machine will be consistent after a series of appending operations.
Raft is responsible for ensuring log consistency across all nodes in the cluster.
We also mentioned that RAFT gives the leader node a stronger leader. It’s easy to understand how raft ensures that logs are consistent, that all operations (logs) must be handed over to the Leader (follewer receives writes to the Leader) and replicated by the Leader to other nodes. To ensure consistency of log implementation levels across the cluster.
This process is called Log replication, and the corresponding system model is a Log state replication machine.
2. Raft log replication mechanism parsing
2.1 Overall process analysis
Once the leader is voted in, it assumes the responsibility of leading the cluster, receiving client requests, wrapping the operations into logs, and copying them to other nodes.
The overall process is as follows:
-
The Leader serves the client, and each request from the client contains an instruction to be executed by the state replication machine.
-
The Leader appends this directive as a new log to its own log collection and then makes an AppendEntries RPC request to the other nodes to attach the log to their local log collection.
-
When the log has been safely replicated, that is, most (N/2+1) nodes have been replicated, the leader applies the log to its local state machine and returns the result of the successful operation to the client.
The log model of the whole cluster can be macroscopically represented as the following figure (X ← 3 means x is assigned 3) : *Raft cluster logging model
In addition to storing the operation instructions of the state machine (such as the assignment instruction x ← 3, which means that x is assigned to 3), each log has a unique integer index value (log index) indicating its position in the log set. In addition, each log stores a term number (the number at the top of the log entry box, the same term number in the same color), which represents the current term when the leader receives this instruction. Logs of the same term are sent by the same leader during their term.
When a log is deemed safe by the leader node to apply to the state machine, it is said to be committed (Committed entries in the figure above). What kind of logs can be committed? The answer is: when the leader learns that the log has been successfully replicated by more than half of the nodes in the cluster. So in the figure above we can see that this log (TERm3, Index7) and the previous logs are committed, although two nodes have incomplete logs.
Raft ensures that all committed logs are persisted and “eventually” must be applied by the state machine.
Note: The “end” word here is subtle and it points to one thing: Raft only ensures log consistency, but the consistency of the state machine that we really want requires some extra work, which will be covered in a follow-up article on Linear Consistency and Performance Optimization.
2.2 Raft Log Replication Process Diagram
We used raft animation (raft.github. IO /) to simulate regular log replication of this…
*Figure 1
In Figure 1, S1 is elected leader, and there are no logs yet. We simulate a client making a request to S1.
*Figure 2
As shown in Figure 2, S1 receives the client request and adds a new log (term2, index1), then initiates AppendEntries RPC to other nodes in parallel.
*Figure 3
As shown in Figure 3, S2 and S4 receive the request first, attach the log, and respond to S1.
*Figure 4.
As shown in Figure 4, the log is attached to all nodes, but since the leader has not received any response, it is unclear whether the log has been successfully copied.
*Figure 5
As shown in Figure 5, when S1 receives responses from two nodes, the border of the log entry has turned into a solid line, indicating that the log has been safely replicated. In a 5-node cluster, the number of copies from the two follower nodes plus the leader node itself has been more than half. At this point, S1 will respond to the client’s request.
*Figure 6.
As shown in Figure 6, the leader then sends the heartbeat packet to the followers with the log index (TERm2, index1) that is currently committed.
*Figure 7.
As shown in Figure 7, all followers know from the heartbeat packet that the log (TERm2, INDEx1) has been committed, so the borders of the log entry on all nodes become solid lines.
2.3 Raft’s guarantee of log consistency
We used (term2, index1) to represent a log entry. Why include term instead of index? The reason is that term can be used to check for log inconsistencies between nodes, which will be easier to understand when you read the next section.
Raft guarantees that if two log entries in different node log collections have the same term and index, they must store the same instructions.
Why can such assurances be given? Because Raft requires the leader to create only one log for the same index in a term and never modify it.
Raft also ensures that if two log entries in different node log sets have the same term and index, then all previous log entries are the same.
This is because the AppendEntries RPC issued by the leader carries an additional (term, index) of the previous log and the follower rejects the new log if it cannot find the same (term, index) log locally.
Therefore, as long as the followers continue to receive logs from the leader normally, the above conclusions can be verified by induction.
2.4 Possible Log Inconsistency Scenarios
When all nodes are working properly, the leader and follower logs are always the same and AppendEntries RPC never fails. However, there is always the risk that any node may go down at any time, and how to maintain the consistency of the cluster log is the real problem that we need to solve.
*Log inconsistency scenario diagram
The figure above shows the chaos that can occur in the logs of a cluster when a TERM8 leader is first installed. For example, the follower may lack some logs (a to B), have more uncommitted logs (C to D), or both (e to f).
Note: it is impossible for a Follower to have more committed logs than the leader. This is achieved by election restrictions, as described in the next Safety section.
We’ll start by trying to reproduce the A ~ F scenarios above and end with raft addressing this inconsistency.
Scenarios A to B. The Follower logs lag behind the leader logs
In this scenario, the followers are down for a period of time, follower-A starts to break down after receiving (TERM6, Index9), and follower-B starts to break down after receiving (TerM4, Index4). I won’t repeat it here.
Scenario c. Have more Follower logs than the leader
When the TerM6 leader was synchronizing (TerM6, Index11) to the followers, the leader crashed and only follower-C received the AppendEntries RPC of the log. After a series of elections, TERM7 may have timed out, or the leader may break down as soon as he takes office. Finally, the leader of TERM8 takes office, which achieves the scene C we see.
Scenario D. The Follower logs have more term7 than the leader logs
When the TERM6 leader successfully committed (TERM6, Index10), an outage occurred. At this time, the TerM7 leader takes office and synchronizes two logs to the followers. However, the logs break down before the commit time is available. The cluster then selects a TerM8 leader.
Scenario E. Follower logs have less term5 to 6 and more term4 than the leader logs
After the TerM4 leader synchronizes (TERM4, Index7) to the followers and successfully commits (TERM4, Index5) and previous logs, the outage occurs, followed by follower-E outage. Thus all log synchronization that occurs in TERm5 ~7 is missed by follower-E. When follower-E recovers, terM8’s leader is installed.
Scenario F. The Follower logs have less term4-6 and more term2-3 than the leader logs
When the Leader of TerM2 synchronizes some logs (index4 ~ 6) to the followers, there is an outage before it is too late to commit, but it quickly recovers and is selected as the leader of TerM3. It continued to synchronize some logs (index7-11) to the followers, but again failed before the commit was made. Follower-f also crashed, and when follower-F woke up, the cluster had advanced to Term8.
2.5 How to Handle Log Inconsistency
As you can see from the above scenarios, real world clustering is very complex, so how does Raft deal with so many inconsistent scenarios? It’s very simple violence, think of the word Strong Leader.
Raft forces followers to copy the leader’s log set to resolve inconsistencies.
That is, any logs on the follower node that are inconsistent with those on the leader node are overwritten by the logs on the leader node. This does not pose a problem because of certain election restrictions. If the follower’s log does not match the leader’s, the log must be uncommitted on the follower. Uncommitted logs are not applied to the state machine and are not perceived by external clients.
To keep the follower log set exactly the same as its own, the leader must first find the last point of agreement between the two. Because once this log is agreed upon, all the previous logs must be agreed upon as well (recall above). This confirmation is done in the AppendEntries RPC consistency check step.
The Leader maintains a next index for each follower, which indicates the next log index to be sent to the follower. When a leader has just assumed office, it initializes all next Index values as index+1 of its last log. If a follower log is inconsistent with the leader, the next AppendEntries RPC consistency check will fail. After the follower rejects the Append Entries for RPC, the leader reduces the next index value and tries again.
Eventually there must be a next index so that the leader and follower logs are consistent up to this point. In extreme cases, if the next index is 1, the follower does not have any logs consistent with the leader, and the leader must synchronize the logs from the first log.
For each follower, once the value of the next index is determined, the leader starts to synchronize the logs from that index. The followers delete the existing inconsistent logs and keep the latest ones that the leader has synchronized.
Logs for the entire cluster automatically converge using this simple mechanism. Also note that the leader never overwrites or deletes his own log, but forces followers to keep up with it.
This requires that the leader elected by the cluster vote must have “Safety of log”, which is also referred to above: the limitation of election.
We’ll discuss this in more detail in the next article.
Welcome to forward, follow, the author’s wechat official account:Q’s blog.Irregular delivery of dry goods, practical experience, system summary, source code interpretation, technical principles.