Raft is a consistency algorithm for distributed systems. It is not as difficult to implement as Paxos, and has similar performance as Paxos. It is widely used in Etcd, Consul, etc. Consul was used in container servization, and I read the paper on Raft algorithm in passing. Then I did LAB2 of mit6.824 distributed system course to practice Go language. Due to the random election time and simulated node crash in the experiment, exceptions may appear after you run hundreds of times, so you need to test for several times to ensure that the test passes. My implementation code for Raft algorithm is here 6.824-2017- Raft, many references to other code, see README. Lab1 of 6.824 course is to complete a simplified version of MapReduce with relatively simple implementation. See 6.824-2017-MapReduce for the code. If there are any mistakes, please correct them.
1 overview
The consistency algorithm of distributed system is that a group of machines work together, even if some of them go down, the system can still provide services. Consistency algorithms used to be described in Paxos, but the complexity of Paxos and the difficulty of implementing the code led to Raft, a much more straightforward consistency algorithm that, surprisingly, works almost as well as Paxos.
For ease of understanding, Raft uses algorithm decomposition (leader election, log replication, and security) and state reduction. Unlike previous consistency algorithms, Raft has some special features:
- A strong Leader. Raft uses a strong Leader feature where log replication can only be replicated from the Leader node to other nodes.
- Leader election. Raft uses a random timeout to elect the Leader to ensure that the election does not fail.
- Member changes. Raft uses a federated consistency approach to ensure that services are not affected when member configurations change.
2 Replicate the state machine
Consistency algorithm is put forward in the background of replication state machine. In a replication state machine, servers in a cluster make the same copies from the same state, and clients can continue to perform operations even if some of the servers go down. Replication state machines can be used to solve many fault tolerance problems in distributed systems. Large distributed systems usually have a cluster leader, such as GFS, HDFS, etc., usually use a separate replication state machine to manage the leader election and configuration information to cope with the leader crash. Other examples include Chubby and ZooKeeper.
The replication state machine is implemented by copying logs, as shown in the following figure. Each server stores a log, which stores a series of commands, and the state of the server executes those commands in sequence. The same commands are stored in the same order in each log, and the state machine executes those same commands in the same order. Each server’s state machine is deterministic, they execute the same commands in the same order, and the final state and output must be the same.
Copy the state machine icon
It is the job of the consistency algorithm to keep the replication logs consistent. The consistency module on the server receives client commands and adds them to its log. It communicates with other servers to ensure that each log eventually contains the same commands in the same order, even if a server goes down in the process. Once the client commands are correctly copied, each server’s state machine processes the commands in the order in the log and returns the output to the client. In the end, these servers look like a highly available state machine.
Consistency algorithms used in practical systems usually have the following characteristics:
- Ensure safety. Ensure security under all non-Byzantine conditions, including network latency, partitioning, packet loss, out-of-order, etc.
- High availability. A system is available as long as a majority (more than half) of the servers in the cluster are available. For example, a cluster of five servers can allow two servers to go down without affecting service.
- Log consistency is guaranteed independent of timing. Incorrect clocks and extreme message latency are at worst usability problems.
- In general, as long as most of the servers in the cluster respond to the procedure call, the command will complete, and a few slow servers will not affect overall system performance.
3 Raft algorithm
The Raft algorithm is divided into two parts, the Leader Election and Log Replication. There is a nice animation that illustrates the implementation of the Raft algorithm, and there is a good article on member changes and log snapshots. See Raft-membership Change in depth.
3.1 Leadership Election
Learn about the states of nodes in Raft: Follower, Candidate, and Leader. The state transition is shown in the figure below.
Raft node state transition diagram
-
The node in the Follower state does not receive any RPC voting or log replication or heartbeat packet within a random Election timeout (called Election timeout). It becomes a Candidate.
-
Nodes in the Candidate state begin voting immediately. It votes for itself and then sends a Vote to other nodes. This Request is called a Request Vote RPC. If more than half of the nodes vote, the system changes to the Leader state. If you voted for another node or found instructions to update the Term (such as election polls and log copy instructions to update the Term), you will become a Follower. If no more than half of the votes are cast, the Candidate status remains and a new election is held.
-
The node in the Leader state sends RPC requests to other nodes periodically (this time is HeartbeatTimeout, usually much less than the election timeout (bit 50ms as I set in the experiment). If an instruction to update the tenure is found, the state changes to Follower.
There are two requirements for voting in an election:
- Condition 1: The term of the node requesting the vote must be longer than or equal to that of the node, and the node has not voted for other nodes (including itself).
- Condition 2: The log of the node requesting the vote must be the node containing the latest submitted log, which is a restriction added to ensure log security. How can I ensure that the request vote node contains the latest commit log? You can compare the tenure of the last log on two nodes. If the tenure is different, the log with the longest tenure is updated. If the tenure is the same, the log is updated for longer periods.
3.2 Log Replication
Raft is a strong Leader mechanism and logs can only be copied from the Leader to other nodes. The LogEntry LogEntry contains index, term, and command elements. Index indicates the log index, term indicates the term, and command indicates the log content. The log format is as follows:
Sketch of Raft log format
The usual log replication flow looks like this:
- The client sends the request to the Leader.
- The Leader receives the client request and appends the request command to its own log as a LogEntry entry.
- Leader then in the nearest one
Heartbeat timeout
When sendingAppend Entries RPC
To the Follower node. - Once the log is submitted successfully:
- At this point, logs are in the Uncommitted state. After logs are successfully added to half nodes, the Leader submits the logs to the state machine and returns a message indicating that logs are successfully written to the client.
- And in the following
Append Entries RPC
To notify other nodes to submit the log. - The Follower node submits logs to its state machine.
- If the Leader node fails, the other Follower nodes elect a new Leader after the timeout. If the Follower node is down or slow, the Leader will retry until it succeeds.
Even if network segmentation occurs, there is no problem when there are multiple leaders in the cluster. Assume that the cluster of 5 nodes is divided into two small clusters of 3 nodes and 2 nodes. The large cluster of 3 nodes can successfully submit logs due to the number of more than 3 and half, while the small cluster of insufficient nodes cannot successfully submit logs. When the network is restored, the new Leader will eventually be created in the large cluster (guaranteed by condition 2 of the election vote) and synchronized to the nodes of the previously divided small cluster, because the logs have been successfully committed by another large cluster.
A few points about log replication:
- The same index and tenure log entries submitted on different servers must have the same command, and all entries preceding this log entry are the same.
- If a log entry is committed, then all the log entries that were indexed before it must also be committed.
- The Leader never overwrites its own logs. If other status nodes are inconsistent with the current Leader log, the logs need to be updated, including writing new logs or deleting inconsistent logs.
- Logs submitted by the Leader must appear on the new Leader in the future.
- To ensure secure commit logs, the Leader must satisfy these two commit rules (see 4.3 Unsafe situation and 4.4 Secure Situation) :
- The log entries have been copied to most Follower nodes.
- At least one new log entry for the Leader’s current term is copied to most Follower nodes.
Timing and availability:
One of the features of Raft is that it does not depend on timing. There is no chance that the system will fail due to timing problems, but the availability of the system will inevitably depend on timing. If the server crashes, the Candidate node will fail to vote and the Raft must have a stable Leader otherwise it will not work. Leadership election is the most critical part of Raft’s timing requirements. Raft’s ability to elect and maintain a stable Leader requires a system that meets the following timing requirements:
broadcastTime << electionTimeout << MTBF
Copy the code
BroadcastTime refers to the average time for a server to send RPCS to other servers in the cluster in parallel and receive responses, electionTimeout refers to the electionTimeout time, and MTBF refers to the average time between failures of a single server. The broadcastTime is much smaller than electionTimeout to ensure that the Leader keeps sending heartbeat packets to the followers to prevent the followers from initiating elections. The electionTimeout is much smaller than MTBF to ensure the stable operation of the system. After the Leader crashes, the service is theoretically unavailable for approximately electionTimeout.
BroadcastTime is generally set to 0.5ms to 20ms depending on the storage mode. ElectionTimeout is generally set to 10-500ms(300+ms is set in the experiment). The MTBF of a server is usually several months or even years, which is easy to meet this requirement.
4 Raft Log Replication Status Analysis
4.1 Replication succeeds only when the previous log is the same
Replication status diagram 1
4.2 Logs of the Leader’s latest term have been copied to most nodes (Security)
Replication status diagram 2
As shown in the figure, s1-S3 has successfully copied the fourth LogEntry in term 2. At this time, the Leader must include the fourth LogEntry. Therefore, neither S4 nor S5 can be elected as the Leader in the re-election, and the fourth log can be safely submitted.
4.3 The Leader tries to commit logs from an older term (unsafe)
Replication status diagram 3
As shown in the figure above, it is not safe to commit the third LogEntry at this point, because S5 will overwrite the third log of S1,S2, and S3 if S5 is elected Leader.
4.4 Leader Securely submits logs
Replication status diagram 4
As shown in the figure above, a log entry 4 of the Leader’s latest term 4 has been copied to most nodes S1-S3 at this time, S5 cannot be elected successfully, and log entries 3 and 4 are secure. This confirms that the new log entry of the Leader’s current term cannot be submitted until at least one of the new log entries has been copied to most of the Follower nodes.
4.5 Inconsistent Logs Caused by Leader Changes
The Leader changes, causing inconsistent logs
As shown in the preceding figure, the change of Leader results in inconsistent logs of all nodes. The following operations are required:
- The new Leader is required to ensure that the followers’ logs are consistent with them. If there are redundant logs that are inconsistent with the followers’ logs, they are deleted, and if there are fewer logs, they are added. For example, in the following processing flow chart, (a) is to add missing logs, and (b) is to delete inconsistent redundant logs and add new logs.
- The Leader maintains a nextIndex list for each Follower, recording the index of the next log to be sent to the corresponding Follower node.
- If the Follower copy log fails, the Leader needs to reduce the nextIndex and retry.
Flowchart for handling inconsistent logs
A few things to note about Raft implementation
The data structure required for Raft implementation is complete in the paper as shown below:
Raft implements data structures and process points
- Leader heartbeat and log replication are both available
Append Entries RPC
Request sending, which simplifies the code. Unlike log replication, the Entries parameter for heartbeat RPC is empty. - Note the two timeouts. One is the election timeout and the other is the log replication (heartbeat) interval. The election timeout
ElectionTimeout
And log replication intervalHeartbeatTimeout
The choice of two timeouts, note that the replication interval must be much smaller than the election timeout, i.eHeartbeatTimeout << Electiontimeout
. My code set the election timeout to be random (300+Rand(100))ms. Note that the election timeout should be random every time, otherwise it may cause the election to fail. The replication interval is fixed at 50ms(the requirement in the paper is within 20ms, and the requirement in the experiment is about 100ms. The test found that when the election timeout is 300+ms, the heartbeat interval is 50ms, which can pass the test). - Note the locking problem. Data shared by multiple coroutines must be locked for access
rf.mu.Lock()
, remember to release the lock and usedefer rf.mu.Unlock()
It’s a good plan. Also remember to add data Race detection when testing,go test -race
. - Be careful when submitting logs
applyLogs()
In the log commit part of the function, commitIndex commits any log entries larger than lastApplied, because it is possible to commit more than one log at a time, otherwise an error will occur. - The first entry of the log array rf.log is not used, mainly for compatibility with the paper. The log index starts from 1. Note that if the first entry of the go array is nil, the GOB encoding and decoding will be problematic, so an empty LogEntry will be added to fill it.
- Call whenever you modify a variable that you want to persist
rf.persist()
Persist. The variables that are persisted by each node arecurrentTerm, voteFor, log
. - To optimize
Append Entries RPC
Number of code, please refer toResources 3The instructions.
6 Reference Materials
- Course from the basic framework of code: git://g.csail.mit.edu/6.824-golabs-2017
- The experiment of the course home page: pdos.csail.mit.edu/6.824/labs/…
- Of course a guide page: thesquareplanet.com/blog/studen…
- Raft algorithm paper: pdos.csail.mit.edu/6.824/paper…
- Raft of a simple animation, it is highly recommended: thesecretlivesofdata.com/raft/
- A JS implementation of Raft: github.com/ongardie/ra…
- Initial version V1 reference code: github.com/yuyang0/mit…
- Modified version V2 state switching part of the implementation reference this: github.com/happyer/dis…
- Raft algorithm PPT, copy the log part of the content and figure from this PPT-Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and John Ousterhout Stanford University