Raft Consensus Algorithm is one of the common Consensus algorithms In distributed systems. In Search of an Understandable Consensus Algorithm, the author points out that the Poxas Consensus Algorithm is difficult to understand. Second, it is difficult to apply to the actual system. In view of the problems existing in Paxos, the author aims to provide an easy-to-understand consensus algorithm. In a separate section of the paper, Raft is described as a practical, safe, usable, effective and easy-to-understand consensus algorithm. This paper describes the details of the Raft consensus algorithm, and many of the content descriptions and cited images are taken from the original paper.
Summary of Raft
We discuss Raft in the following three parts:
- Leader election — a new Leader must be chosen when an existing Leader fails.
- Log replication — The leader must accept Log entries from clients and replicate them across the cluster, Forcing the other logs to agree with its own.
- Safety — The key Safety property for Raft.
During normal operation, Raft is divided into two parts: the leader election process and the normal operations such as log replication based on the elected leader.
A Raft cluster usually containsA server that allows the system to haveTwo failed servers. Each server is in one of three states:leader
,follower
orcandidate
. In normal operation, there is only one leader and all other servers are followers. Followers are passive and do not respond to requests from themselves but from the leader and candidate. The leader handles all client requests (if the client contacts the follower, the follower is forwarded to the leader). The candidate state is used to elect the leader. The state transition is shown in the figure below:
Server nodes are required to store the following status information for leader election and log replication:
state | Persistent on all servers |
---|---|
currentTerm | Last known tenure number of the server (initialized to 0 and increasing) |
votedFor | Id of the candidate currently receiving the vote |
log[] | Log entry set; Each entry contains an instruction executed by the user’s state machine and the tenure number when received |
state | Change frequently on all servers |
---|---|
commitIndex | The index value of the largest known log entry that has been committed |
lastApplied | The index value of the log entry that was last applied to the state machine (initialized to 0 and increasing) |
state | Frequently changed in leadership (reinitialization after election) |
---|---|
nextIndex[] | For each server, the index value of the next log entry that needs to be sent to it (initialized to the leader’s last index value plus one) |
matchIndex[] | For each server, the highest index value of the logs that have been copied to it |
Raft has the following characteristics at any given moment:
- Election Safety: Can only have one leader in a term;
- Leader append-only: the Leader does not overwrite or delete entries in the log, but Only adds entries. (The followers roll back logs according to the Leader.)
- Log Matching: If two logs contain an entry with the same index and term, all entries before the index are the same;
- 2. If an entry was submitted at a given term, then it must exist at the Leader for the higher term. (This is guaranteed in the election of leaders, but we’ll talk about that later.)
- State Machine Safety: If a node applies an entry to the State Machine, then no node will apply the entry of the index to the State Machine again.
Let’s discuss these sections in detail.
Leader Election
The initial state of a node is followers. If the followers do not receive a heartbeat message from the leader within the election timeout period, the node changes to the candidate state. To avoid election conflicts, this timeout is a random number (usually 150 to 300ms). Emitted to other nodes after the timeout becomes a candidateRequestVote
RPC request, assuming there isTwo nodes, copy thatMore than five nodes agree to respond, that is, they are elected as the leader node to start the next stage of work. If the candidate receives a heartbeat message from the Eader during the election, the candidate changes to the follower state.
During the election period, there may be multiple candidates. It is possible that no majority of votes are received during the first round of election, and then the second round of election randomly times out again until the leader is elected or the heartbeat message of the leader is received again and the state changes to follower status.
In normal state, the leader continuously broadcasts heartbeat messages. The followers reset the timeout after receiving the heartbeat messages from the leader. If the leader crashes or is abnormally offline, the follower node cannot receive heartbeat information and the election process starts again when the leader times out.
There are some details to be added here. Each leader can be understood as having its own term. Each term starts from the election stage until the term ends due to node failure and other reasons. Each follower node can vote only once during an election period. In the figure, T3 may fail in the election because it does not get more than half of the votes, etc., and the next round of election is required. At this point, followers can send RequestVote requests to the candidate who arrives first (first come, first served) again.
Reject all requests (RequestVote, AppendEntry, etc.) that have a Term less than the current node. In the case of a candidate election, the leader node receives an AppendEntry request that has a Term longer than the current node’s Term. The leader is recognized, and the candidate is converted to follower.
Log Replication
After the leader is elected, it enters the effective working phase, that is, the log replication phase. The log replication process is divided into two phases: recording logs and submitting data.
The whole process is as follows:
- First, the client issues a command to the leader. (Each command can be considered an entry, or log entry.)
- After receiving the command command from the client, the leader adds this command entry to the local log. At this time, the command is in uncommitted state, so the current node status is not updated.
- The leader then sends this entry to all followers by copying the AppendEntries message through the log(Can be one or multiple entries)The followers copy the log entries to other nodes in the cluster(Not all the log entries sent by the leader are unconditionally received. In addition, there may be inconsistency between the local log and the leader log, which will be explained in detail later. The normal situation is firstly described here.)Appends to the local log and responds to the leader’s success or failure.
- After the leader receives confirmation from most followers, this entry changes from uncommitted to committed on the leader node. At this point, press this command to update the leader state or apply the entry to the state machine. The result is then returned to the client;
- At the next heartbeat (which can also be or in most cases is a new log copy AppendEntries message with relevant information, as described in the detailed field descriptions below), the leader notifies all followers to update the confirmed entry and the followers update their status when they receive it. The status of the command specified by the client is updated on all nodes.
It can be seen that every time the client submits the command, the service node adds the command entry to the log. After the leader confirms that most nodes have added the log, the leader will confirm the submission and update the node status. If you’re a little confused about the process, look at the Raft animation demo, which visually demonstrates the process of leader selection and log replication.
Safety
Previously described how the Raft algorithm elects and copies logs. However, the mechanisms described so far are not sufficient to ensure that each state machine executes the same instructions in the same order. We need to further consider the following questions:
- The first question is: during the leader election, followers receive the voting request initiated by the candidate and respond if they agree. But what are the specific rules? Are all followers likely to be elected leaders?
- Second, the leader can fail at any time. How does the new leader submit the log entries of the previous term?
The election limits
If the current leader node fails, a new leader needs to be elected. At this time, the status of the follower node may be different. Some followers may be in the same state as the leader who just failed, and in a new state. Some of the followers may record the current index is much less than that of the original leader node, and the status update is relatively late. At this time, from the point of view of the system, the choice of the latest candidate is optimal, and from the point of view of correctness, to ensure that the leader node is yan, That is, if an entry is submitted successfully in a certain term, it must exist in the leader with a higher term. Conversely, if a candidate’s state is older than the committed state, it must not be selected as the leader. To be specific, the voting rules are as follows: 1) the node is only voted for the node whose log state is not older than its own; 2) Each node can only be invested once in a term. Under the condition of 1, it is first-come, first-served;
Let’s look at the definition of RPC (which is called by the candidate to collect votes) :
parameter | explain |
---|---|
term | Term number of the candidate |
candidateId | Id of the candidate requesting a ballot |
lastLogIndex | The index value of the candidate’s last log entry |
lastLogTerm | The tenure number of the candidate’s last log entry |
The return value | explain |
---|---|
term | Current term number so that the candidate can update his or her term number |
voteGranted | This ballot is valid if the candidate has won it |
Receiver implementation:
- if
term < currentTerm
Returns false - If votedFor is empty or candidateId, and the candidate’s log is at least as new as your own, vote for him
You can see that the RequestVote vote request contains lastLogIndex and lastLogTerm to compare the log state. In this way, although it is not guaranteed that the candidate in the latest state becomes the leader, it is guaranteed that the node selected as the leader must have the latest committed state, but it is not guaranteed that it has the latest uncommitted state entries.
Commit the log entry for the previous term
The leader knows that a log for the current term can be submitted as long as it is stored on most servers. However, uncommitted log entries from previous terms can be overwritten by leaders of subsequent terms, even if they have been stored on most nodes. The picture below illustrates the situation:
The time series shown in the figure shows why the leader could not decide to commit the log entry for the old tenure number. In (a), S1 is the leader, partially duplicating the log entry at index position 2. In (b), S1 crashes and S5 wins the election in term 3 with S3, S4 and its own votes, then receives a different log entry from the client and places it at index 2. And then (c), S5 crashes again; S1 The system is restarted, the election is successful, and logs are copied. At this point, the log from tenure 2 has been copied to most machines in the cluster, but has not yet been committed. If S1 crashes again in (d), S5 can be re-elected (with votes from S2, S3 and S4) and then overwrite their logs at index 2. If, on the other hand, S1 copies the log entries from the new term it dominates to most machines before the crash, as in (e), then these new log entries will be committed in subsequent terms (since S5 will not be elected). This ensures that all previous old log entries will be committed at the same time.
To eliminate the situation described above Raft never commits a log entry from a previous tenure by counting the number of copies. Only log entries from the leader’s current term can be submitted by counting the number of copies; Once a log entry for the current term is committed in this way, previous entries are also indirectly committed due to the log matching feature.
When the leader copies logs from previous tenure, Raft keeps the original tenure number for all the logs.
Thinking about several situations in Raft
What can I do if the logs of the follower node and the leader are inconsistent?
<prevLogIndex=7,prevLogTerm=3,entries=[x<-4]>,leaderCommit=7
prevLogIndex=7,prevLogTerm=3
index=7
x<-5
x<-4
index<prevLogIndex
prevLogIndex=7,prevLogTerm=3
x<-4
preLogIndex
index,term
prevLogIndex,prevLogTerm
Let’s look at the AppendEntries RPC (the leader is responsible for calling the copy log instruction; Also used as heartbeat) definition:
parameter | explain |
---|---|
term | Term number of the leader |
leaderId | Id of the leader so that the follower can redirect the request |
prevLogIndex | The new log entry follows the previous index value |
prevLogTerm | PrevLogIndex Specifies the term number of the entry |
entries[] | Prepare log entries for storage (empty for heartbeat; Sending more than one at a time is for efficiency) |
leaderCommit | The index value of the log that the leader has submitted |
The return value | explain |
---|---|
term | The current term number used by the leader to update himself |
success | True if the follower contains a log that matches the prevLogIndex and prevLogTerm |
Receiver implementation:
- if
term < currentTerm
Return false; - Return false if the term number of the log entry at the prevLogIndex position does not match the prevLogTerm;
- If an existing log entry conflicts with a new one (with the same index value but different tenure number), delete this entry and all subsequent entries. (One of the principles of inconsistent follower handling in raft is to follow the leader)
- Attach any new entries that do not already exist in the log;
- if
leaderCommit > commitIndex
, set commitIndex equal to the smaller of the leaderCommit and new log entry index values;
To sum up, the core principle for dealing with inconsistencies is to follow the leader. When the leader sends a AppendEntry request to the follower, the follower performs a consistency check on the AppendEntry. If the request passes, the follower updates the status information. If the request is inconsistent, the leader rejects the request. At this point, the nextIndex is decremented and a log copy request is sent to the follower node again until a log consistency is found. Then, the logs of the follower node are overwritten by those of the leader node.
The leader hangs up. What can I do?
The treatment method of this situation may be mentioned intermittently before. The first thing is to elect a new leader. After the new leader is elected, there may still be some entries that have not been submitted in the last term and are in the uncommitted state. The solution is that the new leader only deals with the entries submitted for the new term, and the entries not submitted for the previous term have been recorded in the logs of most nodes before the election of the new leader, then the new leader submits the latest entry, Entries that were previously in the uncommitted state are also committed, because if two logs contain an entry with the same index and term, then all entries before the index are the same. If the new leader is not logged by most nodes before the election, the original unsubmitted entries may be overwritten by the entries of the new leader.
How to deal with network partition?
Situation basic network partition is inevitable in the distributed system, a network partition, the original leader in the side of the partition, at this time if the client sent instructions, the old leader is still in the process of logging partitions a test copy, but did not receive confirmation of most nodes, submitted by the client order entry can only be recorded in the log, Unable to commit confirmation, in uncommitted state. On the other side of the partition, if the heartbeat information cannot be received, a new leader will be elected in the election process. The new leader will be responsible for the requests from the zero side of the partition and perform operations such as log replication. Because the new leader can receive most follower confirmation, the client’s command entry can be submitted and the node status can be updated. When the network partition is restored, the two leaders will receive the heartbeat information broadcast by each other. At this point, the old leader finds the leader with a longer term. When the old leader becomes a follower, all operations on the side of the old leader partition are rolled back to receive updates from the new leader.
Members of the change
In a distributed system, the number of nodes or servers is not constant, so we may increase or decrease the number of nodes at any time. When the number of nodes is increased, two leaders may be elected successfully, which is mainly caused by the inconsistency between the old and new configurations. How to deal with it? The simplest thing to do is to stop all the nodes, update the configuration, and restart all the nodes, but this will cause the service to be unavailable for a period of time. In many cases, this is not allowed. Raft’s solution was originally a Joint Consensus approach and then a single server changes approach. Let’s describe this problem in detail.
Raft requires that only one leader can be created during a given term. The trouble with member changes is that there may be two leaders when a member changes. For example: the original system had three nodes with members [1,2,3] and now there are four and five new members. Suppose 1,2 and 3 are partitioned when members are changed. At this point, [1,2] is a group. 1 is elected as the leader by nodes 1 and 2, while 5 is elected as the leader by nodes 3, 4 and 5.
Because the update time of the old and new configurations of each node is different, there may be two majority cases of the old and new configurations at a certain point. In the figure above, most of the old configurations are two nodes, while most of the new configurations are three nodes. There are two majority cases at the moment of the red line in the figure. If the network partition is elected at this time, there will be two leaders.
How do you solve it? What can be done to prevent the occurrence of the above two situations? This can be resolved by single-node changes, which are member changes made one node at a time. The main idea is to make use of the feature of “changing one node at a time without having two majorities of the old configuration and the new configuration” to implement member changes. For example, you can change A 3-node cluster [A,B,C] to A 4-node cluster [A,B,C,D] and then change A 4-node cluster to A 5-node cluster [A,B,C,D].
Why wouldn’t a single-node change result in two majorities? We can deduce as follows: Assume that the number of original nodes is 2n+1, then most of the old nodes are n+1, and one new node is added. If the number of newly configured nodes is 2n+2, then most of the newly configured nodes are major_new= N +2. The number of nodes required by two majorities is major=major_old+major_new= N +1+n+2=2n+3>2n+2, which means that the number of nodes required by two majorities exceeds the total number of nodes.
Specifically, we still take this 3-node cluster changed to a 5-node cluster as an example to illustrate. Assume that there is A 3-node cluster [A,B,C], where node A serves as the leader and is configured as [A,B,C]. We first add node D to the cluster and the new configuration is [A,B,C,D]. Member change is implemented in the following two steps:
- In the first step, leader node A synchronizes data to the new node D.
- In step 2, the leader copies the new configuration [A,B,C,D] as A log entry to all nodes in the new configuration (A,B,C,D), and then applies the new log entry to the local state machine to complete the single-node change.
After the change, the configuration item of the existing cluster is [A,B,C,D]. The same procedure is used to add node E. Raft is a log entry where commands are used to change the cluster configuration. This is a log entry and the process is the same as a normal log entry, except that the state machine results in a configuration change.
Log compression
Log compression is mainly to solve the conflict between the infinite growth of the log and the limited storage space, you can think of a question: for the committed log entries, is it necessary to keep? Can I delete or compress some committed log entries if not necessary? Raft’s main solution is to use snapshots for log compression.
As shown in the figure above, log entries before log index 5 can be deleted and only a snapshot (with the current status and some meta information such as the tenure index number) can be retained.
In the implementation of specific projects, snapshots are created independently for each node. When logs exceed a certain amount, snapshots are triggered. The specific implementation and more details will be discussed later.
Client Protocol
Raft consensus algorithms also work with a client protocol that solves a number of problems together. Questions such as: How does the client interact with the cluster? If the client knows the leader node, it can directly send the command to the leader node. If the client does not know the command, it can send it to the known node in the cluster at will. The node will forward the request from the client to the Leader node. The client sends a request (or command) to the leader, but the leader does not respond. Retry is one solution. What if the leader of the connection crashes the client? If the client times out and resends the command, how can I ensure that the command is not executed twice by the state machine? When generating a command, the client must add a unique ID. If the same command already exists in the server logs, the ID is ignored.
The appendix
Attached here is a screenshot from the paper, which explains in detail what information needs to be maintained on different nodes, how each message is defined, and how messages should be processed, excluding log compression and member change:
As a bonus, RAFT consensus solves a different problem than PBFT consensus: Raft nodes can’t have malicious nodes, node messages can be delayed and lost, but they can’t be fake or evil, i.e., no Byzantine nodes.
In this paper, the raft made an integral part of the consensus algorithm combing study, there might be some details not clear place, in the real engineering code implementation, there are more details, at the same time, the lack of proof here why raft algorithm is proved correct, remains to be further understand the consensus algorithm after amended from time to time.
Welcome to follow the wechat official account for more technology sharing on distributed system: