This article is adapted from Ongaro’s Youtube video.
The target
Raft’s goal (or the goal of a distributed consensus algorithm) is to ensure that logs are replicated exactly the same across multiple servers.
As long as the logs on each server are the same, the state machines on different servers executing the same commands from the logs in the same order will produce the same results.
The consensus algorithm’s job is to manage these logs.
The system model
We assume:
- The server may go down, stop running for a while and then come back up, but it is non-Byzantine (i.e., it acts non-maliciously, does not tamper with data, etc.);
- Network communication may be interrupted, messages may be lost, delayed, or out of order; Network partitioning may occur;
Raft is a consensus algorithm based on Leader, so it mainly considers:
- Normal running of the Leader
- If the Leader fails, a new Leader must be selected
Advantages: Only one Leader, simple.
Difficulty: When the Leader changes, the system may be in an inconsistent state. Therefore, the next Leader must clean up the system.
We will explain Raft in 6 sections:
- Leader election;
- Healthy: Log replication (the easiest part);
- Security and consistency when Leader changes (the trickiest, most critical part);
- Dealing with the old Leader: What if the old Leader is not actually offline?
- Client interaction: Implement Linearizable Semantics;
- Configuration changes: How to add or remove nodes from a cluster;
Prior to the start
To get started, there are some terms for Raft that you need to know.
Server status
The server can only be in one of the following three states at any time:
- Leader: Handles all client requests and log replication. There can be at most one viable Leader at a time.
- Follower: completely passive (no RPCS are sent but only respond to received RPCS) – most servers are in this state most of the time;
- Candidate: a temporary state between the Leader and followers to elect a new Leader.
When the system is running properly, only the Leader and the rest are Followers.
State transition diagram:
term
Time is divided into terms, and each Term is denoted by a number that increases monotonously and never repeats.
A normal term has at least one Leader, usually divided into two parts:
- The election process at the beginning of the term;
- The part that is in normal operation;
Some terms may not elect a Leader (as shown in Term 3), at which point the next Term is immediately entered and another attempt is made to elect a Leader.
Each node maintains a currentTerm variable that represents the currentTerm in the system. CurrentTerm must be stored persistently so that it can be restored if the server goes down and restarts.
* Term is very important! Tenure helps Raft identify outdated information. ** For example, if the node currentTerm = 2 communicates with the node currentTerm = 3, we know that the information on the first node is out of date.
We only use the latest term information. Later we will encounter various situations to detect and eliminate information that is not the latest term.
Two RPC
All types of communication between servers in Raft is via two RPC calls:
RequestVote
: used in elections;AppendEntries
: used to copy logs and send heartbeats;
1. The Leader election
Start the
- When nodes are started, they are in the Follower state.
- The Follower passively accepts the Leader’s or Candidate’s RPC;
- So, if the Leader wants to maintain authority, he must send heartbeat packets (empty) to other nodes in the cluster
AppendEntries RPC
); - Waiting for election timeout (
electionTimeout
, generally within 100~500ms) after the Follower did not receive any RPC:- The Follower considers that there is no Leader in the cluster
- Start a new round of elections
The election
When a node starts campaigning:
- Increase one’s own
currentTerm
- To become a Candidate, the goal is to obtain more than half of the votes of the nodes and become the Leader
- Vote for yourself first
- Sends in parallel to other nodes in the cluster
RequestVote RPC
Asks for a vote, and if it does not receive a response from the specified node, it tries again and again until one of three things happens:
- Get more than half of the votes: become the Leader and send to other nodes
AppendEntries
The heart; - Received RPC from the Leader: converted to Follower;
- Neither of the other two things happened, and no one could win (
electionTimeout
Passed) : increasedcurrentTerm
, a new election;
The flow chart is as follows:
Election security
There are two p’s that the election process requires: safety and liveness.
Safety: Only one Leader is elected for a term. Need to ensure:
- Each node can only vote once in the same term. It will vote for the first ballot request that meets the criteria, and then reject the other candidates’ requests. This requires persistent storage of voting information
votedFor
In order to recover after downtime and restart, otherwise after restartvotedFor
Loss will result in voting to another node; - Only when more than half of the votes are obtained, can a Leader be elected. That is to say, two different candidates cannot both get more than half of the votes in the same term.
Liveness: Ensures that a Leader is ultimately elected.
Here’s the thing: in principle, we could split the vote indefinitely, if the election started at the same time, ran out at the same time, ran again at the same time, and so on.
The solution is simple:
- The node randomly selects a timeout, usually between [T, 2T] (T =
electionTimeout
) - In this way, the nodes are unlikely to start campaigning at the same time, and the nodes that run first have enough time to claim the other nodes’ votes
- T >> broadcast time(T is much larger than the broadcast time) is better
2. Log replication
Log structures
Each node stores its own log copy (log[]), and each log record contains:
- Index: Position of the record in the log
- Tenure number: Tenure number when the record was first created
- The command
** Logs must be stored persistently. ** A node must safely write records to disk before it can return responses to other nodes in the system.
If a log record is stored on more than half of the nodes, we consider it committed — this is a very important feature of Raft! If a record is committed, it means the state machine can safely execute the record.
In the figure above, records 1-7 are committed and record 8 has not yet been committed.
Note: most people commit when they copy the log. This definition is imprecise and will be modified later.
The normal operation
- The client sends a command to the Leader and expects the command to be executed by all state machines.
- The Leader first appends the command to his own log.
- The Leader sends in parallel to the other nodes
AppendEntries RPC
, waiting for a response; - If more than half of the nodes respond, the new log is considered committed:
- The Leader passes the command to his own state machine and returns the response to the client
- In addition, once the Leader knows that a record has been committed, the subsequent
AppendEntries RPC
Notifies Followers who have already committed their Followers - The Follower sends the submitted commands to its state machine
- If the Follower crashes/times out: the Leader will repeatedly try to send the RPC;
- Performance optimization: The Leader does not have to wait for every Follower to respond, but only needs more than half of the successful responses (ensuring that logs are stored on more than half of the nodes) – a slow node does not slow the system because the Leader does not have to wait for him;
Log Consistency
Raft tries to maintain a high level of log consistency across the cluster.
Index and TERM for Raft logs uniquely identify a log record. (This is very important!!)
- If the logs of two nodes have the same tenure number at the same index location, they are considered to have the same command. The log is exactly the same from the beginning to this index location;
- If the given record is committed, then all previous records are committed as well.
AppendEntries
Consistency check
Raft detects these two properties through AppendEntries RPC.
- For each
AppendEntries RPC
Contains new log recordsThe previous oneIndex (prevLogIndex
) and term of office (prevLogTerm
); - Followers check whether their index and term match
prevLogIndex
和prevLogTerm
Matches, the match receives the record; Otherwise refuse;
3. The shift Leader
When the new Leader takes over, the logs may not be very clean because the previous Leader may go down before completing the log replication. Raft’s approach to this is that no special treatment is required.
When the new Leader takes office, he will not perform any cleaning operations immediately, he will do the cleaning during normal operation.
The reason is that when a new Leader takes office, it usually means that some machines fail. Those machines may be down or the network is down, so there is no way to clean their logs immediately. We have to keep the system up and running until the machine is up and running again.
The main premise is that Raft assumes that the Leader’s log is always correct. ** So what the Leader needs to do, over time, is to make sure that all Follower logs eventually match it.
But at the same time, the Leader can fail before it can do this, and the logs pile up over a period of time, resulting in what looks like quite a mess, as shown below:
Since we already know that index and term are unique identifiers for log records, the commands contained in the log are no longer shown here, likewise below.
As shown, this could happen if S4 and S5 were leaders for terms 2, 3, and 4, but somehow they crashed without copying their logs. The system was partitioned for a while, and S1, S2, and S3 took turns being leaders for terms 5, 6, and 7. But there was no way to communicate with the S4 and S5 for log cleaning — so the logs we saw were very messy.
The only thing that matters is that the records between indexes 1 and 3 are committed (majority nodes already exist), so we have to make sure we keep them.
The other logs are uncommitted, we have not passed these commands to the state machine, and no client will receive the results of these executions, so it doesn’t matter whether they are retained or discarded.
security
Once the state machine executes a log command, you must ensure that other state machines do not execute different commands at the same index location.
Raft Security: If a log record is committed on a tenure number, it must appear in the future Leader’s log on a larger tenure number.
This ensures security requirements:
- The Leader does not overwrite the records in the log;
- Only records from the Leader’s log can be committed;
- Logs must be committed before they can be applied to the state machine;
This has decided that we need to amend the electoral process:
- If the log of a node does not contain correct content, avoid the node from becoming the Leader.
- Modify the definition of committed slightly (That is, the above mentioned to be slightly modified) : The majority store is committed, but at some point we must delay committing the log record until we know it is safe.What is safe is that we assume that the subsequent Leader will also have this log.
Delay submission and select the best Leader
The question arises: How do we ensure that a Leader is selected that does a good job of keeping all committed logs?
This is tricky, as an example: suppose we want to elect a new Leader in the cluster below, but the third server is unavailable.
In this case, only looking at the logs of the first two nodes, we cannot confirm whether the majority is reached, so we cannot confirm whether the fifth log has been submitted.
So what to do?
By comparing logs, during an election, select the logs most likely to contain all the submitted logs:
- Candidate in
RequestVote RPCs
Contains log information (index and term of the last record, denoted aslastIndex
和lastTerm
); - Who will receive this vote the request of the server V is more complete: log ` (lastTermV > lastTermC) | |
(lastTermV == lastTermC) && (lastIndexV > lastIndexC) ‘will reject the vote; (i.e., the term of V is newer than the term of C, or the same term but V’s log is more complete than C’s);
- Whoever wins the election ensures that the Leader and more than half of the nodes that vote for it have the most complete log — most complete means that the index and term pairs of unique identifiers are the largest.
For example
Case 1: The Leader decides to commit the log
The index = 4 log for Leader S1 with tenure 2 has just been copied to S3, and the Leader can see that index = 4 has been copied to more than half of the servers, then the log can be committed and safely applied to the state machine.
Now, this record is safe and the Leader of the next term must include this record, so neither S4 nor S5 can get votes from other nodes: S5 term is too old and S4 log is too short.
Only one of the first three can become the new Leader — S1 can of course, S2 and S3 can also become the Leader by getting votes from S4 and S5.
Case 2: The Leader tries to commit the log for the previous term
As shown in the figure, at term 2, the records are only written on nodes S1 and S2. For some reason, the Leader S5 of term 3 does not know these records. S5 creates its own three records and then goes down, and Leader S1 of term 4 is elected. S1 attempts to match logs from another server. So it copies the log for tenure 2 to S3.
Index =3 is not safe.
Because S1 might go down at this point, then S5 might get votes from S2, S3, and S4 to become the Leader for term 5. Once S5 becomes the new Leader, it will overwrite logs with index=3-5, and these records from S1-S3 will disappear.
We also need a new rule to deal with this situation.
New Commit rules
New elections are not enough to secure the logs, and we need to keep changing the COMMIT rules.
The Leader commits a log:
- Logs must be stored on more than half of the nodes;
- The Leader must see that more than half of the nodes must also store at least one log from their tenure.
As shown, go back to Case 2 above: When index = 3 & term = 2 is copied to S3, it cannot commit the record, and must wait until the record with term = 4 is stored on more than half of the nodes, at which point index = 3 and index = 4 are considered committed.
Now S5 can’t win the election, it can’t get votes from S1-S3.
Combining the new election rules with the COMMIT rules, we can guarantee the security of Raft.
Log inconsistency
Leader changes can lead to log inconsistencies, as shown here.
As you can see from the figure, there are typically two types of inconsistent logs in Raft clusters:
- Missing Entries;
- Extraneous Entries;
All we need to do is clean up both types of logs.
Repairing the Follower Log
The new Leader must keep the followers’ logs consistent with his own by:
- Delete Extraneous Entries;
- Complete Missing Entries;
The Leader saves nextIndex for each Follower:
- The index of the next log to send to the Follower;
- Initialize to: 1 + the Leader’s index of the last log;
The Leader fixes the log with nextIndex. When AppendEntries RPC consistency check fails, decrement nextIndex and retry. As shown below:
For a:
- At the beginning
nextIndex
= 11, index = 10 & term = 6, check failed; nextIndex
= 10, index = 9 & term = 6, check failed;- And so on until
nextIndex
= 5, with log index = 4 & term = 4, the log now matches and will complement the Leader’s log in A. So I’m going to go down like that.
For B: a match is checked until nextIndex = 4. It is worth noting that in the case of B, when Follower overwrites inconsistent logs, it will delete all subsequent log records (and any subsequent records that are irrelevant will also be irrelevant). As shown below:
4. Dispose of the old Leader
In practice, the old Leader may not disappear immediately, for example: a network partition separates the Leader from the rest of the cluster, which elects a new Leader. The problem is that if the old Leader reconnects and does not know that the new Leader has been elected, it will try to continue committing logs as the Leader. At this point, if a client sends a request to the old Leader, the old Leader will try to store the command and copy logs to other nodes — we must prevent this from happening.
Tenure is used to find stale leaders (and Candidates) :
- Each RPC contains the term of the sender;
- If the sender’s tenure is too old, the RPC is rejected regardless of the procedure, and the sender transitions to Follower and updates its tenure;
- If the receiver’s tenure is too old, the receiver will turn to Follower, update its tenure, and then process the RPC as normal.
Since the election of the new Leader will renew the tenure of more than half of the servers, the old Leader cannot commit new logs because it will contact at least one node in the majority cluster and then discover that its tenure is too old and will switch to followers to continue working.
We are not going to discuss other extreme cases here.
5. Client protocol
The client only sends commands to the Leader:
- If the client does not know who the Leader is, it will communicate with any of the servers.
- If the communicating node is not the Leader, it tells the client who the Leader is;
The Leader does not respond until the command is logged, committed, and executed to the state machine.
The problem here is if the Leader outage causes the request to time out:
- The client reissues commands to other servers and is eventually redirected to the new Leader
- Retry the request with the new Leader until the command is executed
This leaves the risk that a command may be executed twice — the Leader may go down after executing the command but before responding to the client, at which point the client goes to find the next Leader and the same command is executed twice — which is unacceptable!
The workaround is that each command the client sends to the Leader has a unique ID
- The Leader writes the unique ID to the log record
- Before the Leader accepts the command, it checks to see if the ID already exists in its logs
- If the ID is in the log, it indicates that the request is repeated. The new command is ignored and the response of the old command is returned
Each command is executed only once, and this is the key element of linearization.
6. The configuration is changed
Over time, there will be machine failures that we need to replace or change the number of nodes, and there will need to be some mechanism to change the system configuration in a safe, automatic way without stopping the system.
System configuration refers to:
- The ID and address of each server
- System configuration information is very important in determining the composition of the majority
The first thing to realize is that we can’t just switch from the old configuration to the new one, which could lead to a contradictory majority.
As shown, the system is running in a three-server configuration, at which point we need to add two servers. If we change the configuration directly, they may not be able to switch the configuration completely at the same time, which will result in S1 and S2 forming a majority of the old cluster, while S3-S5 has switched to the new configuration at the same time, which will result in two clusters.
This means we have to use a two-phase protocol.
If someone tells you that they can make a decision in a single phase of a distributed system, you should ask them very seriously, because they are either wrong or have discovered something that the rest of the world does not know.
Joint Consensus
Raft achieves a two-stage agreement through Joint Consensus, where a majority of votes are cast on both the old and new configurations.
Stage 1:
- Leader received
Write a configuration change request to the
The configuration change takes effect immediately, and then the log is passedAppendEntries RPC
Copy to Follower and receive the message
The node immediately applies the configuration as the current node configuration. - When Cold+newC_{old+new}Cold+new logs are copied to most nodes, Cold+newC_{old+new}Cold+new logs have been committed.
The Cold+newC_{old+new}Cold+new log has been committed to ensure that any subsequent Leader must have a Cold+newC_{old+new}Cold+new log. The Leader election process must be voted by the majority in the old configuration and the majority in the new configuration.
Stage 2:
After a log is submitted, write one to it immediately
And pass the logAppendEntries RPC
Copy to Follower, received
The node immediately applies the configuration as the current node configuration.- CnewC_{new}Cnew is committed when the CnewC_{new}Cnew log is copied to the majority node; After the CnewC_{new}Cnew log is committed, subsequent configuration is based on CnewC_{new}Cnew;
Joint Consensus has a few more details:
- During the change process, nodes from the old and new configurations may become leaders.
- If the current Leader is not in the CnewC_{new}Cnew configuration, it must step down once CnewC_{new}Cnew is committed.
As shown in the figure, the old Leader may continue to serve for a short time after it is no longer a newly configured member; That is, the old Leader may continue to be the Leader (although not actually the Leader) in CnewC_{new}Cnew configuration until CnewC_{new}Cnew logs are copied to the majority and committed.
reading
Raft questions, how many questions can you get right?
Golang implements the Paxos distributed consensus algorithm
Rambling on about distributed consensus
Understand Paxos (including pseudocode)