Thesis Address: Raft Paper

2 Replicate the state machine

The goal of consistency algorithm is to ensure the consistent state of all nodes in the cluster. The instructions to be executed by nodes can be divided into two types: read and write. Only the write instruction changes the state of the node, so in order to keep the state consistent across the cluster nodes, the write instruction must be synchronized to all the nodes.

Ideally, a write command on any node is expected to immediately change the state on other nodes without any delay, and all nodes are expected to change the state as if it were a single machine.

But lag far behind that of memory operations, write command could not be executed immediately at the same time, therefore can only ensure all write command on all nodes in the same order was executed, so need two conditions: one is written only allow a node processing orders, the second is all nodes in order to maintain a consistent logs.

Figure 1: Replication state machines are typically implemented using replication logs, as shown in the figure below. Each server stores a log containing a series of commands, and its state machine executes the commands in the log in sequence. The commands in each log are the same and in the same order, so each state machine processes the same sequence of commands. So you get the same state and the same sequence of outputs.

Consistency algorithms are developed in the context of replication state machines. It describes a set of state machines on servers that compute the same copy of the same state and can continue to run even if some servers go down, thus enabling fault tolerance in distributed systems.

The job of the consistency algorithm is to ensure the consistency of the replication logs. The consistency module on each server receives commands from the client and adds them to its logs. The servers communicate with each other through a consistency module to ensure that each log ends up containing the same commands in the same order, even if some servers fail. Once the commands are correctly copied, the state machine on each server processes them in log order and returns the output to the client. This results in a highly available replication state machine.

Consistency algorithms in real systems usually have the following properties:

  • They ensure security (not returning incorrect results) under all non-Byzantine conditions, including network latency, partition and packet loss, duplication and out-of-order.
  • As long as any majority (more than half) of the servers are operational and can communicate with each other and with customers, consistency algorithms are available. Thus, a typical cluster of five servers can tolerate the failure of any two servers. If servers suddenly go down, they can later recover from their state and rejoin the cluster.
  • They do not rely on timing to ensure log consistency: faulty clocks and extreme message latency can cause usability problems in the worst case.
  • In general, the command completes as long as most of the cluster (more than half of the servers) have responded to a single round of remote procedure calls; A few (less than half) slow servers do not affect overall system performance.

5 Raft algorithm

Raft of new features

  • Strong Leader: In Raft, log entries flow only from the leader to other servers. This simplifies the management of replication logs and makes RAFT easier to understand.
  • Leader election: Raft uses a random timer for Leader election. This simply adds a few mechanisms to the heartbeats required by any consistency algorithm, while resolving conflicts quickly and easily.
  • Member changes: Raft uses a common and consistent approach to dealing with cluster member changes, in which most machines in the two different configurations of the cluster overlap during the adjustment process, allowing the cluster to continue to work when the members change.

Comprehensibility of Raft:

  • The big issue is broken down into sub-issues: leadership election, log replication, security, and role change
  • Simplify the state space to consider by reducing the number of states, such as using a randomized timeout to simplify the leader election algorithm in Raft.
  • A Raft cluster must contain an odd number of servers, and 2F +1 will tolerate f server failures. (To keep most of the servers online for log submission and leader election)
Raft is an algorithm used to manage the replication logs described in Section 2

Figure 2:

Interpretation of the above

1. State:
  • Persistent state on all servers:Persistent state of all servers (updated on stable storage before responding to RPC) :
    • CurrentTerm: The server sees the latest term in the log entry (initialized to 0 when first started, monotonically increased)
    • VotedFor: ID of the candidate who received votes in the current term (null if none)
    • Log [] : log entry; Each entry contains the command from the state machine and the time the leader received the entry (the first index is 1).
  • Volatile state on all servers:Volatile state on all servers:
    • CommitIndex: known index of the highest log entry committed (initialized to 0, monotonically increased)
    • LastApplied: Index of the highest log entry that has been applied to the state machine (initialized to 0, monotonically increased)
  • Volatile state on leaders:Volatile state on leader (server) :(reinitialized after election)
    • NextIndex []: For each server, the index of the next log entry to be sent to that server (the last log index initialized to the leader +1)
    • MatchIndex []: For each server, the index of the highest log entry known to be replicated on the server (initialized to 0, monotonically increased)

AppendEntries RPC: call request to follow by the leader used to copy log entries (§5.3); Also used as a heartbeat (§5.2)

  • Arguments:Request parameters from Leader to follower:
    • Term: term of leadership
    • LeaderId: Enables followers to redirect for clients
    • PrevLogIndex: Index of the log entry immediately before the new log
    • PrevLogTerm: The term of the log entry immediately preceding the new one
    • Entries [] : Log entries to be saved (empty for heartbeat use; More than one time can be sent to improve efficiency.)
    • LeaderCommit: The leader’s index of the highest log entry known to have been committed
  • Results:The return value:
    • Term: Current term. Leaders renew their terms
    • Success: True if the entry contained by the follower matches the prevLogIndex and prevLogTerm parameters sent by the Leader
  • Receiver implementation:(implementation of the receiver, i.e. Follower) :
    • \1. Return false if term < currentTerm (§5.1)
    • \2. If the Follower log does not contain an entry with term matching prevLogTerm at prevLogIndex (i.e., logIndex is preLogIndex for no log entry), return false (§5.3).
    • \3. If an existing entry conflicts with a new entry (same index but different term), delete the existing entry and all subsequent entries (§5.3)
    • \4. Add any new entries that have not yet appeared in the log
    • \5. If leaderCommit > commitIndex, set commitIndex = min for the Follower (leaderCommit, last new entry index)

RequestVote RPC candidate to call for collecting votes (§5.2) :

  • Arguments:Request parameters from Candidate to follower:
    • Term: indicates the term number of the candidate
    • CandidateId: Id of the candidate requesting the ballot
    • LastLogIndex: The index value of the last log entry for the candidate
    • LastLogTerm: Term number of the candidate’s last log entry
  • Results:The return value:
    • Term: indicates the current term number. Candidates will update their term number
    • VoteGranted: true indicates that the candidate has received votes
  • Receiver implementation:(Implementation of Follower)
    • \1. Return false if term < currentTerm (§5.1)
    • \2. If votedFor is null or candidateId and the candidate’s log is at least as new as the recipient’s log, a vote is cast (§5.2, §5.4).
4,Rules for ServersRules for all servers
  • All Servers:
    • If commitIndex>lastApplied: add lastApplied, apply log[lastApplied] to the state machine (§5.3)
    • If the RPC request or response contains term T > currentTerm: Set currentTerm = T and convert to follower (§5.1).
  • Followers (§ 5.2) :
    • Respond to candidates and leaders’ RPCS
    • If the election times out, no AppendEntries RPC is received by the incumbent Leader, and no vote for the Candidate is cast
  • Candidates (§ 5.2) :
    • Upon conversion to a candidate, the election begins:
      • Increasing currentTerm
      • Vote for yourself
      • Resetting the election timer
      • Send RequestVote RPCs to all other servers
    • If you get a majority of server votes: Become the leader
    • AppendEntries RPC: Convert to follower if received from the new leader
    • If the election runs out: start a new election
  • Leaders:
    • On elections: initial null AppendEntries RPC (heartbeat) sent to each server; Repeat during idle periods to prevent election timeouts (§5.2)
    • If you receive a command from the client to append an entry to the local log and respond after the entry is applied to the state machine (§5.3)
    • If the last log index of a follower is ≥nextIndex: send AppendEntries RPC containing log entries starting from nextIndex
      • If successful: Update nextIndex and matchIndex for follower (§5.3)
      • If AppendEntries failed because of log inconsistencies: decant NextIndex and retry (§5.3)
    • If there is an N, such that N>commitIndex, most matchIndex[I]≥N, and log[N]. Term == currentTerm: set commitIndex = N (§5.3, §5.4)

The key features

Figure 2: A condensed summary of the Raft consensus algorithm (excluding membership changes and log compaction). The server behavior in the upper-left box is described as a set of rules that are independently and repeatedly triggered. Section numbers such as §5.2 indicate where specific functions are discussed. A formal specification [31] describes the algorithm more precisely.

Figure 3:

  • Election security: A maximum of one leader can be elected for a given term. § 5.2
  • Leader append-only: The Leader never overwrites or deletes an entry in his log; It only appends new entries. § 5.3
  • Log matching: If two logs contain an entry with the same index and tenure, all entries for the two logs are the same from beginning to the given index. § 5.3
  • Leader integrity: If a log entry is submitted in a particular term, it will appear in the log for all leaders of higher numbered terms. § 5.4
  • State machine security: If a server applies a log entry for a given index to its state machine, other servers will never apply different log entries for the same index. § 5.4.3

Raft achieves consistency by first electing a Distinguished leader and then giving it full responsibility for managing replication logs. The Leader receives the log entries from the client, copies the log entries to other servers, and notifies the other servers to apply the log entries to their state machines when security is ensured. Having a leader greatly simplifies the management of replication logs. For example, the leader can decide where new log entries need to be placed in the log without consulting with other servers, and data flows from the Leader to other servers. The leader may be down or disconnected from other servers, and a new leader is elected.

By electing a leader, Raft breaks the consistency problem into three relatively independent sub-problems that will be discussed in the following sub-chapters:

  • Leader election: When the current Leader is down, a new Leader must be elected. (section 5.2)

  • Log replication: The Leader must receive log entries from the client and copy them to other nodes in the cluster, forcing the logs of other nodes to match his own.

  • Security: The key to security in Raft is the security of the state machine in Figure 3: If any server node has applied a particular log entry to its state machine, no other server node can apply a different instruction to the same log index location. Section 5.4 describes how the Raft algorithm ensures this property; This solution adds additional restrictions on the election mechanism (Section 5.2).

    After showing consistency algorithms, this section discusses some of the usability issues and the role of timing in systems.

5.1 Raft foundation

A Raft cluster contains several server nodes; Typically five, such a system can tolerate two node failures. At any given moment, each server node is in one of three states: Leader, follower, or candidate. In normal cases, a cluster has only one leader and all other nodes are followers. Followers are passive: they do not send any requests, but simply respond to requests from the leader and candidate. The Leader handles all client requests (if a client communicates with a follower, the follower redirects the request to the Leader). The third state, candidate, is used to elect a new leader (Section 5.2)

Figure 4:

Raft splits time into terms of arbitrary length, as shown in Figure 5. Terms are marked with consecutive integers. Each term begins with an election in which one or more candidates try to become leader. If a candidate wins the election, he then acts as leader for the remainder of the term. In some cases, a single election does not elect a leader. In this case, the term ends with no leader; A new term (including a new election) will soon begin again. Raft ensures that there is at most one leader in any given term.

Figure 5:

The number of term transitions observed by different server nodes may vary, and in some cases, one server node may not see the leader election process or even the entire term. Tenure acts as a logical clock in the Raft algorithm, which allows the server node to discover outdated information such as the obsolete leader. Each server node stores a current tenure number that increases monotonically over time. Servers exchange the current tenure number when communicating with each other. If one server’s current tenure number is smaller than the others, the server will update its tenure number to the larger value. If a candidate or leader finds that his tenure number has expired, it immediately returns to the follower state. If a node receives a request containing an expired tenure number, it rejects the request outright.

The Raft algorithm uses RPCS to communicate between server nodes, and the basic consistency algorithm requires only two types of RPCS. RequestVote RPC is initiated by the candidate during the election (Section 5.2), and AppendEntries RPC is initiated by the leader to replicate the log and provide a heartbeat mechanism (Section 5.3). Section 7 adds a third RPC for transferring snapshots between servers. When the server does not receive an RPC response in time, it retries, and they can initiate RPCS in parallel for maximum performance.

5.2 Leader election

Election mechanism:

Raft uses a heartbeat mechanism to trigger the leader election. They are all followers when the server program starts. A server node remains a follower as long as it receives a valid RPC from the leader or candidate. The Leader periodically sends heartbeat (excluding AppendEntries RPC) to all followers to maintain its status. If a follower does not receive any messages for an election timeout period, it assumes that no leader is available on the system and begins an election to elect a new leader.

Start the election:

To start an election process, followers add their current tenure number and change to candidate status. It then votes itself and sends RequestVote RPC in parallel to the other server nodes in the cluster (getting the other server nodes to vote for it). The Candidate remains in the current state until one of three things happens :(a) it itself wins the election (receives more than half of the votes), (b) another server node becomes the leader, or (c) after a period of time there is no winner. These results are discussed separately in the following sections.

Win an election:

A candidate wins the election and becomes the leader when more than half of the cluster’s server nodes vote for the same term. Each server node will vote for only one candidate for the same term, on a first-come-first-served basis (note: Section 5.4 adds additional restrictions on voting). The rule requiring a majority of votes ensures that at most one candidate can win the election (election security in Figure 3). Once a candidate wins the election, he becomes the leader. It then sends heartbeat messages to other server nodes to determine its status and prevent new elections.

Voting status:

While waiting for the vote, the candidate may receive AppendEntries RPC from another server node claiming to be the Leader. If the leader’s tenure number (contained in the RPC) is no less than the candidate’s current tenure number, the candidate recognizes the leader as legitimate and returns to the follower state. If the tenure number in the RPC is smaller than its own, the candidate will reject the RPC and remain candidate.

Double voting:

A third possible outcome is that the candidate neither wins nor loses the election: if multiple followers become candidates at the same time, the votes may be split so that no candidate wins a majority of the vote. When this happens, each candidate runs out of time and then starts a new round by increasing the current term number. However, if there is no other mechanism, the situation can be repeated indefinitely.

Random election timeout:

The Raft algorithm uses random election timeouts to ensure that split votes rarely occur and are quickly resolved if they do. To prevent votes from being split up in the first place, election timeouts are chosen at random from a fixed interval (say, 150-300 milliseconds). This can spread out the servers so that in most cases only one server will vote out; The server then wins the election and sends the heartbeat before any other server times out. The same mechanism was used to deal with the split of votes. Each candidate resets a random election timeout when starting an election, and then waits until the election time expires. This reduces the chances of a repeat split in a new election. Section 9.3 shows that this scheme can quickly select a leader.

5.3 Log Replication

The Leader, once elected, starts serving client requests. Each request from the client contains an instruction that will be executed by the replicated state machine. The Leader appends the directive as a new entry to the log and then initiates a parallel AppendEntries RPC to other servers to copy the entry. When the item has been safely copied (as described below, successfully copied by half server), the leader applies the item to its state machine (which executes the instruction) and returns the execution results to the client. If the follower crashes or is running slowly, or the network loses packets, the leader will continue to retry AppendEntries RPC (even after the client has replied) until all the followers have finally stored all the log entries.

The log is organized as shown in Figure 6. Each log entry stores a state machine instruction and the tenure number when the leader receives it. The tenure number is used to detect inconsistencies between multiple log copies and also to ensure some of the properties in Figure 3. Each log entry has an integer index value to indicate its position in the log.

Figure 6:

A log is made up of entries, which are numbered sequentially. Each entry contains the command to create its tenure (the number in each box) and the state machine. An entry is considered valid if it can be safely applied to a state machine.

The Leader can decide when it is safe to apply log entries to the state machine; Such log entries are said to be committed. Raft algorithm ensures that all submitted log entries are persistent and will eventually be executed by all available state machines. Once the leader who created the log entry copies it to more than half of the servers, the log entry is submitted (for example, entry 7 in Figure 6). At the same time, all log entries in the Leader log before this entry are also committed, including those created by other leaders. Section 5.4 discusses some of the details of applying this rule after a leader change and proves that this committed rule is safe. The Leader tracks the maximum index of the log entries that will be submitted, which will be included in all future AppendEntries RPCS so that other servers finally know which log entries need to be submitted. Once followers know that a log entry has been submitted, they apply it to their local state machine (in log order).

We designed the Raft logging mechanism to maintain a high level of log consistency across different servers. Not only does this simplify the system’s behavior and make it more predictable, but it is also an important part of security. Raft maintains the following features, which also make up the log matching feature in Figure 3:

  • If two entries in different logs have the same index and tenure number, they store the same instruction.
  • If two entries in different logs have the same index and tenure number, then all their previous entries are the same.

The Leader creates a maximum of one log entry at a log index for a specific tenure number, and the position of the log entry in the log never changes. This point guarantees the first property above. The second feature is guaranteed by a simple consistency check performed by AppendEntries RPC. In sending AppendEntries RPC, the leader includes the index position and the term number (prevLogTerm and prevLogIndex) of the previous log entry. If the follower does not find an entry in its log that contains the same index position and tenure number, it will reject the new entry. Consistency checking is like a generalization step: the empty Log state must be a Log Matching Property at first, and then consistency checking guarantees Log Matching as the Log expands. Therefore, every time AppendEntries RPC returns success, the leader knows that the follower’s log must be the same as his own (from the first entry to the latest entry).

During normal operation, the leader and follower logs are consistent, so AppendEntries consistency check for RPC never fails. However, a leader crash would leave the log in an inconsistent state (the old leader might not have copied all the entries in its log). This inconsistency is exacerbated by a series of leader and follower crashes. Figure 7 shows the circumstances in which the follower’s log might differ from the new leader’s log. Followers may lack some log entries that the new leader does not have, may have some log entries that the new leader does not have, or both. Missing or extra log entries can involve multiple terms.

Figure 7: When a leader is successfully elected (the top log), the follower can be any of (A-f). Each box represents a log entry; The numbers inside indicate tenure numbers. Followers may be missing some entries (a-b), may have some uncommitted entries (C-D), or both (e-F). For example, scenario F might happen where the server that corresponds to scenario F, being the leader during term 2, appoints some log entries to its own logs and crashes before committing any of them. The server quickly rebooted, was elected leader again in term 3, and added some log entries to its own log. Before the logs for term 2 and term 3 could be submitted, the server went down again and remained down for the next few terms.

Inconsistency problem

In the Raft algorithm, the leader resolves inconsistencies by forcing the follower to copy its log. This means that log entries that conflict with the leader will be overwritten by the leader’s log entries. Section 5.4 will show that security can be guaranteed by adding a restriction.

To keep the followers’ logs consistent with their own, the leader must find the largest entry that both agree on (the largest index), delete all entries in the followers’ logs from that point onwards, and send all entries from that point onwards to the followers. All of this happens in a reply to a consistency check in AppendEntries RPCs. For each follower, the Leader maintains a nextIndex (a map array) that represents the index of the next log entry that the Leader will send to the follower. When a new leader is elected, the leader initializes all nextIndex values to the index of his last log entry plus 1 (11 in Figure 7). If the follower log does not match the leader log, the next consistency check in the AppendEntries RPC will fail. After being rejected by followers, Leaer will reduce the nextIndex value and retry AppendEntries RPC. Eventually nextIndex is at a point where the leader and follower logs agree. At this point the AppendEntries RPC succeeds, removing all log entries that conflict with the leader from the follower and appending the leader’s log entries (if any are needed). Once AppendEntries RPC is successful, the follower logs are the same as the leader’s and remain the same for the rest of the term.

This protocol can be optimized to reduce the number of rejected AppendEntries RPCS. For example, when rejecting a AppendEntries RPC request, followers can include the term number of the conflicting entry and the first index of the term they stored. With this information, the leader can reduce nextIndex by skipping all conflicting log entries for that term; This becomes the term of each conflicting log entry requiring one AppendEntries RPC instead of one per entry. In practice, we don’t think this optimization is necessary because failures don’t happen very often and there aren’t likely to be many inconsistent log entries.

With this mechanism, the leader does not need to do anything special to bring the logs back to a consistent state after taking power. The Leader simply performs normal operations and the log automatically becomes consistent when the AppendEntries consistency check fails. The Leader never overwrites or deletes its own log entry (the Leader Append-only property in Figure 3).

This log replication mechanism demonstrates Raft’s consistency: As long as half of the servers are up and running, Raft can accept, copy and apply new log entries. Under normal circumstances, new log entries can be copied to more than half the machines in the cluster in an RPC round trip; And a single slow follower does not affect overall performance.

Article 5 the axiom

  • Election security features: For a given term number, at most one leader can be elected
  • Leader only adds rule: The leader never deletes or overwrites its own logs, only adds them
  • Log matching rule: If two logs have the same term number of entries at the same index location, then the entries from the beginning to the index location are identical
  • Leader full feature: If a log entry has been committed in a term, it must appear in all leaders with a later term number
  • State machine security feature: If a leader has applied a log entry at a given index value location to the state machine, no other node can apply a different log entry at that index location

5.4 security

The previous section described how the Raft algorithm performs leader election and log replication. However, the mechanisms described so far do not adequately guarantee that each state machine executes the same instructions in the same order. For example, a follower may enter an unavailable state, during which the leader may submit several log entries. The follower may then be elected leader and overwrite the log entries with new ones. As a result, different state machines may execute different sequences of instructions.

This section refines the Raft algorithm by adding a restriction to the leader election. This limitation ensures that for a given arbitrary term number, the Leader contains all the log entries submitted for each previous term (the leader displacement nature in Figure 3). With this leader election limitation, we also make the submission rules clearer. Finally, we show a brief proof of the Leader displacement property and show how it leads the replication state machine to perform the correct behavior.

5.4.1 Election restrictions

Raft uses voting to prevent a candidate from winning an election unless the candidate contains all the log entries that have already been submitted. A candidate must communicate with more than half of the nodes in the cluster in order to win the election, which means that at least one of the server nodes contains all the submitted log entries. If the candidate’s log is as new as at least half of the server nodes (I’ll define “new” precisely next), then it must contain all the submitted log entries. The RequestVote RPC enforces this restriction: the RPC contains the candidate’s log information, and if the voter’s own log is newer than the candidate’s, it will reject the vote request.

Raft defines which log is new by comparing the index value and tenure number of the last log entry in the two logs. If the last entry in the two logs has a different tenure number, the log with the larger tenure number is updated. If the last entry in both logs has the same tenure number, then the longer entry is updated.

5.4.2 Submit log entries from previous tenures

As described in Section 5.3, once a log entry for the current term has been stored on more than half of the server nodes, the leader knows that it has been committed. If a leader crashes before committing a log entry, future leaders attempt to complete the copy of the log entry. However, if a log entry from a previous term has been stored on more than half of the server nodes, the leader cannot immediately determine that the log entry has been committed. Figure 8 shows a case where an old log entry that has been stored on half a node may still be overwritten by a future leader.

Figure 8: The time series shown in the figure shows why the leader cannot determine whether a log has been committed for the old tenure number.

In (a), S1 is the leader, partially copying the log entry at index position 2.

In (b), S1 crashes, then S5 wins the election in term 3 with S3, S4 and its own votes, then receives a different log entry from the client and places it at index 2.

And then (c), S5 crashes again; S1 The system is restarted, the election is successful, and logs are copied again. At this point, the log from tenure 2 has been copied to most machines in the cluster, but has not yet been committed.

If S1 crashes again in (d), S5 can be re-elected (with votes from S2, S3 and S4) and then overwrite their logs at index 2.

As the above example illustrates, even if a log entry is written to (copied) by more than half of the nodes, it does not mean that it has been commited to the Raft cluster (it may go down just before it is committed). Because once a log is committed by the Leader, it can never be deleted or modified as long as the node is the Leader. This example also demonstrates that the leader cannot determine the commit status of a log entry based solely on information from previous terms.

However, before the crash, if S1 had copied the log entry (with term 4) to most machines, such as (e), during its term, then the entry would have been committed (S5 would not have been elected successfully). In this case, all previous logs are also committed.

Election restrictions: Voters will only vote for candidates whose term is longer than their own and whose last entry is newer (the index is larger if the term is greater than or equal to their own). Is it true?

Therefore, for the above scenario, Raft algorithm adds an additional restriction to the log commit condition: the Leader is required to have at least one log commit in the current term, i.e. written to disk by more than half of the nodes.

As described in e in the figure above, S1, as the Leader, copies a log entry (term 3, term 4) from the log at position 3 on most nodes before the crash, so even if ·S1 is down at this time, S5 is also unlikely to win the election — because S2 and S3’s latest log entries have a term number of 4, which is larger than S5’s 3, S3 could not get more than half of the votes. “Unable to win an election, this means that the log entry for position 2 will not be overwritten.

Therefore, before accepting the client write command, the new leader needs to submit a no-OP (empty command) and copy the log of his tenure number to most cluster nodes to truly ensure the establishment of the election limit.

5.4.3 State machine security demonstration

Argument: Does each leader have all committed logs from all previous leaders?

Having given the full Raft algorithm, we can now talk more precisely about the Leader integrity feature (if a log entry is committed in a certain period, The entry will then appear in the log for all leaders with higher numbered terms (Leader Displacement -erty) (this discussion is based on the security certificate for Section 9.2). We assume that the leader integrity property is not satisfied, and then we derive the contradiction. Suppose the leader for term T commits a log entry during his term, but the log entry is not stored to the leader for some future term. Assume U is greater than T and there is no minimum tenure number for storing this log entry.

Figure 9: If S1 (the leader for term T) commits a new log entry during its term, and S5 is elected leader during the subsequent term U, then at least one node, such as S3, must have both received the log entry from S1 and voted for S5.

(Contradiction) The following is the case where the log matching feature is inconsistent with the assumption that the integrity feature is not met:
  1. Assumption: Leader U (S5 in the figure above) must have no AE committed when it first became Leader (the Leader never deletes or overwrites any entries).
  2. Leader T copies the log entry (AE) to half nodes in the cluster, and Leader U wins votes from half nodes in the cluster. Therefore, at least one node (voter, S3) accepts both the log entries from Leader T and votes for Leader U, as shown in Figure 9. This voter is the crux of the conflict.
  3. The voter must accept the committed log entry from Leader T before voting for Leader U; Otherwise it will reject AppendEntries requests from the leader T (otherwise its tenure number would be larger than T at this point).
  4. The voter still keeps the log entry (AE) when voting for Leader U, because any leader between U and T contains the log entry (based on the assumption above) and the leader never deletes the entry. Followers delete entries only if they conflict with the leader.
  5. When the voter votes for Leader U, the leader U log must be at least as new as the voter’s log. This leads to one of two paradoxes.
  6. First, if the voter has the same tenure number as leader U’s last log entry, then leader U’s log is at least as long as the voter’s, so leader U’s log must contain all the log entries in the voter’s log. This is a contradiction because the voter contains the committed log entry (AE), but leader U does not contain (AE) in the above assumption.
  7. Otherwise, the last log term of the Leader U must be larger than the voter’s. Also, this term is larger than T because the voter’s last log term is at least T (which contains the submitted log entry AE for T term). The previous leader (S1) who created the last log for Leader U must have committed entries in the log (based on the assumption). Leader U’s log must also contain committed entries, which is a contradiction because S5 does not contain log entry AE.
  8. Therefore, all leaders with terms greater than T must contain all log entries committed in term T.
  9. The log matching feature ensures that the future leader will also contain the log entries (AE) that are indirectly committed, such as index 2 in Figure 8 (d).
(Syllogism) State machine security demonstration:
  1. Define A as the last committed log of the last term, and B as the leader of the current term
  2. Because A must synchronize to more than half of the nodes in the cluster
  3. In addition, B can become the leader only after obtaining votes from more than half of the nodes in the cluster
  4. Therefore, there must be A node with A log in the voters of B
  5. And because of election restrictions, B can become a leader only if he is newer than all the voters who vote for him
  6. Therefore, B’s logs must contain A
  7. Because logs match exactly, if A is included by B, all logs smaller than A are included by B
  8. Because lastApplied <= commitIndex
  9. This is also because RAFT ensures that committed logs are in the same order across all cluster nodes
  10. Therefore, application logs must be ordered consistently on all nodes
  11. Because the state machine can only execute the application logging part sequentially
  12. Proof: The state machine must eventually be consistent on all nodes in the cluster

With the Leader integrity feature, we can demonstrate the state machine security feature in Figure 3: if a server has already applied a log entry at a given index to its own state machine, other servers will not apply a different log entry at the same index. When a server applies a log entry to its own state machine, its log and the Leader’s log are identical from start to the log entry, and the log entry must be committed.

Now consider the following minimum tenure number: a server applies a log entry to a particular index in that tenure number; The log integrity feature ensures that the leader with a higher tenure number will store the same log entry, so that the log entry will have the same value when the server applies the index in subsequent terms. Therefore, the state machine security feature is valid.

Finally, Raft requires the server to apply log entries in log index order. Coupled with state machine security features, this means that all servers apply the same log entries to their state machines in the same order.

5.5 Followers and candidates crash

So far, we’ve only looked at the leader crash. The processing mode of a Follower and candidate crash is much simpler than that of the leader crash, and the processing mode of both crashes is the same. If followers or candidates crash, subsequent RequestVote and AppendEntries RPCs sent to them will fail. Raft handles this failure with infinite retries; If the crashed machine is restarted, these RPCS will complete successfully. If a server crashes after completing an RPC without a response, it will receive the same request again after it restarts. Raft’s RPCs are idempotent, so retries like this do no harm. For example, a follower who receives a AppendEntries request but already has them in its log will simply ignore them in the new request.

5.6 Timing and Availability

One of the requirements of Raft is that security cannot depend on timing: the whole system cannot produce false results because some event runs a little faster or slower than expected. However, availability (the ability of the system to respond to clients in a timely manner) is inevitably dependent on timing. For example, when a server crashes, the message exchange takes longer than normal, and the candidate will not wait too long to win the election; Raft will not work without a stable leader.

The Leader election is the most critical aspect of timing in Raft. Raft can elect and maintain a stable leader as long as the entire system meets the following time requirements:

BroadcastTime << electionTimeout << MTBF

In this inequality, broadcast time refers to the average time for a server to send RPCs in parallel to all the other servers in the cluster and receive a response; The election timeout is the one described in Section 5.2. The average time between failures is the average time between two failures for a server.

The broadcast time must be an order of magnitude less than the election timeout so that the leader can reliably send heartbeat messages to prevent followers from entering the election state. Combined with the method of randomizing election timeouts, this inequality also makes the split up of votes impossible. Election timeouts need to be orders of magnitude smaller than the average time between failures for the entire system to operate stably. When the leader crashes, the entire system becomes unavailable for longer than the election timeout period. We hope that this situation is only a small part of the overall time.

Broadcast times and mean time between failures are determined by the system, but election timeouts are of our choice. Raft RPCs require the receiver to persist information to stable storage, so the broadcast time can be between 0.5 ms and 20 ms depending on the storage technology. Therefore, the election timeout may need to be between 10 milliseconds and 500 milliseconds. The average time between failures on most servers is a few months or more, which is easy to meet.

6 Cluster member changes

So far, we have assumed that the configuration of the cluster (the set of servers participating in the conformance algorithm) is fixed. In practice, however, it is occasionally possible to change the configuration of a cluster, such as replacing machines that are down or changing the replication level. Although this can be done by taking the entire cluster offline, updating all configurations, and then restarting the entire cluster, the cluster will be unavailable during the change. In addition, if manual steps are involved, there is a risk of error. To avoid such problems, we decided to automate configuration changes and incorporate them into the Raft consistency algorithm.

To make the configuration change mechanism safe, there cannot be any point in the transformation at which it is possible to elect two leaders in the same term. Unfortunately, any scenario in which the server directly converts from the old configuration to the new configuration is not secure. It is not possible to automatically convert all servers at once, so the entire cluster may be divided into two independent majos (two partitions) during the conversion (see Figure 10).

Figure 10:

Figure 10: It is not safe to switch directly from one configuration to another, because each machine will switch at different times. In this example, the cluster went from three machines to five. Unfortunately, there is a point in time when two different leaders are elected for the same term. One gets the votes of half the machines in the old configuration, and one gets the votes of half the machines in the new configuration.

To ensure security, configuration changes must take a two-phase approach. There are many two-phase implementations. For example, some systems stop the old configuration in the first phase and cannot handle client requests; The new configuration is then enabled in the second phase. In Raft, the cluster first switches to a transitional configuration called Joint Consensus. Once the federation has been committed, the system switches to the new configuration. Syndication is a consistent combination of old and new configurations:

  • Log entries are copied to all the new and old servers in the cluster.
  • Either the new or the old server can become the leader.
  • Agreement (for election and submission) requires a majority on each configuration.

Federation allows independent servers to configure the transformation process at different times without compromising security. In addition, syndication allows the cluster to respond to client requests during configuration changes.

Cluster Configuration Changes

Figure 11: Timeline of configuration changes. Dashed lines represent the configuration items that have been created but not committed, and solid lines represent the most recently committed configuration items. The leader first creates Cold,new configuration items in their logs and commits them to Cold, New (majority of Cold and majority of Cnew). It then creates the Cnew entry and submits it to the majority of the Cnew entries. At this point in time, neither Cold nor Cnew can make decisions independently.

The cluster is configured to store and communicate with special log entries in the replication log; Figure 11 shows the configuration change process. When a leader receives a request to change the configuration from C-old to C-new, it stores the configuration (C-old,new in the figure) as a log entry for syndication and copies the entry in the manner described earlier. Once a server adds the new configuration log entry to its log, it uses the configuration to make all future decisions (the server always uses the latest configuration in its log, whether or not the configuration log has been committed). This means that the leader uses the rules of C-old,new to determine when a c-old,new log entry is committed. If the leader crashes, the new leader may be elected in a C-old configuration or in a C-old,new configuration, depending on whether the winning candidate has already received the C-old,new configuration. Under no circumstances can C-NEW make unilateral decisions during this period.

Once C-old,new is committed, neither C-old nor C-new can make a decision without the other party’s approval, and the Leader integrity feature ensures that only servers with C-old,new log entries can be elected leader. It is now safe for the leader to create a log entry describing the C-new configuration and copy it to the rest of the cluster. In addition, the new configuration takes effect immediately after it is received by the server. When the new configuration is committed under the c-new rule, the old configuration becomes irrelevant and servers that do not use the new configuration can be shut down. As shown in Figure 11, c-Old and C-New cannot unilaterally decide at any point; This ensures security.

There are three issues that need to be addressed with regard to configuration changes.

The first problem is that the new server may start without storing any log entries. When these servers join the cluster in this state, they need a period of time to update and catch up with other servers, during which time they cannot submit new log entries. To avoid this temporary system unavailability Raft introduces an additional phase prior to configuration changes where new servers join the cluster with no voting rights (the leader also copies logs to them, but doesn’t consider them for half the time). Once the new server catches up with the other machines in the cluster, configuration changes can be made in the manner described above.

The second problem is that the leader of the cluster may not be a member of the new configuration. In this case, the leader is demoted (back to the follower state) once the C-new log entry has been submitted. This means that there is a period of time (during the Leader commits c-New) when the leader manages a cluster that does not include himself; It copies the log but doesn’t count itself in the half. The Leader transition occurs when c-New is committed, because this is the earliest time that the new configuration can operate independently (it will always be possible to elect a new Leader under the C-New configuration). Until then, leaders may only be elected from c-old.

The third problem is that the servers that are removed (not in C-New) can disrupt the cluster. These servers will no longer receive heartbeats, so when an election times out, they will run a new election process. They send RequestVote RPCs with the new term number, which causes the current leader to return to the follower state. A new leader is eventually elected, but the removed server times out again, and the process repeats itself again, resulting in poor system availability.

To prevent this problem, the server ignores RequestVote RPCs when it thinks the current leader exists. Specifically, when the server receives a RequestVote RPC within the minimum election timeout, it does not update the tenure number or vote. This does not affect normal elections, and each server waits at least a minimum election timeout (10 milliseconds out of 10 to 500 milliseconds) before starting an election. Conversely, it helps to avoid disruption of the removed server: if the leader can send heartbeats to the cluster, it will not be removed by the larger tenure number.

7 Log Compression

Raft’s log grows as it contains more client requests during normal operation, but in a real system, the log cannot grow indefinitely. As the log gets longer, it takes up more space and takes more time to play back. If there is no mechanism to clean up the accumulated stale information in the logs, it will eventually lead to usability problems.

Snapshot technology is the easiest way to compress logs. In the snapshot technology, the current system status is persisted to stable storage in the form of snapshots. Logs generated before this point in time are discarded. Snapshotting is used in Chubby and ZooKeeper, and the following sections will cover snapshotting in Raft.

Incremental compression methods, such as log cleaning or log-structured Merge trees (LSM trees), are possible. These methods operate on only a small amount of data at a time, thus spreading the load of compression. First, they select a data region that has accumulated a large number of objects that have been deleted or overwritten, then overwrite the objects that are still alive in the region, and then release the region. They require a lot of extra mechanics and complexity compared to snapshot techniques, which simplify the problem by manipulating entire data sets. The state machine can implement LSM trees using the same interface as the snapshot technology, but the log cleaning method needs to be modified Raft.

Figure 12: A server replaces the already committed entries in its log (indexes 1 through 5) with a new snapshot that stores only the current state (values of variables X and Y). The last Included Index and last Included term of snapshots are saved to locate snapshots before entry 6 in the log

Figure 12 shows the basic idea behind snapshots in Raft. Each server creates a snapshot independently, and the snapshot contains only the entries in its own log that have already been committed. The main job is for the state machine to write its state to the snapshot. Raft snapshots also contain a small amount of metadata: The Last Included Index refers to the index value of the last log entry replaced by the snapshot (the last log entry applied to the state machine), and the last Included term refers to the tenure number of the entry. This metadata is retained to support AppendEntries consistency checking on the first entry after the snapshot, since the entry requires the previous index value and tenure number. To support cluster member changes (Section 6), the snapshot also includes the latest configuration from the log as the Last Included Index. Once the server has finished writing the snapshot, it can delete all log entries before last Included Index, including the previous snapshot.

Although the server typically creates snapshots independently, the leader must occasionally send snapshots to lagging followers. This usually occurs when the leader has already discarded the next log entry that needs to be sent to the followers. Fortunately this is not possible in normal operations: a follower that is in sync with the Leader will usually have this log entry. However, an exceptional slow follower or newly clustered server (Section 6) will not have this entry. The way to keep the follower up to date is to send it a snap over the Internet.

Figure 13: Summary of the InstallSnapshot RPC. The snapshot is divided into several pieces for transmission. This gives followers signs of life for each block, so it can reset its election timer.

  • InstallSnapshot RPC: Called by the leader to send chunks of a snapshot to the follower. Leaders always send blocks in order.
  • Arguments:

    • Term: term number of a leader
    • LeaderId: Allows the follower to redirect the request
    • LastIncludedIndex: The snapshot replaces all entries up to and including this index
    • LastIncludedTerm: Term number of the last log entry contained in the snapshot
    • Offset: indicates the byte offset of the block location in the snapshot file
    • Data [] : indicates the original byte of the snapshot block, starting from the offset
    • Done: True if this is the last partition
  • Results:

    • Term: current term number for leaders to update themselves
  • Receiver implementation:

      1. If term < currentTerm, reply immediately
      2. If it is the first block (offset is 0), a new snapshot is created
      3. Writes data at the specified offset
      4. If done == false, reply and wait for more data
      5. Save the snapshot file and discard existing or partial snapshots with smaller indexes
      6. If an existing log entry has the same index value and tenure number as the last one contained in the snapshot, subsequent log entries are retained and replied
      7. Discarding the entire log
      8. Reset the state machine with snapshot content (and load the cluster configuration for the snapshot)

The Leader uses InstallSnapshot RPC to send snapshots to followers that are too late. As shown in figure 13. When a follower receives a snapshot with this RPC, it must decide what to do with the existing log entries. Usually this snapshot contains information that is not in the recipient’s log. In this case, the follower discards all of its logs; These will be replaced by the snapshot, and some uncommitted entries may conflict with the snapshot. If the snapshot received is the first part of your log (due to network retransmission or an error), the entries contained in the snapshot will be deleted completely, but the entries after the snapshot will still be useful and retained.

This approach to snapshots violates Raft’s strong Leader principle because followers can create snapshots without knowing the status of the leader. But we think the violation is justified. The Leader exists to prevent conflicts when the consistency is reached, but when the snapshot is created, the consistency is already reached, so no decision conflicts occur. Data can still only flow from the leader to the followers, but the followers can reorganize their data.

We considered an alternative leader-based snapshot scheme, in which only the Leader would create a snapshot, and then the leader would send its snapshot to all followers. But there are two drawbacks to this. First, sending snapshots wastes network bandwidth and slows the snapshot process. Each follower already has the information they need to create their own snapshot, and it is clearly cheaper for followers to create snapshots from their local state than to receive them over the network. Second, the leader implementation is more complex. For example, the leader sends snapshots to followers in parallel with new log entries so that new client requests are not blocked.

There are two other issues that affect snapshot performance. First, the server must decide when to create a snapshot. If snapshots are created too frequently, a lot of disk bandwidth and other resources are wasted. If snapshots are created too infrequently, you run the risk of running out of storage capacity and increasing the log playback time on restart. A simple strategy is to create a snapshot when the log size reaches a fixed size. If this threshold is set significantly larger than the desired snapshot size, the disk bandwidth load on the snapshot will be small.

The second performance issue is that it takes a while to write snapshots, and we don’t want that to interfere with normal operations. The solution is to use a copy-on-write technique so that new updates can be received without affecting the snapshot being written. For example, a state machine with a generic data structure naturally supports such functionality. In addition, the operating system’s support for copy-on-write techniques (such as fork on Linux) can be used to create a memory snapshot of the entire state machine (this is the approach used in our implementation).

8 Client Interaction

Raft’s client sends all requests to the leader. When the client starts for the first time, it randomly picks a server to communicate with. If the client first selects a server other than the Leader, the server rejects the client’s request and provides information about the leader it recently received (AppendEntries requests include the leader’s network address). If the leader has crashed, client requests will time out; The client then randomly picks the server again and tries again.

Our goal for Raft is to achieve linear semantics (every action is executed immediately, only once, between its invocation and its reply). However, as mentioned above, Raft can execute the same command multiple times: for example, if the leader crashes before responding to the client after submitting the log entry, the client retries the command with the new leader, causing the command to be executed again. The solution is for the client to assign a unique sequence number to each instruction. The state machine then keeps track of the latest serial number and associated replies that each client has processed. If an instruction is received whose sequence number has already been executed, the result is immediately returned without re-executing the request.

Read-only operations can be handled directly without logging. However, if nothing else is done, you run the risk of returning stale data because the leader may respond to a client request by being replaced by a new leader, unaware that it is no longer the latest leader. Linearized reads definitely don’t return stale data, and Raft needs to use two additional precautions to ensure this without logging. First, the leader must have up-to-date information about which log entries have been committed. The Leader integrity feature guarantees that the Leader must have all the log entries that have been committed, but at the beginning of its term, it may not know which ones have been committed. In order to know this information, it needs to submit a log entry during its tenure. Raft solves this problem by having the leader commit an empty log entry to the log at the beginning of the term without any action. Second, the leader must check whether it has been replaced before processing read-only requests (if a newer leader is elected, its information is outdated). Raft addresses this problem by having the leader exchange heartbeat information with half of the nodes in the cluster before responding to read-only requests. As an alternative, the leader can rely on the heartbeat mechanism to implement a form of lease, but this approach relies on Timing for security (assuming the time error is bounded).