Abstract

Raft is a consistency algorithm for managing replication logs. It provides the same functionality and performance as the Paxos algorithm, but its algorithm structure is different from that of Paxos, making Raft much easier to understand and build real-world systems. To improve understandability, Raft breaks down the consistency algorithm into several key modules, such as leader election, log replication, and security. It also reduces the number of states that need to be considered by implementing a stronger consistency. The results of a user study show that Raft is much easier for students to learn than Paxos. The Raft algorithm also includes a new mechanism to allow dynamic changes in cluster members, which exploits overlapping majorities for security.

1 introduction

Consistency algorithms allow a group of machines to work as a unit and continue to work even if some of them fail. Because of this, consistency algorithms play an important role in building reliable large-scale software systems. For the past 10 years, the Paxos algorithm has dominated the conformance algorithm space: the vast majority of implementations are based on or influenced by Paxos. Paxos has also become an example of consistency in teaching.

Unfortunately, despite efforts to reduce its complexity, the Paxos algorithm remains extremely difficult to understand. In addition, the algorithm structure of Paxos itself needs to be greatly modified before it can be applied to the actual system. All of these have led to the Paxos algorithm headache in both industry and academia.

After working with the Paxos algorithm, we began to look for a new consistency algorithm that could provide a better basis for building practical systems and teaching. What we did was unusual, and our primary goal was understandability: could we define a consistent algorithm in a real system that could be learned in a much easier way than the Paxos algorithm. In addition, we want the algorithm to facilitate the development of the intuition of the system builder. Not only is it important for an algorithm to work, but it is also important to be able to clearly see why it works.

The Raft consistency algorithm is the result of this work. When designing the Raft algorithm, we used some special techniques to make it more understandable, including algorithm decomposition (Raft is divided into three main modules: leader election, log replication and security) and reducing the state of the state machine (as compared to Paxos, Raft reduces nondeterminism and the way servers are inconsistent with each other. A study of 43 students from two universities showed that Raft was significantly easier to understand than Paxos. After the students learned both algorithms together, 33 of them were able to answer Raft questions compared to Paxos.

Raft is similar in many respects to existing consistency algorithms (mainly Oki and Liskov’s Viewstamped Replication), but it also has some unique features:

  • Strong Leader: Raft uses a stronger form of leadership than other consistency algorithms. For example, log entries are only sent from the leader to other servers. This approach simplifies the management of replication logs and makes the Raft algorithm easier to understand.
  • Leadership election: The Raft algorithm uses a random timer to elect a leader. This approach just adds a little mechanism to the heartbeat mechanism that any consistency algorithm must implement. It will be easier and quicker to resolve conflicts.
  • Membership adjustment: Raft uses a common and consistent approach to dealing with changing cluster members, in which most of the machines in the two different configurations of the cluster overlap during the adjustment process, which allows the cluster to continue to work when the members change.

We believe that Raft is superior to Paxos or other consistency algorithms, both for educational purposes and as a basis for practical projects. It’s simpler and easier to understand than other algorithms; Its algorithm description is sufficient to implement a realistic system; It has many open source implementations and is used by many companies; Its safety has been proven; Its efficiency is comparable to other algorithms.

Next, this paper will introduce the following contents: Copy state machine problems (Section 2), discuss the advantages and disadvantages of Paxos (Section 3), discuss the methods we use to understand capabilities (Section 4), illustrate the Raft consistency algorithm (Sections 5-8), evaluate the Raft algorithm (Section 9), And some related work (section 10).

2 Replicate the state machine

Consistency algorithms are proposed in the context of replication state machines (cf. 37). In this approach, state machines on a set of servers produce copies of the same state and can continue to operate even if some machines go down. Replication state machines are used to solve many fault-tolerant problems in distributed systems. For example, large-scale systems often have a cluster leader, such as GFS, HDFS, and RAMCloud. Typical applications are independent replication state machines that manage leadership elections and store configuration information and survive leader outages. Like Chubby and ZooKeeper.

Figure 1: Replication state machine structure. Consistency algorithms manage replication logs from client instructions. The state machine processes the same instructions in the same order from the log, so the result is the same.

Replication state machines are typically implemented based on replication logs, as shown in Figure 1. Each server stores a log containing a series of instructions and executes them in the order of the log. Each log contains the same instructions in the same order, so each server executes the same sequence of instructions. Because each state machine is deterministic, each execution produces the same state and the same sequence.

It is the job of the consistency algorithm to ensure that the replication logs are the same. On a server, the consistency module receives the instructions sent by the client and adds them to its log. It communicates with consistency modules on other servers to ensure that logs on each server eventually contain the same requests in the same order, even though some servers may go down. Once the instructions have been correctly copied, each server’s state machine processes them in log order, and the output is returned to the client. Thus, the server cluster appears to form a highly reliable state machine.

Consistency algorithms used in real systems usually have the following characteristics:

  • Security guarantee (never return an incorrect result) : Errors including network delay, partitioning, packet loss, redundancy, and out-of-order are guaranteed to be correct in the case of non-Byzantine errors.
  • Availability: Availability is guaranteed as long as most machines in the cluster are running and able to communicate with each other and clients. Thus, a typical cluster of five nodes can tolerate two node failures. If the server is stopped, it is considered a failure. They can recover from the state and rejoin the cluster when stable storage is available.
  • Independent of timing for consistency: physical clock errors or extreme message delays may cause usability problems only in the worst case.
  • Typically, an instruction can be completed as quickly as possible when most nodes in the cluster respond to a round of remote procedure calls. A small number of slow nodes does not affect the overall performance of the system.

3. Problems of Paxos algorithm

Over the past 10 years, Leslie Lamport’s Paxos algorithm has become almost synonymous with consistency: Paxos is the algorithm most often used in courses and is the starting point for most consistency algorithm implementations. Paxos starts by defining a protocol that can agree on a single decision, such as a single copy log entry. We call this subset single-decision Paxos. It then facilitates a series of decisions by combining multiple instances of the Paxos protocol. Paxos guarantees security and activity, while also supporting cluster membership changes. Paxos is proven to be correct and efficient in general.

Unfortunately, Paxos has two obvious drawbacks. The first disadvantage is that the Paxos algorithm is particularly difficult to understand. The full explanation is notoriously opaque; After a great deal of effort, only a few people managed to understand the algorithm. As a result, there have been several attempts to explain Paxos in simpler terms. Although these explanations focus only on a subset of single decision problems, they are still challenging. A survey at the 2012 NSDI conference showed that few people were satisfied with the Paxos algorithm, even among seasoned researchers. We tried to understand Paxos ourselves; We didn’t understand Paxos until we read a lot of simplified explanations of Paxos and designed our own algorithm, a process that took almost a year.

We assume that Paxos’s opacity comes from its choice single decision problem as its foundation. Single-decision Paxos is obscure and subtle, and it is divided into two scenarios that have no simple intuitive explanation and cannot be understood independently. As a result, it is difficult to establish an intuitive sense of why the single-decision Paxos algorithm works. Making up multi-decision Paxos adds a lot of complexity to the rules. We believe that the problem of reaching agreement on multiple decisions (a log as opposed to a single log) can be broken down into other ways and are more straightforward and obvious.

The second problem with the Paxos algorithm is that it does not provide a good enough foundation to build a realistic system. One reason is that there is no widely accepted algorithm for multiple decision problems. Lamport’s description is basically about single-decision Paxos; He briefly describes the approach to implementing multi-decision Paxos, but lacks much detail. Of course, there are many attempts to externalize Paxos, but they are all different from each other and the Paxos overview is different. Systems such as Chubby implement an algorithm similar to Paxos, but most of the details are not publicly available.

Moreover, the structure of Paxos algorithm is not very easy to build a practical system; Single decision decomposition produces other results as well. For example, selecting a set of log entries independently and merging them into a serialized log does not provide much benefit, but just adds a lot of complexity. It is much simpler and more efficient to design a system around a log; New log entries are added to the log in a strictly restricted order. Another problem is that Paxos uses a peer-to-peer, peer-to-peer approach as its core (although it ultimately proposes a weak-leader approach to optimize performance). This makes sense in a simplified world where only one decision is made, but few real-world systems use this approach. If there is a series of decisions that need to be made, it is easier and faster to choose a leader first and then have him coordinate all the decisions.

As a result, there are few practices in the real world that are similar to Paxos. Each implementation started with Paxos, found many implementation problems, and then developed a structure that was significantly different from Paxos. This is time-consuming and error-prone, and understanding the difficulty of Paxos makes the problem worse. The Paxos algorithm is proven to be correct in theory, but the real-world system is so different from Paxos that these proofs are of little value. The following is typical from Chubby’s implementation:

There is a huge gap between Paxos algorithm description and implementation of the real system. The final system is based on an unproven algorithm.

Due to the above problems, we believe that Paxos algorithm does not provide a good foundation for the practice of the system, nor does it provide good help for teaching. Given the importance of consistency issues in large-scale software systems, we decided to see if we could design a consistency algorithm with better features to replace Paxos. The Raft algorithm is the result of this experiment.

Design for comprehensibility

The Raft algorithm was designed for a number of reasons: it had to provide a complete, real-world basis for implementing the system in order to greatly reduce the developer’s work; It must be secure in all cases and available in most cases; And most of its operations must be efficient. But our biggest and most important challenge is comprehensibility. It must be easy for the general population to understand. In addition, it must make intuitive sense so that the system builder can make the necessary extensions in reality.

When designing the Raft algorithm, there are a number of points where we need to choose between various alternatives. In this case, we evaluate the alternatives based on the principle of comprehensibility: how difficult is it to explain each alternative (for example, how complex is Raft’s state space and are there subtle hints)? Is it easy for a reader to fully understand the scheme and implications?

We realize that the analysis of this intelligibility is highly subjective; However, we used two commonly applicable techniques to solve this problem. The first technique is known as problem decomposition: whenever possible, we break a problem down into several relatively independent, solvable, explicable, and comprehensible sub-problems. For example, the Raft algorithm is broken down into leader election, log replication, security, and role change.

The second approach we use is to simplify the space of states to consider by reducing the number of states, making the system more coherent and eliminating uncertainty where possible. In particular, all logs are not allowed to have holes and Raft limits the possibility of the logs becoming inconsistent. Although in most cases we try to eliminate uncertainty, there are also cases where uncertainty can improve understandability. In particular, randomization methods increase uncertainty, but they help reduce the number of state Spaces by using similar methods for all possible choices. We used randomization to simplify the leader election algorithm in Raft.

5 Raft consistency algorithm

Raft is an algorithm used to manage the replication logs described in Section 2. Figure 2 summarizes the abbreviated version of the algorithm for reference purposes, and Figure 3 lists some key features of the algorithm. Each of these elements will be described in the remaining sections.

Raft achieves consistency by electing a noble leader and then giving him full responsibility for managing the replication log. The leader receives the log entry from the client, copies the log entry to the other servers, and tells the other servers to apply the log entry to their state machine when security is warranted. Having a leader greatly simplifies the management of replication logs. For example, the leader can decide where a new log entry needs to be placed in the log without consulting with other servers, and the data flows from the leader to other servers. A leader can go down, can lose connections to other servers, and a new leader is elected.

By taking a leadership approach, Raft breaks the consistency issue down into three relatively independent sub-issues that will be discussed in the following sub-chapters:

  • Leadership election: A new leader needs to be elected when the existing leader goes down (Section 5.2)
  • Log replication: The leader must receive logs from the client and copy them to other nodes in the cluster, forcing the logs of other nodes to be the same as his own.
  • Security: The key to security in Raft is the state machine security shown in Figure 3: If any server node has applied a certain log entry to its state machine, no other server node can apply a different directive to the same log index location. Section 5.4 describes how the Raft algorithm ensures this property; This solution involves a limitation on an additional election mechanism (Section 5.2).

After showing consistency algorithms, this chapter discusses some of the issues of usability and the role of candidates in the system.

Status:

state Persistent on all servers
currentTerm Last known tenure number of the server (initialized to 0 and increasing)
votedFor Id of the candidate currently receiving the vote
log[] Log entry set; Each entry contains an instruction executed by the user’s state machine and the tenure number when received
state Change frequently on all servers
commitIndex The index value of the largest known log entry that has been committed
lastApplied The index value of the log entry that was last applied to the state machine (initialized to 0 and increasing)
state Frequently changed in leadership (reinitialization after election)
nextIndex[] For each server, the index value of the next log entry that needs to be sent to it (initialized to the leader’s last index value plus one)
matchIndex[] For each server, the highest index value of the logs that have been copied to it

Attach log RPC:

The leader is responsible for calling to copy the logging directive; Also used as a heartbeat

parameter explain
term Term number of the leader
leaderId Id of the leader so that the follower can redirect the request
prevLogIndex The new log entry follows the previous index value
prevLogTerm PrevLogIndex Specifies the term number of the entry
entries[] Prepare log entries for storage (empty for heartbeat; Sending more than one at a time is for efficiency)
leaderCommit The index value of the log that the leader has submitted
The return value explain
term The current term number used by the leader to update himself
success True if the follower contains a log that matches the prevLogIndex and prevLogTerm

Receiver implementation:

  1. ifterm < currentTermReturn false (Section 5.1)
  2. Return false if the term number of the log entry at the prevLogIndex position does not match the prevLogTerm (Section 5.3)
  3. If an existing log entry conflicts with the new one (same index but different tenure), delete this entry and all subsequent entries (Section 5.3).
  4. Append any entries that do not exist in an existing log
  5. ifleaderCommit > commitIndex, set commitIndex equal to the smaller of the leaderCommit and new log entry index values

To request a vote RPC:

Called by the candidate to collect votes (Section 5.2)

parameter explain
term Term number of the candidate
candidateId Id of the candidate requesting a ballot
lastLogIndex The index value of the candidate’s last log entry
lastLogTerm The tenure number of the candidate’s last log entry
The return value explain
term Current term number so that the candidate can update his or her term number
voteGranted This ballot is valid if the candidate has won it

Receiver implementation:

  1. ifterm < currentTermReturn false (Section 5.2)
  2. If votedFor is empty or candidateId, and the candidate’s log is at least as new as your own, then vote for him (section 5.2, Section 5.4).

Rules for all servers:

All servers:

  • ifcommitIndex > lastApplied, so lastApplied plus one, and putlog[lastApplied]Applying to state machines (Section 5.3)
  • Tenure number if received in RPC request or responseT > currentTerm, then set currentTerm equal to T and switch state to follower (Section 5.1)

Followers (Section 5.2) :

  • Respond to requests from candidates and leaders
  • If they do not receive a heartbeat from the leader or a candidate asks for a vote before the election expires, they become candidates themselves

Candidates (Section 5.2) :

  • The election process begins immediately after the transition to candidate status
    • Increment currentTerm (currentTerm)
    • Vote for yourself
    • Reset the election timeout timer
    • Send the RPC requesting the vote to all other servers
  • If a majority of the server votes are received, then the leader becomes
  • If an additional log RPC is received from the new leader, convert to a follower
  • If the election process runs out, another round is called

Leader:

  • Once become leader: send empty additional log RPC (heartbeat) to all other servers; Repeatedly sending after a certain amount of free time to prevent the follower from timeout (Section 5.2)
  • If a request is received from the client: Attach an entry to the local log and respond to the client after the entry is applied to the state machine (Section 5.3)
  • If the index value of the last log entry is greater than or equal to nextIndex for a follower, then: send all log entries starting with nextIndex:
    • If successful: Update the corresponding follower’s nextIndex and matchIndex
    • If it fails because of log inconsistencies, reduce nextIndex retries
  • If there is a satisfactionN > commitIndexN, and most of themMatchIndex [I] N or moreTrue, andlog[N].term == currentTermCommitIndex = N (sections 5.3 and 5.4)

Figure 2: A condensed summary of the Raft consistency algorithm (excluding member transformation and log compression).

features explain
Election security Features For a given term number, at most one leader can be elected (Section 5.2)
Leaders attach only principles Leaders never delete or overwrite their logs, only add them (Section 5.3)
Log Matching Rules If two logs have the same log entry tenure number at the same index location, then we consider the log to be identical from the beginning to the index location (Section 5.3)
Complete character of leader If a log entry has been submitted for a term number, it must appear for all leaders with a larger term number (Section 5.4)
State machine security features If a leader has applied a log entry at a given index value position to the state machine, no other server will submit a different log at that index position (Section 5.4.3).

Figure 3: Raft ensures all of these features at all times.

5.1 Raft foundation

A Raft cluster contains several server nodes; Usually five, which allows the entire system to tolerate two node failures. At any given moment, every server node is in one of three states: leader, follower, or candidate. In general, there is only one leader in the system and all other nodes are followers. Followers are passive: they don’t send any requests, but simply respond to requests from the leader or candidate. The leader handles all client requests (if a client contacts a follower, the follower redirects the request to the leader). The third state, candidate, is used to elect a new leader as described in Section 5.2. Figure 4 shows these states and their previous transitions; These transformation relationships are discussed next.

Figure 4: Server status. Followers respond only to requests from other servers. If the follower does not receive the message, he becomes a candidate and calls an election. The candidate who receives the most votes in the cluster becomes the leader. For the duration of a term, the leader remains the leader until he or she breaks down.

Figure 5: Time is divided into terms, and each term begins with an election. After a successful election, the leader manages the entire cluster until the end of his term. Sometimes elections fail, and the term ends without a leader. Switching between terms can be observed on different servers at different times.

Raft splits time into arbitrary lengths of tenure, as shown in Figure 5. Terms are marked with consecutive integers. Each term begins with an election in which one or more candidates attempt to become leader, as described in Section 5.2. If a candidate wins an election, he then acts as leader for the rest of his term. In some cases, an electoral process results in a split of votes. In that case, the term would end without a leader; A new term (and a new election) will soon begin again. Raft ensures that there is at most one leader in a given tenure.

Transitions between terms may be observed multiple times by different server nodes, but in some cases, a node may not observe any election or the entire term. Tenure acts as a logical clock in the Raft algorithm, which allows server nodes to pinpoint outdated information such as stale leaders. Each node stores a current term number, which increases monotonously throughout the period. The current tenure number is exchanged when servers communicate with each other. If a server’s current tenure number is smaller than others, it updates its number to a larger number value. If a candidate or leader finds his tenure number has expired, he immediately reverts to follower status. If a node receives a request containing an expired tenure number, it will reject the request outright.

Communication between server nodes in the Raft algorithm uses remote procedure calls (RPCs), and the basic conformance algorithm requires only two types of RPCs. RequestVote RPCs are initiated by the candidate during the election period (Section 5.2) and then AppendEntries are initiated by the leader to replicate the logs and provide a heartbeat mechanism (Section 5.3). Section 7 adds a third RPC for transferring snapshots between servers. When the server does not receive an RPC response in time, it retries, and they can initiate RPCs in parallel for maximum performance.

5.2 Leadership Election

Raft uses a heartbeat mechanism to trigger a leadership election. They are all followers when the server program starts. A server node remains a follower until it receives valid RPCs from the leader or candidate. Leaders periodically send heartbeat packets (that is, additional log entry RPCs that do not contain log entry content) to all followers to maintain their authority. If a follower does not receive any messages for a period of time, i.e., an election timeout, then he thinks there is no leader available in the system and initiates an election to elect a new leader.

To begin an election process, the follower first increments his current term number and transitions to candidate status. He then votes himself by sending RPCs requesting votes to other server nodes in the cluster in parallel. The candidate remains in his current state until one of three things happens :(a) he himself wins the election, (b) some other server becomes the leader, or (c) after a period of time no one wins. These results are discussed separately in the following paragraphs.

When a candidate receives votes for the same tenure number from most of the server nodes in the entire cluster, he wins the election and becomes the leader. Each server will vote a maximum of one vote for a term number on a first-come, first-served basis (note: Section 5.4 adds a little extra restriction on voting). The rule requiring a majority of votes ensures that at most one candidate can win the election (election security in Figure 3). Once a candidate wins an election, he immediately becomes leader. He then sends heartbeat messages to other servers to establish his authority and prevent the creation of a new leader.

While waiting for the vote, the candidate may receive an additional log entry RPC from another server declaring it to be the leader. If the leader’s tenure number (included in this RPC) is no less than the candidate’s current tenure number, the candidate recognizes the leader as legitimate and returns to the follower status. If the tenure number in the RPC is smaller than his or her own, then the candidate will reject the RPC and remain the candidate.

A third possible outcome is that the candidate neither wins nor loses: if more than one follower becomes a candidate at the same time, the vote may be divided so that no candidate can win a majority. When this happens, each candidate runs out of time and then starts a new round by increasing the current term number. Without another mechanism, however, the vote could be divided up by infinite repetition.

Raft algorithm uses random election timeouts to ensure that split votes are rare and quickly resolved if they do occur. To prevent votes from being split up in the first place, election timeouts are chosen at random from a fixed interval (say, 150-300 milliseconds). This can spread out the servers so that in most cases only one server will vote out; He then wins the election and sends heartbeat packets before any other server times out. The same mechanism is used when votes are split. Each candidate will reset a random election timeout at the start of an election, and then wait for the results of the vote within the timeout. This reduces the likelihood of another split in a new election. Section 9.3 shows how this scheme can quickly select a leader.

The leadership election is an example of how the principle of comprehensibility guides our program design. Initially we planned to use a ranking system where each candidate would be given a unique ranking to choose from in the competition. If a candidate finds that another candidate has a higher ranking, he or she will go back to follower status, which makes it easier for the higher ranking candidate to win the next election. However, we found that this approach was a bit problematic in terms of usability (if the highly ranked server went down, then the low-ranked server might time out and go into candidate state again. And if this happens fast enough, the entire election process can be reset.) We tweaked the algorithm several times, but every time we tweaked it, there were new problems. In the end we decided that the random-retry approach was more obvious and easier to understand.

5.3 Log Replication

Once a leader is elected, he starts serving the client. Each request from the client contains an instruction to be executed by the replicated state machine. The leader appends this instruction as a new log entry to the log, and then initiates the additional entry RPCs to other servers in parallel, asking them to copy the log entry. When the log entry is safely copied (described below), the leader applies the log entry to its state machine and returns the result of execution to the client. If the follower crashes or runs slowly, or the network loses packets, the leader will repeatedly try to attach RPCs to log entries (despite replying to the client) until all the followers have finally stored all the log entries.

Figure 6: The log consists of ordered numbered entries. Each entry contains the tenure number (the number in the box) at the time it was created, and an instruction that the state machine needs to execute. An entry is considered ready for submission when it can safely be applied to the state machine.

The logs are organized as shown in Figure 6. Each log entry stores a state machine instruction and the tenure number when it is received from the leader. The tenure number in the log is used to check for inconsistencies and also to ensure some of the properties in Figure 3. Each log entry also has an integer index value to indicate its position in the log.

The leader decides when it is safe to apply log entries to the state machine; Such log entries are said to have been committed. Raft algorithm ensures that all submitted log entries are persistent and will eventually be executed by all available state machines. The log entry is submitted when the leader copies the created log entry to most servers (for example, entry 7 in Figure 6). At the same time, all previous log entries in the leader’s log are also committed, including entries created by other leaders. Section 5.4 discusses some of the implications of applying this rule after a leadership change, and he also shows that this definition of submission is safe. The leader tracks the index of the largest log entry that will be committed, and the index value will be included in all future additional log RPCs (including heartbeat packets) so that other servers finally know where the leader is committed. Once the follower knows that a log entry has been committed, he applies it to the local state machine as well (in log order).

We designed Raft’s logging mechanism to maintain a high level of consistency between logs from different servers. Not only does this simplify the system’s behavior and make it more predictable, it is also an important component of security. Raft maintains the following features, which also make up the log matching feature in Figure 3:

  • If two entries in different logs have the same index and tenure number, they store the same instruction.
  • If two entries in different logs have the same index and tenure number, then all their previous entries are the same.

The first feature comes from the fact that the leader creates at most one log entry in a specified log index position for a single term, and the position of the log entry in the log never changes. The second feature is guaranteed by a simple consistency check for the attached logging RPC. When sending an additional log RPC, the leader includes the index position and tenure number of the new log entry immediately following the previous entry. If the follower does not find an entry in its log that contains the same index position and tenure number, it rejects a new entry. Consistency checking is like a generalization step: at first, the empty log state must satisfy the log matching feature, and then consistency checking protects the log matching feature when the log expands. Therefore, whenever the attached log RPC returns success, the leader knows that the follower’s log must be identical to his own.

In normal operation, the leader and follower logs remain consistent, so a consistency check for the attached log RPC never fails. However, a leader crash would leave the log in an inconsistent state (the old leader might not have copied all the log entries completely). The problem of inconsistency is exacerbated by a series of collapses of leaders and followers. Figure 7 shows how a follower’s log might differ from the new leader’s. The follower may lose some log entries that the new leader has, he may have some log entries that the leader does not have, or both. Missing or extra log entries can last for multiple terms.

Figure 7: When a leader is successfully elected, followers can be anything (A-F). Each box represents a log entry; The numbers inside indicate tenure numbers. The follower may be missing some log entries (a-B), may have some uncommitted log entries (C-D), or both (e-F). For example, scenario F might have a server that was the leader for term 2, had attached some log entries to its own log, but crashed before committing; Soon the machine was rebooted, reelected as leader in term 3, and a few more entries were added to the log; The server went down again before the logs for terms 2 and 3 were committed, and remained down for the next few terms.

In Raft algorithm, the leader handles inconsistencies by forcing followers to copy their own logs directly. This means that conflicting log entries among followers are overwritten by the leader’s log. Section 5.4 explains how to make such operations safe by adding some restrictions.

To bring the follower’s log into line with his own, the leader must find the last point where the two agree, then delete all entries from that point and send his own log to the follower. All of this is done during conformance checks for additional log RPCs. The leader maintains a nextIndex for each follower, which represents the index address of the next log entry that needs to be sent to the follower. When a leader first gains power, he initializes all nextIndex values to add 1 to his last log’s index (11 in Figure 7). If a follower’s log is inconsistent with the leader, the next consistency check on the attached log RPC will fail. After being rejected by the follower, the leader decreases the nextIndex value and retries. Eventually nextIndex will be at a point where the leader and follower logs agree. When this happens, the appending log RPC succeeds, and the follower conflict log entries are removed altogether and the leader’s log is appended. Once the attached log RPC succeeds, the follower’s log is kept in line with the leader’s for the rest of the term.

If necessary, the algorithm can be optimized by reducing the number of additional log RPCs that are rejected. For example, when a request to attach a log RPC is rejected, the follower can include the term number of the conflicting entry and the earliest index address of the term it stored. With this information, leaders can reduce nextIndex across all log entries for that term conflict; This turns out to require an additional entry RPC once per term instead of once per entry. In practice, it is highly doubtful that this optimization is necessary, as failures are rare and it is unlikely that there will be so many inconsistent logs.

Through this mechanism, leaders do not need any special operation to restore consistency when they gain power. It only needs to do the normal operation, and the log will automatically converge if the consistency check of the RPC to reply to the attached log fails. The leader never overwrites or deletes his own log (the leader in Figure 3 only adds features).

The log replication mechanism demonstrates the consistency described in Section 2: Raft can accept, copy, and apply new log entries as long as most machines are working; Under normal circumstances, new log entries can be copied to most machines in the cluster in a single RPC; And a single slow follower does not affect overall performance.

5.4 security

The previous section described how the Raft algorithm elects and copies logs. However, the mechanisms described so far are not sufficient to ensure that each state machine executes the same instructions in the same order. For example, a follower may enter an unavailable state while the leader has submitted several log entries, and then the follower may be elected as the leader and overwrite those log entries; Therefore, different state machines may execute different sequences of instructions.

This section refines the Raft algorithm by adding some restrictions at the time of the leadership election. This restriction ensures that any leader has all the submitted log entries for the previous term for a given term number (the Leader complete feature in Figure 3). By adding this election time limit, we are also clearer about the rules for submission. Finally, we’ll show a brief proof of the complete nature of the leader and show how the leader leads to the correct behavior of the replicated state machine.

5.4.1 Election restrictions

In any leader-based consistency algorithm, the leader must store all submitted log entries. In some consistency algorithms, such as Viewstamped Replication, a node can be elected leader even if it does not initially contain all of the committed log entries. These algorithms all contain additional mechanisms to identify missing log entries and pass them on to new leaders, either at election time or soon after. Unfortunately, this approach leads to considerable additional mechanics and complexity. Raft uses a much simpler approach that ensures that all log entries that have been submitted from previous tenure numbers will appear in the new leader at election time without having to pass them to the leader. This means that the delivery of log entries is one-way, only from the leader to the follower, and the leader never overwrites entries that already exist in its own local log.

Raft uses voting to prevent a candidate from winning an election unless the candidate contains all the log entries that have been submitted. The candidate must contact most nodes in the cluster in order to win the election, which means that every log entry that has been submitted must exist on at least one of these server nodes. If the candidate’s log is at least as new as most server nodes (this new definition is discussed below), then he must hold all the submitted log entries. The request to vote RPC implements this limitation: the RPC contains the candidate’s log information, and then the voter will reject those logs without their own new vote requests.

Raft defines which log is new by comparing the index value and tenure number of the last log entry between the two logs. If the last entry in the two logs has a different tenure number, the log with the larger tenure number is more recent. If the last entry in the two logs has the same tenure number, the longer entry is the newer one.

5.4.2 Submit log entries from previous tenures

As explained in Section 5.3, the leader knows that a log record for the current term can be submitted as long as it is stored on most servers. If a leader crashes before committing a log entry, future leaders will continue to try to copy the log record. However, a leader cannot assume that a log entry from a previous tenure was committed when it was saved to most servers. Figure 8 shows a situation where an old log entry that has been stored on most nodes could still be overwritten by a future leader.

Figure 8: The time series shown in the figure shows why the leader cannot determine the commit status of an old log by its tenure number. In (a), S1 is the leader, partially duplicating the log entry at index position 2. In (b), S1 crashes and S5 wins the election in term 3 with S3, S4 and its own votes, then receives a different log entry from the client and places it at index 2. And then (c), S5 crashes again; S1 The system is restarted, the election is successful, and logs are copied. At this point, the log from tenure 2 has been copied to most machines in the cluster, but has not yet been committed. If S1 crashes again in (d), S5 can be re-elected (with votes from S2, S3 and S4) and then overwrite their logs at index 2. However, before the crash, if S1 had copied the log entry to most machines, such as (e), during its term, then the entry would have been committed (S5 would not have been elected successfully). At this point, all previous logs will be committed as normal.

To eliminate the situation described in Figure 8, Raft never commits a log entry from a previous tenure by counting the number of copies. Only log entries from the leader’s current term can be submitted by counting the number of copies; Once a log entry for the current term is committed in this way, previous entries are also indirectly committed due to the log matching feature. In some cases, the leader can safely know if an old log entry has been committed (for example, if it has been stored on all servers), but Raft uses a more conservative approach to simplify the problem.

When the leader copies the logs from the previous tenure, Raft retains the original tenure number for all the logs, which creates additional complexity in the commit rules. In other consistency algorithms, if a new leader wants to replicate the logs of the previous tenure, it must use the current new tenure number. The approach Raft uses makes it easier to identify the logs because it maintains the same tenure number for the logs over time and over time. Also, new leaders in Raft only need to send fewer log entries than other algorithms (other algorithms had to send more redundant log entries to renumbering them before they were submitted).

5.4.3 Security demonstration

Given the complete Raft algorithm, we can now discuss the leader integrity feature more precisely (this discussion is based on the security proof in Section 9.2). We assume that the leader completeness property does not exist, and then we derive the contradiction. Suppose a leader for term T (leader T) submits a log entry during his term, but the log entry is not stored in the log for a future term of that leader. Let leader U of minimum term U greater than T not have this log entry.

Figure 9: If S1 (the leader of term T) submits a new log for its term, and S5 is elected as the leader for the subsequent term U, then at least one machine, such as S3, has both the log from S1 and voted for S5.

  1. There must be no log entry submitted at the time of the leader U election (the leader never deletes or overwrites any entry).
  2. Leader T copies the log entry to most of the nodes in the cluster, and leader U wins votes from most of the nodes in the cluster. Therefore, at least one node (voter, voter) both accepts the log entry from leader T and votes for leader U, as shown in Figure 9. This voter is the key to this contradiction.
  3. The voter must accept the submitted log entry from leader T before voting for leader U; Otherwise, he will refuse additional log requests from leader T (because his tenure number will be larger than T’s at this point).
  4. Voters still keep this log entry when they vote for leader U, because any intermediate leader contains the log entry (based on the assumption above), the leader never deletes the entry, and followers only delete the entry if they conflict with the leader.
  5. When a voter votes for leader U, leader U’s log must be as new as the voter’s own. This leads to one of the two paradoxes.
  6. First, if the voter and leader U’s last log have the same term number, then leader U’s log is at least as long as the voter’s, so leader U’s log must contain all voters’ logs. This is another contradiction because the voter contains the log entry that has already been submitted, but leader U is not included in the above assumption.
  7. In addition, leader U’s last entry must have a longer term number than the voter. In addition, it is also larger than T because the voter’s last log has at least as large a term number as T (it contains the submitted log from term T). The previous leader who created the last log entry for Leader U must have included the committed log (based on the above assumption, Leader U is the first leader who did not include the log entry). So, according to the log matching feature, leader U must also contain the log that was submitted. Of course, there is a contradiction here.
  8. This completes the contradiction. Therefore, all leaders greater than T must contain all logs from T that have been committed.
  9. The log matching principle ensures that future leaders will also include entries submitted indirectly, such as index 2 in Figure 8 (d).

With the leader complete feature, we can demonstrate the state machine security feature in Figure 3, which means that if a server has already applied a log entry to its own state machine at a given index value, no other server will apply a different log to the same index value. When a server applies a log entry to its own state machine, its log must be the same as the leader’s log on that entry and the previous entry, and have been committed. Now let’s consider the minimum tenure for applying a log at a specified index location on any server; The log full feature ensures that leaders with higher tenure numbers will store the same log entry, so that the same value will be applied to the log entry for an index position in subsequent terms. Therefore, the state machine security feature is valid.

Finally, Raft requires the server to apply the log entries in the order of the index position in the log. Combined with the state machine security features, this means that all servers apply the same set of log sequences to their state machines, in the same order.

5.5 Followers and candidates crash

So far, we’ve only looked at the collapse of the leadership. Followers and candidates deal with crashes much more easily than leaders, and they deal with crashes in the same way. If the follower or candidate crashes, subsequent RPCs sent to them will fail. Handling this failure in Raft is simply through endless retries; If the crashed machine is restarted, then these RPCS will be fully successful. If a server crashes after completing an RPC without a response, it will receive the same request again after it restarts. Raft’s RPCs are idempotent, so retries like this don’t cause any problems. For example, a follower who receives an additional log request but already includes the log will simply ignore the new request.

5.6 Time and Availability

One of the requirements of Raft is that security is not time dependent: the whole system cannot produce wrong results because some event runs a little faster or slower than expected. However, availability (the ability of the system to respond to clients in a timely manner) is inevitably time dependent. For example, if the message exchange takes more time when the server crashes, the candidate will not wait as long to win the election; Raft won’t be able to work without a stable leader.

Leadership selection is one of the most time-critical aspects of Raft. Raft can elect and maintain a stable leader as long as the system meets the following time requirements:

BroadcastTime << electionTimeout << MTBF

In this inequality, broadcast time refers to the average time to send RPCs from one server in parallel to the other servers in the cluster and receive the response; The election timeout is the election timeout limit introduced in Section 5.2. And then the average time between failures is the average time between failures for a server. The broadcast time must be an order of magnitude less than the election timeout so that the leader can send a steady heartbeat message to prevent the follower from entering the election state; By randomizing the election timeouts, the inequality also makes the splitting of votes impossible. The election timeout should be several orders of magnitude smaller than the average time between failures so that the entire system can operate stably. When the leader crashes, the entire system becomes unavailable for about the length of the election timeout; We hope that this situation will be rare in the overall operation of the system.

Broadcast times and mean time between failures are determined by the system, but election timeouts are of our choice. Raft RPCs require the receiver to persist information to stable storage, so the broadcast time can be around 0.5 ms to 20 ms depending on the storage technology. Therefore, the election timeout may need to be between 10 milliseconds and 500 milliseconds. The average time between failures on most servers is a few months or more, which is easy to meet.

6 Cluster member changes

So far, we have assumed that the configuration of the cluster (the set of servers added to the conformance algorithm) is fixed. In practice, however, it is occasionally possible to change the configuration of a cluster, such as replacing machines that are down or changing the replication level. Although this can be done by pausing the entire cluster, updating all configurations, and then restarting the entire cluster, the cluster becomes unavailable at the time of the change. In addition, if manual steps are involved, there is a risk of error. To avoid this problem, we decided to automate configuration changes and incorporate them into the Raft consistency algorithm.

For the configuration modification mechanism to be safe, there must be no point in the transition at which both leaders are elected for the same term. Unfortunately, any scenario in which the server converts directly from the old configuration to the new configuration is not secure. It is not possible to automatically convert all servers at once, so it is possible for the cluster to split into two independent majority groups during the conversion (see Figure 10).

Figure 10: Going directly from one configuration to a new one is not very secure, because individual machines can switch at any time. In this example, the cluster quota is changed from 3 machines to 5 machines. Unfortunately, there is a point in time when two different leaders can be elected in the same term. One through the old configuration, one through the new configuration.

To ensure security, configuration changes must use a two-phase approach. There are many two-phase implementations. For example, some systems stop the old configuration in phase 1 so the cluster can’t handle client requests; The new configuration is then enabled in the second phase. In Raft, the cluster first switches to a transitional configuration called common alignment; Once the consensus has been committed, the system switches to the new configuration. Common consistency is a combination of the old configuration and the new configuration:

  • Log entries are copied to all the new and old servers in the cluster.
  • New and old servers can be leaders.
  • Reaching agreement (for election and submission) requires majority support on both configurations.

Common consistency allows independent servers to configure the transformation process at different times without compromising security. In addition, common alignment allows the cluster to remain responsive to server requests while configuring transformations.

The cluster is configured to store and communicate with special log entries in the replication log; Figure 11 shows the process of configuring the transformation. When a leader receives a request to change the configuration from C-old to C-new, he stores the configuration (c-OLD,new in the figure) consistently for common use, in the form of the log entry and copy described earlier. Once a server adds a new configuration log entry to its log, it uses this configuration to make all future decisions (the server always uses the latest configuration, whether it has been committed or not). This means that the leader uses the c-old,new rule to determine when the log entry C-old,new needs to be committed. If the leader crashes, the new leader who is elected may use the C-Old configuration or the C-Old new configuration, depending on whether the candidate who wins the election has received the C-Old new configuration. In no case will the C-New configuration be made unilaterally during this period.

Once C-old,new is committed, then neither C-old nor C-New can make a decision without the approval of others, and the leader full feature ensures that only servers with C-old,new log entries can be elected as leaders. At this point, it is safe for the leader to create a log entry about the C-new configuration and copy it to the cluster. Furthermore, each server sees a new configuration that takes effect immediately. When the new configuration is committed under the c-new rule, the old configuration becomes irrelevant and servers that do not use the new configuration can be shut down. As shown in Figure 11, C-OLD and C-New do not have any opportunity to make unilateral decisions at the same time; This ensures security.

Figure 11: A timeline for configuring switches. Dashed lines represent entries that have been created but not yet committed, and solid lines represent the last log entry that was committed. The leader first creates the configuration entry for C-old,new in his own log and submits it to C-old, New (majority of C-old and majority of C-New). He then creates the C-New entry and submits it to the majority in c-New. There is no point in time when C-New and C-old can make a decision at the same time.

There are three more questions to be asked about reconfiguration. The first problem is that the new server may initialize without storing any log entries. When these servers are added to the cluster in this state, they need a period of time to update the catch-up before new log entries can be submitted. To avoid this availability gap Raft uses an extra phase during configuration updates where new servers join the cluster with no voting rights (the leader copies the logs to them, not considering they are the majority). Once the new server catches up with the other machines in the cluster, reconfiguration can be handled as described above.

The second problem is that the leader of the cluster may not be a member of the new configuration. In this case, the leader will abdicate (return to follower state) after committing the C-new log. This means that there is a period of time when the leader manages the cluster, but not himself; He copied the log but did not count himself among the majority. A leader transition occurs when C-new is committed, because this is the earliest point in time when the new configuration can work independently (it will always be possible to elect a new leader under the C-New configuration). Until then, leaders may only be elected from c-old.

The third problem is that removing a server that is not in C-New might disrupt the cluster. These servers will no longer receive heartbeats, so when the election times out, they will run a new election process. They send RPCs requesting votes with a new term number, which causes the current leader to fall back into follower status. A new leader would eventually be elected, but the removed servers would time out again, and the process would repeat itself again, resulting in a significant decrease in overall availability.

To avoid this problem, the server ignores request voting RPCs when it confirms the existence of the current leader. Specifically, when the server receives a request to vote RPC within the current minimum election timeout, it does not update the current tenure number or cast a vote. This does not affect normal elections, and each server waits at least a minimum election timeout before starting an election. However, it helps to avoid server disruption by removal: if the leader can send heartbeats to the cluster, he will not be deposed by the larger tenure number.

7 Log Compression

Raft’s logs grow during normal operation, but in a real system they don’t grow indefinitely. As the log grows, it takes up more space and more time to reset. If there is no mechanism to clean up old information that accumulates in logs, it can cause usability problems.

Snapshots are the simplest method of compression. In a snapshot system, the state of the entire system is written as a snapshot to a stable persistent store, and logs up to that point in time are discarded. Snapshotting is used in Chubby and ZooKeeper, and the following sections will cover snapshotting in Raft.

Incremental compression methods, such as log cleaning or log structure merging trees, are possible. These methods operate on only a small amount of data at a time, thus spreading the load of compression. First, they select an area that has accumulated a large number of objects that have been deleted or overwritten, then rewrite the objects that are still active in that area, and then release that area. Compared to simply manipulating snapshots of entire data sets, complex mechanisms are required to implement this. State machines can implement LSM trees using the same interface as snapshots, but log cleaning methods need to be modified Raft.

Figure 12: A server replaces entries 1 through 5 with a new snapshot, the snapshot value storing the current state. The snapshot contains the last index location and tenure number.

Figure 12 shows the basic idea behind snapshots in Raft. Each server creates a snapshot independently, including only the logs that have been submitted. The main work involves writing the state of the state machine to the snapshot. Raft also contains a small amount of metadata into the snapshot: the last included index is the index in the log of the last entry replaced by the snapshot (the log of the last application of the state machine), and the last included tenure is the tenure number of that entry. This data is retained to support conformance checks on additional log requests for the first entry before the snapshot, since this entry requires the last index value and tenure number. To support cluster member updates (Section 6, p. 16), the last configuration is also saved as the last entry in the snapshot. Once the server has completed a snapshot, it can delete all logs and snapshots up to the last index location.

Although the server usually creates snapshots independently, leaders must occasionally send snapshots to lagging followers. This usually happens when the leader has already discarded the next log entry that needs to be sent to the follower. Fortunately, this is not a routine operation: a follower who keeps up with the leader will usually have this entry. However, a very slow follower or server that is new to the cluster (Section 6, p. 62) will not have this entry. The way to keep the follower up to date is to send them a snapshot over the Internet.

Installing snapshot RPC:

Used when leaders send snapshots to followers. Leaders always send in order.

parameter explain
term Term number of the leader
leaderId Id of the leader so that the follower can redirect the request
lastIncludedIndex The index value of the last log entry contained in the snapshot
lastIncludedTerm The tenure number of the last log entry contained in the snapshot
offset Offset of a block in a snapshot
data[] The original data
done True if this is the last partition
The results of explain
term Current term number, so leaders can update themselves

Receiver implementation:

  1. ifterm < currentTermImmediately reply
  2. If it is the first block (offset is 0), a new snapshot is created
  3. Writes data at the specified offset
  4. If done is false, continue to wait for more data
  5. Save the snapshot file and discard the logs whose index value is smaller than the snapshot
  6. If existing logs have the same last term number and index value, subsequent data persists
  7. Discarding the entire log
  8. Reset the state machine with a snapshot

Figure 13: A brief overview of installing snapshots. Snapshots are segmented for ease of transmission; Each chunk gives the follower an indication of life, so the follower can reset the election timeout timer.

In this case the leader uses a new RPC called install Snapshot to send snapshots to followers that are too far behind; As shown in figure 13. When the follower receives a snapshot through this RPC, he must decide for himself what to do with the log that already exists. Usually the snapshot contains information that does not exist in the recipient’s log. In this case, the follower simply discards all of his logs; These are replaced by snapshots, but may conflict with uncommitted logs. If the snapshot received is the first part of your log (due to network retransmission or error), the entries contained in the snapshot will be deleted, but the entries after the snapshot must be correct and retained.

This approach to snapshots is a departure from Raft’s strong leader principle, as followers can create snapshots without knowing the leader. But we think the departure is worth it. Leaders exist to resolve conflicts at the time of consistency, but at the time of snapshot creation, consistency has already been achieved and there is no conflict, so it is ok to have no leaders. Data is still passed from leader to follower, but followers can reorganize their data.

We considered an alternative leader-based snapshot scheme where only the leader would create a snapshot and then send it to all followers. But there are two drawbacks to this. First, sending snapshots wastes network bandwidth and slows snapshot processing. Each follower already has all the information needed to create a snapshot, and it is obviously cheaper to create a snapshot yourself from a local state than to receive a snapshot from someone else over the network. Second, the realization of leadership will be more complicated. For example, the leader needs to send a snapshot while sending new log entries to the follower in parallel so that new client requests are not blocked.

Two other issues affect snapshot performance. First, the server must decide when a snapshot should be created. If snapshots are created too frequently, a lot of disk bandwidth and other resources are wasted. If snapshots are created too infrequently, he runs the risk of running out of storage capacity and increasing the time it takes to rebuild from the logs. A simple strategy is to create a snapshot when the log size reaches a fixed size. If this threshold is set significantly larger than the desired snapshot size, the impact of the snapshot on disk pressure will be minimal.

The second performance issue is that writing snapshots takes a significant amount of time, and we don’t want to interfere with normal operations. The solution is to use a copy-on-write technique so that new updates can be received without affecting the snapshot. For example, state machines with functional data structures naturally support such functionality. In addition, the operating system’s support for copy-on-write techniques (such as fork on Linux) can be used to create memory snapshots of full state machines (which is what our implementation does).

8 Client Interaction

This section describes how the client interacts with Raft, including how the client discovers the leader and how Raft supports linearized semantics. These issues are common to all conformance based systems, and Raft’s solution is similar to the others.

The client in Raft sends all requests to the leader. When the client starts, it randomly picks a server to communicate with. If the client first picks a server that is not a leader, that server rejects the client’s request and provides information about the most recent leader it received (the additional entry request contains the leader’s network address). If the leader has crashed, the client request will time out; The client then retries the process of randomly selecting the server.

Our goal for Raft is to achieve linear semantics (every action is executed immediately, only once, between the time he calls it and the time he receives the reply). However, as mentioned above, Raft can execute the same command multiple times: for example, if the leader commits the log but crashes before responding to the client, the client will retry the command with the new leader, causing the command to be executed again. The solution is for the client to assign a unique sequence number to each instruction. The state machine then tracks the latest serial number and corresponding response for each instruction. If an instruction is received and its sequence number has been executed, the result is returned immediately without re-executing the instruction.

Read-only operations can be handled directly without logging. However, without adding any restrictions, doing so risks returning dirty data because the leader may have been invalidated by the new leader when he responded to the client request, but he didn’t know it. Linearized reads must not return dirty data and Raft needs to use two additional measures to ensure this without logging. First, the leader must have up-to-date information about the logs being submitted. The Leader full feature ensures that the leader must have all the log entries that have been committed, but at the beginning of his tenure, he may not know that they have been committed. In order to know this information, he needs to submit a log entry during his tenure. Raft is implemented by the leader submitting a blank log entry to the log at the beginning of the term without any action. Second, the leader must check whether he has been deposed (his own information has become dirty if a newer leader is elected) before dealing with read-only requests. Raft addresses this problem by having the leader exchange heartbeat information with most of the nodes in the cluster before responding to read-only requests. Optionally, the leader can rely on the heartbeat mechanism to implement a lease mechanism, but this approach relies on time for security (assuming the time error is bounded).

9 algorithm implementation and evaluation

We have implemented the Raft algorithm for RAMCloud as part of a replication state machine that stores configuration information and helps RAMCloud coordinate failover. This Raft implementation contains approximately 2000 lines of C++ code, not including tests, comments, and blank lines. The code is open source. There are also about 25 other independent third-party open source implementations based on this draft paper for different development scenarios. At the same time, many companies have already deployed Raft based systems.

This section evaluates the Raft algorithm in three areas: understandability, correctness, and performance.

9.1 Comprehensibility

To compare the comprehensibility of Raft algorithms with Paxos, we conducted a learning experiment with high-level undergraduate and graduate students in the Advanced Operating Systems course at Stanford university and the Distributed computing course at UC Berkeley. Video lessons for Raft and Paxos were filmed and quizzes were prepared. Raft’s video lecture covers everything in this paper except log compression; The Paxos lecture contains enough information to create an equivalent replica state machine, including single-decision Paxos, multi-decision Paxos, reconfiguration, and some performance optimizations required by real systems (such as leader election). Quizzes test some basic understanding of the algorithm and explain some examples of edges and corners. Each student watched the first video and answered the corresponding test, then watched the second video and answered the corresponding test. About half of the students did the Paxos section first, and the other half did the Raft section first, to illustrate the separate differences between the two lessons learned from the first algorithm. We calculated participants’ scores on each of the quizzes to see if they were better at understanding Raft.

We tried to make the comparison between Paxos and Raft as fair as possible. The experiment favoured Paxos in two ways: 15 of the 43 participants had some previous experience with Paxos, and Paxos videos were 14% longer. As summarized in Table 1, we have taken some steps to mitigate this potential bias. All our materials are available for review.

Care about Measures taken to mitigate prejudice Material available for review
Same lecture quality Both use the same instructor. Paxos is the one used in many universities today. Paxos chairman 14%. video
Same test difficulty The questions are grouped by difficulty and come in pairs on both quizzes. quiz
Fair score Use red headings. Grades are given in random order, with two tests alternating. The scarlet letter title

Table 1: Solutions for each situation, and corresponding materials, in consideration of possible biases.

On average, participants scored 4.9 points higher on Raft than Paxos (Raft scored 25.7 out of 60 and Paxos scored 20.8); Figure 14 shows the score for each participant. A pair of T-tests showed that with 95% confidence, the true Raft score distribution was at least 2.5 points higher than Paxos.

Figure 14: A scatter plot shows the scores of 43 students on Paxos and Raft quizzes. The dots above the diagonal represent students who achieved higher scores in Raft.

We also built a linear regression model to predict a new student’s test performance based on three factors: which quizzes they used, previous experience with Paxos, and the order of learning algorithms. The model showed that the choice of quizzes produced a 12.5 point difference in liking Raft. This is significantly higher than the previous score of 4.9 because many of the students had previous experience with Paxos, which obviously helped Paxos and didn’t have much impact on Raft. Surprisingly, however, the model predicted that for people who took the advanced Paxos quizzes, Raft scores would be 6.3 points lower than Paxos; We don’t know why, but it’s statistically true.

We also asked participants after the test which algorithm they thought was easier to implement and explain; The result of this is shown in Figure 15. Overwhelming results indicated that the Raft algorithm was easier to implement and interpret (33 out of 41). However, this self-reported result is not as reliable as the results of the participants, and the participants may be biased by our hypothesis that Raft is easier to understand.

Figure 15: Using a 5-point question, participants (left) were asked which algorithm they thought would be easier to implement in an efficient and correct system, and on the right which algorithm would be easier to explain to students.

There is a more detailed discussion of Raft user learning.

Accuracy of 9.2

In Section 5, we have provided a formal explanation and a security proof for the consistency mechanism. This formal specification makes the information in Figure 2 very clear through the TLA+ specification language. Some 400 lines of explanation serve as the subject of proof. It’s also useful for anyone who wants to implement it. We proved log complete properties very mechanically through the TLA proof system. However, the constraint premise on which this proof depends has not been mechanically proven (for example, we have not yet proved type safety in this specification). Furthermore, we have written an informal proof that the state machine security properties are complete and fairly clear (about 3,500 words).

9.3 performance

Raft has similar performance to other consistency algorithms such as Paxos. In terms of performance, the most important concern is when to copy a new log entry when the leader is elected. Raft achieves this with a very small number of message packets (a round of messages from the leader to most of the machines in the cluster). Further improvements to Raft’s performance are also possible. For example, it is easy to increase throughput and reduce latency by supporting batch operations and pipeline operations. Many performance optimization schemes have been proposed for other consistency algorithms. Many of these can be applied to Raft as well, but we’ll leave that to future work for now.

We used our own Raft implementation to measure the performance of the Raft leader election and answer two questions. First, does the leadership election process converge quickly? Second, what is the minimum time for a system outage after a leader outage?

Figure 16: Time to find and replace a broken leader. The graph above looks at the degree of randomization over election timeouts, and the graph below looks at the minimum timeouts. Each line represents 1000 experiments (except that only 100 were tried in 150-150 milliseconds), and the corresponding determined election timeout. For example, 150-155 milliseconds means that the election timeout is randomly selected and determined from within this range. The experiment was conducted on a cluster of five nodes with a broadcast delay of about 15 milliseconds. The results are similar for a nine-node cluster.

To measure the leader election, we repeatedly brought down the leader of a five-node server cluster and calculated how long it took to discover that the leader had gone down and elect a new leader (see Figure 16). In order to construct a worst-case scenario, the server logs different lengths of each attempt, meaning that some candidates are ineligible to become leaders. In addition, to facilitate the split of votes, our test script synchronously sent a heartbeat broadcast before terminating the leader (which is about the same as the leader copying a new log to another machine before crashing). Leaders are evenly and randomly down between heartbeats, which is half the minimum election timeout. Therefore, the minimum downtime is about half the minimum election timeout.

The graph above in Figure 16 shows that using only a small amount of randomization on election timeouts can significantly avoid splitting the vote. Without randomization, in our tests, the election process often took more than 10 seconds because too many votes were split up. Adding just 5 milliseconds to the randomization time significantly improved the election process, with the average outage now only 287 milliseconds. Adding more randomization time can greatly improve the worst case: by increasing the randomization time by 50 milliseconds, the worst completion case (1000 attempts) is only 513 milliseconds.

The following figure in Figure 16 shows that reducing the election timeout can reduce system downtime. With an election timeout of 12-24 milliseconds, it takes an average of 35 milliseconds to elect a new leader (the longest took 152 milliseconds). However, lowering the election timeout further would violate Raft’s time inequality requirement: it would be difficult for the leader to send heartbeat packets before electing a new leader. This leads to meaningless leadership changes and reduces the overall usability of the system. We recommend a more conservative election timeout, such as 150-300 milliseconds; Such time is unlikely to lead to meaningless leadership changes and still provides good usability.

10 Related work

There has been a lot of published work on consistency algorithms, much of which falls under the following categories:

  • Lamport’s original description of Paxos, and attempts to describe it more clearly.
  • A more detailed description of Paxos, adding missing details and modifying the algorithm, provides an easier implementation base.
  • Systems that implement consistency algorithms such as Chubby, ZooKeeper, and Spanner. The technical details of Chubby’s and Spanner’s algorithms are not publicly available, although both claim to be based on Paxos. The details of ZooKeeper’s algorithm have been published, but it is quite different from Paxos.
  • Paxos can be applied to performance optimization.
  • Oki and Liskov’s Viewstamped Replication (VR), an alternative algorithm similar to Paxos. The original algorithm description was coupled to the DISTRIBUTED transport protocol, but the core conformance algorithm was separated in a recent update. VR uses a leader-based approach that has a lot in common with Raft.

The biggest difference between Raft and Paxos is the strong leadership nature of Raft: Raft uses leader election as an essential part of the consistency protocol and focuses as much functionality as possible on the leader. This will make the algorithm easier to understand. For example, in Paxos, the leader election and the basic consistency protocol are orthogonal: the leader election is merely a means of performance optimization and is not necessarily required for consistency. However, this adds a redundant mechanism: Paxos includes both a two-phase commit protocol for basic conformance requirements and a separate mechanism for leadership elections. In contrast, Raft directly incorporates the leader election into the consistency algorithm as the first step in two-stage consistency. That removes a lot of mechanics.

Like Raft, VR and ZooKeeper are leader-based, so they have some of Raft’s benefits as well. However, Raft has fewer mechanics than VR and ZooKeeper because Raft minimizes non-leader functionality as much as possible. For example, log entries in Raft follow the direction of being sent from the leader to others: the attached entry RPC is sent out. In VR, the flow of log entries is two-way (leaders can receive logs during elections); This leads to additional mechanics and complexity. ZooKeeper’s log entries are also bidirectional, but its implementation is more like Raft.

Raft has fewer message types than the other conformance based log replication algorithms we mentioned above. For example, we counted the number of messages used by VR and ZooKeeper for basic consistency needs and member changes (excluding log compression and client interactions, which are relatively independent and not algorithmic). VR and ZooKeeper each define 10 different message types, while Raft only has 4 message types (two RPC requests and corresponding responses). Raft’s messages are slightly more informative than those of other algorithms, but they are very simple. In addition, both VR and ZooKeeper transmit the entire log when the leader changes; So for practical use, additional message types are necessary.

Raft’s strong leader model simplifies the overall algorithm, but also excludes some performance optimization approaches. For example, egalitarian Paxos (EPaxos) can achieve high performance in some situations without a leader. Egalitarian Paxos makes the most of its interchangeability in state machine instructions. Any server can submit instructions in one round of communication, unless other instructions are presented at the same time. However, if the instructions are issued concurrently and do not communicate with each other, EPaxos requires an additional round of communication. Because any server can submit instructions, EPaxos does a good job of load balancing between servers and can easily achieve low latency over WAN networks. However, he added significant complexity to Paxos.

Several methods of cluster member transformation have been proposed or implemented in other work, including the original discussion of Lamport, VR and SMART. We chose to use the common consensus approach because it has very little effect on the rest of the consensus protocol, so we can implement member transformations with very few mechanisms. Lamport’s alpha-based approach was not chosen Raft because it assumed that consistency could be achieved without a leader. Compared to VR and SMART, Raft’s reconfiguration algorithm can be done without limiting normal request processing; In contrast, VR needs to stop all processing, and SMART introduces a similar approach to alpha, limiting the number of requests processed. Raft’s approach also requires fewer additional mechanics to implement compared to VR and SMART.

Conclusion 11

Algorithms are often designed with correctness, efficiency, or simplicity as the primary goal. While these are meaningful goals, we believe that comprehensibility is just as important. None of these goals will be achieved until developers implement algorithms into real systems, which will inevitably deviate from their published form. Unless developers have a deep understanding and intuitive feel for the algorithm, it will be difficult for them to maintain the desired features when implementing it.

In this paper, we try to solve the problem of distributed consistency, but a widely accepted but confusing algorithm, Paxos, has plagued countless students and developers for years. We created a new algorithm, Raft, which is obviously easier to understand than Paxos. We also believe that Raft can provide a solid foundation for practical implementation. Having comprehensibility as a design goal changes the way we design Raft; This process led us to find that we ended up with few technical duplicates, such as problem decomposition and simplifying the state space. These techniques not only improve the comprehensibility of Raft, but also give us confidence that it’s correct.

Thank you for 12

This study must be supported by Ali Ghodsi, David Mazie ‘res, and the students of CS 294-91 at Berkeley and CS 240 at Stanford. Scott Klemmer helped us design the user survey, and Nelson Ray suggested statistical analysis. The slides on Paxos used in user surveys were largely borrowed from Lorenzo Alvisi’s slides. Special thanks to DavidMazieres and Ezra Hoch for finding some hard-to-find bugs in Raft. Many people have provided useful feedback and user survey materials on this paper, including Ed Bugnion, Michael Chan, Hugues Evrard, Daniel Giffin, Arjun Gopalan, Jon Howell, Vimalkumar Jeyakumar, Ankita Kejriwal, Aleksandar Kracun, Amit Levy, Joel Martin, Satoshi Matsushita, Oleg Pesok, David Ramos, Robbert Van Renesse, Mendel Rosenblum, Nicolas Schiper, Deian Stefan, Andrew Stone, Ryan Stutsman, David Terei, Stephen Yang, Matei Zaharia and 24 anonymous conference reviewers (there may be duplication), and special thanks to our leader Eddie Kohler. Werner Vogels tweeted a link to an early draft that brought Raft a lot of attention. Our work is supported by the Gigascale Systems Research Center and the Multiscale Systems Research Center, both of which are funded by the Center of Concern Research Program, a semiconductor research company program supported by STARnet, A semiconductor research company program supported by MARCO and DARPA, approved in NSF No. 0963859, and supported by Facebook, Google, Mellanox, NEC, NetApp, SAP, and Samsung. Diego Ongaro is supported by Junglee, inc., a Stanford graduate group.

reference

slightly