Abstract
Raft is a consistency algorithm for managing replication logs. It provides the same functionality and performance as the Paxos algorithm, but its algorithm structure is different from that of Paxos, making Raft much easier to understand and build real-world systems. To improve understandability, Raft breaks down the consistency algorithm into several key modules, such as leader election, log replication, and security. It also reduces the number of states that need to be considered by implementing a stronger consistency. The results of a user study show that Raft is much easier for students to learn than Paxos. The Raft algorithm also includes a new mechanism to allow dynamic changes in cluster members, which exploits overlapping majorities for security.
1 introduction
Consistency algorithms allow a group of machines to work as a unit and continue to work even if some of them fail. Because of this, consistency algorithms play an important role in building reliable large-scale software systems. For the past 10 years, the Paxos algorithm has dominated the conformance algorithm space: the vast majority of implementations are based on or influenced by Paxos. Paxos has also become an example of consistency in teaching.
Unfortunately, despite efforts to reduce its complexity, the Paxos algorithm remains extremely difficult to understand. In addition, the algorithm structure of Paxos itself needs to be greatly modified before it can be applied to the actual system. All of these have led to the Paxos algorithm headache in both industry and academia.
After working with the Paxos algorithm, we began to look for a new consistency algorithm that could provide a better basis for building practical systems and teaching. What we did was unusual, and our primary goal was understandability: could we define a consistent algorithm in a real system that could be learned in a much easier way than the Paxos algorithm. In addition, we want the algorithm to facilitate the development of the intuition of the system builder. Not only is it important for an algorithm to work, but it is also important to be able to clearly see why it works.
The Raft consistency algorithm is the result of this work. When designing the Raft algorithm, we used some special techniques to make it more understandable, including algorithm decomposition (Raft is divided into three main modules: leader election, log replication and security) and reducing the state of the state machine (as compared to Paxos, Raft reduces nondeterminism and the way servers are inconsistent with each other. A study of 43 students from two universities showed that Raft was significantly easier to understand than Paxos. After the students learned both algorithms together, 33 of them were able to answer Raft questions compared to Paxos.
Raft is similar in many respects to existing consistency algorithms (mainly Oki and Liskov’s Viewstamped Replication), but it also has some unique features:
- Strong Leader: Raft uses a stronger form of leadership than other consistency algorithms. For example, log entries are only sent from the leader to other servers. This approach simplifies the management of replication logs and makes the Raft algorithm easier to understand.
- Leadership election: The Raft algorithm uses a random timer to elect a leader. This approach just adds a little mechanism to the heartbeat mechanism that any consistency algorithm must implement. It will be easier and quicker to resolve conflicts.
- Membership adjustment: Raft uses a common and consistent approach to dealing with changing cluster members, in which most of the machines in the two different configurations of the cluster overlap during the adjustment process, which allows the cluster to continue to work when the members change.
We believe that Raft is superior to Paxos or other consistency algorithms, both for educational purposes and as a basis for practical projects. It’s simpler and easier to understand than other algorithms; Its algorithm description is sufficient to implement a realistic system; It has many open source implementations and is used by many companies; Its safety has been proven; Its efficiency is comparable to other algorithms.
Next, this paper will introduce the following contents: Copy state machine problems (Section 2), discuss the pros and cons of Paxos (Section 3), discuss the approach we took for comprehensibility (Section 4), illustrate the Raft consistency algorithm (Sections 5-8), evaluate the Raft algorithm (Section 9), And some related work (section 10).
2 Replicate the state machine
Consistency algorithms are proposed in the context of replication state machines (cf. 37). In this approach, state machines on a set of servers produce copies of the same state and can continue to operate even if some machines go down. Replication state machines are used to solve many fault-tolerant problems in distributed systems. For example, large-scale systems often have a cluster leader, such as GFS, HDFS, and RAMCloud. Typical applications are independent replication state machines that manage leadership elections and store configuration information and survive leader outages. Like Chubby and ZooKeeper.
Figure 1: Replication state machine structure. Consistency algorithms manage replication logs from client instructions. The state machine processes the same instructions in the same order from the log, so the result is the same.
Replication state machines are typically implemented based on replication logs, as shown in Figure 1. Each server stores a log containing a series of instructions and executes them in the order of the log. Each log contains the same instructions in the same order, so each server executes the same sequence of instructions. Because each state machine is deterministic, each execution produces the same state and the same sequence.
It is the job of the consistency algorithm to ensure that the replication logs are the same. On a server, the consistency module receives the instructions sent by the client and adds them to its log. It communicates with consistency modules on other servers to ensure that logs on each server eventually contain the same requests in the same order, even though some servers may go down. Once the instructions have been correctly copied, each server’s state machine processes them in log order, and the output is returned to the client. Thus, the server cluster appears to form a highly reliable state machine.
Consistency algorithms used in real systems usually have the following characteristics:
- Security guarantee (never return an incorrect result) : Errors including network delay, partitioning, packet loss, redundancy, and out-of-order are guaranteed to be correct in the case of non-Byzantine errors.
- Availability: Availability is guaranteed as long as most machines in the cluster are running and able to communicate with each other and clients. Thus, a typical cluster of five nodes can tolerate two node failures. If the server is stopped, it is considered a failure. They can recover from the state and rejoin the cluster when stable storage is available.
- Independent of timing for consistency: Physical clock errors or extreme message delays only cause usability problems in the worst cases.
- Typically, an instruction can be completed as quickly as possible when most nodes in the cluster respond to a round of remote procedure calls. A small number of slow nodes does not affect the overall performance of the system.
3. Problems of Paxos algorithm
Over the past 10 years, Leslie Lamport’s Paxos algorithm has become almost synonymous with consistency: Paxos is the algorithm most often used in courses and is the starting point for most consistency algorithm implementations. Paxos starts by defining a protocol that can agree on a single decision, such as a single copy log entry. We call this subset single-decision Paxos. It then facilitates a series of decisions by combining multiple instances of the Paxos protocol. Paxos guarantees security and activity, while also supporting cluster membership changes. Paxos is proven to be correct and efficient in general.
Unfortunately, Paxos has two obvious drawbacks. The first disadvantage is that the Paxos algorithm is particularly difficult to understand. The full explanation is notoriously opaque; After a great deal of effort, only a few people managed to understand the algorithm. As a result, there have been several attempts to explain Paxos in simpler terms. Although these explanations focus only on a subset of single decision problems, they are still challenging. A survey at the 2012 NSDI conference showed that few people were satisfied with the Paxos algorithm, even among seasoned researchers. We tried to understand Paxos ourselves; We didn’t understand Paxos until we read a lot of simplified explanations of Paxos and designed our own algorithm, a process that took almost a year.
We assume that Paxos’s opacity comes from its choice single decision problem as its foundation. Single-decision Paxos is obscure and subtle, and it is divided into two scenarios that have no simple intuitive explanation and cannot be understood independently. As a result, it is difficult to establish an intuitive sense of why the single-decision Paxos algorithm works. Making up multi-decision Paxos adds a lot of complexity to the rules. We believe that the problem of reaching agreement on multiple decisions (a log as opposed to a single log) can be broken down into other ways and are more straightforward and obvious.
The second problem with the Paxos algorithm is that it does not provide a good enough foundation to build a realistic system. One reason is that there is no widely accepted algorithm for multiple decision problems. Lamport’s description is basically about single-decision Paxos; He briefly describes the approach to implementing multi-decision Paxos, but lacks much detail. Of course, there are many attempts to externalize Paxos, but they are all different from each other and the Paxos overview is different. Systems such as Chubby implement an algorithm similar to Paxos, but most of the details are not publicly available.
Moreover, the structure of Paxos algorithm is not very easy to build a practical system; Single decision decomposition produces other results as well. For example, selecting a set of log entries independently and merging them into a serialized log does not provide much benefit, but just adds a lot of complexity. It is much simpler and more efficient to design a system around a log; New log entries are added to the log in a strictly restricted order. Another problem is that Paxos uses a peer-to-peer, peer-to-peer approach as its core (although it ultimately proposes a weak-leader approach to optimize performance). This makes sense in a simplified world where only one decision is made, but few real-world systems use this approach. If there is a series of decisions that need to be made, it is easier and faster to choose a leader first and then have him coordinate all the decisions.
As a result, there are few practices in the real world that are similar to Paxos. Each implementation started with Paxos, found many implementation problems, and then developed a structure that was significantly different from Paxos. This is time-consuming and error-prone, and the difficulty of understanding Paxos makes the problem worse. The Paxos algorithm is proven to be correct in theory, but the real-world system is so different from Paxos that these proofs are of little value. The following is typical from Chubby’s implementation:
There is a huge gap between Paxos algorithm description and implementation of the real system. The final system is based on an unproven algorithm.
Due to the above problems, we believe that Paxos algorithm does not provide a good foundation for the practice of the system, nor does it provide good help for teaching. Given the importance of consistency issues in large-scale software systems, we decided to see if we could design a consistency algorithm with better features to replace Paxos. The Raft algorithm is the result of this experiment.
Design for comprehensibility
The Raft algorithm was designed for a number of reasons: it had to provide a complete, real-world basis for implementing the system in order to greatly reduce the developer’s work; It must be secure in all cases and available in most cases; And most of its operations must be efficient. But our biggest and most important challenge is comprehensibility. It must be easy for the general population to understand. In addition, it must make intuitive sense so that the system builder can make the necessary extensions in reality.
When designing the Raft algorithm, there are a number of points where we need to choose between various alternatives. In this case, we evaluate the alternatives based on the principle of comprehensibility: how difficult is it to explain each alternative (for example, how complex is Raft’s state space and are there subtle hints)? Is it easy for a reader to fully understand the scheme and implications?
We realize that the analysis of this intelligibility is highly subjective; However, we used two commonly applicable techniques to solve this problem. The first technique is known as problem decomposition: whenever possible, we break a problem down into several relatively independent, solvable, explicable, and comprehensible sub-problems. For example, the Raft algorithm is broken down into leader election, log replication, security, and role change.
The second approach we use is to simplify the space of states to consider by reducing the number of states, making the system more coherent and eliminating uncertainty where possible. In particular, all logs are not allowed to have holes and Raft limits the possibility of the logs becoming inconsistent. Although in most cases we try to eliminate uncertainty, there are also cases where uncertainty can improve understandability. In particular, randomization methods increase uncertainty, but they help reduce the number of state Spaces by using similar methods for all possible choices. We used randomization to simplify the leader election algorithm in Raft.
5 Raft consistency algorithm
Raft is an algorithm used to manage the replication logs described in Section 2. Figure 2 summarizes the abbreviated version of the algorithm for reference purposes, and Figure 3 lists some key features of the algorithm. Each of these elements will be described in the remaining sections.
Raft achieves consistency by electing a noble leader and then giving him full responsibility for managing the replication log. The leader receives the log entry from the client, copies the log entry to the other servers, and tells the other servers to apply the log entry to their state machine when security is warranted. Having a leader greatly simplifies the management of replication logs. For example, the leader can decide where a new log entry needs to be placed in the log without consulting with other servers, and the data flows from the leader to other servers. A leader can go down, can lose connections to other servers, and a new leader is elected.
By taking a leadership approach, Raft breaks the consistency issue down into three relatively independent sub-issues that will be discussed in the following sub-chapters:
- Leadership election: A new leader needs to be elected when the existing leader goes down (Section 5.2)
- Log replication: The leader must receive logs from the client and copy them to other nodes in the cluster, forcing the logs of other nodes to be the same as his own.
- Security: The key to security in Raft is the state machine security shown in Figure 3: If any server node has applied a certain log entry to its state machine, no other server node can apply a different directive to the same log index location. Section 5.4 describes how the Raft algorithm ensures this property; This solution involves a limitation on an additional election mechanism (Section 5.2).
After showing consistency algorithms, this chapter discusses some of the usability issues and the role of timing in systems.
Status:
state | Persistent on all servers |
---|---|
currentTerm | Last known tenure number of the server (initialized to 0 and increasing) |
votedFor | Id of the candidate currently receiving the vote |
log[] | Log entry set; Each entry contains an instruction executed by the user’s state machine and the tenure number when received |
state | Change frequently on all servers |
---|---|
commitIndex | The index value of the largest known log entry that has been committed |
lastApplied | The index value of the log entry that was last applied to the state machine (initialized to 0 and increasing) |
state | Frequently changed in leadership (reinitialization after election) |
---|---|
nextIndex[] | For each server, the index value of the next log entry that needs to be sent to it (initialized to the leader’s last index value plus one) |
matchIndex[] | For each server, the highest index value of the logs that have been copied to it |
Attach log RPC:
The leader is responsible for calling to copy the logging directive; Also used as a heartbeat
parameter | explain |
---|---|
term | Term number of the leader |
leaderId | Id of the leader so that the follower can redirect the request |
prevLogIndex | The new log entry follows the previous index value |
prevLogTerm | PrevLogIndex Specifies the term number of the entry |
entries[] | Prepare log entries for storage (empty for heartbeat; Sending more than one at a time is for efficiency) |
leaderCommit | The index value of the log that the leader has submitted |
The return value | explain |
---|---|
term | The current term number used by the leader to update himself |
success | True if the follower contains a log that matches the prevLogIndex and prevLogTerm |
Receiver implementation:
- if
term < currentTerm
Return false (Section 5.1) - Return false if the term number of the log entry at the prevLogIndex position does not match the prevLogTerm (Section 5.3)
- If an existing log entry conflicts with the new one (same index but different tenure), delete this entry and all subsequent entries (Section 5.3).
- Append any new entries that do not already exist in the log
- if
leaderCommit > commitIndex
, set commitIndex equal to the smaller of the leaderCommit and new log entry index values
To request a vote RPC:
Called by the candidate to collect votes (Section 5.2)
parameter | explain |
---|---|
term | Term number of the candidate |
candidateId | Id of the candidate requesting a ballot |
lastLogIndex | The index value of the candidate’s last log entry |
lastLogTerm | The tenure number of the candidate’s last log entry |
The return value | explain |
---|---|
term | Current term number so that the candidate can update his or her term number |
voteGranted | This ballot is valid if the candidate has won it |
Receiver implementation:
- if
term < currentTerm
Return false (Section 5.2) - If votedFor is empty or candidateId, and the candidate’s log is at least as new as your own, vote for him (section 5.2, Section 5.4)
Rules for all servers:
All servers:
- if
commitIndex > lastApplied
, so lastApplied plus one, and putlog[lastApplied]
Applying to state machines (Section 5.3) - Tenure number if received in RPC request or response
T > currentTerm
, then set currentTerm equal to T and switch state to follower (Section 5.1)
Followers (Section 5.2) :
- Respond to requests from candidates and leaders
- If they do not receive a heartbeat from the leader or a candidate asks for a vote before the election expires, they become candidates themselves
Candidates (Section 5.2) :
- The election process begins immediately after the transition to candidate status
- Increment currentTerm (currentTerm)
- Vote for yourself
- Reset the election timeout timer
- Send the RPC requesting the vote to all other servers
- If a majority of the server votes are received, then the leader becomes
- If an additional log RPC is received from the new leader, convert to a follower
- If the election process runs out, another round is called
Leader:
- Once become leader: send empty additional log RPC (heartbeat) to all other servers; Repeatedly sending after a certain amount of free time to prevent the follower from timeout (Section 5.2)
- If a request is received from the client: Attach an entry to the local log and respond to the client after the entry is applied to the state machine (Section 5.3)
- If the index value of the last log entry is greater than or equal to nextIndex for a follower, then: send all log entries starting with nextIndex:
- If successful: Update the corresponding follower’s nextIndex and matchIndex
- If it fails because of log inconsistencies, reduce nextIndex retries
- If there is a satisfaction
N > commitIndex
N, and most of themMatchIndex [I] N or more
True, andlog[N].term == currentTerm
CommitIndex = N (sections 5.3 and 5.4)
Figure 2: A condensed summary of the Raft consistency algorithm (excluding member transformation and log compression).
features | explain |
---|---|
Election security Features | For a given term number, at most one leader can be elected (Section 5.2) |
Leaders attach only principles | Leaders never delete or overwrite their logs, only add them (Section 5.3) |
Log Matching Rules | If two logs have the same log entry tenure number at the same index location, then we consider the log to be identical from the beginning to the index location (Section 5.3) |
Complete character of leader | If a log entry has been submitted for a term number, it must appear for all leaders with a larger term number (Section 5.4) |
State machine security features | If a leader has applied a log entry at a given index value position to the state machine, no other server will submit a different log at that index position (Section 5.4.3). |
Figure 3: Raft ensures all of these features at all times.
5.1 Raft foundation
A Raft cluster contains several server nodes; Usually five, which allows the entire system to tolerate two node failures. At any given moment, every server node is in one of three states: leader, follower, or candidate. In general, there is only one leader in the system and all other nodes are followers. Followers are passive: they don’t send any requests, but simply respond to requests from the leader or candidate. The leader handles all client requests (if a client contacts a follower, the follower redirects the request to the leader). The third state, candidate, is used to elect a new leader as described in Section 5.2. Figure 4 shows these states and their transition relationships; These transformation relationships are discussed next.
Figure 4: Server status. Followers respond only to requests from other servers. If the follower does not receive the message, he becomes a candidate and calls an election. The candidate who receives the most votes in the cluster becomes the leader. For the duration of a term, the leader remains the leader until he or she breaks down.
Figure 5: Time is divided into terms, and each term begins with an election. After a successful election, the leader manages the entire cluster until the end of his term. Sometimes elections fail, and the term ends without a leader. Switching between terms can be observed on different servers at different times.
Raft splits time into arbitrary lengths of tenure, as shown in Figure 5. Terms are marked with consecutive integers. Each term begins with an election in which one or more candidates attempt to become leader, as described in Section 5.2. If a candidate wins an election, he then acts as leader for the rest of his term. In some cases, an electoral process results in a split of votes. In that case, the term would end without a leader; A new term (and a new election) will soon begin again. Raft ensures that there is at most one leader in a given tenure.
Transitions between terms may be observed multiple times by different server nodes, but in some cases, a node may not observe any election or the entire term. Tenure acts as a logical clock in the Raft algorithm, which allows server nodes to pinpoint outdated information such as stale leaders. Each node stores a current term number, which increases monotonously throughout the period. The current tenure number is exchanged when servers communicate with each other. If a server’s current tenure number is smaller than others, it updates its number to a larger number value. If a candidate or leader finds his tenure number has expired, he immediately reverts to follower status. If a node receives a request containing an expired tenure number, it will reject the request outright.
Communication between server nodes in the Raft algorithm uses remote procedure calls (RPCs), and the basic conformance algorithm requires only two types of RPCs. RequestVote RPCs are initiated by the candidate during the election period (Section 5.2) and then AppendEntries are initiated by the leader to replicate the logs and provide a heartbeat mechanism (Section 5.3). Section 7 adds a third RPC for transferring snapshots between servers. When the server does not receive an RPC response in time, it retries, and they can initiate RPCs in parallel for maximum performance.
5.2 Leadership Election
Raft uses a heartbeat mechanism to trigger a leadership election. They are all followers when the server program starts. A server node remains a follower as long as it receives valid RPCs from the leader or candidate. Leaders periodically send heartbeat packets (that is, additional log entry RPCs that do not contain log entry content) to all followers to maintain their authority. If a follower does not receive any messages for a period of time, i.e., an election timeout, then he thinks there is no leader available in the system and initiates an election to elect a new leader.
To begin an election process, the follower first increments his current term number and transitions to candidate status. He then votes himself by sending RPCs requesting votes to other server nodes in the cluster in parallel. The candidate remains in his current state until one of three things happens :(a) he himself wins the election, (b) some other server becomes the leader, or (c) after a period of time no one wins. These results are discussed separately in the following paragraphs.
When a candidate receives votes for the same tenure number from most of the server nodes in the entire cluster, he wins the election and becomes the leader. Each server will vote a maximum of one vote for a term number on a first-come, first-served basis (note: Section 5.4 adds a little extra restriction on voting). The rule requiring a majority of votes ensures that at most one candidate can win the election (election security in Figure 3). Once a candidate wins an election, he immediately becomes leader. He then sends heartbeat messages to other servers to establish his authority and prevent the creation of a new leader.
While waiting for the vote, the candidate may receive an additional log entry RPC from another server declaring it to be the leader. If the leader’s tenure number (included in this RPC) is no less than the candidate’s current tenure number, the candidate recognizes the leader as legitimate and returns to the follower status. If the tenure number in the RPC is smaller than his or her own, then the candidate will reject the RPC and remain the candidate.
A third possible outcome is that the candidate neither wins nor loses: if more than one follower becomes a candidate at the same time, the vote may be divided so that no candidate can win a majority. When this happens, each candidate runs out of time and then starts a new round by increasing the current term number. Without another mechanism, however, the vote could be divided up by infinite repetition.
Raft algorithm uses random election timeouts to ensure that split votes are rare and quickly resolved if they do occur. To prevent votes from being split up in the first place, election timeouts are chosen at random from a fixed interval (say, 150-300 milliseconds). This can spread out the servers so that in most cases only one server will vote out; He then wins the election and sends heartbeat packets before any other server times out. The same mechanism is used when votes are split. Each candidate will reset a random election timeout at the start of an election, and then wait for the results of the vote within the timeout. This reduces the likelihood of another split in a new election. Section 9.3 shows how this scheme can quickly select a leader.
The leadership election is an example of how the principle of comprehensibility guides our program design. Initially we planned to use a ranking system where each candidate would be given a unique ranking to choose from in the competition. If a candidate finds that another candidate has a higher ranking, he or she will go back to follower status, which makes it easier for the higher ranking candidate to win the next election. However, we found that this approach was a bit problematic in terms of usability (if the highly ranked server went down, then the low-ranked server might time out and go into candidate state again. And if this happens fast enough, the entire election process can be reset.) We tweaked the algorithm several times, but every time we tweaked it, there were new problems. In the end we decided that the random-retry approach was more obvious and easier to understand.
5.3 Log Replication
Once a leader is elected, he starts serving the client. Each request from the client contains an instruction to be executed by the replicated state machine. The leader appends this instruction as a new log entry to the log, and then initiates the additional entry RPCs to other servers in parallel, asking them to copy the log entry. When the log entry is safely copied (described below), the leader applies the log entry to its state machine and returns the result of execution to the client. If the follower crashes or runs slowly, or the network loses packets, the leader will repeatedly try to attach RPCs to log entries (despite replying to the client) until all the followers have finally stored all the log entries.
Figure 6: The log consists of ordered numbered entries. Each entry contains the tenure number (the number in the box) at the time it was created, and an instruction that the state machine needs to execute. An entry is considered ready for submission when it can safely be applied to the state machine.
The logs are organized as shown in Figure 6. Each log entry stores a state machine instruction and the tenure number when it is received from the leader. The tenure number in the log is used to check for inconsistencies and also to ensure some of the properties in Figure 3. Each log entry also has an integer index value to indicate its position in the log.
The leader decides when it is safe to apply log entries to the state machine; Such log entries are said to have been committed. Raft algorithm ensures that all submitted log entries are persistent and will eventually be executed by all available state machines. The log entry is submitted when the leader copies the created log entry to most servers (for example, entry 7 in Figure 6). At the same time, all previous log entries in the leader’s log are also committed, including entries created by other leaders. Section 5.4 discusses some of the implications of applying this rule after a leadership change, and he also shows that this definition of submission is safe. The leader tracks the index of the largest log entry that will be committed, and the index value will be included in all future additional log RPCs (including heartbeat packets) so that other servers finally know where the leader is committed. Once the follower knows that a log entry has been committed, he applies it to the local state machine as well (in log order).
We designed Raft’s logging mechanism to maintain a high level of consistency between logs from different servers. Not only does this simplify the system’s behavior and make it more predictable, it is also an important component of security. Raft maintains the following features, which also make up the log matching feature in Figure 3:
- If two entries in different logs have the same index and tenure number, they store the same instruction.
- If two entries in different logs have the same index and tenure number, then all their previous entries are the same.
The first feature comes from the fact that the leader creates at most one log entry in a specified log index position for a single term, and the position of the log entry in the log never changes. The second feature is guaranteed by a simple consistency check for the attached logging RPC. When sending an additional log RPC, the leader includes the index position and tenure number of the new log entry immediately following the previous entry. If the follower does not find an entry in its log that contains the same index position and tenure number, it rejects a new entry. Consistency checking is like a generalization step: at first, the empty log state must satisfy the log matching feature, and then consistency checking protects the log matching feature when the log expands. Therefore, whenever the attached log RPC returns success, the leader knows that the follower’s log must be identical to his own.
In normal operation, the leader and follower logs remain consistent, so a consistency check for the attached log RPC never fails. However, a leader crash would leave the log in an inconsistent state (the old leader might not have copied all the log entries completely). The problem of inconsistency is exacerbated by a series of collapses of leaders and followers. Figure 7 shows how a follower’s log might differ from the new leader’s. The follower may lose some log entries that the new leader has, he may have some log entries that the leader does not have, or both. Missing or extra log entries can last for multiple terms.
Figure 7: When a leader is successfully elected, followers can be anything (A-F). Each box represents a log entry; The numbers inside indicate tenure numbers. The follower may be missing some log entries (a-B), may have some uncommitted log entries (C-D), or both (e-F). For example, scenario F might have a server that was the leader for term 2, had attached some log entries to its own log, but crashed before committing; Soon the machine was rebooted, reelected as leader in term 3, and a few more entries were added to the log; The server went down again before the logs for terms 2 and 3 were committed, and remained down for the next few terms.
In Raft algorithm, the leader handles inconsistencies by forcing followers to copy their own logs directly. This means that conflicting log entries among followers are overwritten by the leader’s log. Section 5.4 explains how to make such operations safe by adding some restrictions.
To bring the follower’s log into line with his own, the leader must find the last point where the two agree, then delete all entries from that point and send his own log to the follower. All of this is done during conformance checks for additional log RPCs. The leader maintains a nextIndex for each follower, which represents the index address of the next log entry that needs to be sent to the follower. When a leader first gains power, he initializes all nextIndex values to add 1 to his last log’s index (11 in Figure 7). If a follower’s log is inconsistent with the leader, the next consistency check on the attached log RPC will fail. After being rejected by the follower, the leader decreases the nextIndex value and retries. Eventually nextIndex will be at a point where the leader and follower logs agree. When this happens, the appending log RPC succeeds, and the follower conflict log entries are removed altogether and the leader’s log is appended. Once the attached log RPC succeeds, the follower’s log is kept in line with the leader’s for the rest of the term.
If necessary, the algorithm can be optimized by reducing the number of additional log RPCs that are rejected. For example, when a request to attach a log RPC is rejected, the follower can include the term number of the conflicting entry and the earliest index address of the term it stored. With this information, leaders can reduce nextIndex across all log entries for that term conflict; This turns out to require an additional entry RPC once per term instead of once per entry. In practice, it is highly doubtful that this optimization is necessary, as failures are rare and it is unlikely that there will be so many inconsistent logs.
Through this mechanism, leaders do not need any special operation to restore consistency when they gain power. It only needs to do the normal operation, and the log will automatically converge if the consistency check of the RPC to reply to the attached log fails. The leader never overwrites or deletes his own log (the leader in Figure 3 only adds features).
The log replication mechanism demonstrates the consistency described in Section 2: Raft can accept, copy, and apply new log entries as long as most machines are working; Under normal circumstances, new log entries can be copied to most machines in the cluster in a single RPC; And a single slow follower does not affect overall performance.
5.4 security
The previous section described how the Raft algorithm elects and copies logs. However, the mechanisms described so far are not sufficient to ensure that each state machine executes the same instructions in the same order. For example, a follower may enter an unavailable state while the leader has submitted several log entries, and then the follower may be elected as the leader and overwrite those log entries; Therefore, different state machines may execute different sequences of instructions.
This section refines the Raft algorithm by adding some restrictions at the time of the leadership election. This restriction ensures that any leader has all the submitted log entries for the previous term for a given term number (the Leader complete feature in Figure 3). By adding this election time limit, we are also clearer about the rules for submission. Finally, we’ll show a brief proof of the complete nature of the leader and show how the leader leads to the correct behavior of the replicated state machine.
5.4.1 Election restrictions
In any leader-based consistency algorithm, the leader must store all submitted log entries. In some consistency algorithms, such as Viewstamped Replication, a node can be elected leader even if it does not initially contain all of the committed log entries. These algorithms all contain additional mechanisms to identify missing log entries and pass them on to new leaders, either at election time or soon after. Unfortunately, this approach leads to considerable additional mechanics and complexity. Raft uses a much simpler approach that ensures that all log entries that have been submitted from previous tenure numbers will appear in the new leader at election time without having to pass them to the leader. This means that the delivery of log entries is one-way, only from the leader to the follower, and the leader never overwrites entries that already exist in its own local log.
Raft uses voting to prevent a candidate from winning an election unless the candidate contains all the log entries that have been submitted. The candidate must contact most nodes in the cluster in order to win the election, which means that every log entry that has been submitted must exist on at least one of these server nodes. If the candidate’s log is at least as new as most server nodes (this new definition is discussed below), then he must hold all the submitted log entries. The request to vote RPC implements this limitation: the RPC contains the candidate’s log information, and then the voter will reject those logs without their own new vote requests.
Raft defines which log is new by comparing the index value and tenure number of the last log entry between the two logs. If the last entry in the two logs has a different tenure number, the log with the larger tenure number is more recent. If the last entry in the two logs has the same tenure number, the longer entry is the newer one.
5.4.2 Submit log entries from previous tenures
As explained in Section 5.3, the leader knows that a log record for the current term can be submitted as long as it is stored on most servers. If a leader crashes before committing a log entry, future leaders will continue to try to copy the log record. However, a leader cannot assume that a log entry from a previous tenure was committed when it was saved to most servers. Figure 8 shows a situation where an old log entry that has been stored on most nodes could still be overwritten by a future leader.
Figure 8: The time series shown in figure shows why the leader could not decide to commit the log entry for the old tenure number. In (a), S1 is the leader, partially duplicating the log entry at index position 2. In (b), S1 crashes and S5 wins the election in term 3 with S3, S4 and its own votes, then receives a different log entry from the client and places it at index 2. And then (c), S5 crashes again; S1 The system is restarted, the election is successful, and logs are copied. At this point, the log from tenure 2 has been copied to most machines in the cluster, but has not yet been committed. If S1 crashes again in (d), S5 can be re-elected (with votes from S2, S3 and S4) and then overwrite their logs at index 2. If, on the other hand, S1 copies the log entries from the new term it dominates to most machines before the crash, as in (e), then these new log entries will be committed in subsequent terms (since S5 will not be elected). This ensures that all previous old log entries will be committed at the same time.
To eliminate the situation described in Figure 8, Raft never commits a log entry from a previous tenure by counting the number of copies. Only log entries from the leader’s current term can be submitted by counting the number of copies; Once a log entry for the current term is committed in this way, previous entries are also indirectly committed due to the log matching feature. In some cases, the leader can safely know if an old log entry has been committed (for example, if it has been stored on all servers), but Raft uses a more conservative approach to simplify the problem.
When the leader copies the logs from the previous tenure, Raft retains the original tenure number for all the logs, which creates additional complexity in the commit rules. In other consistency algorithms, if a new leader wants to replicate the logs of the previous tenure, it must use the current new tenure number. The approach Raft uses makes it easier to identify the logs because it maintains the same tenure number for the logs over time and over time. Also, new leaders in Raft only need to send fewer log entries than other algorithms (other algorithms had to send more redundant log entries to renumbering them before they were submitted).
5.4.3 Security demonstration
Given the complete Raft algorithm, we can now discuss the leader integrity feature more precisely (this discussion is based on the security proof in Section 9.2). We assume that the leader completeness property does not exist, and then we derive the contradiction. Suppose a leader for term T (leader T) commits a log entry during his term, but the log entry is not stored in the log for a leader for a future term. Let leader U of minimum term U greater than T not have this log entry.
Figure 9: If S1 (the leader of term T) submits a new log for its term, and S5 is elected as the leader for the subsequent term U, then at least one machine, such as S3, has both the log from S1 and voted for S5.
- There must be no log entry submitted at the time of the leader U election (the leader never deletes or overwrites any entry).
- Leader T copies the log entry to most of the nodes in the cluster, and leader U wins votes from most of the nodes in the cluster. Therefore, at least one node (voter, voter) both accepts the log entry from leader T and votes for leader U, as shown in Figure 9. This voter is the key to this contradiction.
- The voter must accept the submitted log entry from leader T before voting for leader U; Otherwise, he will refuse additional log requests from leader T (because his tenure number will be larger than T’s at this point).
- Voters still keep this log entry when they vote for leader U, because any intermediate leader contains the log entry (based on the assumption above), the leader never deletes the entry, and followers delete the entry only when they conflict with the leader.
- When a voter votes for leader U, leader U’s log must be as new as the voter’s own. This leads to one of the two paradoxes.
- First, if the voter and leader U’s last log have the same term number, then leader U’s log is at least as long as the voter’s, so leader U’s log must contain all voters’ logs. This is another contradiction because the voter contains the log entry that has already been submitted, but leader U is not included in the above assumption.
- In addition, leader U’s last entry must have a longer term number than the voter. In addition, it is also larger than T because the voter’s last log has at least as large a term number as T (it contains the submitted log from term T). The previous leader who created the last log entry for Leader U must have included the committed log (based on the above assumption, Leader U is the first leader who did not include the log entry). Therefore, according to the log matching feature, the leader U must also contain the submitted log, which is a contradiction.
- This completes the contradiction. Therefore, all leaders greater than T must contain all logs from T that have been committed.
- The log matching principle ensures that future leaders will also include entries submitted indirectly, such as index 2 in Figure 8 (d).
With the leader complete feature, we can demonstrate the state machine security feature in Figure 3, which means that if a server has already applied a log entry to its state machine for a given index value, no other server will apply a different log to the same index value. When a server applies a log entry to its own state machine, its log must be the same as the leader’s log on that entry and the previous entry, and have been committed. Now let’s consider the minimum tenure for applying a log at a specified index location on any server; The log full feature ensures that leaders with higher tenure numbers will store the same log entry, so that the same value will be applied to the log entry for an index position in subsequent terms. Therefore, the state machine security feature is valid.
Finally, Raft requires the server to apply the log entries in the order of the index position in the log. Combined with the state machine security features, this means that all servers apply the same set of log sequences to their state machines, in the same order.
5.5 Followers and candidates crash
So far, we’ve only looked at the collapse of the leadership. Followers and candidates deal with crashes much more easily than leaders, and they deal with crashes in the same way. If the follower or candidate crashes, subsequent RPCs sent to them will fail. Handling this failure in Raft is simply through endless retries; If the crashed machine is restarted, then these RPCS will be fully successful. If a server crashes after completing an RPC without a response, it will receive the same request again after it restarts. Raft’s RPCs are idempotent, so retries like this don’t cause any problems. For example, a follower who receives an additional log request but already includes the log will simply ignore the new request.
5.6 Time and Availability
One of the requirements of Raft is that security is not time dependent: the whole system cannot produce wrong results because some event runs a little faster or slower than expected. However, availability (the ability of the system to respond to clients in a timely manner) is inevitably time dependent. For example, if the message exchange takes longer than the server failure interval, the candidate will not have enough time to win the election; Raft won’t be able to work without a stable leader.
Leadership selection is one of the most time-critical aspects of Raft. Raft can elect and maintain a stable leader as long as the system meets the following time requirements:
BroadcastTime << electionTimeout << MTBF
In this inequality, broadcast time refers to the average time to send RPCs from one server in parallel to the other servers in the cluster and receive the response; The election timeout is the election timeout limit introduced in Section 5.2. And then the average time between failures is the average time between failures for a server. The broadcast time must be an order of magnitude less than the election timeout so that the leader can send a steady heartbeat message to prevent the follower from entering the election state; By randomizing the election timeouts, the inequality also makes the splitting of votes impossible. The election timeout should be several orders of magnitude smaller than the average time between failures so that the entire system can operate stably. When the leader crashes, the entire system becomes unavailable for about the length of the election timeout; We hope that this situation will be rare in the overall operation of the system.
Broadcast times and mean time between failures are determined by the system, but election timeouts are of our choice. Raft RPCs require the receiver to persist information to stable storage, so the broadcast time can be around 0.5 ms to 20 ms depending on the storage technology. Therefore, the election timeout may need to be between 10 milliseconds and 500 milliseconds. The average time between failures on most servers is a few months or more, which is easy to meet.
This post was reposted from TopJohn’s Blog with permission from TopJohn
Blockchain in Simple terms – systematic learning of blockchain to create the best blockchain technology blog.