Welcome to my station around ~~

Distributed systems largely solve the reliability and scalability problems we encounter on stand-alone systems. Sharding was introduced to address extensibility; To address reliability, replicas were introduced. The introduction of these two mechanisms creates more tricky problems that need to be solved.

In this article, THE author introduces the distributed consistency algorithm Raft. It is called “overview” because it mainly introduces large structures and processes. If you want more details, you can download resources for a more detailed study.

The structure of this paper: firstly, a brief introduction of the consensus algorithm can solve those problems; Then, the main flow and data structure of Raft algorithm are summarized. Once you have an overall impression of Raft, focus on the main process selection, details and issues in log replication. Finally, prove the correctness of Raft’s algorithm and describe how Raft handles the problem of member change.

Problem solved by consensus algorithm

Simply put, consensus is when multiple nodes agree on a piece of information.

The “information” could be a lock holder, the leader of a cluster, whether a user name is registered, a value of a bank account… If multiple nodes do not agree on the information, it may lead to inconsistent results processed by different machines, and in important scenarios, inconsistent transfers may occur.

Here is a simple master-slave replication scenario to introduce what kind of guarantee the consensus algorithm brings

Simple master-slave replication is a very common pattern, even in scenarios involving shards, each shard can be regarded as such a data flow pattern. In a typical master-slave replication system, this is typically replicated from the master node to the slave node as shown below.

In such a design, all changes are made on the primary node. In case of failover, an outdated leader may be selected, causing the loss of written records. If the network between the secondary node and the master node is disconnected, the master node may be considered as a failure, and the master node may be promoted to the master node, resulting in the phenomenon of brain split when two master nodes appear at the same time.

The consensus algorithm can ensure that the records written successfully are not lost later in such scenarios, and there is only one master node at most at a time.

In addition to choosing the main scenario, the consensus algorithm can provide a strong guarantee in distributed locking, uniqueness guarantee, timing dependence and other aspects. This article mainly describes the details of Raft algorithm in kv storage scenarios.

Since this article is only an overview, the author’s ability is also a priority, it is hard to avoid omission. If you have any questions or comments, please feel free to discuss them in the comments section

Origin of Raft

It all starts with Paxos

**There is only one consensus protocol, All other Approaches are just broken versions of Paxos. ** By Mike Burrows(Google Chubby author)

As Mike Burrows pointed out, Paxos is synonymous with distributed consensus algorithms, and most of the other consensus algorithms, including Raft, are variations on top of it without some guarantee.

So why do people continue to release broken versions on top of Paxos? In 2014 Raft author Diego Ongaro explained in his PhD thesis in 2014 that he observed two major shortcomings of Paxos:

Difficult to understand

The full Paxos algorithm is hard to understand because Lamport, the author of Paxos, loves all sorts of weird metaphors. Diego Ongaro didn’t really master the Paxos algorithm until he read some simplified explanations of the algorithm and began to develop his own consistency algorithm, a process that took a good doctor a year.

No engineering implementation details are provided

Lamport’s algorithm description only provides a framework protocol and strict theoretical proof, but does not carry out much description in engineering. There is hardly a widely accepted engineering implementation in the open source community, and many implementations of the Paxos algorithm are approximations of it. As a result, the actual engineering protocol always deviates more or less from the protocol mentioned in the paper. Although the proof of Paxos is very strict, its engineering implementation may be even less guaranteed theoretically.

It is the above two problems that lead to the application of distributed consensus algorithm is not rolled out more quickly. In his PhD thesis, Diego Ongaro proposed a simplified version of his distributed consensus algorithm Raft, hoping to make consensus algorithms easier to understand and quickly and correctly engineered.

An overview of the Raft

The main process

The server side of Raft is a master-slave structure where each copy maintains a log list and state machine. The log is maintained by the consensus module, and the state machine is the data visible to the client. So any data that exists in the log but is not copied to the state machine is invisible to the client.

In this case, the data is of the simple KV type. The consensus module is the core of the Raft algorithm and ensures consistency by controlling the Leader election process, log order, and notifying the state machine which logs are executed. As long as the logs are in the same order, the contents of the state machine are consistent. The main flow of a change operation is as follows:

  1. The client sends the change information to the Leader
  2. The consensus module ensures that logs sent to each replica are consistent
  3. The state machine on each server receives the notification from the consensus module, obtains the corresponding operation from the log, and executes it
  4. Return the operation result to client

The client may initially send the change request to the Follower. The Follower tells the client which node is the Leader and asks the client to send the change information to the Leader.

term

In the Raft protocol, timelines are divided into different terms, with a maximum of one Leader per term. In the above term, the blue part indicates that the election is in progress, and the green part indicates that the leader has been elected and is providing services. It is not always possible to elect a leader for each term. For example, there was no successful election in t3.

Role transformation

In Raft, the nodes of the server are divided into three different roles: Leader, Follower, and Candidate.

Follower

During initialization, all nodes are followers and can only passively receive messages from the Leader and Candidate, but cannot actively send messages to them. Receive the change message from the Leader and return the change result. Receives the campaign message from the Candidate and returns whether you agree to elect the Candidate as Leader.

The results returned by followers are tagged with the current term so that candidates and leaders can check if they have fallen behind.

Candidate

If a node finds that it has not received a heartbeat from the Leader for a long time, it nominates itself as the candidate Leader and sends a request for election. This node is the Candidate node until the election result is determined.

For the Candidate node, if the election is successful, it becomes the Leader node. If the campaign fails, it returns to the Follower node.

Leader

Responsible for receiving all update requests and forwarding them to followers. At the same time, the followers continuously send heartbeat to the outside. When the Leader finds himself lagging behind, he turns into a Follower.

The data structure

Before looking at the Raft algorithm, let’s take a look at the individual data structures.

Log structures

A log file in Raft can be thought of as a monotonically increasing array with each element containing term and content representing a client request. Term can be used to check whether the logs of other nodes are consistent with their own. The content represents the specific content of the request, generally the value of KV structure. Each log represents its position in the array through index.

In the figure above, there are three term records. The logs in the first row represent the logs in the Leader, and the logs in the following four rows are the logs in the followers. Indexes 1-7 committed indicates that a log has been copied to most nodes, is guaranteed not to be lost by Raft protocols, and is eventually copied to all nodes.

Node status

Raft protocol Each node has a State data structure that holds information such as the node’s log and State. To ensure that the state of the node can be restored after the restart, important fields are persisted, and other information that can be obtained through communication between nodes is only stored in memory.

The State data structure includes:

Persistent information:

  1. CurrentTerm: Term of the current node. If the node becomes a follower, the term is updated to that of the leader.
  2. VotedFor: which candidate the current node VotedFor during the current term
  3. Log[]: Log list. Each entry contains term and content

Pure memory information:

  1. CommitIndex: indicates that the node is copied to the latest location of most nodes. After the restart, the node can communicate with each other to re-calculate the latest location.
  2. LastApplied: Applied to the latest location of the state machine, which can also be calculated by communicating with each other.

Leader specific pure memory information:

  1. NextIndex[]: Indicates the latest log position of each follower
  2. MatchIndex[]: maximum index matched by each server and the leader. Default is 0. The logs between matchIndex and nextIndex are those that are being synchronized and are a list of logs that have not been successfully responded to the AppendEntries message that the Leader has sent to the Follower

Choose main process

The trigger condition

Raft protocol uses the timeout to determine whether the node is in trouble: if the Follower has not received any request from the leader for more than a period of time, the leader is considered to be in trouble.

At this point, the leader may actually have a problem, or it may just be too busy or the network may have a problem. In any case, followers actively become candidates, increment the term of the current node, and try to send requests to other nodes.

Vote request

Requests for votes include:

  1. Term: Term of the current node, with a value of previous Term +1, indicating that it is a candidate for the next Term
  2. CandidateId: indicates the ID of the current candidate node
  3. LastLogIndex: Position of the last entry in the current (candidate) log
  4. LastLogTerm: Term of the last record in the current (candidate) log

Vote return result:

  1. Term: current Term of the node that receives the vote request
  2. VoteGranted: indicates whether the node votes for the current candidate

Candidate sends a vote request

Once the Follower determines that the leader is in trouble, the Follower thinks it is time to go to the next term and perform the following operations

  1. First, update the current term, increment, and the state becomes candidate
  2. Vote for yourself and update votedFor
  3. Send a vote request to all servers and wait for more than half of the members to agree to become the new leader

Followers receive a poll request

After receiving the poll request, the followers perform the following operations:

  1. Check whether the candidate’s term is updated, and return false if it is smaller than the candidate’s term
  2. Check your votedFor field and return false if the current term has been votedFor by someone else. This step ensures that followers will only vote for one node in the same term
  3. At this point, the candidate’s term is updated, and the current node has not voted for anyone else. Check whether the candidate’s term is newer than your own by using lastLogIndex and lastLogTerm. Returns true if candidate is updated; Otherwise, return false.
    • The rules for determining whether two logs are updated are as follows: First, compare lastLogTerm. The higher the term, the newer the log. If lastLogTerm is the same, the larger the lastLogIndex, the newer the log.

Possible election results

The result of the vote could be as follows:

elected

Each node votes for the first qualified candidate for the new term. If the candidate returns true after receiving more than half of the nodes, he/she will consider that he/she has won the election, change the state to the leader, and immediately continue to send heartbeat to all nodes to maintain his/her position as the leader, so as to prevent other nodes from jumping out to run for the leader.

Other nodes are elected first

If a heartbeat message or change message from another Server is received during the process of waiting for the approval of half of the nodes, and the term contained in this message is no smaller than your own term, it indicates that another node has won the election, and you cannot get the approval of half of the nodes no matter how much you wait. At this point, the Candidate changes the state to Follower and updates term to term in the heartbeat to start receiving commands issued by the leader.

No successful node has timed out

There is also a situation where no node succeeds, such as when three candidates are asked to vote at the same time and none of them wins a majority of votes. In this case, if a candidate finds that he or she has not become the leader for a period of time, the next election is automatically held until he or she becomes the leader or receives a heartbeat from another leader.

Log copy

State of the log

A log can have many states on all nodes, from sending requests to successful updates. Raft provides the guarantee that once the log is copied to most nodes, it will not be lost. As long as it is not lost, the change log can be executed in the state machine and seen by the client.

Reviewing the data structure for node states, Raft introduces two log states to determine what state a log is currently in: Commited and Applied:

  • Commited: Logs are persisted on more than half of the nodes. The committed state indicates that logs will not be lost.

  • “Applied” : logs are executed on the state machine and can be read by clients.

Raft requires that the first log be processed only when all the previous logs are fine, so instead of maintaining a log state for each entry, only the latest commited log needs to be recorded via commitIndex. Log the most recent executed, client-visible logs with lastApplied. CommitIndex and lastApplied states must be committed and applied before commitIndex and lastApplied.

In the log state shown above, there are five nodes.

For the leader, logs from indexes 1 through 7 are copied to the half-node, so commitIndex is 7, which means that the first 7 logs can be safely executed in the state machine. Assuming the first six log entries have been executed, the leader’s lastApplied is 6.

For the first follower, there is some delay relative to the leader, with the latest log only reaching position 4. So its commitIndex might be 4, and lastApplied might be 4 or earlier.

Replication process

As described in the Raft overview, requests from the client are sent to the Leader. After adding the request to the local log, the Leader sends the add command to the followers. If a Follower fails to execute the command, the Leader tries again and again until it succeeds. As long as more than half of the followers return that the log was appended successfully, the leader can update the commitIndex to the log index, marking that the log has been persisted successfully. And securely notifies the state machine to execute the corresponding log, updating lastApplied, so that the client can see the change.

Why is it important to successfully append half nodes? Review the main selection process: If an unappended node becomes a candidate, more than half of the nodes will find that its lastLogIndex is smaller than their own when voting, so they refuse to choose it as the primary node. So a node can become a master node only if it is one of the more than half nodes, thus ensuring that the more than half successful log will survive all future elections. Also due to such chain judgment, as long as the log term under one index of the two servers is the same, then all the previous logs are the same.

After the log is successfully persisted, the heartbeat sent by the Leader contains updated commitIndex and lastApplied. Using these two fields, followers know that the logs have been safely persisted, update commitId accordingly, and execute the logs between the old lastApplied and the new lastApplied to the state machine.

In Raft, change messages with empty content are used as heartbeat messages to keep data structures as lean as possible. So the data structures for both the heartbeat message and the change message are shown in the following section.

Change the message

The change message will only be sent by the node that thinks it is the Leader. Whether the followers recognize the node as the Leader is another matter. If the followers discover that the Leader is further behind than they are, they do not execute the corresponding commands.

Change message request data structure:

  1. Term: Term of the current (leader) node. This field allows followers to determine whether the current node is more behind
  2. LeaderId: indicates the ID of the current leader node. This allows the trailing node or the new node to know who the current leader is. In this way, the Follower can tell the client who is the leader when the client performs a change operation
  3. PrevLogIndex: indicates the index of the previous position of the current log, which allows followers to determine whether the current node log is consistent with it
  4. PrevLogTerm: Term of the latest message in the current (leader) log, which allows followers to determine whether the current node is the same as the current node
  5. Entries[]: Message of this change (which can be multiple for performance purposes), or empty if it is heartbeat.
  6. LeaderCommit: Position of the latest state machine execution of the current (leader) node

Change message returned:

  1. Term: current Term of follower
  2. Success: Indicates whether the message was successfully executed

The sample

The figure above shows a simple log scenario, including a current Leader and six followers (a)-(f). The process of digging into the details of selection and replication can be clearer, but it can also be time-consuming. If you want a quick overview, you can skip the examples. To improve your understanding, I introduced a magical background to the process (just for laughs) :

  1. In terM1 a long time ago, node (A) was the leader of all nodes. It kept sending heartbeat to other nodes and maintained its position as leader.
    1. During TERM1, Magic Castle received three change requests from the client, (a) the node successfully copied the three request logs to all nodes, and the client received three successful change execution results.
    2. Due to the magic wave, (a) temporarily lost contact with everyone, although it was soon restored, enough time had passed since the last heartbeat, during which time a new election was held at the remaining nodes.
  2. Node (f) becomes the leader of terM2 and receives three requests. But (f) as soon as I became the leader, the network cable was pulled out by the dark demon fairy, and other nodes did not have time to receive the three messages. In the client’s view, these three changes were not successfully executed.
  3. Baalala small magic fairy saw through the plot, and the wire back. (f) Successfully re-elected terM3 leader with the latest lastLogIndex and lastLogTerm, and received five terM3 requests. When things didn’t work out, the cable was cut and five new requests were not sent. The five new messages sent by the client failed to update again, and the (f) node fell asleep.
  4. (e) The node could not bear the expectation of Magic Castle and became the leader of TERM4.
    1. (e) The node receives two requests from the client, and it diligently copies the first request of TERM4 to all nodes except (f) and the second request to all nodes except (b) and (f), achieving half replication. So the client receives the result that terM4’s first two changes were successful.
    2. While node (e) is constantly asking node (b) and node (f) to learn from other nodes and update to the latest log, its network cable is also killed. But the client, unaware of what happened to (e), continues to send it two change requests. (e) After updating the local log, only to find that the outside world can not be contacted, only to watch other nodes hold new elections.
    3. In fact, the node (b) was already affected by dark magic when it updated the second request, and it was frozen into a frozen server in a corner that no one noticed.
  5. At this time, there are only four nodes that can still be contacted, which are (a), (c), (d) and the current Leader. Magic castle can not be a day without magic King, (a) node to stand up for the Leader of TERm5. (c), (d), the current Leader of course fully supports, so (a) node becomes the Leader of TERM5.
    1. True to the former Leader of TERM1, (a) after receiving two new change requests from the client, the node successfully copies the log to the four surviving nodes and returns a message indicating that the client’s change was successful.
    2. But (a) of the old disease again, the magic disturbance on its body is more and more intense, accidentally did not send a heartbeat package in time.
  6. (a) The heartbeat was not sent in time again, and (c) the node thought it was more appropriate to be the Leader by itself, so it initiated a new round of terM6 voting. (d) The node and the current Leader find that the log status of the node (c) is up to date and agree that the node becomes the new Leader. By the time (a) node responds, it has received the vote message from (c) node. Since the logging state of the (c) node is as new as it is, so is making it the Leader! So (c) node also votes yes. Obtained the support votes of 4 nodes. (c) Node has obtained more than half of the votes of 7 nodes and was successfully elected Leader of TERM6.
    1. The client sends four more change requests, and node (C) keeps these four logs locally and sends them to the remaining nodes for updates. Except for the three nodes that could not answer, the remaining four nodes more or less executed the new change message.
    2. (c) The node still failed to escape the curse, and it failed to make all nodes receive heartbeats at one point. At the time of the next election, the client knows that the first two of the four changes have been successfully executed, but not necessarily the last two.
  7. (c) The node has received a vote request from the node (d) while it is busy getting other nodes to implement changes. However, as the log of node (c) is newer than that of node (d), node (c) does not agree to let node (d) become the Leader. Just when node (d) is at a loss, the unknown node (b) suddenly breaks its freeze magic! After waking up, node (B) also received the voting request from node (D). Since the log is newer than itself and the term is higher than itself, node (B) voted for it without hesitation, and node (D) was successfully elected Leader of TERM7.
    1. (d) The node happily adds these two messages to the local log and is about to send them to other nodes when the power fails…
    2. Now (e) and (f) wake up from their long sleep
  8. Finally, the current Leader appears. It found that node (d) had not sent a heartbeat message for a long time and launched a new campaign. This time, its state is newer than all other nodes except (c) and (d), and it successfully gets 5 votes and becomes the leader of TERM8.

The above is the history of the reign of king Fairy, if you read the above process, please click 18 thumbs up for your patience.

Processing log inconsistency

So far so good! No one can make a mistake or lose a connection in this perfect process. But if things were so simple, why would Diego Ongaro write his 400-plus page doctoral thesis?

We now know that this can happen, but what mechanism does Raft use to make the log state of all nodes ultimately consistent?

In Raft, the leader will force the followers to be exactly the same as their own logs, and if there is a conflict, the leader will rule. When the leader sends a message to a Follower, it is accompanied by the prevLogIndex and prevLogTerm, which indicate the position and term of the previous message.

  1. If the Follower finds that the log term on the prevLogIndex does not match the prevLogTerm, the current log does not match the leader log, and returns false.

  2. If the Leader finds an inconsistency, he tries to send the earlier message, repeating the message until the log term of the corresponding Follower position is the same as that of the Leader

  3. After the Follower checks that the previous log matches, the Follower overwrites the local log to the value sent by the leader and returns true.

  4. The Leader receives successfully appended messages and increses indexes until all messages are successfully synchronized to the followers

In this example, for the current Leader, the old data of (a), (b), (e), and (f) will be overwritten to be consistent with the current Leader. In addition, if there are subsequent logs with term=8, the logs at positions 11 and 12 in (c) and (d) will also be overwritten. Because these logs are not copied to more than half of the nodes, they cannot be applied to the state machine, and overwriting them will have no real impact.

Take (f) as an example, its synchronization with the Leader may go through the following process:

Step 1: Review the change message. The Leader uses the empty change message as the heartbeat message and sends the heartbeat to (f) with the latest prevLogIndex and prevLogTerm, currently the latest prevLogIndex=10 and prevLogTerm=6.

Step 2 :(f) find that the term of local location 10 is 3, which is less than the prevLogTerm sent by the Leader. Return fasle, indicating that the term is inconsistent with the Leader.

Step 3: The Leader receives the false return from (f) and knows that the log at position 10 has not yet matched its own. PrevLogIndex =9, prevLogTerm=6, prevLogTerm=6

Step 4 :(f) continue to check the term at position 9 and find that it is still inconsistent with the Leader, return false.

Step 5: The Leader continues to send the index and term of more forward logs.

Step 6: In this way, go back and forth. Until (f) where prevLogIndex=3, prevLogTerm=1 is consistent with Leader

PrevLogIndex =3, prevLogTerm=1 Therefore, the message sent by the Leader can be applied to the log. After the following logs are cleared, the log at index=4 is synchronized with the Leader and success is returned.

Step 8: The Leader receives the success message and knows that (f) has synchronized the log at index=4 with the Leader, so it updates the position of (f) in the nextIndex array to 5 and sends the log changes at position 5 to (f).

Step 9 :(f) continue synchronization with the Leader and return suceess.

Continue this n steps until (f) is synchronized with the Leader’s log state and the position (f) in the Leader’s nextIndex array is the latest 11.

Process uncommitted logs

If after a leader copies logs to half of the machines, commitId has been updated locally and applied to the state machine, but the other machines have not received notification via heartbeat, then after the next election, the new leader may not know whether the logs of the previous term have been committed. Consider the following scenario:

In the above scenario, the possible scenarios for each phase are as follows

A) S1 is the leader of TERM2 and successfully copies logs to S2

B) S1 hangs, S5 becomes the leader of TERM3 and receives new change requests

C) S5 hangs, S1 reboots and becomes the leader of TERM4. After receiving the new change request, S1 continues to copy the previous logs to S3. At this time, the term2 logs have reached half and S1 is relieved to submit the term2 logs.

D) If S1 fails again, S5 elects to be the leader of TERM5 again and starts to copy its previous logs to all servers, overwriting most of the TerM2 and TERM4 logs that have already been copied to. This results in terM2 logs that have been committed on S1 being overwritten, violating the rule that committed logs cannot be overwritten.

E) If in the case of C), S1 has copied terM4’s logs to half machine, then S5 cannot be elected leader, because S5’s lastLogTerm is smaller than TERM4 and cannot be supported by half machine.

To avoid the situation in D, Raft specifies that the leader can only commit the logs of the current term. If the followers find unapplied logs before the new commitId, they apply them to the state machine. In Figure C, commitId is not updated when TERm2 logs are copied by S1. CommitId is updated only after TerM4 is copied to most logs. After receiving the new commitId, the Follower applies the TerM2 and TerM4 logs to the state machine.

After doing so, the following situations may occur:

  1. All went well, terM2 and TerM4 successfully committed and applied on most nodes

  2. S1 fails after commitId is updated to TERM4 and terM2 and TerM4 logs are applied to the state machine. CommitId is not updated on other servers. At this point, more than half of the servers must have updated terM4 logs, and the next leader must not lose the applied logs.

  3. Term2 is half over, terM4 is not half over, commitId for S1 has not been updated, and Term2 and TERM4 have not been applied by the state machine.

    1. In this case, S5 without TERM4 may become the leader and override terM2 and TERM4. But because TERM2 and TERM4 do not apply on any state machines, they do not violate the rules and cause problems.
    2. At this time, S2 including TERM4 may also become the leader. S2 will update commitId and apply to the machine after new logs of its term are submitted to half of the machines. Other machines will apply TERM2 and TERM4.
    3. If the client does not submit any more requests in 3.2, terM2 and TERM4 may always apply only on the S1 state machine. As with 3.1, there will be no material impact.

Consistency proof

This section is about proving the correctness of the Raft algorithm, if you are not interested in logical proof you can skip it

To recap, Raft provides the following guarantees:

  1. Election security: A maximum of one node can be elected leader in a term (drawer principle, but two terms may exist at the same time, in which case node logs of the old term can never be applied)

  2. Add-on logs of the Leader: The Leader logs only add new content to the log, but do not delete or change the existing content

  3. Log matching: If two replicas contain logs with the same index and term, then all logs before this index are matched (guaranteed by log replication rules)

  4. Leader integrity: If a log is committed for a term, it is guaranteed to exist for all subsequent terms of the Leader.

  5. State machine security: If one server executes logs under an index to the state machine, then all servers can execute only that log under that index.

Can the process so far ensure that the order on each state machine is the same?

To prove by contradiction:

If the fourth clause leader integrity is not true, then we can deduce the contradiction. Anyway, the fourth clause is true.

Let the Leader of term T commit a log for the current term, but the log is not stored by the Leader of the following term. Let the minimum tenure of the leader who no longer stores this log be U.

  1. The submitted log must have been absent at election timeinside

  2. The log must have been copied to most nodes, otherwise the log will not be committed. Due to the behindIs elected as the leader, so there must be a machine that both receives logs and gives themThe vote.

  3. Voters must have received the log before voting. Otherwise it will definitely refuse to vote for leader_U due to voting rules.

  4. Since voting for a candidate requires that the candidate’s log be at least newer than the current log, the candidate must include the submitted log in order to get a majority of votes. Because the leader will not delete his own logs, the submitted logs must exist in term U, which contradicts the problem.

Having proved the fourth, we can prove the fifth:

Once a server applies a log to the state machine, this log and previous logs must be consistent with the Leader, and the log has been committed. The leader must include this log in the future, so the log received by all nodes must be the same as the applied log.

Members of the change

The transition process

The previous discussion was based on the situation that the nodes remain unchanged, because the configuration update time of different servers must be different, there is a possibility of brain split (both nodes think they have more than half support).

To ensure security, the node update process must use two-phase commit. The cluster first switches to a transition state containing the old and new two node sets (there are nodes in both sets)

  1. Requests in the transition phase are submitted to servers in both the old and new collections

  2. A server in either collection can campaign as leader

  3. A COMMIT requires a replication of more than half of the servers in both the old and new collections

When the leader receives a change to the node configurationAfter that, the merge configuration will be put togetherPublished as a special log to the old and new nodes. Once a node is received, it will use the new server list. After that, if you’re inThe new leader may be inorProduced in, this time aloneNodes in the.

Once theCommit ensures that only receive is receivedCan be elected as the leader. This is the time to create aThe message is secure because the leader knows, messages are guaranteed to be copied toMost of the nodes in. All nodes receivedAnd then you use itConfiguration, onceMore than half of the machines have received configurations, so the ones that are no longer neededThere is no problem when the node goes offline, because the leader must follow laterThe new log is generated inIn must also need more than half to commit, meet the conditions.

The new node is initialized

Before configuration changes can be made, all nodes need to ensure that their logs are up to date. Therefore, the node becomes a node with no voting rights at startup and passively receives the leader’s logs. The above configuration change process will not start until the log synchronization is completed.

Raft protocol has the corresponding snapshot module, which is not expanded here.

Rolled off the production line Leader

If the leader is notIn the words of the member onceWhen the leader is committed, the leader is logged out.

The resources

In Search of an Understandable Consensus Algorithm(Extended Version), an Understandable diagnosis of the fate of Raft. Most of the distribution charts are from the thesis. I believe I can get a clearer understanding of the details, including log compression, snapshots, and so on.

If you’d like to see more examples, check out the visualization page listed at the end of this article, a github project created specifically to learn about the Raft algorithm. The examples are vivid and easy to follow.

In engineering, the well-known ectD is based on Raft, while ZooKeeper uses another consensus algorithm, Zab, with similar functionality. For a systematic understanding of consensual algorithms AND engineering, including Paxos, Zab, AND Raft, consider Raft’s doctoral thesis CONSENSUS: BRIDGING THEORY AND PRACTICE.

Finally, if you are interested in writing your own implementation for Raft, take a look at MIT’s Distributed Systems open course 6.824. Lab2 is the engineering implementation for Raft


Resources and websites

Raft paper:raft.github.io/raft.pdf

Visualization of Raft algorithm:thesecretlivesofdata.com/raft/

Raft’s PhD thesis consensus: Connecting Theory and Practice:Web.stanford.edu/~ouster/cgi…

MIT Distributed Systems Open Course 6.824:Pdos.csail.mit.edu/6.824/sched…