Raft is a highly consistent, decentralized and highly available distributed protocol widely used in engineering. Here emphasis is on engineering, because in the academic theory circle, the most dazzling or famous Paxos. But Paxos is: easy for the few who really understand, difficult for those who don’t, and most people who don’t.

Raft is a consensus algorithm. Consensus is when multiple nodes agree on something, even when some nodes fail, the network delays, the network splits. The most popular cryptocurrencies of the year (bitcoin, blockchain) require consensus algorithms, and in distributed systems, consensus algorithms are more used to improve fault tolerance, such as replication in distributed storage. Raft protocol is a leader-based consensus algorithm corresponding to Leaderless consensus algorithm.

Overview of Raft algorithms

UnderStandable is the number one goal of the Raft algorithm. Of course Raft improves understandability and is no match for Paxos in terms of performance, reliability and availability.

Raft went through a lot of work to achieve this easy to understand goal, but there were two main things raft did:

  • Problem decomposition
  • State the simplified

Raft first elects the leader, who is fully responsible for the management of the Replicated log. The leader is responsible for receiving all client update requests, then copying them to the follower node and executing them when it is “safe”. If the leader fails, Followes elects a new leader.

This brings us to two of raft’s most recent sub-issues: Leader election and log Replication

leader election

Raft protocol a node is in one of three states at any one time:

  • leader
  • follower
  • candidate

The following diagram illustrates the difference between these three states

All nodes are in the follower state at startup. If no heartbeat message from the leader is received within a period of time, the follower switches to the candidate and initiates an election. If a majority of votes are received, the leader state is switched to. If it finds that other nodes are newer than it, it actively switches to followers.

In a word, there is only one leader in the system at most. If there is no leader in a period of time, everyone chooses the leader by election-voting. The leader constantly sends heartbeat messages to followers to indicate their survival status. If the leader fails, the followers switch to candidate and re-elect the leader.

The term term

As we know, leaders are elected by voting. Each leader works for a period of time, and then a new leader is elected to continue to be responsible for it. Each new term is called a term, and this is also true in raft protocol, the term.

The term begins with an election, followed by a long or short period of normal Operation. As you can see from the figure above, tenure is increasing, which acts as a logical clock; In addition, term 3 shows a situation where a leader is not elected and a new election is called.

The election process

We know that if the follower does not receive a heartbeat from the leader during election timeout, (perhaps the leader has not been elected yet and everyone is waiting; Maybe the leader dies; Perhaps there is a network failure between the leader and the follower. The steps are as follows:

  • Add local data of a nodecurrent termTo switch to the candidate state
  • Vote for yourself
  • It is sent to other nodes in parallelRequestVote RPCs
  • Wait for other nodes to reply

During this process, there are three possible outcomes, based on messages from other nodes

  1. After receiving a majority of votes, the leader wins the election and becomes the leader
  2. When told that someone else has been elected, switch to follower
  3. If a majority vote has not been received in a certain period of time, the majority vote remains a candidate and a new election is called

In the first case, after winning the election, the new leader will immediately send a message to all nodes to spread the message to prevent other nodes from triggering new elections. Here, back to the voter’s point of view, how a voter decides whether to vote for an election request has the following constraints:

  • Each node can only cast one vote during the term of office
  • A candidate must know as much information as he does
  • First-come-first-served

The second case, let’s say I have three nodes A, B, and C. A B initiates the election at the same time, while A’s election message reaches C first, and C votes for A. When B’s message reaches C, it cannot satisfy the first constraint mentioned above, that is, C will not vote for B, and A and B will not vote for each other obviously. After A wins, it sends A heartbeat message to B and C. Node B finds that node A’s term is no lower than its own term and converts to follower when it knows that node A has A Leader.

In the third case, when there are even nodes, there may be no node that wins the majority vote, as shown in the following figure:

If there was a tie, the system would be unavailable for an extended period of time (no leader could handle client write requests), so Raft introduced randomized election timeouts to avoid the tie as much as possible. Meanwhile, in leader-based consensus algorithm, the number of nodes is odd, so as to ensure the emergence of majority.

log Replication

With leader, the system should enter the external work period. All requests from the client are sent to the leader, who schedules the order of these concurrent requests and ensures the consistency between the Leader and the Followers state. The approach in Raft is to inform the followers of these requests and the order in which they are executed. The Leader and Followers execute the requests in the same order to ensure consistency.

Replicated state machines

The implementation of consensus algorithms is generally based on Replicated State machines.

Simply put: Same initial state + same input = same end state. The Replicated Log is persistent and sequence-preserving and is the cornerstone of most distributed systems.

Therefore, it can be said that in raft, the leader encapsulates client commands into log entries and copies these log entries to all follower nodes. Then you apply the commands in the log entry in the same order, and the status is guaranteed to be the same.

The following diagram illustrates this log-based Replicated State machine visually

Request Completion process

When the system (leader) receives a write request from the client and returns it to the client, the whole process will go through the following steps from the leader’s perspective:

  • The leader appends log entries
  • The leader distributes the log entries to the followers in parallel through RPC
  • The leader waits for the response that most nodes have copied successfully
  • The leader applies entry to the state machine
  • The leader responds to the client
  • The leader notifies the follower application logs

It can be seen that the log submission process is similar to two-phase submission (2PC), but the difference from 2PC is that the leader only needs replies from the majority of nodes, so the system is available as long as more than half of the nodes are working.

So what does the log look like on each node

Each log entry stores an instruction for the state machine, as well as the term number when it receives the entry from the leader and an index indicating its position in the log.

Raft’s logging mechanism provides two guarantees, collectively known as Log Matching Property:

  • If there are two logs on different machinesentryWith the same offset and term number, they store the same instruction.
  • If there are two logs with the same offset and term number on different machinesentry, all entries before this entry in the log remain the same.

This consistency check is done in an induction like manner: no one has a Log in the initial state, so there is no Log Matching Property check, but this check is done whenever the Followers want to add to the Log. Therefore, as long as AppendEntries return success, the leader knows that the follower’s log must be exactly like his own.

In normal cases, the leader and follower logs are the same, so AppendEntries consistency checks never fail. However, if the Leader crashes, their logs are likely to be inconsistent. This inconsistency can be complicated if the leader or followers crash. The following figure shows all log inconsistencies:

As shown above, (a)(b) Followers may lose log, (c)(d) have extra log, or (e)(f) may lose and extra log across multiple terms. In Raft, the leader forces the followers to be exactly the same as their own log, which means the followers’ log is likely to be overwritten by the leader’s new push log.

In order to force followers to conform to the leader, he must first identify the point of difference between himself and the followers, then ask the followers to delete the followers’ logs after that point, and then synchronize his followers’ logs after that point to the followers. This implementation is also done through a consistency check on AppendEntries RPCs. The leader also tells followers the nextIndex offset of the new log sent to each follower. When the new leader first starts the service, it initializes the nextIndex of all followers as the offset of its latest log entry +1 (as shown in the figure 11). If a follower’s log does not match the leader’s, the AppendEntries RPC will fail and the leader will decant the nextIndex and retry until the disconnection is found. The rest is done, removing the conflicting log entries and synchronising its own.

safety

We’ve seen how Raft chooses master and makes log replication, but this isn’t enough to ensure that different nodes execute a strictly consistent sequence of instructions, and some additional security mechanisms are needed. For example, a follower may not be available when the current leader commits the log, but it may later be elected as a new leader, who may overwrite the previously committed entries with new ones. As a result, different replication state machines may execute different instruction sequences, resulting in inconsistent conditions. Here Raft adds an election mechanism to ensure that the new leader must contain any previous Commited entries.

  • The election limits

A candidate must contact most nodes in order to be elected. Let’s say their set is called A, and an entry that can be committed must be stored on most nodes. This means that for every committed entry, there must be one node in the set of A that holds it. The candidate has A chance to be elected leader if his Log is no older than any node in A, so the candidate must already hold all commited entries to become leader.

  • Submit entries for early terms

After a leader is elected, it does not directly submit the previous leader’s log, but submits the previous leader’s log when submitting the current term’s log. The details of this implementation are described in the log Matching section. If the leader is elected and does not receive the request from the client, it is mentioned in the paper that the leader immediately tries to copy and submit an empty log at the beginning of the term.

Therefore, in the figure above, the situation at time (C) does not occur, i.e. leader S1 of terM4 term does not copy terM2 logs to S3. Rather, as described in (e), the term2 logs are submitted incidentally by copy-commit terM4 logs. If terM4’s logs are submitted successfully, terM2’s logs must also be submitted successfully. In this case, S5 will not be reelected even if S1crash.

  • Mediate expired leader

It is possible in Raft that more than one server is the leader at any one time. A leader suddenly loses connection with the other servers in the cluster. As a result, a new leader is elected. Then the old leader reconnects, and the cluster has two leaders. The old leader will probably continue to serve the client and try to copy entries to other servers in the cluster. But Raft’s term mechanism shredded the old leader’s attempt to cause any inconsistency. Each RPC Server must exchange their current term number. Once the new leader is elected, there must be a majority group containing the latest term number. The RPC of the old leader must contact a majority group. Changes to the follower state.

However, it may be that the old leader commits an entry and loses the connection. At this time, the new leader must have that commit entry, but the new leader may not commit it. At this time, the new leader will commit this entry again before the initial service client. Followers return success if they commit, but if they do not commit, they return success.

Followers or candidate crash

The handling of Followers and candidate crash is much easier than that of the leader. The handling process is the same. If a follower or candidate crashes, RequestVote and AppendEntries RPCs sent to it in the future will fail, Raft’s strategy is to keep sending them again and again, and crash’s machine will recover successfully. In addition, if the server crash completed an RPC before replying, it would still receive the same RPC after it resumed (let it try again). All RPCS in Raft are idempotent operations. If the follower already has an entry, but the leader makes it copy, Followers can be ignored.

The cluster expansion

The biggest challenge of capacity expansion is that it is difficult to ensure consistency because capacity expansion is not atomic. It is possible that one part of the cluster uses the old configuration information and another part uses the new configuration information. At this point, we add two nodes, but configuration changes cannot take effect on all nodes at the same time. If the election happens to be triggered at the arrow point in time, the saved configuration of server1 and Server 2 is still a 3-node cluster, so Server1 can become the Leader through its own votes and Server 2’s. Server5 can also become the Leader after passing the votes of 3, 4 and 5 by more than half. Two Leader errors exist.

Raft’s solution

Raft protocol addresses this problem by adding a transition phase between the old and new configuration changes, called the Joint Consensus phase. In the joint consensus phase, the cluster imposes the following constraints:

  • Log entries are copied to all new and old nodes in the cluster
  • Either a new node or an existing node can become the leader
  • Agreement (for election and submission) requires a majority on both configurations, respectively

If the above conditions are met, the cluster can be switched and still provide services externally. Here’s a look at Raft’s solution in a concrete scenario.

Add a node

You currently have a 3-node cluster, and now you want to add two nodes to the cluster. Raft sends configuration change logging instructions by sending logs.

The association consensus log is sent

In the first phase, the Leader first sends a new log to the nodes in the cluster. The instructions in the log are configuration changes. This log is copied to the Follower node like a normal log, but it does not change the state machine data because it is not a data operation instruction. The current status is shown as follows:

The C(O + N) log in the figure above is the configuration change log, which tells other clusters that they need to enter the joint consensus phase where old and new coexist. Unlike normal log entries, the node enters the joint consensus phase as soon as it receives the C(O + N) log, without waiting for the Leader to commit the log. In other words, in the figure above, S1, S4 and S5 enter the joint consensus stage, while S2 and S3 are still in the old configuration stage because they have not received the log yet.

As mentioned earlier, once you get into the joint consensus phase Raft requires that any decision be reached in half of the old and new configurations. For S1 in the figure above, because it is already in the joint consensus phase, if it wants to commit logs of configuration changes, it must be more than half in the old cluster (S1, S2, S3) and more than half in the new 5-node cluster. The log in the figure above has been copied to 3 nodes, meeting more than half of the requirements in the new cluster, but not more than half in the old 3-node cluster, so this log cannot be committed at this time.

Commit the joint consensus log

As shown in the figure above, entering the second phase, the Leader submits C(O + N) logs, now more than half of the nodes are running in the joint consensus phase, even if the Leader crashes, the newly elected Leader cluster will still run in the joint consensus phase. At this point, the Leader copies a log of the new cluster taking effect (C(new) in the figure) into the cluster to tell the followers that the node can now switch to the new cluster mode. Like C(O + N), followers do not need to wait for the Leader to submit C(new). They can switch to a new cluster as soon as they receive the log. It is important to note that at this stage of C(O + N) committing and copying C(new) logs, the cluster is still providing services normally, that is, before C(new) logs are sent, the instructions sent from the customer side to the Leader can be committed normally, but only if the logs are copied to more than half of the nodes in the new and old clusters.

Submit a new cluster log

In phase 3, after the C(new) log has been replicated, the Leader is already running in the new cluster, so it is ready to commit as long as C(new) has been replicated to at least three nodes. Then the expansion of the entire cluster is complete.

Capacity Expansion security

  1. In phase 1 above, if the Leader crashes, will there be multiple leaders?

    Raft protocol provides that there is no need to wait for a log to be submitted for a configuration change, each node will run with the new configuration as soon as it is received. In Phase 1, if S1 crashes and S2 and S3 are in the old configuration Phase, assume that S2 becomes the candidate and is elected Leader after receiving more than half of S3’s votes. S4 and S5 are in the joint consensus stage, assuming S4 becomes a candidate, then it must be more than half in both the old and new clusters. If S1 now crashes and quickly restarts and joins the cluster, S4 can receive more than half of the votes from S1 and S5 in the new cluster. In the old cluster, votes from S1 were received, but votes from S2 or S3 were not received, because S2 and S3 had already voted for S2 in the same election round. Therefore, the S4 cannot win the election of the old cluster, and there will be no two leaders.

  2. In phase 1, if the Leader crashes, does the newly elected Leader continue to commit C(O + N)?

    The answer is not necessarily. If S2 is elected Leader, because it does not have C(O + N) logs, C(O + N) on S4 and S5 will be deleted according to the log replication specification described in the previous article, so it will not commit. The client must resend the configuration change instruction. If S4 is elected as the Leader, the logs of C(O + N) are copied and the configuration change process continues. Both of these cases are Raft compliant.

  3. When phase 2Leader crashes, will committed C(O + N) be affected?

    Don’t. Raft protocol promises committed logs to be permanent. S1 crashes in Phase 2. Among the old cluster nodes, only S2 can be elected as the new Leader because S2 has the latest logs. In a new node added to the cluster, whether S4 or S5 becomes the new Leader, C(O + N) will continue to be copied and committed.

availability

Raft has two other issues that need to be addressed to improve availability when a cluster configuration changes:

  1. New logs cannot be submitted during data synchronization on a new node, causing the cluster to become unavailable for a short time?

    Raft provides that new nodes can join in a non-voting manner during data synchronization (similar to Observer nodes in ZooKeeper). For example, if there are 5 nodes and 2 nodes in the data synchronization phase, the Leader can organize the vote in a cluster of 3 nodes. After data synchronization is complete, the node is processed as a normal node.

  2. The deleted server will no longer receive heartbeat after the Leader enters the C-new phase, triggering an election to invalidate the Leader?

    To avoid this problem, Raft specifies that the server ignores the RPCs requesting the vote when the node confirms the existence of the current leader. The way to confirm the existence of the Leader is to reject the request if the time since the vote RPC was received is less than the minimum election timeout since the last heartbeat or log was received from the Leader.