I studied raft protocol a long time ago, but I didn’t use it in the project recently, so I got a little rusty. I reviewed raft this time, spent two days, and took notes.

Consistency problem

In distributed systems, consensus problem refers to that for a set of servers, given a set of operations, we need a protocol to make their final results agree.

Due to CAP theory tells us for distributed systems, if you don’t want to sacrifice the consistency, we can only give up availability, so data consistency model mainly has the following kinds: strong consistency, the weak consistency, etc., and finally in this chapter, we mainly discuss the algorithm of Raft, is a kind of strong consistency algorithm in the distributed system.

Strong consistency of the general implementation principle: when one server receives a set of instructions for the client, it must communicate with other servers to ensure that all servers are in the same order received the same instruction, so that all servers can produce consistent results, looks like a machine.

Description of Raft algorithm

Before Raft was put forward, the Paxos protocol is first proved the consistency of the algorithm, is very difficult, but the Paxos thesis based on engineering practice and teaching of Paxos very headache, so the Raft in the process of design, starting from the understandability, using the algorithm of decomposition and reduction of state, has been used very widely at present.

In Raft, the problem is broken down into: leader selection, log replication, security, and member changes.

The basic concept

Replicated State Machine





1.png

  • Replication state machines are implemented by copying logs:

    • Log: Each machine keeps a log, which is a series of commands from the client’s requests
    • State machine: The state machine executes these commands in sequence
    • Consistency model: In a distributed environment, logs on multiple machines are consistent, so that the logs are played back to the state machine in the same state
  • The consistency algorithm acts on the consistency model and generally has the following characteristics:

    • Safety: The result is correct in the case of non-Byzantine problems (network delay, network partition, packet loss, redispatch, packet out of order, etc.)
    • Availability: The system is available when more than half of the machines work normally
    • Timing -unindependent: Does not rely on clock to ensure log consistency. Incorrect clock and extreme message latency will cause usability problems at most

Note: In a real implementation, it is recommended that every command operation of the state machine be idempotent so that consistency is more easily guaranteed.

Server status

Each server must be in one of three states:

  1. The leader
  2. The candidate
  3. followers




2.png

Followers only respond to requests from other servers. If the follower receives no message, it becomes a candidate and an election begins. The candidate who receives a majority of server votes becomes the new leader. Leaders remain leaders until they go down.

Term (Term)

Raft algorithm divides time into terms of arbitrarily different lengths. Terms of office are expressed in consecutive numbers. Each term begins with an election, in which one or more candidates try to become leader. If a candidate wins an election, it serves as leader for the remainder of that term. In some cases, the votes will be divided, there may be no leader elected, and then another term will begin, and the next election will begin immediately. The Raft algorithm guarantees a maximum of one leader in a given term.





3.png

RPC

Communication between server nodes in the Raft algorithm uses remote procedure calls (RPCs), and the basic conformance algorithm requires only two types of RPCs. RequestVote RPCs are initiated by the candidates during the election and then AppendEntries are initiated by the leader to replicate the logs and provide a heartbeat mechanism. A third RPC was added to transfer snapshots between servers. When the server does not receive an RPC response in time, it retries, and they can initiate RPCs in parallel for maximum performance. There are three types of RPC:

  1. RequestVote RPC: initiated by a candidate during an election
  2. AppendEntries RPC: a heartbeat mechanism initiated by the leader in which log replication is also done
  3. InstallSnapshot RPC: The leader uses this RPC to send snapshots to followers who are too far behind.

Timeout setting:

  1. BroadcastTime: indicates the heartbeat timeout of the leader
  2. Election Timeout: Candidate Timeout set by the follower
  3. MTBT: Refers to the average time between failures of individual servers

BroadcastTime << ElectionTimeout << MTBF

  1. BroadcastTime should be an order of magnitude smaller than ElectionTimeout, so that leaders can keep sending heartbeats to prevent their followers from voting.
  2. ElectionTimeout is also several orders of magnitude smaller than MTBF in order for the system to run stably.

Generally, BroadcastTime ranges from 0.5 ms to 20 ms, and ElectionTimeout ranges from 10ms to 500ms. Most servers have MTBF of a few months or longer.

Leader selection

  • Triggering conditions:

    1. Under normal circumstances, when the follower receives the leader’s heartbeat, the ElectionTimeout is cleared to zero and will not trigger.
    2. Leader failure. When the ElectionTimeout of followers occurs, they will become candidates and trigger the selection of leaders.
  • Candidate operation procedure:

Followers increment the current term, convert to Candidate, vote on themselves, and initiate RequestVote RPC, waiting for the following three scenarios to occur;

  1. Get more than half of the server votes, win the election, become the leader;
  2. The other server wins the election and receives the corresponding heartbeat, becoming a follower;
  3. If the election expires and no server wins the election, the current term will be extended and the election will be re-launched.
  • Matters needing attention:

    1. The server can vote for a maximum of one candidate within a term on a first-come, first-served basis.
    2. A candidate may receive AppendEntries RPC from another declared leader while waiting to vote. If the leader’s term (as in RPC) is longer than the current candidate’s current term, the candidate considers the leader legitimate and switches to a follower; If the term in the RPC is smaller than the current term of the candidate, the candidate rejects the RPC and keeps the candidate status.
    3. Candidates neither win nor lose elections: if many followers become candidates at the same time, votes are split and no candidate may get a majority of the vote. When this happens, each candidate will time out and start a new election by self-increasing the term number and initiating another round of RequestVote RPC. However, if there is no other way to distribute votes, the situation could go on indefinitely. So Raft used random election timeouts (between 150 and 300ms) to avoid this.
  • Why is there no talk about receiving RequestVote RPC requests from other candidates? Possible explanations:

    1. Candidates have already voted for themselves, and a candidate only votes for one person for a term, not another;
    2. It is also possible that the algorithm itself sets the candidate to reject all requests from other servers.

Log copy





4.png

The process of receiving an order:

  1. The leader accepts client requests;
  2. Leaders append instructions to the log;
  3. Send AppendEntries RPC to followers;
  4. After receiving confirmation from most followers, the leader commits the log, plays back the log in the state machine, and returns the result to the client.

Submission process:

  1. During the next heartbeat phase, the leader again sends AppendEntries RPC to the follower and the log is commited;
  2. After receiving Commited logs, the follower plays them back on the state machine.

security

So far described mechanism is not sufficient to ensure that each state to perform the same instruction in the same order, for example: a follower may enter the unavailable state leaders have already submitted a number of log entries at the same time, and the follower may be elected leaders and cover these log entry; Therefore, different state machines may execute different sequences of instructions.

1. Leader appends logs (append-only)

Leaders never overwrite existing log entries; There is always only one flow: from the leader to the follower;

2. Election restrictions: Voting prevents a server without all log entries from winning an election

If the voter’s log is newer than the candidate’s, reject the vote request; This means that to win an election, the candidate’s log must be at least as new as most servers’ logs, and it must contain all of the submitted log entries.

3. Never commit log entries before the tenure (only commit log entries within the tenure)

In Raft algorithm, when a log is safely copied to most machines and AppendEntries RPC returns correctly on most servers, then the log is committed and the leader updates the Commit Index.





5.png

In step C, we would have committed logs with a previous term of 2 to other servers if we had allowed them to commit before, resulting in a 2 log situation on most machines. This results in a situation where a log entry with tenure 3 in S5 in D overwrites the already committed log.

Raft never commits a previously popular log entry by counting the number of replicates. Only log entries for the current term of the leader can be submitted by counting. Once a Log entry for the current term is submitted this way, previous Log entries are also submitted indirectly due to Log Matching Property.

Since Raft will not submit log entries before the tenure, there will be no transition from B to C, only a direct transition from B to E in the case of S5down, which will result in a new tenure and S5 will not have a chance to be chosen as the leader.

4. Candidates and followers collapse

The collapse of candidates and followers is much easier to deal with. If such a role crashes, all subsequent RCPS sent to them in RequestVote and AppendEntries will fail, and Raft algorithms handle such failures as a simple infinite retry. If the servers become available again, the RPCS will return successfully. If a server completes an RPC but crashes before responding to the Leader, it will receive the same RPC request when it becomes available again. The receiving server is responsible for checking. For example, if it receives an RPC request that already contains the log, it can simply ignore the request to ensure that it is harmless to the system.

Cluster Member Change

A change in the number of members of a cluster differs from a restart in that the former changes the number of members and affects the selection and decision process of a leader. In distributed systems, this is very important to the concept of a majority of members in a cluster.

In a simple way, o&M personnel temporarily take the system offline, modify the configuration, and bring the system back online. But there are two drawbacks to this approach:

  1. Cluster unavailable at change time
  2. Risk of human error

It is not safe to go directly from one configuration to a new one

As shown below:





6.png

Because each machine can be switched at any time. In this example, the cluster quota is changed from 3 machines to 5 machines. Unfortunately, there is a point in time when two different leaders can be elected in the same term. One through the old configuration, one through the new configuration.

Two-stage method to ensure safety:

To ensure security, configuration changes must use a two-phase approach. In Raft, the cluster first switches to a transitional configuration called common alignment; Once the consensus has been committed, the system switches to the new configuration. Common consistency is the combination of the old configuration and the new configuration.

Common consistency allows independent servers to configure the transformation process at different times without compromising security. In addition, common alignment allows the cluster to remain responsive to server requests while configuring transformations.

A leader receives a request to change the configuration from C-old to C-new, which stores the configuration (C-OLD,new in the figure) consistently for common use, in the form of the log entry and copy described earlier. Once a server adds a new configuration log entry to its log, it uses this configuration to make all future decisions. The leader full feature ensures that only servers with C-OLD,new log entries can be elected as leaders. When c-old,new log entries are committed, the leader commits C-new using the same policy, as shown in the figure below. There is no opportunity for c-OLD and C-New to make unilateral decisions at the same time, which ensures security.





7.png

A timeline for configuring switches. Dashed lines represent entries that have been created but not yet committed, and solid lines represent the last log entry that was committed. The leader first creates the configuration entry for C-old,new in his own log and submits it to C-old, New (majority of C-old,new and majority of C-New). He then creates the C-New entry and submits it to the majority in c-New. There is no point in time when C-New and C-old can make a decision at the same time.

Log compression

Almost all distributed systems (Such as Chubby and ZooKeeper) use snapshots to compress logs. After snapshots are taken, snapshots are stored in stable and persistent storage, and the logs and snapshots before snapshots are discarded.

This is how Raft does it:





8.png

Unlike the other Raft operations leader-based, snapshots are generated independently by individual nodes. In addition to log compression, the snapshot can also be used to synchronize status: slow-follower and new-server. Raft uses the InstallSnapshot RPC to do this and I won’t go into details.

The Client interactions

  1. Client sends requests only to the leader;
  2. The Client sends a request to the follower. The follower rejects the request and redirects it to the leader.
  3. If the Client request fails, the system sends the request again due to timeout.

Raft algorithm requires Client requests to be linearized to prevent requests from being executed more than once. There are two solutions:

  1. Raft algorithm requires each request to have a unique identity;
  2. Raft’s requests remain idempotent;