preface

Reading The source code of Alibaba Nacos, IT is found that Raft (The Distributed Consensus Algorithm) is used in Nacos. So I looked up some information and put it into writing, hoping to have some reference for everyone when learning Raft. As a special note, the Raft algorithm is quite extensive and this article will use a very simplified approach (as explained below). This article will not cover the implementation of Raft code.

Let’s start with the Byzantine generals

Byzantine Failures is a basic problem in point-to-point communication proposed by Leslie Lamport. The problem originates as follows:

Byzantium was the capital of the Eastern Roman Empire in what is now Istanbul, Turkey. Because of the vast territory of the Byzantine Empire, armies were kept far apart for defensive purposes, and generals had to rely on messengers to carry messages from one another. In times of war, all the generals and adjutants in the Byzantine army had to agree on whether they had a chance to win before attacking the enemy’s camp. However, there may be traitors and enemy spies in the army, and the decisions of the generals disturb the order of the whole army. When consensus is reached, the results do not represent the majority opinion. At this point, the Byzantine problem was formed as to how the remaining loyal generals could reach an agreement without being influenced by the traitors, with members known to be rebelling.

The Byzantine general problem is the most complex and rigorous fault tolerant model in the distributed domain. But in daily work use of distributed systems will be less complicated, faced with the problem of more computer fault hanging out, or network communication problems and can’t transmit information, this situation does not consider each other between computer sending malicious information, greatly simplifies the system to the requirement of fault tolerance, the most important is to achieve consistency.

Consistency algorithm

In 1990, Leslie Lamport proposed Paxos, a consistency algorithm based on message passing. Paxos algorithm solves the problem of how to reach agreement on a certain value (resolution) in a distributed system.

The consistency algorithm allows multiple machines to work together as a cluster, and the cluster still works when some of the machines fail. Because of this, consistency algorithms play a key role in building reliable large-scale software systems. Previously, Paxos almost dominated the discussion of conformance algorithms: most conformance implementations are based on or influenced by Paxos, and Paxos has become the primary tool used to teach students about conformance.

Unfortunately, Paxos is just too hard to understand, although many people have been trying to make it easier to understand. In addition, its architecture requires complex changes to support the actual system. As a result, both system developers and students struggle with Paxos.

Raft comes out of the sky

Due to the obscurity of Paxos, Stanford professors published a new distributed consistency protocol called Raft in 2014. Raft is almost as efficient as Paxos, but much easier to understand and use for system development.

To make the Raft protocol easier to understand, Raft separates the key elements of consistency, such as leader election, log replication, and security, and it enforces greater consistency to reduce the number of states that must be considered. Let’s get into the world of Raft!

Extremely simplified thinking

As shown in the picture below, we should first consider a question: A and B take turns to put black and white weiqi pieces flat on a round table. Each time they put one piece, the pieces should not overlap. Whoever has no place to put them first will lose.

How can I put it to win?

The diagram above answers this question, first mover wins, using three different ways of thinking.

  1. Suppose the table is only as big as a go child.

  2. If the table is infinitely large, the first mover occupies the center of the circle. Since the circle is symmetrical, you can always find a place on the other side of the circle as long as your opponent can still find a place.

  3. A circle can have a number of small circles of equal diameter and tangent to each other.

Three different ways of thinking gradually deepen in comprehensibility difficulty. The first is extremely simplistic thinking, but mathematically imprecise. The second is limit thinking, combined with the first is mathematical induction, mathematically rigorous. The third type of thinking is figurative thinking, which uses geometric concepts but is difficult to understand for people without basic knowledge of geometry.

Extremely simplified thinking analysis Raft

In a Raft cluster with several nodes, there are several important roles: Leader, Candidate, and Follower. Each role is responsible for different tasks. Normally, a node in a cluster has only two roles: Leader and Follower.

1. Leader: Handles all client interactions, log replication, etc., usually only one Leader at a time;

2. Followers: respond to the Leader’s log synchronization request, to the Candidate’s invitation request, and to forward (redirect) the transactions requested by the client to the Follower to the Leader;

3. Best companies rank: 12 When the cluster is just started or the Leader is down, the node playing the role of Follower will become the Candidate and initiate the election. After winning the election (obtaining more than half of the votes of the nodes), the node will change from the Candidate to the Leader.

The diagram below shows the process of the three different roles, which we will analyze in detail below.

Raft election

Like a democratic society, leaders are chosen by popular vote. In the beginning, there is no leader, and all the participants in the cluster are the masses. During the general election, all the masses can participate in the election, and then the role of all the masses becomes the candidate.

Each crowd was assigned a countdown timer, which was set at random between 150ms and 300ms. The one whose countdown timer ended first would have the priority to solicit votes. For example, when the countdown is over, Joe will vote for himself first and then initiate RequestVote RPC to others in the cluster. When the majority of the cluster (N/2+1) vote for Joe, then Joe will become the leader. At this time, Joe will initiate heartbeat to all the others in the cluster to show his authority and tell them, You will all listen to me.

Let’s take A look at an actual process and if we think about it in A very simplified way, A minimum Raft cluster has at least three nodes (A, B, C), assuming that A’s countdown ends first and A initiates A vote request to B and C.

In the first case, both B and C vote for A, at which time A becomes the Leader

In the second case, B votes for A and C votes for himself. At this time, A can also become the Leader because it has obtained A large majority of votes.

In the third case, A, B and C all vote for themselves, and the Leader is not elected at this time

In this case, it indicates that Split Votes are invalid, in which each party Votes for itself and no party wins a majority of Votes. Each participant then takes a random Election Timeout to re-vote until one party has a majority. In theory, if the vote is tied every time, the Election will continue.

After the election, the Leader sends heartbeat messages to all the Follower nodes. If the Follower does not receive heartbeat messages from the Leader for a period of time, the Follower considers that the Leader may have died and initiates a new main election process again.

term

Raft algorithm divides time into terms that act as a logical clock, with each term beginning with a Leader election. After successfully electing a Leader, the Leader manages the entire cluster for the entire term. If the Leader election fails, the term ends with no Leader, as shown below:

Log copy

The Leader, once elected, starts serving client requests. Each request from the client contains an instruction that will be executed by the replicated state machine. The Leader appends the directive as a new entry to the log and then initiates a parallel AppendEntries RPC to other servers to copy the entry. When the item is safely copied, the leader applies the item to its state machine (the state machine executes the instruction) and returns the execution results to the client. If the follower crashes or is running slowly, or the network loses packets, the leader will continue to retry AppendEntries RPC (even after the client has replied) until all the followers have finally stored all the log entries.

Replication process

  1. Each request from the client contains instructions to be executed by the replicated state machine.

  2. The leader adds this instruction to the log as a new log entry, and then initiates an RPC in parallel to the other servers to copy this information.

  3. If the log is safely replicated, the Leader applies the log to his state machine and returns it to the client.

  4. If followers break down or are slow or lose packets, the leader will try again and again until all the followers have finally saved all the log entries.

Log data structure:

  1. Tenure number at log creation time (used to check for node log inconsistencies)

  2. Instructions that the state machine needs to execute (real content)

  3. Index: The integer index indicates the position of the log entry in the log

In sending AppendEntries RPC, the leader includes the index position and tenure number of the previous log entry. If the follower does not find an entry in its log that contains the same index position and tenure number, it will reject the new entry. Consistency checking is like a generalization step: the empty Log state must be a Log Matching Property at first, and then consistency checking guarantees Log Matching as the Log expands. Therefore, every time AppendEntries RPC returns success, the leader knows that the follower’s log must be the same as his own (from the first entry to the latest entry).

Raft has the following guarantees for logs: If two log entries have the same tenure number at the same index position, Raft assumes that the log is identical from the beginning to the index.

According to this guarantee, when the leader and follower logs conflict, the leader checks whether the last log from the follower matches the leader. If the last log does not match the leader, the leader checks the last log from the follower until it matches. After the match, all the conflicting logs are deleted. This enables consistency between master and slave logs.

During normal operation, the leader and follower logs are consistent, so AppendEntries consistency check for RPC never fails. However, a leader crash would leave the log in an inconsistent state (the old leader might not have copied all the entries in its log). This inconsistency is exacerbated by a series of leader and follower crashes. Followers may lack some log entries that the new leader does not have, may have some log entries that the new leader does not have, or both. Missing or extra log entries can involve multiple terms.

The following figure shows the Leader and Follower log conflicts:

How does Raft deal with this situation?

The leader maintains a subscript for each follower, called nextIndex, which indicates the index of the next log entry that needs to be sent to the follower.

When a new leader gains power, he adds the index of his last log to 1. If a follower’s log does not match the leader’s, the consistency check will fail in the next RPC add-on log request.

When this happens, the leader retries the nextIndex decrement until a correct log match is encountered.

When the match is successful, the followers delete all the conflicting logs, so that the follower and the leader’s logs agree.

Impact of the Leader node on consistency

During log replication, the Leader can fail at any stage. Take a look at how Raft protocol ensures data consistency at different stages.

Let’s take a look at the normal schematic.

Let’s look at some of the anomalies.

** Scenario 1: Before the data reaches the Leader node: ** At this stage, the Leader’s failure does not affect consistency.

Case 2: Data reaches the Leader node but is not copied to the Follower node

In this phase, the Leader hangs up, and the data is in the uncommitted state on the Leader. The Client does not receive any response and considers the timeout failure to initiate a safe retry. If the Follower node does not have this data, the Client can resubmit the data after the master is selected again. After the original Leader node is recovered, it joins the cluster as a Follower and synchronizes data from the new Leader to forcibly keep the data consistent with that of the new Leader.

Scenario 3: Data reaches the Leader node and is successfully copied to all nodes of Follower, but is not received by the Leader

At this stage, the Leader hangs up. Although the data on the Follower node is in the unsubmitted state, it remains consistent. Data submission can be completed after the Leader is elected again. In this case Raft requires RPC requests to be idempotent, that is, to implement an internal de-duplication mechanism.

Scenario 4: Data reaches the Leader node and is successfully copied to some nodes of followers, but has not been received by the Leader

At this stage, the Leader hangs, the data in the Follower node is unsubmitted and inconsistent, and Raft protocol requires that only the node with the latest data be voted. Therefore, the node with the latest data will be selected as the Leader and forced to synchronize data to followers, so that data will not be lost and ultimately consistent.

Scenario 5: The data reaches the Leader node and is successfully copied to all or most of the followers nodes. The data is in the submitted state at the Leader node but not in the unsubmitted state at the Follower node

In this phase, the Leader hangs up and a new Leader is elected. The process is the same as that in phase 3.

Scenario 6: Data reaches the Leader node and is successfully copied to all or most of the Follower nodes. The data is submitted on all nodes but has not responded to the Client

In this phase, the Leader fails and the data in the cluster is actually consistent. The repeated retries based on the idempotent policy have no impact on the consistency.

Scenario 7: Split brain caused by network partition, and two leaders appear

Network partition separates the original Leader node from the followers node. If the followers fail to receive the heartbeat message from the Leader, they will elect a new Leader. At this time, two leaders are generated. The original Leader is alone in a region, and the data submitted to it cannot be copied to most nodes, so the submission will never succeed. Data submitted to the new Leader can be submitted successfully. After the network is restored, the old Leader automatically demotes to Follower if the new Leader has a new Term in the cluster and synchronizes data from the new Leader to achieve data consistency in the cluster.

This exhaustive analysis of all the scenarios for the smallest cluster (3 nodes) shows that Raft protocol deals with consistency very well and is easy to understand.

At the end

Raft is a classic and easy to understand consensus algorithm for distributed applications. His design ideas can give us a lot of inspiration to some extent, above I mainly analyze elections and log replication, in fact, there are a lot of things not mentioned, such as security. If you want a deeper understanding of Raft, I suggest reading the Raft paper, which is only a brief introduction and requires you to delve into more of it yourself.

Raft algorithm demo: * * * * thesecretlivesofdata.com/raft/

Some references in this article:

  1. Raft github: raft.github.io/

  2. Raft paper: raft.github.io/raft.pdf

  3. Why Raft is easier to understand the distributed consistency algorithm: www.cnblogs.com/mindwind/p/…