Disclaimer: This series of articles will be written for readers who have read or know a lot about Raft. If you haven’t read or know anything about Raft, it will be hard to understand.

Pick up where we left off to explore the implementation details of Raft’s Leader election, log replication, security, and more.

Raft foundation

A Raft cluster usually consists of multiple machines. A common Raft cluster is 2F+1 where F is the number of machines that can fail. If there are five machines in a cluster, Raft can only tolerate two services failing and if three services fail then the whole cluster fails. Most Raft clusters have 5 machines.

Each machine has three states: leader, follower and candidate, as shown in the figure below, which is the transition diagram between the three states.

  • The leader receives all requests, and if the client requests a follower, the follower forwards the request to the leader
  • Followers only receive requests from the leader and candidate and do not initiate them. If followers do not receive any communication or signals, they become candidates and run a new election
  • A candidate is the state that occurs when a new leader is elected and becomes the leader if a candidate receives a vote request (RequestVote RPC, hereinafter referred to as a RequestVote request) from most machines

Raft divided the time into terms, and each term started with an election. If a candidate became the new leader, it would enter the normal operation stage. If no new leader was elected, a new election would be held again, which would be a new term. Simply put, in Raft, the unit of time is term. As shown below:

Each server stores the current term serial number, which is passed between server communications. If the server receives a request that contains an old term, the server rejects the request.

Leader election

Taking a look at how Raft does the leader election, Raft uses a heartbeat mechanism to trigger the leader election. When the server starts and the initial state is follower, the leader sends a heartbeat to all followers by issuing a AppendEntries RPC (hereinafter referred to as an Append request) that does not contain log entries. The followers receive a heartbeat request packet from the leader. This indicates that the leader is still “alive”. If the followers do not receive the heartbeat from the leader for a long time, they consider that there is no leader, change to candidate, and initiate a new leader election.

When the leader election starts, the follower increments the current term, immediately enters candidate state and makes a RequestVote request to all machines, asking the other machines to vote for it. Servers in the Raft cluster can only vote for one candidate in a term other than themselves.

A candidate’s state changes if one of three things happens: 2. Candidate receives heartbeat packets from other machines. If the heartbeat contains a greater term than the term of the current candidate, a new leader has been created and the candidate becomes Follwoer. (If the heartbeat contains a term smaller than the term of the current candidate, the candidate ignores the heartbeat directly.) 3. If no candidate succeeds in obtaining a majority of votes for a period of time, the election is considered timeout and a new election is held. However, if there are no other restrictions and multiple followers become candidates at the same time, the candidates will send RequestVote requests at the same time, and the votes will be split among multiple candidates. Therefore, no candidate will win the election, and the election will time out. Another election with the same result will enter an endless loop with no leader due to timeout.

To solve the third case, Raft used random timeouts to make sure the split vote didn’t happen. Random timeouts mean that Raft timeouts are chosen within a range (such as 150-300ms) rather than a fixed unit of time. As a result, only one server will time out most of the time. Additionally, after a timeout, Raft uses a random time to start a new election so that all candidates don’t have to re-elect at the same time and end up in an endless cycle of election timeouts.

This random timeout mechanism is a simple and effective way to solve this problem.

Log copy

Raft is a protocol used to ensure data consistency in distributed systems. The main job of Raft is to receive data from clients and synchronize data to ensure data consistency across all nodes.

Log Replication Process

The log mentioned many times in this paper is the data to be synchronized in RAFT system, called log entries. The format of the log is shown below. Each log contains the instructions to be executed, term and the index of the log.

After the leader is elected, the leader starts to work normally and receives the requests from the client. The requests from the client contain the instructions to be executed. After receiving the instructions from the client, the leader saves the instructions as log entries and sends Append requests to all followers. Notify them to synchronize the log entry, and when the synchronization is complete, the request returns with success. If most servers have completed log synchronization, the leader considers the instruction to be commit. The commit will check all logs and commit the committable logs generated in the previous term (as mentioned below). Then the leader will execute the instructions in the committable logs into the state machine. And returns the result to the client.

If follower synchronization is slow (such as timeout, slow network, or lost response), the leader will retry the Append request until all followers have successfully saved all log entries.

Note that the leader will retry even though the leader has already responded to the client. This makes sense because as long as the leader responds to the client, the log entry is committable and all that remains is to ensure that the log is synchronized to all followers. For followers, a log entry is copied, but still needs to be executed, and will not actually be executed into the follower’s state machine until the follower realizes that the entry is committable.

The log replication process described above is as follows:

1. The client initiates the instruction

2. The leader receives the command and sends an Append request to all followers

3. Follower synchronizes logs to the local PC

4. The follower operation completes and returns to the leader

5. If most of the followers have completed synchronization, the leader returns a response to the client

Log Matching feature

In terms of logging, Raft implements and maintains the following two features:

1. If two log entries have the same subscript and term, they hold the same instructions

2. If two log entries have the same subscript and term, the logs before them are the same

Together, these two features make up the log matching feature of Raft. If two log entries have the same subscript and term, then the logs are the same. In Raft, the leader creates at most one log entry at a certain index of a term, and its index does not change. A consistency check is performed on each Append request. (Followers reject an Append request if they find that the subscript and term of the previous log entry are inconsistent with those of the request containing prevLogIndex and prevLogTerm.) Raft does this to maintain the log matching feature to ensure the consistency of log entries.

Log consistency

Normally, the data of the leader and followers are consistent. However, if the leader dies suddenly, data inconsistency may occur.

In Raft, the leader forces the followers to directly copy the leader’s log entries to solve the problem of data inconsistency. If the log entries of the followers and the leader are different, the followers only recognize the leader’s data and the local data will be overwritten by the leader’s data.

The leader maintains a nextIndex for each follower (the subscript of the next log entry the leader will send to the follower). If the follower’s log is found to be inconsistent with the leader’s, the follower will reject the Append request. After receiving a rejected response, the leader will reduce the nextIndex of the rejected follower by one and then issue another Append request. If the Append request is successful, the log entries of the leader and follower at this subscript position are the same. This log will always be maintained.

With log replication, the leader does not need to perform any other operations to restore data consistency. The Leader does not delete or overwrite its own log entries. It only needs to perform normal operations and automatically check the consistency when the Append request fails.

The log replication mechanism also demonstrates Raft’s ability to receive, copy, and execute new log entries as long as most machines are up and running. Normally, new log entries can be copied to most machines after one RPC; Individual machine timeouts do not affect overall performance.

security

The election of the leader and the replication of logs were introduced earlier, but these mechanisms alone are not sufficient to ensure that each state machine executes the same commands in the same order. For example, if there are only these two mechanisms, the logs will be out of order or overwritten by other elected leaders if there are some exceptions. Therefore, some other mechanisms are needed to ensure data consistency.

The election limits

The first restriction in Raft: restrict the flow of logs only from the leader to the follower and the leader does not overwrite existing log entries.

The second restriction is the voting limit for electing the leader. A candidate can only be elected if all the submitted log entries are included.

Each time a leader is elected, a candidate makes a request to the other machines to get the latest log entry information for each machine. If a candidate’s log is newer than most machines, it is elected leader.

The log version is defined as: if the maximum subscripts of two logs are not the same term, the one with the larger term wins; If the maximum subscripts of two logs are the same, the one with the larger log length wins. It can be described by the following logic:

There are machines s1, s2, define the log contained in the machine log, the latest log subscript is lastIndex, log length is log.length, the following logic can determine whether the log is up to date:

	log1 := s1.log
	log2 := s2.log

	iflog1[lastIndex].term ! = log2.[lastIndex].term && log1.term > log2.term || log1[lastIndex].term == log2.[lastIndex].term && log1.length > log2.length then s1 winelse
		s2 win
Copy the code

The log entry for the term before committing

As mentioned earlier, if a log is copied to most machines, the leader considers the log to be committable. However, if the leader crashes before committing the log, something unexpected can happen. Here’s an example:

The scenario is shown in the figure: A and S1 are selected as the leader, and the logs are replicated at index=2 halfway through (only S2 is performed) b and S1 are suspended. After voting by S3, S4 and their own, S5 is selected as the leader, term=3, and receives a different log from S1 at index=2. C and S5 are suspended. S1 restarts, is selected as leader, term=4, but continues its previous replication, and the log term=2, index=2 is copied to S3. At this point, the log has been copied to most machines and is considered ready to commit

The following two scenarios are hypothetical: if D and S1 fail, S5 can be chosen as the leader to copy the data of step B, and the data with index=2 is overwritten by S5

If E, if S1 copies the log data received by the current term (index=3, term=4) to most machines before hanging, and S5 cannot be elected leader according to voting restrictions, because log5[3].term < log3[3].term. Once copied to most machines, this data can be committed, along with its previous log entries

D and E above are hypothetical scenarios. In order to avoid the problems described above, Raft has another restriction that the leader does not commit the logs that were considered committed in the previous term, only the logs that were considered committed in the current term. Due to the log matching feature, Raft indirectly commits the previous term’s committable logs when committing the current term. (If two log entries have the same subscript and term, all the logs before them are the same, so the leader must commit the last committable log. Otherwise there will be data inconsistency between the leader or follower).

The leader commits the log by:

for entry in GetEntries(lastCommited, newCommited):
   entry.Commit();
Copy the code

By limiting the election condition and limiting the leader to only commit logs for the current term, the problems described above en route do not occur.

According to the basis of Raft algorithm, leader integrity characteristics can be demonstrated, and state machine can be further proved. The detailed demonstration process will not be translated, if you are interested, you can read the paper.

Followers and candidates crash

A follower/candidate crash is handled the same way. If a follower/candidate crashes, subsequent Append or RequestVote requests will fail. Raft’s implementation keeps retrying these requests until the machine is restarted. (Raft’s RPC requests are idempotent, so repeated RPCS don’t affect the system.)

Time and availability

In order to make Raft systems highly available, Raft requires that security not be affected by execution time, that is, the system does not have abnormal results due to the machine’s response time. A stable leader is needed to keep Raft systems running.

Raft has three temporal attributes

  • BroadcastTime: indicates the time between a server sending an broadcastTime and receiving a response from an RPC. Generally, the time ranges from 0.5 to 20ms because the RPC needs to persist data locally
  • ElectTimeout: indicates the timeout period for electing the leader. The value ranges from 10 to 500ms
  • MTBF: Average machine failure time, generally expressed in months

Raft can ensure the stability of the leader by using the following time expression:

broadcastTime<=electionTimeout<=MTBF

BroadcastTime <=electionTimeout because Raft relies on heartbeat packets to maintain a leader

2. To keep the system stable, electionTimeout<=MTBF

Answer Q&A

A question from last time:

Q1: In the leader election process, how do candidates decide to vote for the machine that initiates RequestVote Rpc? Do you vote when you receive a request? Are there any requirements to be a leader?

Once a leader is elected, a candidate sends a request to other machines to obtain the latest log entry information of each machine. If a candidate’s log version is newer than that of most machines, it can be elected leader.

The log version is defined as: if the maximum subscripts of two logs are not the same term, the one with the larger term wins; If the maximum subscripts of two logs are the same, the one with the larger log length wins. That is:

	log1 := s1.log
	log2 := s2.log

	iflog1[lastIndex].term ! = log2.[lastIndex].term && log1.term > log2.term || log1[lastIndex].term == log2.[lastIndex].term && log1.length > log2.length then s1 winelse
		s2 win
Copy the code

Q2: What does the log matching feature have to do with the logs that the leader commits for the previous term?

Review the log matching feature:

1. If two log entries have the same subscript and term, they hold the same instructions

2. If two log entries have the same subscript and term, the logs before them are the same

If the leader does not commit the logs for the previous term, data inconsistencies occur and part 2 of the log matching feature cannot be maintained.

Here’s an example:

Suppose there are 5 machines. In TERm2, S1 is the leader, and most of the machines copy the logs at subscript 2. At this time, S4 and S5 have no data, S1 hangs before submission, and S3 is selected as the leader in Term3, and most of the machines copy the logs at subscript 3. The leader (S3) commits the logs at subscript 3, but not the data at subscript 2. S4 can only receive data with index=3 and term=3, so the log data before INDEX =3 and term=3 of S1 and S4 will be inconsistent.

If terM3 commits all data between subscripts 2 and 3 and the leader finds log inconsistencies during replication, the leader’s data will be forced to use and this problem will not occur. The log matching feature is met.

conclusion

So far, this exploration of Raft has come to an end. This exploration has gained a lot. Log replication and leader should be the most difficult part in Raft.

Raft uses election restrictions to ensure that the machine that becomes the leader must have the latest data to avoid data overwriting. The log matching feature is met through the logs submitted before the term. The data consistency check ensures the data consistency between the leader and followers. In general, Raft provides the stability and consistency of the system through restrictions and regulations. This is part of the protocol and the system that uses this protocol must follow these conventions to function properly.

Next time we’ll explore cluster relationship changes, log compression, and more.

Reference links:

Question about Committing entries from previous terms

Raft algorithm in detail

Understand Raft algorithm

Original article, writing is limited, talent and learning shallow, if the article is not straight, hope to inform.

If this article has been helpful to you, please feel free to like it. Thank you

More exciting content, please pay attention to the individual public number.