“This is the 22nd day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

This year MIT finally took the initiative to post the lecture material on Youtube. After taking half the course, this year IT plans to scan the video and write notes. This course is based on distributed basic theory: fault tolerance, backup and consistency, with selected industrial-level system papers as the main line, and then filled with detailed reading materials and excellent course experiments, through academic theory and industrial practice, it is a rare and excellent distributed system course. Course video: Youtube, site B. Course materials: 6.824 home page. This is the sixth lecture note, the first part of Raft’s paper, summarizing the types of fault tolerance and Leader election in Raft.

Author: A miscellany of wood birdswww.qtmuniao.com/2020/05/09/…

Fault tolerance

State machine backup: State Machine Replication

Fault-tolerant model

We have studied the following fault-tolerance patterns:

  • Computational redundancy: MapReduce, but all computations are scheduled by a single point Master.
  • Data redundancy: GFS also relies on a single point of Master to select the Master of multiple copies.
  • Service redundancy: VmWARE-FT relies on a single TestAndSet operation

You can see that they both rely on a single component to make some key decisions. The advantage of this is that individual components do not require consensus algorithms and do not generate inconsistencies; The downside is that this component becomes a single point in the system. Of course, the above system has already reduced the single point problem to a very small piece, so we’ll go one step further and use Raft, a consensus algorithm, to take the last one down.

Split Brain

The biggest problem with consensus algorithms is how to avoid Split Brain.

So how does a Split Brain arise and why is it so dangerous?

Suppose we want to make a backup of the test-and-set service. Test-and-set, in a nutshell, is a lock service. If multiple clients request services at the same time and only one of them passes the Test (whether the Test is 0) to obtain the lock and the Server status is set to 1, the other clients fail to pass the Test and therefore cannot obtain the lock.

Consider the following four system roles: [C1, C2, S1, S2]. S1 and S2 form a double-backup, fault-tolerant system. C1 and C2 are the clients using this system. Suppose C1 can communicate with S1, but loses contact with S2. Can the system provide normal service to C1 with only S1?

  1. If S2 does go down, the system should work properly in the absence of S2, otherwise the system would not be fault-tolerant.
  2. If S2 is not down, but C1 is disconnected. The system cannot provide services to C1 because S2 may be providing services to C2. If S1 provides services to C1 at the same time, the system status will be inconsistent and the service will fail: C1 and C2 acquire locks at the same time.

In this case, we are faced with a dilemma of choice:

  1. Or no fault tolerance guarantee, even though we used a double backup server.
  2. Or you can still reply to the client request, but since a Split Brain can occur, consistency is not guaranteed.

The problem, however, is that server S1 cannot tell whether S2 is crashed (” Server crashed”) or has a network broken. Because in both cases, S1 sees the same thing: the request to S2 is not answered. S1 can communicate with C1 and S2 can communicate with C2. However, S1+C1 cannot receive the reply from S2+C2, which is called network partitionn.

Network partitions may take a long time, and it may be necessary to bring in an outside force (such as operations) to determine when the network is trusted and when the server is trusted to break the deadlock. So how can we automate fault tolerance? A: Majority vote.

Majority rule

Plurality rule, which requires that a system cluster contain an odd number of servers to avoid symmetry. In a situation where there are only two services, it is difficult to decide which one will prevail.

On a system with an odd number of servers, we just need to get a majority of votes to keep the system running and not get stuck in a deadlock with a single vote (e.g., primary election in Raft, commit log entries, etc.). Majority rules break the deadlock, and the principle is simple: you can’t have more than one partition that contains a majority of servers. Note that the majority here refers to the majority of servers that make up all of the system, not the majority of surviving servers.

If a cluster consists of 2f + 1 servers, a maximum of f servers can go down and still provide external services.

Another important property of majority voting is that any two clusters with a majority vote must intersect. For example, in Raft, the voting clusters involved in two successive Leader elections must intersect, so the next round can get the decision information of the previous term (including the term of the previous round and the COMMIT information of the previous round) through the intersection part.

In the 1990s or so, two algorithms, Paxos and View-stamped Replication (VSR, MIT), were developed to solve the split-brain problem using plurality votes. While the former is now more widely known, the latter is more similar to Raft in thought.

An overview of the Raft

Raft is generally in the form of a library running on each replica server. It is managed by a replicated state machine and is mainly responsible for the synchronization of operation logs. Based on this, we can further build a reliable KV storage layer, mainly responsible for state storage.

The figure above shows a typical client and key-value service interaction flow:

  1. The client sends a “Put/Get” request to the Leader’s K/V layer
  2. The Leader converts the request to a Command (including actions and parameters), which is appended to the native log file
  3. The Leader synchronizes the Command to Followers with AppendEntries RPC
  4. Followers appends this Command to the native log file
  5. The Leader waits for replies from the majority of servers that include itself
  6. After getting most of the Server replies, the Leader submits the Command’s log entry. Commit means that the Command entry will not be deleted and can still be inherited by the next Leader even if part of the server goes down
  7. The Leader executes the Command, applies it to the state machine, and then replies to the client
  8. The next time the AppendEntries RPC is executed, the Leader synchronizes the commit message (offset in the operation log to commit) to the Followers
  9. Followers apply the corresponding Command to the state machine after receiving the COMMIT message

The operation log

So why use operation logs to record user commands?

  1. The Leader uses logs to determine the order of commands. Make all replicas agree on the order of requests, especially a large number of requests that arrive at about the same time, and keep log entries in the same order. In this case, the log acts as a locked queue
  2. Hold the Command for later commit after commit
  3. Save the Command in case the Leader needs to send it to the followers again due to a network/server exception
  4. The status of the server is rebuilt after the server is restarted

Q&A:

  1. What if the request is too fast, but the log append speed is not enough?

    Therefore, raft is not generally used for high concurrency middleware. Based on this assumption, you can limit the Leader’s request processing speed if this situation does occur.

  2. When each server restarts, it does not execute commands in the log immediately because it does not know what has been committed (commit points are not persisted) and needs to be told later by the Leader.

Raft interface

Raft provides two main interfaces to the KV layer: Start(Command), ApplyMsg->ApplyCh.

Start(command) (index, term, isleader)

Start only works when the Leader calls, and the implication is to get the majority of servers to agree on a new Log Entry (which contains Command). There are mainly the following steps:

  1. The Leader appends the Command to the local log
  2. Send AppendEntries RPC to each Follower
  3. Start() returns immediately (to layer K/V) without waiting for a reply from each Follower (asynchronous)
  4. The K/V layer needs to listen for applyCh to determine if the Command is committed.

There are three return values:

  1. Index: The log location to which Command will be submitted
  2. Term: current term of the Leader
  3. Isleader: If the value is false, the client needs to try another server until the Leader is tried.

ApplyMsg->ApplyCh

ApplyMsg, which contains the Command and Index fields; ApplyCh, K /v Listen for the channel of ApplyMsg sent after raft commit

  1. Each server in the system for eachSubmit theEach log entry sends an ApplyMsg to applyCh
  2. After obtaining the ApplyMsg, update the Command on each server to the local state machine
  3. The Leader is responsible for replying requests to the client (after the corresponding log entry is committed)

At some point, the log entry of each server in the system is not exactly the same. For example, when the Leader is synchronizing the log entry, the Leader and some Followers have added the log entry, while the other Followers do not receive the log entry. At this point, the log entry for each server in the system forks.

But the good news is that log entries for all servers will eventually be unified by the new Leader.

Leader election

When it comes to Leader election, the first question to consider is: Is Leader necessary? Do we need a Leader to synchronize logs across all servers? The answer is no, like Paxos.

So why does Raft do Leader? There are many reasons. One is that when the system and network work normally, having a Leader make decisions can make the system more efficient. The client requests at most two times (the first time to get the Leader location, the second time to send the request to the Leader).

Term

Raft labels the Leader’s sequence, called term:

  1. A new Leader means a new term
  2. A term may have at most one Leader or none
  3. Tenure helps Followers follow the latest Leader instead of the one who has already stepped down

Election

When the Follower does not receive the heartbeat message from the current Leader within a certain time interval (Raft called election timeout, we will use an Election timer), It will add one to its term (because two leaders are not allowed in a term, so it is possible to be elected as the new Leader in the new term only after the term count is increased first), and set itself as a candidate, vote for itself, and then ask for votes from other servers.

Note that:

  1. This process may lead to unnecessary elections. For example, a server temporarily loses contact with the Leader, after the election timeout, initiates an election, and then the Leader is connected again, which will bring the whole cluster into a new term and invalidate the original Leader. This is sometimes inefficient, but safe
  2. The old Leader may still be alive and think of himself as the Leader. For example, when a network partition occurs, the Leader is divided into a few server partitions, and the majority of server partitions elect a new Leader. The old Leader will still think of himself as the Leader and try to perform the Leader functions, such as receiving client requests and trying to synchronize log entries, but since it is impossible to get most responses, it is impossible to commit and reply to the client

Q&A:

If something magical happens to the network, the Leader can only send heartbeat to the Followers, discouraging them from voting. However, the Client cannot receive the request. Does Raft work properly in this situation?

No, but there are some small steps that can be taken to solve the problem. Two-way heartbeat, for example, to rule out half-connected servers in time.

Leader and term

So how to ensure that at most one Leader is elected in a certain term?

  1. To become a Leader, you must obtain more than half of the votes of the cluster servers
  2. Each server can vote at most one vote per term
    • If you are a Candidate, don’t vote for yourself
    • If not a Candidate, vote for the first Candidate who asks for a vote (and meets certain criteria, mentioned in the next section)

In addition, when network partitions occur, there can still be at most one Leader. Even if a small number of servers go down, the Leader can still be elected.

Leader a heartbeat

The Candidate is elected as the Leader after obtaining a majority of votes. However, only the Leader knows that he is the Leader, and other servers cannot know. Therefore, the result of this election needs to be broadcast to other servers through heartbeat. If the server receiving the heartbeat finds that the heartbeat term is larger than its own, it recognizes the Leader of this term, updates its own term to the Leader term, and then changes to Follower.

After that, the Leader stops the Followers from becoming Candidate by beating the heartbeat of the other servers. This requires the Leader’s heartbeat period to be smaller than the Election timeout.

Flat ticket

In a certain term, there are two situations in which the Leader cannot be selected:

  1. No more than half of the servers are reachable to each other
  2. Multiple candidates called the election simultaneously and none received a majority of votes

In order to avoid the endless cycle of multiple candidates simultaneously initiating elections, simultaneously timed out to the next term, and then voting again at the same time, Raft introduces random values where each election timeout for each server is not a fixed value, but a random value within a range. In this way, after an election crash, the next election will inevitably be staggered due to different election timeout options. Stagger is not enough, of course. You must stagger enough to ensure that a Candidate starts voting on it before any other server times out, thus avoiding a reelection crash.

The election timeout

So how do you select election Timeout?

  1. Its minimum value should be greater than several times (more than two) heartbeat interval; Because the network occasionally loses packets, which causes some heartbeats to be dropped, which causes unnecessary elections
  2. The random interval is as large as possible so that the server with the fastest timeout becomes a Candidate can vote to other servers in time and become the Leader
  3. However, if the system loses the Leader, the system stops for a long time
  4. In our experiment, too long will make the test fail (the test program requires time limit on the process of selecting the Leader).

The old Leader

When network isolation occurs, the Leader and a few servers are isolated into a partition, and the remaining majority of nodes elect a new Leader. What happens if the old Leader can’t feel the new Leader?

  1. The old Leader would not commit any log entries because he could not get most followers to synchronize the log entries
  2. Although no commit is made, some servers receive log entries from the old Leader, causing log divergence among servers in the cluster

Log differences

When everything is working properly, the Followers simply accept the log entries that the Leader synchronizes for them. However, when an exception occurs, for example, the Leader only synchronizes logs for some machines in the cluster, and then breaks down, what should the system do next?

Of course, in Figure A, log 11 May or may not have been committed by the old Leader. But if the new Leader has no way of knowing, it simply waits for the log entry to be submitted by majority vote (more than half of the servers have it).

reference

  1. Blackboard writing in English
  2. video

Welcome to pay attention to the public log bird miscellany, get more distributed system articles.

Copy the code