Content Introduction Guide

  • Paxo algorithm guide

  • Zab algorithm guide

  • Raft Algorithm Guide


Paxo algorithm guide

Background to the Paxos algorithm

Paxos algorithm is a consistency algorithm based on message passing proposed by Leslie Lamport in 1990. It is currently recognized as one of the most effective algorithms to solve the distributed consistency problem, which is how to reach agreement on a certain value (resolution) in a distributed system.

The premise of Paxos algorithm

The Paxos algorithm is based on the assumption that there is no Byzantine general problem, that is: the channel is secure (channel reliable) and the signal emitted cannot be tampered with.

Introduction to Paxos algorithm

In the Paxos algorithm, there are three roles:

  • -Chuck: No, no, no.
  • Acceptor: a decision maker who can approve a proposal.
  • Learner: the Learner who makes the final decision.

The security requirements of Paxos are as follows:
  • Only the proposed value can be selected.

  • Only one value is selected.

  • If a process thinks a value is selected, then the value must be the selected value.

Process description of Paxos algorithm:

Paxos algorithm is similar to two-stage submission, and its algorithm execution process is divided into two stages. Specific as follows

Prepare stage
  • Proposer sends a proposal numbered N to more than half of the acceptors.
  • After an acceptor receives a prepare request with number N:
    • If it is smaller than the request it has already responded to, reject the reply or return error;
    • If N is greater than the number of all Prepare requests that the Acceptor has responded to, it returns {pok, null, if any, for the proposal with the highest number that it has accepted (after phase 2). Null}) is reported to the Proposer as a response, with the corresponding proposal number updated locally and the Acceptor committing not to accept any proposal numbered less than N.
    • There are two cases:
      • If an acceptor has accepted a proposal, return the maximum value that accepted the proposal.
      • If the proposal has not been accepted, return {pok, NULL, null})
The accept stage
  • If a proposer receives a prepare response to a proposal numbered N from more than half of the acceptors, it sends an Accept request to the proposals numbered N to more than half of the acceptors.

Note: V is the value of the proposal with the largest number in the response received.

  • If the response does not contain any proposals, then V is determined by the Proposer itself and can be any value.

  • If an Acceptor receives an Accept request for a proposal numbered N, it accepts the proposal as long as the Acceptor has not responded to a Prepare request numbered greater than N.

  • If N is less than an Acceptor and the prepare request it responded to, it rejects, does not respond, or responds with an ERROR.

The details are shown in the figure below:

More than half of acceptors automatically perceive and learn the proposal. (The same process can play multiple roles simultaneously)

Learner’s learning process

Learner learning includes two scenarios:
  • The node where Learner participates in the proposal election, and Learner needs to know whether the proposal value he accepts is chosen.
  • The node where Learner is already behind other nodes, so Learner needs to choose an appropriate strategy to quickly complete the catch-up and re-participate in the proposal election.
Selected to inform
  • A node whose Proposer is chosen is notified to each node with a message (MsgType_PaxosLearner_ProposerSendSuccess).

  • Normally, all nodes are online and participate in the PAXOS election. So to avoid instance ID conflicts, PaxOS recommends that only the proposer from the master node initiate a proposal, which ensures that the number of proposals accepted matches the number of proposals learned.

  • If an acceptor acceptor fails to Accept the request, learner only updates the memory state.

  • Finally, if there are follower nodes, the data is synchronized to the followers (the follower node does not participate in the PAxOS algorithm, which is equivalent to the synchronization standby of a PaxOS node).

Proposal value catch-up
  • Once a node falls behind, it cannot participate in the PaxOS proposal election. At this time, learner initiates active learning to complete the chase.

  • When Paxos starts, the Learner timer is started and the learner timer is used to send a learn request to each Node. The request contains the Instance ID and Node ID of the Node. After receiving the request, each node responds to the data and completes the learning process independently.

Data Data consistency is maintained on all nodes

Why are there two paragraphs submitted

  • On the one hand, after the first pre-submission, he may be told that he already has a point of view. At this time, he should not put forward his own point of view, but should converge and support the latest point of view as soon as possible.
  • On the other hand, pre-lock.

How to ensure the unique Proposal number

  • Suppose there are K servers running paxOS, and their initial number is 0… K – 1. The number is then incremented by k each time to ensure a globally unique increment.

  • After a formal proposal is accepted by more than half of acceptors, the accepted proposal is confirmed to be that idea.

  • There must be an intersection between two sets of more than half.

There is a live lock problem in Paxos algorithm.

There is an extreme case where two proposers present a series of proposals numbered in ascending order and then fall into an infinite loop, unable to complete the second phase, that is, select a proposal. The diagram below:

Half basis of Paxos algorithm

  • In Paxos algorithm, the concept of “half” is adopted, that is, the minority is subordinate to the majority, which makes Paxos algorithm has good fault tolerance. So why use half is ok?
  • Paxos is based on the principle of half mathematics: we call the set composed by most (half) processes a legal set, and two legal (half) sets must have non-empty intersection, that is, at least one common process, which is called the nature of legal set. For example, the set of processes A,B,C,D, and F, the legal set Q1 includes processes A,B,C, and Q2 includes processes B,C, and D, so the intersection of Q1 and Q2 must not be empty, and C is the common process of Q1 and Q2. If there is one fundamental principle to Paxos, it is this simple property. That is to say: there must be intersection between two sets of half, that is, they must be equal, that is, they must reach agreement.
  • Paxos is a highly fault-tolerant distributed consistency algorithm based on message passing. Paxos algorithm introduced the concept of half, solve the 2 PC, 3 PC is too conservative, and make the algorithm has good fault tolerance, and Paxos algorithm support distributed node role between the rotation, the maximum to avoid the emergence of distributed single point, thus Paxos algorithm can solve the problem of infinite wait, both also can solve the problem of fissure, So far, it is the best distributed consistency algorithm. Zookeeper’s ZAB algorithm and Raft consistency algorithm are both based on Paxos. In the following article, I will gradually introduce the excellent distributed coordination service framework, is also an excellent implementation of industrial consistency algorithm Zookeeper use and implementation.

ZAB algorithm guide

The ZAB protocol adopted by Zookeeper is also based on the Paxos algorithm. However, ZAB has made many improvements and optimizations to Paxos, and their design goals are also different — ZAB protocol is mainly used to build a highly available distributed data master/slave system. The Paxos algorithm is used to build a distributed consistent state machine system.

ZAB is the ZooKeeper Atomic Broadcast Protocol. It is an algorithm used by ZooKeeper to implement consistency. ZAB is divided into the following three phases:

  • Fast leader election: fast election phase
  • Recovery: Recovery phase
    • The Discovery;
    • Synchrozation
  • Broadcasting: Broadcast stage

Let’s start with some basic concepts

  • ElectionEpoch: With each leader election, the electionEpoch is incremented to mark the leader election cycle
  • PeerEpoch: After each leader election, a new peerEpoch is elected to mark the round to which the transaction request belongs
  • Zxid: unique token of the transaction request, assigned by the leader server. The highest 32 bits are the peerEpoch mentioned above, and the lowest 32 bits are the count of the request, starting from 0.
  • LastprocessZxid: the zxID of the last commit transaction request
  • LinkedList committedLog, Long maxCommittedLog, and Long minCommittedLog: ZooKeeper stores transaction request proposals executed in the latest period. The default limit is 500. CommittedLog is a list of motions. MaxCommittedLog indicates the ZXID of the largest motion, and minCommittedLog indicates the ZXID of the smallest motion in the committedLog.
  • The ConcurrentMap < Long, Proposal > outstandingProposals: Each time a proposal is proposed, it will be deposited to outstandingProposals. Once the proposal is accepted by the majority, it will be submitted and deleted from outstandingProposals.
  • ConcurrentLinkedQueue toBeApplied: Property owned by the Leader that stores a proposal to this list whenever it is ready to be submitted. Once the proposal is applied to ZooKeeper’s memory tree, it can then be removed from toBeApplied.
  • State: indicates the status of the current server
  • RecvQueue: Message receiving queue for messages received from other servers.
  • QueueSendMap: Message sending queue that holds messages to be sent, grouped by SID.
  • SenderWorkerMap: set of senderworkers. Each SenderWorker message sender corresponds to a remote Zookeeper server, which is responsible for sending messages and is also grouped by SID.
  • LastMessageSent: The most recently sent message, keeping the most recently sent message for each SID.
In THE ZAB protocol, there are four states of a service:
  • LOOKING: Enters the leader election state
  • FOLLOWING: The leader election ends, and the follower state is entered
  • LEADING: leader After the election, the system enters the leader state
  • OBSERVING: The state of being an observer

Protocol algorithm description

Process –
  1. The leader creates a proposal according to the transaction request of the client, and puts the ZXID and proposal of the proposal into the outstandingProposals Map.
  2. The leader sends the proposal to all the followers. If more than half of the followers respond OK, the leader thinks it is OK to submit the proposal and deletes it from outstandingProposals.
  3. The leader will send a command to all followers to submit the proposal. The leader himself will also start the submission process and apply the request content into the Memory tree of ZooKeeper.
  4. Then update lastProcessedZxid as the zxID of the request, and store the motion to the above committedLog, as well as update maxCommittedLog and minCommittedLog.
  5. The leader replies to the client and removes the ToBeApplied from the proposal
Fast Leader election process

The election process focuses on two key points: leader election at startup and how the newly started server perceives the leader after the election. There are two important data in the voting process:

  • HashMap

    recvset: used to collect votes for servers in LOOKING, FOLLOWING, and LEADING states
    ,>
  • HashMap

    outofElection: used to collect votes for servers FOLLOWING, LEADING
    ,>
The specific process is as follows:
  1. The server votes itself by electionEpoch:
  • Load data from the snapshot log and transaction log to get the memory tree data for the local machine, as well as lastProcessedZxid. The voting content is:
    • ProposedLeader: The myID value of the server itself, initially the ID of the local machine
    • ProposedZxid: The largest transaction ZXID, originally lastProcessedZxid of this machine
    • ProposedEpoch :peerEpoch value, obtained from the high 32 of lastProcessedZxid above
    • The vote is then sent to all servers.
  1. After receiving the voting notification, the server PK.

    • If the electionEpoch in the notification received is larger than its own, update its own electionEpoch to serverA’s electionEpoch;
    • If the electionEpoch in the notification received by, is smaller than its own, a notification is sent to serverA to send its vote and electionEpoch to serverA, which will update its own electionEpoch after receiving it.
    • If the electionEpoch is the same, the PK rule is proposedZxid, followed by myId.
  2. Determine the leader based on the state of the server

    • If the state of the server currently sending the vote is LOOKING, it only needs to judge whether the vote of the machine is more than half in recvset. If more than half, it indicates that the leader election is successful. If the id of the current server is equal to the proposedLeader of the above half vote, then you will become the leader; otherwise, you will become a follower.

    • If the status of the server currently sent to vote is FOLLOWING and LEADING, it indicates that the leader election process has been completed, and the sent vote is the information of the leader. Here, it is necessary to determine whether more than half of the votes sent by the leader are in recvset or Outofelection. Meanwhile, it is also necessary to check whether the leader has sent him voting information and confirm whether the leader is in LEADING state from the voting information.

The Recovery process

Once the leader election is completed, the recovery phase begins, in which the followers synchronize the data information on the leader.

  1. Communication initialization

The Leader creates a ServerSocket to receive connections from followers, and serves each connection with a LearnerHandler thread.

  1. Re-elect a new peerEpoch for peerEpoch

    • Followers send a Leader, FOLLOWERINFO message containing their own peerEpoch information to the leader.

    • The Leader’s LearnerHandler retrieves the peerEpoch information, selects the largest peerEpoch, and adds 1 as the new peerEpoch.

    • All of the Leader’s LearnerHandlers will then send a Leader.LeaderInfo message to their respective followers, containing the new peerEpoch mentioned above;

    • Followers update their peerEpoch with the peerEpoch and send their lastProcessedZxid to the leader. The leader’s is synchronized based on the difference between the lastProcessedZxid and the leader’s lastProcessedZxid.

  2. Synchronization of transaction proposals that have been processed

    • Check whether lastProcessedZxid in LearnerHandler is between minCommittedLog and maxCommittedLog

    • If the lastProcessedZxid in the LearnerHandler is the same as the lastProcessedZxid of the leader, synchronization has been maintained

    • If lastProcessedZxid is between minCommittedLog and maxCommittedLog, the part of the motion from lastProcessedZxid to the end of maxCommittedLog Re-send to the LearnerHandler corresponding follower, and send the corresponding commit command for the proposal.

There may be a problem with the above: That is, although lastProcessedZxid is between them, no motion corresponding to lastProcessedZxid is found, that is, the zxID is not available to the leader. At this point, the strategy is to completely follow the leader’s synchronization and delete the transaction log of this part of the follower. Then re-send the bills in this section and submit those bills.

  • If lastProcessedZxid is greater than maxCommittedLog, the transaction logs with more than one follower are deleted

  • If lastProcessedZxid is smaller than minCommittedLog, the snapshot is used to restore the data.

  1. Synchronization of pending transaction motions

    • The LearnerHandler will also send and submit proposals greater than the lastProcessedZxid in the LearnerHandler from the leader’s toBeApplied data.

    • The LearnerHandler will also send proposals from the Leader’s outstandingProposals that are greater than the lastProcessedZxid in the LearnerHandler. OutstandingProposals have not been confirmed yet.

  2. Add LearnerHandler to the official Follower list

  3. LearnerHandler sends leader.newleader and leader.uptodate commands.

    • The leader starts the heartbeat detection process and continuously sends heartbeat commands to followers to check whether there is a heartbeat response from the half machine. If there is no heartbeat response from the half machine, the leader shuts down the machine and enters the leader election state.
    • The LearnerHandler sends the leader. UPTODATE message to the corresponding follower. After receiving the message, the follower starts the Broadcast processing with the Leader.

Transaction persistence and recovery procedures

  • Transaction persistence is divided into broadcasting persistence and leader shutdown process persistence.
    • The leader generates a motion for each transaction request and sends it to all followers. After receiving a proposal, the follower logs the proposal to the transaction log. Each time 100,000 proposals are recorded (the default), the transaction log executes flush and opens a new file to record the transaction log
    • At the same time, snapshot.[lastProcessedZxid] is used as the file name to create a new file, and the snapshot content is saved to this file
    • If more than half of the leader’s heartbeat detection fails, the shutdown method is executed, in which the transaction log is flushed
Transaction recovery can be divided into snapshot recovery and log recovery.
  • Restore transaction snapshot: it will find the latest 100 snapshot files in the transaction snapshot file directory, and sort, the latest first; Restore and verify the preceding snapshot files one by one. If the verification succeeds, exit. Otherwise, use the next snapshot file for restoration. Restore complete update to the latest lastProcessedZxid;
  • Transaction log recovery: Find the transaction log whose zxID is greater than or equal to lastProcessedZxid from the transaction log file directory, and apply the above transaction log to the memory tree of ZooKeeper. At the same time, update lastProcessedZxid. At the same time, store the above transaction log to the committedLog and update the maxCommittedLog and minCommittedLog

Raft Algorithm Guide

Raft background

In distributed system, consistency algorithm is very important. Of all the consistency algorithms, Paxos is the most famous. It was proposed by Leslie Lamport in 1990 as a consistency algorithm based on message passing, which is considered the most efficient of its kind.

Although Paxos algorithm is very effective, it is very difficult to implement because of its complex principles. Up to now, there are few open source software to implement Paxos algorithm, including Chubby and LibPaxos.


  • The Paxos algorithm is too complex and difficult to implement, which greatly restricts its application. In the field of distributed system, an efficient and easy to implement distributed consistency algorithm is urgently needed. In this context, Raft algorithm came into being.

  • Raft is a consensus algorithm. Consensus is when multiple nodes agree on something, even when some nodes fail, the network delays, the network splits.

  • Consensus algorithms are generally implemented based on Replicated state machines: In simple terms: Same initial state + same input = same end state.

Raft role

A Raft cluster consists of several nodes in three states: Leader, Follower, and Candidate. Each state is responsible for different tasks. Normally, nodes in a cluster can only be in the Leader and Follower states.

  • Leader: synchronizes logs, handles requests from clients, and maintains heartBeat contact with followers.
  • Follower: responds to the Leader’s log synchronization request, Candidate’s request for a vote, and forwards (redirects) the client’s requests to the follower to the Leader.
  • Candidate: In charge of election voting, the nodes in the Follower state will become candidates and initiate election when the cluster is just started or the Leader is down. After winning the election (winning more than half of the votes of the nodes), the nodes will change from candidate to Leader state.

There is another key concept: term. Starting with an election, the term increases with each election, acting as a logical clock.

3 sub-problems for Raft

To simplify logic and implementation, Raft breaks the consistency problem into three relatively independent sub-problems.

  • Leader Election: When the Leader fails or the cluster is started, a new Leader needs to be elected.
  • Log Replication: The Leader receives requests from clients and copies them as Log entries to other nodes in the cluster, forcing the logs of other nodes to be the same as his own.
  • Safety: If any server node has applied an identified log entry to its state machine, no other server node can apply a different directive to the same log index location.

The election process

If the followers do not receive a heartbeat from the leader within the election timeout, they will initiate an election.

Stage 1: All nodes are followers.

When a Raft cluster starts up (or the Leader fails), all nodes are in the Follower state and the initial Term is 0. Start the election timer at the same time. The timeout duration of the election timer on each node ranges from 100 to 500 milliseconds and is inconsistent.

Stage 2: followers become candidates and vote.

When there is no leader, the Followers state automatically changes to candidate and sends a vote request to all nodes in the cluster and resets the election timer.

The voting process includes:
  • Add the current term on the node to switch to the candidate state
  • Vote for yourself and send RequestVote RPCs to other nodes in parallel
  • Waiting for the reply from other nodes, there are three possible outcomes:
    • After receiving a majority of votes, the leader wins the election and becomes the leader
    • When told that someone else has been elected, switch to follower
    • If a majority vote has not been received in a certain period of time, the majority vote remains a candidate and a new election is called
Stage three: Voting strategy
The voting constraints are:
  • A node is allowed to issue only one vote in a term;
  • Candidates should know as much as they do (more on this later when we cover Log Replication and safety).
  • First-come-first-served
  • If there were an even number of nodes participating in the election, Raft used randomized election timeouts to avoid tie votes as much as possible, and also required the number of nodes to be odd to ensure the emergence of majority as much as possible.

The log Replication principle

  • When the leader is elected, all requests from the client are sent to the leader. The leader schedules the requests in the same order as the followers’ state.

  • In a cluster, all nodes can become the leader. To ensure the consistency of the cluster after the leader node changes, Log Replication is required to solve the following two problems:

  • Followers execute each successful proposal in the same order as the Leader node;

  • Each successfully submitted proposal must have enough successful copies to ensure consistent subsequent access

Phase 1: Client requests are submitted to the Leader.

After receiving the proposal request from the client, the Leader writes it into the local log as an Entry. Note that at this point, the Entry status is Uncommitted. The Leader does not update the local data, so it is unreadable.

Stage 2: The Leader sends the Entry to other followers
  • The Leader maintains a heartbeat connection with Floolwers. The Leader sends AppendEntries to other followers in parallel and allows them to Replicate the log Entry. This process is called replication.

  • Why does the Leader send AppendEntries to followers? Because the heartbeat between the Leader and followers is periodic and the Leader may receive multiple requests from clients during a week. Therefore, Followers are most likely sent with multiple AppendEntries.

  • The Leader sends Followers not only AppendEntries but also the prevLogIndex of the new Entry followed by the prevLogIndex of the previous Entry. The Leader Term number is also included. If the Follower does not find an entry in its log that contains the same index position and tenure number, it will refuse to accept a new entry because this is a sign that the Follower and the Leader are not the same.

  • How to solve the problem that the logs of the Leader and followers are inconsistent? In normal cases, the logs of the Leader and followers are consistent. However, a series of crashes for the Leader and Follower can leave their logs in an inconsistent state.

There are three cases:
  1. Followers lag behind the new leader and lose some of the log entries that the new leader has
  2. Followers lead the new leader, have some log entries that the leader doesn’t have,
  3. Or both. Missing or extra log entries can last for multiple terms.

To bring the Follower’s log back into line with the Leader’s, the Leader must find the last point of agreement between the two (that is, backtrack to find the closest point of agreement between the two), delete all entries from that point and send his own log to the Follower. The Leader maintains a nextIndex for each Follower, which represents the indexed address of the next log entry that needs to be sent to the Follower. When a Leader first gains power, it initializes all nextIndex values and increments its last log index by 1. If a Follower’s log does not match the Leader’s, the consistency check will fail the next time the log is attached. After being rejected by a Follower, the Leader reduces the Follower’s nextIndex value and tries again. Eventually nextIndex will cause the Leader and Follower logs to agree at some point. When this happens, the attaching log is successful, at which point all the conflicting Follower log entries are removed and the Leader log is added. Once the attached logs are successfully attached, the Follower’s logs are the same as the Leader’s and continue to be the same for the rest of the term.

Stage 3: Leader waits for Followers to respond.

Once the Followers receive the Leader’s replication request, they have two possible responses:

  • Write to the local log, return Success;

  • The consistency check fails, write is rejected, and False is returned. The cause and resolution are described above.

  • When the Leader receives the majority of Followers’ response, it marks the Entry Committed and applies the log Entry to its state machine.

Stage 4: The Leader responds to the client.

After the first three phases are complete, the Leader replies OK to the client indicating that the write operation was successful.

In phase 5, the Leader notifies the Followers that the Entry has been submitted

When the Leader responds to the client, the Followers will be notified with the next heartbeat, and the Followers will mark the Entry as submitted. At this point, more than half of the nodes in the Raft cluster have reached a consistent state to ensure strong consistency.

Raft safety guarantee

1) Election safety: At most one leader can be elected in a term. Raft algorithm passes

A node can only cast one vote at a time. Only the node that wins a majority vote becomes the leader. 2) log matching: If a log entry on two nodes has the same log index and the same term, all log entries before that index should be the same. The leader creates only one log entry at any location in a certain term, and the log entry is append-only.

3) Consistency Check. The leader includes the term and index of a log before the latest log entry in AppendEntries and notifies the leader of an inconsistency if the follower does not find the log at the corresponding term index. When a discrepancy occurs between the leader and follower, the leader forces the follower to copy his own log.

2) Leader displacement: If a log entry was submitted at a certain term, it must appear in the log for all the leader’s higher term.

A node can be committed only when A log is copied to the majority of nodes. A node can become the leader only when A node votes for A majority. One of the preconditions for node A to vote for node B is that B’s log cannot be older than A’s log. 4) Stale leader: A backward leader, but in the case of network partition, there may be two leaders, but the terms of the two leaders are different. In some raft implementations or raft-like protocols, if the leader does not receive the message from the majority node, he can step down and change to the follower state by himself.

5) Leader crash: The new node becomes the leader. In order to prevent data loss, the new leader is expected to contain all the entries that have been committed. To avoid the complexity of the reverse flow of data from followers to the Leader, Raft restricts that the new Leader must be the latest node in the current Log, that is, the Log Entry with the largest number of terms.

6) State Machine Safety

After a leader is elected, the logs of the previous leader are not directly submitted. Instead, the logs of the current term are also submitted. The specific implementation is as follows: If the leader is elected and does not receive requests from the client, it is mentioned in the paper that at the beginning of the term, the leader immediately tries to copy and submit an empty log.

Summary: Raft breaks down the consensus problem into two relatively independent problems, leader election and log Replication. The process is to elect the leader, who is then responsible for copying and submitting the log (the log contains the command)

Log Replication constraints:

A log is copied to most nodes, which is committed. The committed log is guaranteed not to be rolled back. The leader must contain the latest committed log, so the leader only appends logs and does not delete overwrite logs. Raft never commits log entries from previous terms by counting Replicas.