What is the PaxOS algorithm

Is a message – based consistency algorithm with high fault tolerance.

What problem was solved

Mainly solve the distributed system, how to reach a consensus on a decision. Main project implementation: ZAB, Google Chubby, PhxPaxos of wechat.

Paxos algorithm and the Byzantine General problem

The author of PAxOS algorithm argues that it is impossible to achieve consistency through message passing when the channel is not trusted. Therefore, the paXOS algorithm is based on the premise that there is no Byzantine general problem. That is, the channel is considered to be trusted and messages passed between clusters cannot be tampered with. In a distributed system, there are two modes of message transmission for each node: shared memory and message transmission. Paxos is based on the messaging communication model.

Three roles of Paxos

There are three roles in the PAXOS algorithm, each with different functions. But many times, a process can play many roles. A Proposal B.

  • A Proposer
  • Accecptor: Decision maker
  • Are you a Learner

Paxos algorithm consistency, through the following points to ensure:

  • Each proposer, when making a proposal, first needs to get a globally unique, increasing number N and assign N to the proposal. N can be generated in two ways: global generator and sponsor maintenance.
  • After receiving the proposal, each voter will save the proposal number locally. So each voter has a maximum number, let’s say maxN. Each voter will only accept proposals whose numbers are greater than their local maxN.
  • Only one proposal was chosen
  • Once a proposal is selected, learner synchronizes the proposal to the local.
  • Of course, no proposal can be selected without a proposal being selected

Two algorithms for submitting proposals

2PC and 3PC algorithms:

  • 2PC algorithm: Tow Phase Commit Namely, prepare-> Accept
  • 3PC algorithm: Three Phase Commit That is, prepare > Accept > Commit

The difference is whether the Accept phase has the commit function.

3PC algorithm description

3PC algorithm has three stages: preparation, reception and submission.

The stage of preparation:

  • The proposer prepares a proposal numbered N and sends prepare(N) to the decision maker to test whether the cluster supports the proposal.
  • Each decision maker has a local maxN. When receiving a prepare(N), the decision maker compares its own maxN in several cases.
    • If N is less than maxN, it indicates that the proposal is outdated. Then, reject the proposal by not responding or replying to error.
    • If N is greater than maxN, it indicates that the decision maker can accept the Proposal, and it will return a Proposal(myID,maxN,value), which is the maximum maxN Proposal received before, to show that it is willing to accept the Proposal. Myid is the proposer’s ID, maxN is the proposal number, and value is the proposal value. If it is the first receive, both are null.

Accept phase

  • If the proposer receives more than half of the responses, then the proposer sends the actual Proposal(myID,N,value) to all decision makers.
  • After receiving the proposal, the decision maker compares it with the local maxN or PREPARE record N. If N is greater than or equal to the number, the decision maker accepts the proposal and sends it back to the proposer. Otherwise, the proposal will be rejected
  • If the sponsor does not receive acceptance for more than half of the decision’s feedback. So, either drop the proposal or increment the number and resubmit the request, starting with Prepare.

Commit phase

If the proposer receives more than half of the responses, a broadcast message is sent out

  • Send an “executable data synchronization message” to the accept decision maker to execute the proposal once received.
  • Send “proposal + actionable message” to decision makers who don’t respond and ask them to do it as soon as they receive it.

2PC algorithm process

2PC has no submission process. After preparing successfully, the actual submission is sent directly to the decision maker. After receiving the proposal, the decision maker directly performs data synchronization. Don’t wait for a COMMIT message. 2PC has more disadvantages, but it is more efficient.

The live lock problem of PAxOS algorithm

Live lock: A live lock is a trial-and-fail-trial-and-fail process, but a live lock may unlock itself. That is, if there are a lot of proposals that keep being submitted, it’s possible that a proposal will keep failing, keep incrementing N, keep trying and failing, but it’s possible that it will succeed one time.

ZAB: Zookeeper Atomic Broadcast, zK Atomic Broadcast. Crash recovery broadcast protocol designed specifically for ZK to achieve data consistency in distributed systems. ZAB uses a single process to handle all write requests. When the server data state release changes, the cluster uses the ZAB broadcast protocol to broadcast a transaction Proposal to all replicas. The ZAB protocol guarantees a globally unique increasing XID. When a ZK client sends a read request, if the receiver is not the leader, the request is forwarded to the Leader, and only the Master can handle the write request.

Paxos and ZAB

ZAB is an industrial implementation of PaxOS, which aims to build a highly available distributed data master-slave system. The follower serves as the leader’s slave, and a leader can be selected from the follower immediately after the leader dies. ZAB resolves the live lock problem by allowing only one process to submit a proposal, which is a 3PC submission. When the leader dies, the voting algorithm is 2PC, and all followers can submit, which means I choose me.

Three roles

  • Leader: The sole handler of write requests in zK, which can also handle read requests.
  • Followers: process read requests, vote on the leader’s proposals, synchronize the results of the leader, and have the right to vote and be elected
  • Observer: No voting rights. Can only handle write requests, no election and no vote to be elected. learner=follower+observer

Three data

  • Zxid: indicates a 64-bit length. The upper 32 bits are epoch and the lower 32 bits are xID
  • Epoch: Each leader has a new epoch, which can be considered era 👌
  • Xid: serial number.

Three models

There are three patterns in ZK, no clear dividing line, all intertwined

  • Recovery mode: Consists of two important phases, leader election and initial synchronization
  • Broadcast mode: contains two types: initial broadcast and update broadcast
  • Synchronization mode: contains two types: initial synchronization and update synchronization

Synchronous mode

Initial synchronization: There are two stages in the recovery mode, leader election and initial synchronization. After the leader is elected, it is only a quasi-leader, and only after the initial synchronization, it is a real leader. Synchronization process:

  • The Leader prepares a pair column for each learner in order to ensure the order sent to each learner.
  • The Leader encapsulates the transactions that are not synchronized to the learner as proposals and queues them.
  • The Leader sends the proposal to the Learner with a COMMIT message, indicating that the transaction has been committed and can be executed directly.
  • After receiving the message, the learner synchronizes the message to the local
  • Sends an ACK message to the leader
  • After receiving the ACK message, the Leader adds the Learner to the list of available followers in the Observer list. If no ACK is returned or the ACK is not received by the leader, it is not added to the available list.

Message broadcast algorithm: After the leader completes synchronization, it enters the normal working state. If there is a write request, the leader will perform the following procedure.

  • After receiving the transaction, the leader will generate a globally unique ID (ZXID) for the transaction and encapsulate the transaction with a Proposal.
  • The Leader obtains the list of all followers and sends the proposal to the followers through the queue of each follower
  • After receiving the proposal, the follower compares it with the maximum ZXID saved locally. If the current ZXID is large, the follower saves the proposal in the local log transaction and returns an ACK
  • After the leader receives more than half of the Acks, it sends a commit message to all the followers, sending the proposal to the Observer.
  • After followers receive the COMMIT message, the transaction is formally updated locally. After receiving the proposal, the Observer updates the proposal locally.
  • After the followers and observers update their messages, they send an ACK back to the leader.

Four kinds of state

Each host has different states on different nodes

  • LOOKING: Election status
  • FOLLOWER: Indicates the normal working status of the followers
  • OBSERVERING: Observer normal working status
  • LEADING:leader Working status

Observer number problem

The number of observers is not very large, and while observers can improve the throughput of the system to some extent, they need to synchronize data from the Leader. The Observer synchronization time is shorter than the follower synchronization time. When the follower synchronization completes, the Observer synchronization ends. The Observer that has completed the synchronization is added to the available list, but the Observer that has not completed the synchronization cannot improve its service, resulting in a waste of resources.

Three principles of recovery mode:

  • Principle of active surrender: If the leader receives less than half of the heartbeats from followers, the leader will automatically change his status to LOOKING and search for the leader again. While going to his followers found that more than half of the host tasks lost the leader, there would be a re-election
  • Processed messages must not be lost: If a follower does not receive a commit message before the leader hangs, this will result in the transaction being executed or not executed by the server. When a new leader is elected, after the recovery mode, it is necessary to ensure that all servers in the cluster perform the transactions previously performed by some servers.
  • Lost messages cannot be reproduced: when a transaction passes, the leader has updated locally but hangs before sending a commit to the followers. When this machine becomes a follower again, there will be one more transaction on the local machine than on the other machines, which does not work. Therefore, transactions like this should be discarded. In this case, the server does not return a success message to the client.