There is only one consensus protocol, and that’s Paxos. All other approaches are just broken versions of Paxos.

There is only one consistency algorithm in the world, and that is Paxos. The other consensus algorithms are just broken versions of Paxos.

— Mike Burrows, author of Google Chubby

Paxos is an integral part of learning distributed systems. Paxos is elegant in theory, but not so easy to implement. “The Part-time Parliament” [1] can not understand, can only see “Paxos made??” [4] This kind of thing makes life like this.

Paxos

The basic problem

Given that a group of processes can propose values, the consensus algorithm needs to ensure that only one of the proposed values is selected. The security requirements for consensus are as follows:

  • Only proposed values are likely to be selected
  • Only one value is selected
  • The process only learns the selected value

A Paxos algorithm has three agents: proposer, acceptor, and learner. A process can take on more than one agent.

  • Any principal can execute at any speed and can be stopped or restarted
  • Messages can take an arbitrarily long time to deliver, can be lost or repeated, but cannot be destroyed

Choose a value

The simplest solution would be to set up a unique receiver and let it choose the first value received, but that would not handle a receiver crash. Therefore, it is safer to set up multiple recipients and select when most recipients receive the value.

Try designing a requirement:

P1. The recipient receives the first offer it receives.

However, if there is no value that is accepted by a majority, there is no consensus. It is therefore best to allow the recipient to receive multiple offers, each numbered with an increasing number. Now the receiver can receive multiple proposals, but it needs to ensure that all the selected proposals contain the same value. According to the incrementing nature of encoding, the requirements can be given:

P2. If a proposal containing the value v is selected, any high-numbered proposal selected must contain the value v.

A proposal can only be selected if it has been accepted by at least one recipient, so P2 is satisfied by meeting the following criteria:

P2a. If a proposal containing the value v is selected, any high-numbered proposal accepted must contain the value v.

If a newly awakened proposer gives a high-numbered proposal to a recipient who has not received any offers, it may destroy P2a and thus strengthen:

P2b. If a proposal containing the value v is selected, any high-numbered proposal made by the proposer must contain the value v.

P2b can be further satisfied by satisfying P2c:

P2c. For any v and n, if a proposal containing the value v, numbered n, is put forward, and there exists a set S composed of most recipients, then one of two situations must occur:

(a) No recipient in S accepts an offer numbered less than N

(b) V is the value of the proposal in S with the highest number accepted by the recipient and number less than n

According to THE requirements of P2c, proposers must know the maximum number of proposals numbered less than N before proposing, including those that have been accepted or will be accepted. It is impossible to predict whether an offer will be accepted, but the recipient can promise not to accept an offer numbered less than N.

This requires the proposer to send a readiness request numbered N to the recipient in advance, asking the recipient to return:

  • A promise: it will not accept any proposal numbered less than N
  • The highest number proposal has been accepted

If the proposer makes a preparation request that is responded to by a majority of the recipients, then it sends the proposal with number N containing the value V to those recipients, then V is the highest numbered proposal value in the response, and if there is no proposal, then it can be any value.

So the acceptance requirements are:

P1a The recipient accepts an offer numbered N if and only if it has not responded to a readiness request numbered greater than n.

  • Phase 1
    • The proposer selects an offer number N and sends the prepare request with the number N to the majority of recipients.
    • If a receiver receives a prepare request whose number n is greater than any prepare request that has already been responded to, the receiver returns a promise that it will not accept any offer numbered less than n, as well as the highest number of offers that have been accepted.
  • The stage 2
    • If the proposer makes a preparation request that is responded to by a majority of the recipients, then it sends the proposal with number N containing the value V to those recipients, then V is the highest numbered proposal value in the response, and if there is no proposal, then it can be any value.
    • If the receiver receives an accept request numbered N, it will accept the offer unless it has already responded to a prepare request of a higher number.

Learn to choose values

The simplest strategy would be for each recipient to send the selected proposal to all learners, but the number of messages is huge. The least message method is for each recipient to send the selected proposal to a particular learner, and then to all learners. Only one special learner responsible for forwarding has poor fault tolerance, and multiple such learners can be set. If the message is lost, the learner may not get the majority of the recipients’ choice proposals, which can be obtained through new proposals, such as asking the proposer to initiate a proposal.

progress

If two proposers constantly interrupt each other’s preparation, no proposal will ever be selected. The solution is to use the leadership election algorithm, which elects a single proposer.

implementation

In a concrete implementation, the algorithm needs to elect a leader as the sole proposer and special learner. A fault-tolerant stable store is used to record the information the recipient needs to remember.

MultiPaxos

Paxos only agrees on one value at a time. If we want to implement a state machine, we need a log. The way to do this is to run Paxos once on each log entry.

The new item

The new log entry needs to be added to the log slot where the first value is not selected from scratch.

The server can process adding entries in parallel, but they must be ordered when applied to the state machine.

To improve efficiency

  • Leadership election: multiple proposers tend to cause conflicts and reduce performance, so one leader is simply elected to be responsible for the proposal. Leadership election can adopt election algorithms such as maximum ID.
  • Single preparation: Use a number for all the selected entries in a leader’s tenure (the premise is that there is no other selected log slot after the insertion position). In this way, only one preparation is required in each leader’s tenure. Subsequent entries are directly received (until they are rejected).

Backup and Application

There are two problems with the algorithm.

Question 1: How do I back up logs to all nodes?

  • Constantly trying to send and receive requests in the background to ensure that all nodes respond;

Question 2: How do I let other nodes know that a log entry is selected?

The first convention is to set the number of the selected log slot to infinite acceptedProposal[I] = ∞, and record the position of the first unselected log slot, firstUnchosenIndex.

  • The proposer informs the receiver: contained in the request RPCfirstUnchosenIndex, the receiver will meet alli < request.firstUnchosenIndexandacceptedProposal[i] == request.proposalLog tankiMark as selected;
  • Process log entries added by the old leader: The recipient included in the responsefirstUnchosenIndexIf the proposer’sfirstUnchosenIndexLarger than the receiverfirstUnchosenIndex, the proposer sends the Success RPC.

After receiving RPC Success(index, v), set acceptedValue[index] = V and acceptedProposal[index] = ∞, and return firstUnchosenIndex.

reference

  1. Lamport, Leslie. “The part-time parliament.” ACM Transactions on Computer Systems (TOCS) 16.2 (1998): 133-169.
  2. Lamport, Leslie. “Paxos made simple.” ACM Sigact News 32.4 (2001): 18-25.
  3. Diego Ongaro, Implementing Replicated Logs with Paxos (video, courseware)