Description: EPaxos (Egalitarian Paxos), as the next generation distributed consistency algorithm, has a broad application prospect. But throughout the industry, there hasn’t been a single engineering implementation of EPaxos, or even an article that makes EPaxos more popular. Although the theory of EPaxos algorithm is good, it is difficult to understand in fact, and there are many challenges in engineering implementation, so the practical application is not mature yet. This article will introduce EPaxos algorithm in easy to understand language, step by step, so that students who only have Paxos or Raft based distributed consistency algorithm can easily understand EPaxos, truly obscure EPaxos, become approachable, into thousands of people.

The author | auspicious light source (YanXiangGuang) | ali technology to the public

The introduction

Egalitarian Paxos (EPaxos), as the next generation distributed consistency algorithm, has a broad application prospect. But throughout the industry, there hasn’t been a single engineering implementation of EPaxos, or even an article that makes EPaxos more popular. Although the theory of EPaxos algorithm is good, it is difficult to understand in fact, and there are many challenges in engineering implementation, so the practical application is not mature yet.

This article aims to introduce EPaxos algorithm in a simple and easy way, step by step, so that students who are based on distributed consistency algorithms such as Paxos or Raft can easily understand EPaxos, truly obscure EPaxos, become approachable, into thousands of people.

In an Understanding of Distributed consistency Algorithm EPaxos, EPaxos is introduced from the problems of Paxos, and the basic concepts and intuitive understanding of EPaxos are introduced. It is believed that readers have an overall impression of EPaxos.

This article introduces the EPaxos core protocol flow from the perspective of Paxos and EPaxos comparison. The last article left the question, I believe that after reading this article, can find the answer. Some background on distributed consistency algorithms such as Paxos or Raft is required to read this article.

Basic idea of EPaxos

EPaxos is a Leaderless consistency algorithm, without electing a Leader, any replica can initiate a proposal.

Leaderless can also be seen as a Leader for each replica, from the perspective of multi-Paxos or Raft, if multiple groups are used, each Leader is divided into different groups with each replica acting as the Leader of a Group. As a Follower of all other groups at the same time, it seems that you can achieve a similar effect of Leaderless.

For Leaderless with multiple groups, each Group independently agrees on a series of instances, and each Group generates a sequence of Instance. Instances generated by different groups are independent from each other, and the sequence cannot be determined. Therefore, cross-group consistency is a major problem that cannot be solved at the consistency level and often needs to be solved at the upper level using distributed transactions.

EPaxos solves this problem and implements true Leaderless. EPaxos determines the relative order of Instance generated by different groups by tracking the dependency between instances, and then combines multiple Instance sequences generated by multiple groups into a global Instance sequence by sorting, thus achieving cross-group consistency. That is, a true Leaderless has been implemented.

EPaxos first runs the consensus protocol to make each copy agree on the value of Instance and the relative order of Instance dependence. Then, EPaxos runs the sorting algorithm to sort Instance globally based on the relative order of the previously agreed Instance, and finally obtains a consistent global Instance sequence.

The above is to introduce the basic idea of EPaxos from the perspective of Multi-PaxOS or Raft using multiple groups to Leaderless. Actual Group is a concept other than consistent algorithm. Group is just introduced here for convenience. But like Paxos or Raft, you can implement multiple groups on top of EPaxos.

Two EPaxos Instance

Instance of EPaxos is different from that of Paxos. In Paxos, Instance number is assigned in advance, but in EPaxos, Instance number is not assigned in advance. Each copy can be concurrently submitted out of order, but the dependencies between instances are tracked and sorted according to the dependencies. Here’s a summary of the differences:

Differences between EPaxos Instance and Paxos

A Paxos Instance is identified by the globally continuous increment InstanceID, which is also the sequence number of an Instance. InstanceID is globally unique and continuously increasing.

The Instance space of EPaxos is two-dimensional, and each copy has one row, so it is identified by the two-dimensional R.i, where R is the identity of the copy, I is the continuously increasing integer in the copy, and each new Instance is incremented by one. Instance sequences owned by replica R are R.1, R.2, r.3,…… R.i,…

EPaxos Instance has a few additional attributes relative to Paxos:

  • State Indicates the current status of Instance. The value can be pre-accepted, accepted, or committed. Because EPaxos Instance has many states, a special state field is required to identify it.
  • Deps is a collection of dependent instances, storing the identifiers of all dependent instances, that is, the Instance to be executed earlier. Deps holds the relative order between instances, which are then sorted based on DEps.
  • Seq is the sequence number of Instance. Its value is the maximum number of seQs of all instances in the DEPS plus one. It reflects the sequence of Instance proposals and is used in subsequent sorting.

The deps and SEQ attributes of Instance of EPaxos are the same as the values of Instance, and they need to be agreed among the copies. The subsequent copies will independently sort the Instance based on DEPS and SEQ. Because the sorting algorithm of EPaxos is deterministic, each copy sorts Instance based on the same DEPS and SEQ, and finally obtains a consistent global Instance sequence.

Instance is regarded as the vertex of a graph, and DEPS is the exit edge of the vertex. After the vertices and edges of the graph are determined and the agreement is reached among all copies, all copies carry out deterministic ordering on the graph, and finally a consistent Instance sequence is obtained.

EPaxos consensus protocol

While Paxos requires two phases to submit a value, EPaxos has one more phase than Paxos in order to determine the value of these attributes because its Instance relies on the set DEPS and sequence number SEQ. A full EPaxos has three phases, but not all of them are required. The following table compares the protocol flow of Paxos with that of EPaxos:

Comparison between Paxos and EPaxos protocol flow

Compared with Paxos, the Prepare and Accept phases of EPaxos are similar, but the PreAccept phase is the most critical phase of EPaxos. Because EPaxos has more PreAccept phases, Instance states are more numerous, so a dedicated state attribute is introduced to identify the current state of an Instance (pre-accepted, Accepted, committed). The state in the Prepare phase is not introduced because there is no proposed value in the Prepare phase. You can distinguish the state from other states by checking whether there is a local proposed value. Normally, EPaxos only runs the PreAccept phase to Commit, and skips the Prepare and Accept phases.

EPaxos is similar to Paxos. If Instance is preempted in any phase, you need to roll back to the Prepare phase and start again.

1 Prepare stage

EPaxos is similar to Paxos in that the Prepare phase is used to gain the right to propose and to learn the latest value that has been proposed before. In EPaxos, because each copy has its own Instance space, the proposal in its own Instance space is equivalent to the Leader of multi-PaxOS. Therefore, it is similar to multi-PaxOS. In general, the Prepare phase can be skipped directly. Go straight to the next stage.

Phase 2 PreAccept

The PreAccept phase is specific to EPaxos, and its role is to determine Instance’s dependency set DEPS and sequence number SEQ, while trying to get the proposed value, DEPS, and SEQ to agree across replicas. If the PreAccept phase has reached an agreement, proceed to the Commit phase (Fast Path) directly. Otherwise, run the Accept phase and then proceed to the Commit phase (Slow Path).

How does the PreAccept phase determine Instance’s dependency set DEPS and sequence number SEQ? In fact, it is relatively simple. Collect the Instance that needs to be executed before from the Instance that already exists in the replica, namely the set of local DEPS. The local SEQ is the maximum seQ of all instances in the local DEPS plus one. The final dependency set DEPS takes the union of the local DEPS set of most replicas, and the final sequence number SEQ takes the maximum of their local SEQ.

The local DEPS and local SEQ of different replicas may be different when proposals are made concurrently, so how can the PreAccept phase be agreed? An agreement has been reached if enough copies (Fast Quorum) have the same local DEPS and local SEQ. Otherwise, the final dependent set DEPS takes the union of the local DEPs of most (Slow Quorum) copies, and the final sequence number SEQ takes the maximum value of their local SEQ, and then runs the Accept phase to reach agreement.

Fast Quorum in the PreAccept phase always contains the proposer, for reasons discussed later. The value of Fast Quorum cannot be smaller than that of Slow Quorum. Assuming that the total number of copies is N and F copies can be tolerated to fail simultaneously, N = 2F + 1, then Fast Quorum = 2F, optimized EPaxos can be optimized to F + [(F + 1) / 2], Slow Quorum = F + 1. The derivation of the value for Fast Quorum will not be introduced here, but will be discussed in more detail in future articles. Slow Quorum is the majority of copies, which is the same as the Accept phase of Paxos.

3 the Accept stage

In Accept phase, EPaxos is similar to Paxos, but Paxos only synchronizes proposal value to most copies. EPaxos needs to synchronize proposal value, DEPS and SEQ to most copies. Once a majority is formed, the decision is reached. If a decision has been reached in the PreAccept phase, you can skip the Accept phase and go directly to the Commit phase.

4 the Commit phase

EPaxos is similar to Paxos in that it asynchronously sends the decision to other copies so that they can learn the decision. The difference is that EPaxos’s decision includes not only the decision value, but also DEPS and SEQ.

EPaxos sorting algorithm

Unlike Paxos, the order of EPaxos instances is not determined after they are committed, so EPaxos requires an additional sorting process to sort the instances that have been committed. When an Instance is committed and all instances in its dependent collection DEPS are committed, a sorting process can begin.

The Instance of EPaxos is regarded as the vertex of the graph, and the dependent set DEps of Instance is regarded as the edge of the vertex. Once the value of Instance is agreed with the dependent set DEPS, the vertices and edges of the graph are agreed among the copies, so that each copy will see the same dependency graph.

EPaxos sorts instances in a way that is similar to deterministic topological sorting of graphs. However, it should be noted that the dependency between EPaxos instances may form loops, that is, there may be loops in the graph, so it is not completely topological ordering.

In order to deal with cyclic dependence, EPaxos’s algorithm for Instance ordering needs to first look for the strongly connected components of the graph. Loops are all contained in the strongly connected components. If a strongly connected component is regarded as a vertex of the graph as a whole, then all the strongly connected components constitute a directed acyclic graph (DAG). Then topological ordering is done for all strongly connected components of directed acyclic graphs.

The flow of EPaxos sorting algorithm is shown in Figure 1, where the part circled by background color is strongly connected component:

EPaxos sorting algorithm

Generally, Tarjan algorithm is used to search strongly connected components of graphs, which is a recursive algorithm. It is found that the recursive implementation is easy to burst stack, which also brings certain challenges to engineering applications.

Instances in different strongly connected components are sorted according to the deterministic topological order, while instances in the same strongly connected component are proposed concurrently, and can be sorted according to any deterministic rules in theory. EPaxos calculates a SEQ sequence number for each Instance. The size of SEQ reflects the order of Instance proposals. Instances in the same strongly connected component are sorted according to the size of SEQ. The actual SEQ may be repeated and does not guarantee globally unique increments, which is not taken into account in the EPaxos paper and can actually be sorted using SEQ plus copy identity.

In fact, as the new Instance is continuously running, the old Instance may depend on the new Instance, and the new Instance may depend on the updated Instance. In this way, the dependency chain may continue to extend without termination, and the sorting process cannot be carried out all the time, forming a live lock. This is one of the major engineering challenges of EPaxos.

Because the Instance sorting algorithm is deterministic, each copy will get a consistent Instance sequence after sorting the Instance based on a consistent dependency graph.

Five EPaxos case

The following uses A specific case to introduce the core protocol flow of EPaxos, as shown in the figure below. The system consists of R1, R2, R3, R4, and R5 copies. The horizontal direction represents time, and the proposal flow of A, B, and C is shown in the figure.

EPaxos consensus protocol

The attribute values of each Instance in the case are shown in the following table:

The Instance property in the EPaxos core protocol process case

1 Proposed value A

Firstly, R1 runs the PreAccept phase to initiate the proposal value A. It first obtains its own local DEps and local SEQ. At this time, it does not have any Instance locally, so the local DEps is an empty set, the local SEQ is the initial value 1, and it persists the proposal value A, local DEps and local SEQ.

Then R1 broadcasts PreAccept(A) message to R2 and R3, which carries the proposal value A, local DEPS, and local SEQ (not marked in the figure). At this time, R1, R2, and R3 constitute Fast Quorum. PreAccept messages can be broadcast only to copies in Fast Quorum, an optimization called Thrifty optimization in the EPaxos paper.

After receiving the PreAccept(A) message, R2 and R3 obtain their local DEPS and local SEQ respectively. Similar to R1, local DEPS is an empty set and local SEQ is 1. After persistence, R2 and R3 reply to R1.

The local DEPS and local SEQ that R1 receives the copy in Fast Quorum are the same, the resolution is reached, the final DEPS is empty, seQ is 1, and the Commit stage is run.

2 Suggest value B

R5 then runs the PreAccept phase to initiate the proposal value B. At this time, it also does not have any Instance locally, so the local DEPS is an empty set and the local SEQ is the initial value 1. R5 broadcasts PreAccept(B) messages to R3 and R4 after local persistence.

The local DEps returned by R4 is an empty set, and the local SEq is 1. R3 already has an Instance of A, so the local DEps returned by R3 is {1.1}, namely {A} marked on the figure, and the local SEQ is 2, namely, the SEQ of A’s Instance plus one.

The local DEPS and SEQ of the copy in Fast Quorum are not the same, so the Accept stage is required. The final DEPS takes the union of the local DEPS of most copies as {1.1}, i.e. {A} on the graph, and the final SEQ takes the maximum value of the local SEQ of most copies as 2. Through the Accept stage, the proposed value B, the final DEPS, and the final SEQ reach a majority. Finally, run the Commit phase Commit.

3 Suggest value C

Finally, R1 runs the PreAccept phase and initiates the proposal value C. At this time, R1 already has an Instance with value A locally, so the local DEPS is {1.1}, namely {A} indicated in the figure, and the local SEQ is 3. R1 broadcasts PreAccept(C) messages to R2 and R3 after local persistence.

R2 and R3 now have local instances of A and B, so the local DEPS of R2 and R3 are 1.1, 5.1}, namely {A, B} marked in the figure. The local SEQ is 3, that is, the SEQ of B’s Instance plus one.

The local DEPS and SEQ of all copies of Fast Quorum except proposer R1 are the same, so the decision has been reached, and the final DEPS is {1.1, 5.1}, i.e. {A, B}, and SEQ is 3. Run the Commit phase.

4 the sorting

The dependency relationship of Instance of proposed values A, B and C is drawn according to their dependency set DEps as shown below (left) :

Dependencies between instances of proposed values A, B, and C (left), and the order after sorting (right)

The deps of Instance A is an empty set, so there is no outedge. The deps of Instance B is {A}, so there is an outbound edge pointing to A; The deps of Instance C is {A, B}, so there are two outgoing edges pointing to A and B respectively.

The dependency graph is already a directed acyclic graph (DAG) without cyclic dependencies. Therefore, vertices A, B and C are each strongly connected components, and their order can be obtained after A deterministic topological ordering: A < — B < — C, as shown in the figure (right).

Six EPaxos discussion

1 the Instance conflict

EPaxos introduces the concept of Instance collision (similar to Parallel Raft, not Parallel collision), where if the values of two instances don’t Interfere (such as accessing different keys), their order doesn’t matter and can even be handled in Parallel. So EPaxos only handles dependencies between conflicting logs.

The Instance dependency set of EPaxos, deps, holds instances that need to be executed before the Instance. After a conflict is introduced, the Instance that conflicts with the Instance is stored in the DEPS.

Conflict is a concept related to specific applications. After the introduction of conflict, all instances are no longer in full order but in partial order, which reduces dependence, reduces the probability of Slow Path and improves efficiency.

2 Fast Quorum

The derivation of the value of Fast Quorum will be introduced in subsequent articles. Here, we will first discuss why Fast Quorum in the PreAccept stage always contains the initiator.

At each stage of EPaxos, the proposer always broadcasts messages to other replicas after local persistence is successful. That is, the proposer is always in the Quorum, so the proposer can always count as one vote in determining whether or not Quorum is reached.

During the PreAccept phase, the proposer includes its local DEPs and SEQ in the PreAccept message as a basis for other replicas to calculate their local DEPS and SEQ, so that the proposer’s local DEPS and SEQ are always included in the final DEPS and SEQ. So a Fast Quorum in the PreAccept phase always contains the proposer.

EPaxos always persistent success after the broadcast to other local copy, so that we can reduce Fast Quorum, but also lead to local persistence and network messaging cannot be parallel, reduces the efficiency of some, but also makes the proposed person cannot tolerate the situation of the local disk damage, these are all EPaxos engineering application must face the problem.

Why is Fast Quorum not smaller than Slow Quorum? There is no need to derive the value of Fast Quorum. This conclusion can be drawn intuitively. In Paxos, one copy proposes a value, and all copies have only two outcomes, accept or reject the value. In EPaxos, each copy may give different DEPS and SEQ, so more copies need to give the same result to ensure that the correct result can be restored after the failure of one copy.

EPaxos pseudocode

At this point, I believe that the reader can understand the EPaxos core protocol flow pseudo-code. The EPaxos core protocol process pseudocode is shown below, with the Proposal ID (or Ballot Number) part omitted for simplicity, which is the same as in Paxos.

The pseudo-code treats the log as a Command, each Instance agrees on a Command, and each copy uses a two-dimensional array CMDS to hold the commands received.

EPaxos core protocol flow pseudocode

Eight summary

EPaxos not only removes the dependency on the Leader by maintaining dependencies between instances, but also enables concurrent out-of-order commits, which can be optimized for Pipelining. Explicit maintenance of dependencies also enables out-of-order execution. EPaxos supports out-of-order confirmation, out-of-order submission, and out-of-order execution, providing higher throughput in theory. At the same time, we can also see some EPaxos engineering application challenges, these are EPaxos project to solve the problems.

This article introduces the core protocol flow of EPaxos from the perspective of the comparison between Paxos and EPaxos. However, EPaxos is not only about these, especially how to ensure the sequence of log sequences in Failover scenarios.

thinking

Finally, I leave a few questions for interested students to think about:

  1. Why and under what circumstances does Instance seQ repeat?
  2. How to deduce the value of Fast Quorum?
  3. If the consensus protocol process for an Instance is not complete and its proposer goes down, what should other deputies do with the Instance?

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.