Please pay attention to the public number: A row of boats, follow me to progress together.

With the development of computer science and the Internet, distributed scenarios become more and more common. Whether an engineer can deal with problems in distributed scenarios becomes a standard to measure whether he is qualified or not. In this article, we introduce the theories related to distributed systems, which are the basis for our understanding and handling of distributed problems.

Theory of CAP

CAP theory was developed in 1998 by computer scientist Eric Brewer. Let me introduce CAP theory.

  • Consintency: consistency. The results of accessing all nodes are the same.
  • Availability: All nodes remain highly available. High Availability also includes no high latency.
  • Partition tolerance: The system can still provide services when nodes are previously unavailable on the network.

The principle of CAP theory is that a system can satisfy at most two of the above three conditions at the same time, but cannot satisfy all three at the same time. I’ve seen an easier way of explaining this before:

  • C stands for consistent
  • A stands for the same time
  • P stands for different Spaces

CP: in different Spaces, if the data is consistent, the data will not be at the same time. AP: in different Spaces, if the data can be retrieved from any space at the same time, the data state will be inconsistent. CA: The data can be retrieved at any time, so P must be 1

Combined with the actual business scenarios, P (fault tolerance of partition) is a requirement that every system must meet, so we have only CP and AP models to choose from. CP model is characterized by high consistency requirements. Zookeeper, as we are familiar with, is a typical CP model. The AP model is characterized by ensuring high availability and ultimate consistency of data rather than real-time consistency. A typical example is Eurka

The Base theory of

Base: Basically Available, Soft state, and Eventual consistency.

  • Basically Available: The core service or part of the service is Available, or the response latency is increased.
  • Soft State: Indicates that data synchronization on some nodes is not complete.
  • Eventual consistency: After system internal coordination, all nodes in the system reach an agreement within a limited time.

Eventual consistency also needs to be noted in practice:

  • Session consistency: If the data has been updated in a single session, the old data cannot be read again
  • Node validity. If data on a node has been updated, the old data cannot be read again

Base theory is more suitable for the overall design of large distributed systems, which is different from ACID’s strong consistent model. It allows the usability of the system to be enhanced by sacrificing strong consistency, allowing the data to reach final consistency over time. Base and ACID can be used together in real business scenarios.

2PC

To quote from Wikipedia: “Two-phase Commit (English: Two-phase Commit) is an algorithm designed to make transactions committed consistently by all nodes based on a distributed system architecture in computer networks and databases. A two-phase commit is also known as a Protocol. There are two roles: coordinator and implementer.

Phase 1: Submit the request phase

  1. The coordinator sends a request to all performers asking if the task is executable.
  2. The executor performs the operation of the transaction and logs it, but does not commit it.
  3. After executing the transaction, the executor returns the execution status to the coordinator. Returns if the transaction executed successfullyreadyState; Returns if the transaction fails to executeTermination ofState.

If one or more performers return termination, initiate the rollback process:

  1. The coordinator initiates to all practitionersThe rollbackThe request.
  2. After receiving the rollback request, the implementer performs the rollback based on the logs recorded in step 2 of Phase 1.
  3. When the rollback is complete, the performer returns to the coordinatorRoll back to finish.
  4. The coordinator receives all performer returnsRoll back to finishThe transaction rollback succeeded.

If the coordinator receives that all performers are ready, the second stage is performed. Phase 2: Commit execution phase

  1. The coordinator initiates a “formal submission” request to all practitioners.
  2. The performer performs the commit operation and releases the associated lock resources.
  3. The actor sends to the coordinatorCommit successfullyThe message.
  4. The coordinator receives the messages of all performersCommit successfullyAfter the message completes the transaction.

The coordinator, in the second stage, must complete the task.

Advantages and disadvantages of phase 2 commit:

  • Advantages: simple principle, simple implementation.
  • Disadvantages: The coordinator has a single point of problem; All performers wait for each other, causing obstruction; If the executor fails in the first stage, the coordinator cannot judge whether the executor succeeds or not, and can only rely on its own timeout mechanism to ensure the rollback. In the second phase, if the executor fails and cannot complete the COMMIT operation, data inconsistencies will result.

3PC

Three-phase commit: A three-phase commit splits the first phase of a two-phase commit into two phases, with the benefit of anticipating risks in advance and reducing the likelihood that performers will block each other.

Phase 1: CanCommmit

  1. The coordinator sends a request for CanCommit to all performers.
  2. The executor anticipates whether the Commit Can be performed based on its condition and returns either “Can” or “No”. If any performer returns “No” or if any performer requests a timeout, the coordinator sends a rollback to all performers, and the performer receives a rollback ACK (which is lighter in this case).

If all performers return “Can”, the next stage is entered.

Stage 2: PreCommit

  1. The coordinator sends PreCommit requests to all performers.
  2. The executor executes the transaction and records a “log”, but does not commit, and returns “Pre” or “No” after the transaction.

If any performer returns “No” or if any performer requests a timeout, the coordinator sends a rollback to all performers, the performer receives the rollback, completes the rollback, and ACK. If all performers return “Can” to proceed to the next phase.

Stage 3: DoCommit

  1. The coordinator sends a DoCommit request to all performers.
  2. The executor executes the Commit transaction, releases the lock, and returns “Completed” or “failed” after the transaction is completed.
  3. If all performers return done, the coordinator completes the transaction. If some performers return “failed” or time out, the coordinator sends a “rollback” action to all performers, and the performers perform the “rollback” and the rollback completes ACK.

Advantages: Three-phase commit relieves blocking problems for two-phase commit implementers. However, it does not fundamentally solve the problems of coordinator’s single point and network delay.

Paxos algorithm:

Paxos is a consensus algorithm based on message passing and highly fault-tolerant proposed by Leslie Lamport in 1990. Paxos algorithm has two parts: Basic Paxos and Multi-PaxOS. Basic Paxos is to solve how multiple nodes agree on a certain value; Multi-paxos is how multiple sets of Basic Paxos agree on a set of values. Multi-paxos does not have a strict proof process, but is more of a guideline for solving distributed problems. Here we mainly introduce Basic Paxos algorithm. There are three roles in Basic Paxos: Proposer, Acceptor, and Learner.

  • -Proposer: No.
  • Acceptor: Accept a proposal for a vote
  • Learner: Record the approved “proposal”

A Proposer is usually the first node to receive a request from a client when the algorithm is implemented. The node is either Proposer or Acceptor. This allows the algorithm logic to be concentrated within the service and decoupled from the client. The proof can be found in Wikipedia: Portals

Final algorithm:

The adoption of a resolution is divided into two stages:

Stage 1: Prepare

  1. Proposer selects a proposal number M and sends a Prepare request numbered M to more than half of the acceptors.
  2. If an Acceptor receives a Prepare request numbered M and M is greater than all proposal numbers to which the Acceptor has responded. Acceptors return proposals with the largest number they have approved and promise not to approve proposals numbered less than M.

Phase 2: Accept phase

  1. If more than half of the acceptors return a Prepare response to the Proposer, the Proposer sends an Accept request numbered M with a value of N to the Acceptor
  2. If Prepare receives this request and does not respond to a Prepare request with a number greater than M, the Acceptor accepts the proposal.

Let’s take a practical example:

P1, P2 represents two proposers A1, A2, and A3 represent three acceptors

Stage 1: Prepare

P1 sends proposal no. 1 to A1, A2, and A3; P2 sends proposal 2 to A1, A2, and A3.

  1. A1 and A2 received the proposal numbered 1. Since they had not received any proposal before, A1 and A2 returned a response to P1 and solemnly promised that “they will not respond to any Prepare request whose number is less than or equal to 1 any more. No bill numbered less than 1 will be passed.”
  2. A1 and A2 received the proposal numbered 2 from P2. Since 2 is greater than 1, A1 and A2 sent a response back to P2 and solemnly promised that “they will not respond to any Prepare request whose number is less than or equal to 2. No bill numbered less than 2 will be passed.”
  3. A3 received the Prepare request numbered 2 from P2. Since A3 had not received any proposal before, it sent a response back to P2 with a solemn promise that “IT will not respond to any Prepare request whose number is less than or equal to 2. No bill numbered less than 2 will be passed.”
  4. A3 received proposal 1 from P1. Because of the commitment made by A3 in the previous step and proposal 1<2, A3 did not respond to this request.

Phase 2: Accept phaseSince BOTH P1 and P2 received more than half of acceptors in the Prepare phase, they both sent Accept requests.

P1 initiates the request with number 1 and value 1; P2 Initiates the request 2 with the value 2.

  • If A1, A2, and A3 receive the Accept request from P1, the number 1 should be less than 2, so the proposal numbered 1 is rejected.
  • A1, A2, and A3 received the Accept request from P2, because the number 2=2, so the number 2 all returned the response and passed.

Proposal acquisition: Acceptors send the approved proposal to a Learner leader group, which diffuses the proposal information to all Learner nodes. Typical industry implementation of Paxos: Chubby

Raft algorithm

Raft algorithm has become the consensus algorithm of choice in recent years. Etcd, Consul, Tidb all use Raft algorithm. Node in Raft algorithm has three states: Follower, Candidate and Leader.

Cluster birth: When the cluster is just started, all the nodes are followers, and the Follower does not receive heartbeat detection from the Leader, he will promote himself as the Candidate. The Candidate requests votes from other nodes. If the Candidate receives more than half of the votes from other nodes, The Candidate becomes the Leader.

The cluster status is consistent for the first time: all changes to the system are made through the Leader node, and each change is logged. The Leader node sends the log to the Follower node for the first time. The Leader node waits for more than half of the nodes. The Leader node commits the current value. The Leader notifies the Follower node that the value has been committed, and the Follower node also commits the log. The cluster is now in a consistent state after the first startup.

Choose the main:

There are two timeouts in the Raft algorithm to control master selection.

  1. The master node determines timeout: the Follower is inFor a period of timeIf the Leader node does not receive a survival check, it promotes itself as a Candidate. We assume that the followers set this timeout to a random value between 150-300ms (150-300ms is just an example).
  2. After the master node is judged to have timed out, the followers promote themselves as Candidate to start a new round of voting.
  3. Candidates vote for themselves first
  4. Candidate sends a vote request to another node
  5. The node that receives the Candidate’s request, if it has not voted in this election round, votes the Candidate node that currently sent the request.
  6. The Candidate node randomly sets the election timeout within a range
  7. Once a Candidate has received more than half of the supporters, he or she is promoted to Leader.
  8. The Leader sends “Append Entries messages” to his followers. Messages are sent periodically in the form of heartbeat checks, and followers receive “Append Entries messages” to the Leader in response. The heartbeat detection remains until the Follower waits for the heartbeat detection to timeout and promotes himself as a Candidate.
  9. A Candidate must obtain more than half of the votes from followers to ensure that only one Leader can be elected in an election. If not, the election fails. At this point, Candidate willWait a random amount of timeThe next round of voting is repeated until a Leader is elected.

Log replication:

Once we have a Leader in the cluster, we need to copy all changes to all nodes in the cluster by sending the “Append Entries Message” log copy.

  1. The client sends information to the Leader, and the Leader logs the request information.
  2. The logs are sent to the followers during the next heartbeat detection.
  3. The followers return a response to the Leader after receiving the request. Once more than half of the nodes respond, the Leader commits the log information.
  4. Returns a response to the client and notifies the Follower node to commit the log.

Node change:

During the period when multiple nodes make member changes at the same time, if the network partition happens, it is easy to have multiple leaders in a cluster. At this time, if the client still sends requests to the cluster, the brain will be split. Member changes in a Raft cluster are usually addressed using a single node change approach where only one node is added or removed at a time. The change of a single node member is divided into two steps:

  1. The Leader synchronizes data to the new node.
  2. The Leader sends the new configuration as a log entry to all nodes in the cluster.

There are a few more rules to be aware of in the Raft algorithm:

  1. The tenure number increases monotonically in the entire cluster. If a Candidate or Leader finds a Leader whose tenure number is larger than his or her own, he or she automatically becomes a Follower.
  2. If a node receives a request whose number is less than the current tenure number, it rejects the request.
  3. The Leader node is the Leader node as long as it keeps sending heartbeat detection requests for a random period of time.
  4. Nodes with high log integrity (the last log recorded, the term number is larger or the index entry with the same term number is larger) refuse to vote for candidates with lower log integrity than themselves.
  5. In an election process, a node can only have one vote, the first to receive the request, the first to receive the vote.

XA transaction

To understand XA transactions, you first need to know the Distributed Transaction Processing (DTP) model, which has three modules:

  • AP: Aplication programs are usually initiated by our applications.
  • RM: Resource Manager, which manages specific resources and has the ability to commit or roll back transactions. Usually storage or some common service such as a database.
  • TM: Transaction Manager, TM is the coordinator of distributed transactions. TM can communicate with all RMS.

The XA specification mainly specifies the interaction process between RM and TM, relying on the idea of two-phase commit. XA specifies a series of interfaces, as shown below:

The main interfaces are:

  • Xa_open: initializes RM
  • Xa_start: starts an XA transaction
  • Xa_end: Ends an XA transaction
  • Xa_prepare: In the preparation phase, an XA transaction is pre-committed
  • Xa_commit: Commits an XA transaction
  • Xa_rollback: Rolls back an XA transaction

The XA transaction execution process is also shown in the form of legends:

Specific steps:

  1. The AP initiates the transaction start request
  2. RM sends a request to TM and calls the xa_start method to mark the start of the transaction
  3. The AP accesses the TM to perform specific operations, such as database specific operation statements
  4. TM executes the xA_end method to mark the end of the transaction
  5. TM initiates XA_prepare to notify RM to prepare for the request submission, similar to the request submission phase of the two-phase submission
  6. TM initiates XA_COMMIT and informs RM to commit execution, similar to the commit execution phase of a two-phase commit. Of course, if the preparation failed in the previous phase, xA_rollback is executed in this phase to notify RM to rollback the transaction
  7. TM calls X_close to end the transaction

XA transactions are implemented by databases such as MySQL and Oracle, so when we do some cross-library operations, no matter whether the same database or not, as long as XA protocol is implemented, we can call the API of the other party to complete distributed transactions.

TCC

TCC is a typical flexible distributed transaction theory, which guarantees the final consistency of data through compensation mechanism. The three stages of TCC:

  1. Try phase: Pre-perform operations to check and reserve resources for the service system.
  2. Confirm phase: Confirm the operation.
  3. Cancel phase: Cancels service operations.

TCC is simple to understand, but when implementing it, you need to consider several issues: idempotent design of the request interface, design of locking resources in the try process, and so on.

Quorum NWR:

The idea of NWR model is to give the choice of CAP to users, so that users can flexibly adjust the proportion of C and A in CAP model through configuration.

  • N: indicates the total number of copies
  • W: Indicates that the write is successful only when W copies are written successfully.
  • R: Indicates that the read is successful only when at least R backup files are read. (If there are inconsistencies, the most recent data is often picked)

If W + R > N, so R > N — W, that is, every time we read, we read at least one latest version. If high writability is required, set W to 1 and R to N. Any node that is successfully written is considered successful, but data must be read from all nodes when read – > weak consistency, high availability. If high readability is required, set W = N and R = 1. In this case, if the data is successfully read from any node, the data is considered successful – > Strong consistency and low availability

Lamport.azurewebsites.net/pubs/paxos-… Zh.wikipedia.org/wiki/Paxos%… Zh.wikipedia.org/wiki/Raft raft. Making. IO/raft. PDF thesecretlivesofdata.com/raft/ github.com/hashicorp/r… En.wikipedia.org/wiki/Paxos_… Pubs.opengroup.org/onlinepubs/… Deepnote. Me / 2020/02/19 /…