Zookeeper consistency protocol

1, the preface

With the continuous improvement of PC performance and the rapid popularization of network technology, many enterprises began to abandon the original mainframe, and use minicomputer and ordinary PC server to build distributed computer systems. One of the most typical is alibaba Group’s “go IOE” campaign.

In previous centralized applications, it was easy to implement a transaction processing system that met ACID characteristics to ensure strict data consistency. But in distributed applications, where data is scattered across different machines, it is difficult to ensure strict consistency of data. Therefore, the classical theory of distributed system like CAP and BASE emerged.

1.1, the ACID

A Transaction is a program execution logical Unit consisting of a series of operations to access and update data in a system. In the narrow sense, a Transaction specifically refers to a database Transaction.

Transactions encompass four major features, Atomicity, Consistency, Isolation and persistence.

1.1.1 atomicity

Atomicity means that a transaction must be an atomic unit of sequence of operations. All operations of each transaction either all succeed or all fail.

1.1.2 Consistency

Consistency means that the execution of a transaction does not violate the integrity and consistency of the database data. The database must be in a consistent state before and after a transaction is executed.

1.1.3 isolation

Isolation means that concurrent transactions are isolated from each other and the execution of one transaction cannot be interfered with by other transactions.

1.1.4. Persistence

Persistence means that once a transaction is committed, its state changes to the corresponding data in the database should be permanent.

1.2. CAP theorem

The CAP theorem means that A distributed system cannot simultaneously satisfy C: Consistency, A: Availability and P: Partition tolerance, and only two of them can be simultaneously satisfied. Because fault tolerance of partitions must exist in distributed systems, it is mainly a trade-off between consistency and availability.

1.2.1 consistency

In a distributed environment, consistency is a property of data consistency across multiple copies. Under the requirement of consistency, when a system performs an update operation in a consistent state, the data of the system should remain in a consistent state.

For a copy of the data distribution in different distributed node system, if the data to the first node update after operation and the update is successful, but not get the data on the second node corresponding updates, so on the second node data read operations, access to the remains of the old data (or called dirty data), This is a classic case of distributed data inconsistency. A distributed system is considered to be highly consistent (or strictly consistent) if an update to a data item is successfully performed so that all users can read its latest value.

1.2.2 Availability

Availability means that the services provided by the system must always be available, and the results of each operation request can always be returned within a limited period of time.

1.2.3 Fault tolerance of partitions

Fault tolerance of partitions constrains a distributed system to have the following features: When a distributed system encounters any network partition failure, it still needs to be able to provide consistent and available services, unless the entire network environment is faulty.

Network partition is to point to in a distributed system, different nodes distribution in different subnet (room or long-distance network, etc.), due to some special reasons lead to these sub network between the status of the network is not connected, but each network’s internal network is normal, resulting in the network environment of the whole system been split into several isolated areas. It is important to note that each node that makes up a distributed system joins and exits as a special network partition.

1.3 BASE Theory

BASE theory is a shorthand for Basically Available, Soft state, and Eventually consistent. Is the result of tradeoffs between consistency and availability in CAP. It is gradually evolved based on CAP theorem, and its core idea is that even if strong consistency cannot be achieved, each application can adopt appropriate ways to achieve the final consistency of the system according to its own business characteristics.

1.3.1. Basic availability

Basic availability means that a distributed system allows for a partial loss of availability in the event of an unpredictable failure — but note that this is by no means equivalent to the system being unavailable.

1.3.2 weak state

Weak state is also called soft state, which is the opposite of hard state. It means that data in the system is allowed to exist in an intermediate state, and the existence of the intermediate state does not affect the overall availability of the system. That is, the system is allowed to delay data synchronization between data copies on different nodes.

1.3.3 Final Consistency

Final consistency emphasizes that all copies of data in the system can eventually reach a consistent state after a period of synchronization. Therefore, the essence of final consistency is that the system needs to ensure the consistency of the final data, rather than ensuring the strong consistency of the system data in real time.

2. Consistency protocols and algorithms

In order to solve the problem of distributed consistency, a large number of classical consistency protocols and algorithms have emerged in the process of long-term exploration and research, among which the most famous are two-phase commit protocol, three-phase commit protocol and Paxos algorithm.

2.1, 2PC protocol

2PC, short for two-Phase Commit, is an algorithm designed for computer networks, especially in the field of database, to ensure that all nodes based on distributed system architecture can maintain atomicity and consistency in the process of transaction processing.

2.1.1 Protocol Description

The two-phase commit protocol divides the transaction commit process into two phases for processing. Before going into the process, let me introduce two concepts:

Coordinator: Used to uniformly schedule execution logic for all distributed nodes.
Actor: a distributed node that is scheduled.

The execution process is as follows:

Phase 1: Submit transaction request (voting phase)
1. Distributed transaction
  
  The coordinator sends transaction content to all participants, asks if a transaction commit can be performed, and waits for responses from each participant.
2. Perform transactions
  
  Each participant node performs a transaction and writes Undo and Redo information to the transaction log.
3. Feedback response
  
  If the participant successfully executes the transaction, the coordinator is given a Yes response indicating that the transaction can be executed.
  
  If the participant does not successfully execute the transaction, a No response is returned to the coordinator indicating that the transaction cannot execute.
Phase 2: Executing the transaction request (Execution phase)

The coordinator decides whether to commit the transaction based on the feedback of each participant. Normally, there are two possibilities.

Commit the transaction

If the coordinator receives a Yes response from all participants, the transaction commit is performed.
1. Send submit request
  
  The coordinator issues Commit requests to all the participant nodes.
2. Transaction commit
  
  After receiving the Commit request, the participant formally performs the transaction Commit and releases the transaction resources occupied during the entire transaction execution.
3. Feedback transaction commit results
  
  The participant sends an Ack message to the coordinator after completing the transaction commit.
4. To complete the transaction
  
  After receiving Ack messages from all participants, the coordinator completes the transaction.
Interrupt the transaction

If any participant returns a No response to the coordinator, or if the coordinator is unable to receive feedback from all participants after the wait times out, the transaction is interrupted.
1. Send a rollback request
  
  The coordinator issues a Rollback request to all the participant nodes.
2. Transaction rollback
  
  After receiving the Rollback request, the participant uses the Undo information recorded in phase 1 to perform the transaction Rollback and, upon completion of the Rollback, releases the resources occupied during the entire transaction execution.
3. Feedback transaction rollback results
  
  The participant sends an Ack message to the coordinator after completing the transaction rollback.
4. Interrupt the transaction
  
  The coordinator completes the transaction interrupt after receiving an Ack message from all participants.

2.1.2 advantages and disadvantages

Advantages:

The principle of simple
To achieve convenient

Disadvantages:

A synchronized block

One of the most obvious and biggest problems with two-phase commit protocols is synchronization blocking, which can greatly limit the performance of distributed systems. During the execution of a two-phase commit, all logic involved in the transaction is blocked, meaning that each participant cannot do anything else while waiting for the response of the other participant.
A single point of the problem

In phase 2 commit, if the coordinator fails, the whole phase 2 commit process will not work, and even worse, if the coordinator fails in Phase 2, the other participants will remain locked in transaction resources and cannot continue to complete the transaction.
Data inconsistency

In phase 2 of the two-phase Commit protocol, when a transaction Commit is performed, after the coordinator sends a Commit request to all participants, a local network exception occurs or the coordinator crashes before the Commit request is sent, resulting in only some participants receiving the Commit request. Therefore, participants who received the Commit request will Commit the transaction, while those who did not receive the Commit request cannot Commit the transaction, resulting in data inconsistency in the entire distributed system.
Is too conservative

The two-phase commit protocol has no well-designed fault tolerance mechanism, and the failure of any node will lead to the failure of the whole transaction.

2.2. 3PC protocol

2.2.1 Protocol Description

3PC, which stands for three-Phase Commit, is an improved version of 2PC that splits the “Commit transaction request” process of the two-phase Commit protocol in two. A transaction protocol consisting of CanCommit, PreCommit and DoCommit phases is formed. The execution process is as follows:

Phase one: CanCommit
1. Transaction to ask
  
  The coordinator sends a CanCommit request containing the transaction content to all participants, asks if the transaction commit operation can be performed, and waits for the response from each participant.
2. Feedback response
  
  When a participant receives a CanCommit request from the coordinator, it normally responds with a Yes response and goes into a preparatory state if it thinks it can execute the transaction successfully, and a No response otherwise.
Phase 2: PreCommit

The coordinator decides whether or not to PreCommit a transaction based on feedback from each participant. Normally, there are two possibilities.

Perform transaction pre-commit

If the coordinator receives a Yes response from all participants, the transaction pre-commit is performed.
1. Send a pre-commit request
  
  The coordinator issues a PreCommit request to all the participant nodes and enters the Prepared phase.
2. Transaction precommit
  
  Upon receiving a PreCommit request, the participant performs a transaction and records Undo and Redo information to the transaction log.
3. Feedback response
  
  If the participant successfully performs the transaction, the coordinator receives an Ack response while waiting for the final instruction: commit or abort.
Interrupt the transaction

If any participant returns a No response to the coordinator, or if the coordinator is unable to receive feedback from all participants after the wait times out, the transaction is interrupted.
1. Send interrupt request
  
  The coordinator issues abort requests to all participant nodes.
2. Interrupt the transaction
  
  The participant interrupts the transaction whether it receives an ABORT request from the coordinator or if a timeout occurs while waiting for the coordinator’s request.
Phase 3: DoCommit

There are two possible scenarios where the actual transaction commit takes place at this stage.

commit
1. Send submit request
  
  Entering this phase, assuming that the coordinator is in a working state and that it has received Ack responses from all participants, it transitions from the “pre-commit” state to the “commit” state and sends a doCommit request to all participants.
2. Transaction commit
  
  Upon receiving the doCommit request, the participant formally performs the transaction commit and, upon completion of the commit, releases the transaction resources occupied during the entire execution of the transaction.
3. Feedback transaction commit results
  
  The participant sends an Ack message to the coordinator after completing the transaction commit.
4. To complete the transaction
  
  After receiving Ack messages from all participants, the coordinator completes the transaction.
Interrupt the transaction

Entering this phase, the transaction is interrupted if the coordinator is in a normal working state and any participant gives the coordinator a No response, or if the coordinator is unable to receive feedback from all participants after the wait times out.
1. Send interrupt request
  
  The coordinator sends abort requests to all participant nodes.
2. Transaction rollback
  
  Upon receiving the ABORT request, the participant uses the Undo information recorded in phase 2 to perform the transaction rollback and, upon completion of the rollback, releases the resources occupied during the entire execution of the transaction.
3. Feedback transaction rollback results
  
  The participant sends an Ack message to the coordinator after completing the transaction rollback.
4. Interrupt the transaction
  
  The coordinator interrupts the transaction after receiving an Ack message from all participants.
Note: The following two faults may occur in phase 3
- The coordinator has a problem.
- The network between the coordinator and the participant fails.
In either case, participants end up not receiving doCommit or ABORT requests from the coordinator in a timely manner, and in such exceptional cases, participants wait for a timeout before continuing to commit the transaction.

2.2.2 advantages and disadvantages

Advantages:

Phase 3 commits reduce the blocking scope of participants compared to phase 2 commits.
Ability to continue to reach agreement after a single point of failure.

Disadvantages:

Data inconsistency

After the participant receives the preCommit message, if the network partition occurs, the node where the coordinator is located and the participant cannot communicate with each other normally. In this case, the participant will still commit the transaction, which inevitably leads to data inconsistency.

2.3 Paxos algorithm

Paxos algorithm is a consistency algorithm based on message passing and has high fault tolerance. It is one of the most effective algorithms to solve distributed consistency problem. (The Paxos algorithm will not be introduced here, and there will be an article on the Paxos algorithm later.)

Zookeeper Atomic Broadcast (ZAB) protocol

ZAB protocol is a crash recovery atomic broadcast protocol specially designed for distributed coordination service Zookeeper. Based on this protocol, Zookeeper implements a system architecture in active/standby mode to maintain data consistency among replicas in the cluster.

3.1 Core processing process

All transaction requests must be coordinated by a globally unique server, called the Leader server, and the remaining servers become followers servers. The Leader server converts a client transaction request into a transaction Proposal (Proposal) and distributes the Proposal to all Follower servers in the cluster. The Leader then waits for feedback from all the Follower servers. Once more than half of the Follower servers have responded correctly, the Leader sends a Commit message to all the Follower servers. Ask them to submit the previous Proposal.

3.2 Protocol introduction

ZAB protocol includes two basic modes, crash recovery and message broadcast. When the whole service framework is started, or when the Leader server has network interruption, crash, exit and restart, ZAB protocol will enter the recovery mode and elect a new Leader server. When a new Leader server is elected and more than half of the machines in the cluster complete state synchronization (data synchronization) with the Leader server, ZAB protocol exits the recovery mode and enters the message broadcast mode. After receiving the transaction request from the client, the Leader server will generate the corresponding transaction proposal and initiate a round of broadcast protocol. If other machines in the cluster receive a write transaction request from the client, these non-Leader servers will first forward the transaction request to the Leader server. If a new server is added, it enters data recovery mode, finds the Leader’s server, synchronizes data with it, and participates in the message broadcast process together.

3.2.1. Crash recovery mode

When the entire server framework is started, or the Leader server crashes, or the Leader server loses contact with half of the followers due to network reasons, the Leader server enters the crash recovery mode. (To put it bluntly, crash recovery mode is to elect a new Leader to complete data synchronization)

There are two data inconsistencies that can occur during crash recovery:

The ZAB protocol needs to ensure that transactions that have been committed on the Leader server are eventually committed by all servers.

Suppose a transaction is committed on the Leader server and has received an Ack from more than half of the Follower servers, but the Leader hangs up before it can send a Commit message to all the Follower machines. In this case, the ZAB protocol needs to ensure that the transaction is eventually committed successfully on all servers, otherwise there will be inconsistencies.
The ZAB protocol needs to ensure that transactions that are only proposed on the Leader server are discarded.

Suppose Server1 on the Leader server issues a transaction and then crashes and exits, causing no other servers in the cluster to receive the transaction. Thus, when Server1 recovers and joins the cluster again, the ZAB protocol needs to ensure that this transaction is discarded.

As for the problems mentioned above, it is determined that ZAB protocol must design such a Leader election algorithm that can ensure that the transaction Proposal submitted by the Leader is submitted, while discarding the transaction Proposal that has been skipped. In view of this requirement, if the Leader election algorithm can ensure that the newly elected Leader server has the transaction Proposal with the highest number (i.e., the largest ZXID) of all machines in the cluster, then it can be guaranteed that the newly elected Leader must have all the submitted proposals. More importantly, if the machine with the highest transaction Proposal is made the Leader, the step of checking the Proposal submission and discarding the work of the Leader server can be saved. (For the Leader election process, see In-depth Principles of Zookeeper.)

After the Leader election is completed and before the official work starts, it is also necessary to confirm whether all proposals in the transaction log have been submitted by more than half of the machines in the cluster, that is, whether data synchronization has been completed.

The following describes the data synchronization process:

The Leader server prepares a queue for each Follower server and sends proposals to the Follower server one by one for transactions that have not been synchronized by each Follower server. Each Proposal message is followed by a Commit message to indicate that the transaction has been committed. After the Follower server synchronizes all its unsynchronized transaction proposals from the Leader server and successfully applies them to the local database, the Leader server adds the Follower server to the actual list of available followers.

3.2.2 Message broadcast mode

The message broadcast process of THE ZAB protocol uses an atomic broadcast protocol, similar to a two-phase commit process. But the ZAB protocol is slightly different from the two-phase commit. In the two-stage submission process of ZAB protocol, the interrupt logic is removed, and all Follower servers either normally feedback the transaction Proposal proposed by the Leader or abandon the Leader server. ZAB also supports the half rule, that is, more than half of the followers in the cluster submit a transaction Proposal after receiving an Ack, rather than waiting for all the followers in the cluster to respond.

Since this simplified two-phase commit model cannot deal with the data inconsistency caused by Leader server crash and exit, crash recovery mode is added in ZAB protocol to solve this problem.

In the process of message broadcast, the Leader server will generate a corresponding Proposal for each transaction request and broadcast it. Before broadcasting the transaction Proposal, The Leader server will first assign a globally monotonically increasing unique ID (i.e., ZXID) to the transaction Proposal. The Leader server will assign each Follower server a separate queue. The transaction proposals that need to be broadcast are then placed in these queues in turn, and messages are sent according to the FIFO policy. After receiving the Proposal, each Follower server writes it to the local disk as a transaction log and sends an Ack response to the Leader server. When the Leader server receives more than half of the Ack responses from the followers, it broadcasts a Commit message to all the followers to inform them to Commit the transaction. At the same time, the Leader also completes the transaction submission. Each Follower server also commits the transaction after receiving the Commit message.

3.3 Description of the Agreement

The whole ZAB protocol mainly consists of two processes: message Broadcast and crash recovery, which can be further divided into three stages: Discovery, Synchronization and Broadcast.

Stage 1: Discovery

The main process is the Leader election process, which is used to elect the master process from multiple distributed processes.

Phase 2: Synchronization

After the discovery process is complete, the synchronization phase is entered. That is, data is synchronized between the Leader server and Follower server.

Stage 3: broadcast

After completing the synchronization phase, the ZAB protocol can officially start receiving new transaction requests from clients and the message broadcast process.