Zookeeper ensures the ultimate consistency of distributed transactions through ZAB.

Zookeeper Atomic Broadcast (ZAB)

1.ZAB is an atomic broadcast protocol specially designed for Zookeeper to support crash recovery, and is the core algorithm used by Zookeeper to ensure data consistency. ZAB borrows from the Paxos algorithm, but it is not a general consistency algorithm and was specifically designed for Zookeeper.

2. Based on ZAB protocol, Zookeeper implements a system architecture in active/standby mode to maintain data consistency among all copies in the cluster, in the form of using a single master process (Leader server) to receive and process all transaction requests (write requests) from clients, and using ZAB atomic broadcast protocol. Change the status of the server data to a transaction Proposal and broadcast it to all the Follower processes.

questions

  • In the master-slave architecture, the leader crashes. How can data consistency be guaranteed?
  • How can I quickly elect a leader if the cluster cannot process write requests?

ZAB process

At its core, ZAB defines how to handle transaction requests that change the data state of the Zookeeper server



All transactions must be coordinated by a globally unique server, called the Leader server, and the remaining servers, called the Follower server

1. The Leader server converts a client transaction request into a transaction Proposal and distributes the Proposal to all the Follower servers in the cluster

2. The Leader server waits for feedback from all the Follower servers. Once more than half of the Follower servers give correct feedback, the Leader sends a Commit message to all the Follower servers, requesting that the previous Proposal be submitted.

ZAB Protocol Overview

The ZAB protocol includes two basic modes: crash recovery and message broadcast

News broadcast

When more than half of the Follower servers in the cluster have completed status synchronization with the Leader server, the entire service framework can enter message broadcast mode.

When a ZAB protocol compliant server is started and joins the cluster, if a Leader server is in charge of message broadcast in the cluster, the joining server will consciously enter the data recovery mode: Find the server where the Leader resides and synchronize data with it. After data synchronization, participate in the message broadcast process.

The message broadcast of ZAB uses the atomic broadcast protocol, which is similar to a two-phase commit process, but different.

1. In phase 2 Commit, all participants need to respond with ACK before sending a Commit request. All participants are required to either succeed or fail. This can cause serious blocking problems;

2. In THE ZAB protocol, the Leader only needs to wait for more than half of the followers to send back acks, rather than receiving all the feedback acks.

Message broadcast process:

1. The client initiates a write request

2. The Leader converts the client request information into a transaction Proposal and assigns a transaction ID (Zxid) to each Proposal.

3. The Leader assigns each Follower a FIFO queue and puts the proposals to be broadcast into the queue one by one

4. After receiving the Proposal, the Follower writes it to the local disk as a transaction log and sends an ACK response to the Leader

5. After receiving more than half of the ACK responses from followers, the Leader considers that the ACK message is successfully sent and sends a Commit message

6. The Leader broadcasts a Commit message to all followers and completes the transaction Commit itself. The Follower also commits the transaction after receiving the Commit message



Crash recovery

During the startup of the entire service framework, ZAB protocol will enter the crash recovery mode if the Leader server encounters network interruption, crash exit, or restart. A new Leader server is also elected.

When a new Leader server is elected and more than half of the machines in the cluster have completed state synchronization (data synchronization) with the Leader server, the ZAB protocol exits the recovery mode.

1. In ZAB protocol, in order to ensure the correct operation of the program, a new Leader server needs to be elected after the whole recovery process.

2. The Leader election algorithm not only needs to let the Leader itself know that it has been elected as the Leader, but also needs to make all other machines in the cluster quickly perceive the new Leader server elected.

ZAB ensures data consistency

The ZAB protocol states that if a transaction Proposal is successfully processed on one machine, it should be successfully processed on all machines, even if the machine fails and crashes. For these situations, the ZAB agreement shall guarantee the following conditions:

  • Transactions that have been committed on the Leader server are eventually committed by all servers. Suppose a transaction is committed on the Leader server and has received Ack feedback from half of the Folower servers, but the Leader hangs before it can send a Commit message to all the Follower machines

  • Discard transactions that are only proposed (uncommitted) on the Leader server. Suppose that the original Leader server Server1 crashes and exits after it raises a transaction Proposal3, causing no other servers in the cluster to receive that transaction Proposal3. So, when Server1 recovers and is added to the cluster again, the ZAB protocol needs to ensure that the Proposal3 transaction is discarded.

To sum up, ZAB’s elected Leader must meet the following conditions:

The ability to ensure that a transaction Proposal that has been submitted by the Leader is submitted and that a transaction Proposal that has been skipped is discarded. That is:

1. The newly elected Leader cannot include proposals that have not been submitted.

2. The newly elected Leader node has the largest ZXID.

How does ZAB synchronize data

All properly running servers either become the Leader or Follower and keep in sync with the Leader.

1. After the Leader election (the new Leader has the highest ZXID) is completed, the Leader server will first confirm whether all proposals in the transaction log have been submitted by more than half of the machines in the cluster, that is, whether data synchronization has been completed, before the official start of work (receiving client requests).

2. The Leader server needs to ensure that all the followers servers can receive each transaction Proposal and correctly apply all submitted transaction proposals to the memory data. After the Follower server synchronizes all its unsynchronized transaction proposals from the Leader server and successfully applies them to the local database, the Leader server adds the Follower server to the real list of available followers. And start the rest of the process.

ZAB runtime status #

In ZAB protocol design, each process may be in one of the following three states:

  • LOOKING: Leader Indicates the election status. The Leader is being sought
  • FOLLOWING: The current node is Follower. Synchronize with the Leader server
  • LEADING: the current node is the Leader, serving as the Leader status of the active process.

ZAB state switch

1. All processes are in the LOOKING state initially, and no Leader exists.

2. Next, the process will try to elect a new Leader, and the Leader will switch to the LEADING state. When other processes find that a new Leader has been elected, it will switch to the FOLLOWING state and start to synchronize with the Leader.

3. The process in the FOLLOWING state is called Follower, and the process in the LEADING state is called Leader.

4. When the Leader crashes or gives up its leadership position, other Follower processes will switch to the LOOKING state and start a new round of Leader election.

A Follower synchronizes with only one Leader. The Leader process and all Follower processes use a heartbeat monitoring mechanism to detect each other’s status.

1. If the Leader can normally receive heartbeat detection within the timeout period, the Follower stays connected to the Leader.

2. If the Leader fails to receive heartbeat detection from more than half of the Follower processes within a specified period, or the TCP connection is disconnected, the Leader will give up the current Leader and change to the LOOKING state. Other followers will also choose to give up the Leader and change to LOOKING state, and then a new round of Leader election will be held

The four stages of ZAB

At the beginning, all nodes are in the Election stage. As long as a node has more than half of the votes in the stage, it can be elected as a quasi-leader. Only after reaching the third stage (synchronous stage), the quasi-leader will become a real Leader.

The purpose of this stage is to select a prospective Leader and then enter the next stage.

Discovery Phase During this phase, the Followers communicate with the leader-to-be elected in the previous round to synchronize the transaction Proposal that the Followers have recently accepted. The main purpose of this phase is to find the latest proposal accepted by most of the current nodes and the quasi-leader will generate a new epoch for the Followers to accept and update their acceptedEpoch.

A Follower will connect to only one Leader. If a node F thinks that another Follower P is the Leader, F will be rejected when attempting to connect to P. After F is rejected, it will enter the election stage.



Synchronous phase

In the synchronization stage, all copies in the cluster are synchronized using the latest Proposal history obtained by the Leader in the previous stage.

A would-be Leader becomes a true Leader only when quorum (more than half of the nodes) is synchronized. Followers will only accept proposals whose zxID is larger than their own lastZxid.



The broadcast phase

At this stage, the Zookeeper cluster can officially provide transaction services and the Leader can broadcast messages. At the same time, if a new node is added, the new node needs to be synchronized. Note that a Zab commit transaction does not require an Ack for all followers as 2PC does, but only an Ack for quorum (more than half of the nodes).



ZAB protocol implementation

The Implementation of the Java version of the ZAB protocol is slightly different from the previous definition, with the Election phase using the Fast Leader Election (FLE), which contains the discovery responsibilities of Step 2. Because FLE elects as its Leader the historical node with the latest proposal, the step of finding the latest proposal is eliminated.

The actual implementation merges the discovery and synchronization phases into the Recovery Phase, so the Zab implementation actually has three phases.

Fast Leader Election

The previously mentioned FLE elects the node with the latest Proposal history (with the largest lastZxid) as the Leader, eliminating the need to discover the latest Proposal. This is based on the fact that the node with the most recent proposal also has the most recent submission record

Requirements for becoming a Leader:

1. What is the epoch

2. The epoch is equal to the epoch

3. The epoch is the same as the zxID and the server_id is the largest (myID configured in zoo.cfg).

When the election starts, all nodes vote for themselves by default. When receiving the votes of other nodes, they will judge and change their votes according to the Leader conditions above, and then resend the votes to other nodes. When a node gets more than half of the votes, the node sets its state to Leading, and the other nodes set its state to Following.



Recovery Phase

At this stage the followers send their lastZxid to the Leader, who then decides how to synchronize the data based on the lastZxid. The implementation here is different from the previous stage 3: a Follower terminates the Proposal after receiving the TRUNC command, and receives a new Proposal after receiving the DIFF command.

History. LastCommittedZxid: recently submitted a proposed zxid history. OldThreshold: is thought to have been too old zxid has submitted a proposal



Broadcast Phase

Refer to 4.1 [ZAB Protocol Content # Message Broadcast]

The connection and difference between ZAB and Paxos

Contact 1. Each Follower process has a role similar to the Leader process, which coordinates the running of multiple Follower processes

2. The Leader process will wait for more than half of the followers to give correct feedback before submitting a proposal.

3. In ZAB, each Proposal contains an epoch value, which is used to represent the current Leader cycle. In Paxos, there is also such a representation, named Ballot.

The difference between

1. In Paxos algorithm, the newly elected main process will work in two stages; The first stage is called the read stage: the new main process communicates with other processes to collect proposals made by the main process and submit them. The second stage is called the write stage: the current main process begins to make its own proposal.

2.ZAB protocol adds a synchronization phase on the basis of Paxos. At this point, the new Leader ensures that more than half of the followers have submitted all the proposals in the previous Leader cycle. The introduction of this synchronization phase can effectively ensure that all processes have completed the submission of all previous transaction proposals before the Leader puts forward the transaction Proposal in the new cycle.

Generally speaking, the essential difference between ZAB protocol and Paxos algorithm lies in their different design purposes: ZAB protocol is mainly used to build a highly available distributed data master/slave system, while Paxos algorithm is used to build a distributed consistent state machine system.

conclusion

Answer to questions:

1. In the master-slave architecture, the leader crashes. How to ensure data consistency?

After the leader crashes, the cluster will elect a new leader and the recovery phase will begin. The new leader will have all the submitted proposals, so it will ensure that the followers synchronize the submitted proposals and discard the unsubmitted proposals (as recorded by the leader). This ensures data consistency across the cluster.

2. How can I quickly elect a leader if the cluster cannot process write requests?

This is achieved through the Fast Leader Election. The Election of the Leader only needs more than half of the nodes to vote, so that the Leader can be elected as soon as possible without waiting for the votes of all nodes.

Source | urlify. Cn/RRziQf