ZooKeeper uses a consistency protocol called ZAB (ZooKeeper Atomic Broadcast) rather than using the Paxos algorithm.

For more information on the Paxos protocol, check out my previous post understanding the Paxos Protocol: A Brief introduction to Distributed Consistency Protocols

ZAB protocol of ZooKeeper

ZAB protocol is a crash recovery atomic broadcast protocol specially designed for distributed coordination service ZooKeeper.

ZooKeeper uses a single master process to receive and process all transaction requests from the client, and uses ZAB atomic broadcast protocol to broadcast state changes of server data to all replica processes in the form of transaction proposals.

The core of ZAB protocol is to define how to handle requests for things that will change the data state of ZooKeeper server, namely:

Therefore, transaction requests must be coordinated by a globally unique server, called the Leader server, and the rest of the servers are called followers servers. The Leader server converts a client transaction request into a transaction Proposal (Proposal) and distributes this Proposal to all Follower servers in the cluster.

After the Leader server waits for correct feedback from all followers, the Leader server will send a Commit message to the followers again, indicating that the previous proposal will be submitted.

Protocol is introduced

The ZAB protocol includes two basic modes: crash recovery and message broadcast.

ZAB will enter the crash recovery mode and elect a new Leader server when the whole service framework is started, or the Leader server crashes, the network is interrupted, the Leader server restarts, or there is no normal communication between more than half of the servers in the cluster and the Leader server.

When a new Leader server is elected and more than half of the machines in the cluster have completed state synchronization (data synchronization) with the Leader server, the ZAB protocol exits the recovery mode.

Then it enters the message broadcast mode. If a Leader is in charge of message broadcast when a ZAB protocol Leader joins the cluster at this time, the server will consciously enter the data recovery mode and join the message broadcast process after completion.

ZooKeeper is designed to allow only a single Leader server to handle transaction requests. After receiving the request of the client, the Leader server will generate the corresponding proposal and initiate a round of broadcast. If another non-Leader machine receives a transaction request from the client, it forwards the transaction request to the Leader server.

News broadcast

ZAB’s message broadcast process is an atomic broadcast protocol, similar to a 2PC process.

But the whole process is slightly different from 2PC. In the two-stage submission process of ZAB, the interrupt logic is removed, and all Follower servers either normally feedback the Proposal put forward by the Leader or abandon the Leader server. At the same time, after receiving an ACK from more than half of the followers, the Leader server can start to submit the Proposal without waiting for feedback from all the followers.

Of course, this simplified two-phase commit model cannot deal with data inconsistency caused by Leader server crash and exit, so crash recovery mode is added in ZAB protocol to solve this problem.

In addition, the whole message broadcast protocol is based on THE TCP protocol with FIFO characteristics for network communication, so it is easy to ensure the sequence of message receiving and sending in the process of message broadcast.

During message broadcast, the Leader server will assign a globally monotonically increasing and unique ID to each transaction Proposal, called the transaction ID (i.e., ZXID). Each Proposal must be dealt with in order of ZXID. The Leader server will allocate a separate queue for each Follower server, and then put the Proposal that needs to be broadcast into the queue, and send messages according to THE FIFO. After receiving the Proposal, the Follower writes it to the local disk as a transaction log and sends an ACK to the Leader. After the Leader receives half of the acks sent by the followers, it broadcasts a COMMIT and commits the thing itself.

Crash recovery

Once the Leader server crashes or loses contact with half of the followers due to network problems, the Leader server enters the crash recovery mode.

In the whole process, an efficient Leader election algorithm is needed to quickly elect a new Leader server, and at the same time, other servers should quickly perceive the newly generated Leader.

Crash recovery consists of two parts: Leader election and data recovery

ZAB protocol crash recovery must meet the following two requirements:

  • The crash recovery process needs to ensure that the Proposal already submitted by the Leader can also be submitted by all proposals
  • The crash recovery process requires discarding things that were raised but not committed by the Leader

Therefore, the elected Leader must also meet the following requirements:

  • The Leader node has the largest ZXID
  • Cannot contain unsubmitted proposals (i.e. the new Leader is the original Follower node that has submitted all proposals)

After completing the Leader election, the Leader server will first confirm whether all proposals in the transaction log have been submitted by more than half of the machines in the cluster, that is, whether data synchronization has been completed.

The Leader server needs to ensure that all the Follower servers can receive each Proposal and correctly apply all submitted proposals to the internal database. The Leader server will pass the queue to the previous question. The uncommitted objects in each Follower queue are sent to the followers in the form of a Proposal message, which is followed by a COMMIT to indicate that the Proposal has been committed.

Let’s see how the ZAB protocol deals with things that are discarded. In the design of the event number ZXID, ZXID is a 64-bit number. The lower 32 bits can be regarded as a simple counter, while the higher 32 bits represent the number of the Leader cycle epoch. Every time a Leader is changed, 1 will be added to the original epoch, so that after the old Leader is restored, The other followers will not listen to it, because the followers obey only the Leader command of the epoch.

The relation and difference between ZAB and Paxos algorithm

The ZAB protocol is not a typical implementation of the Paxos algorithm.

  • Both have a Leader process role that coordinates the execution of multiple Follower processes
  • The Leader process waits for more than half of the followers to give correct feedback before submitting a proposal
  • Each of ZAB’s proposals contains an epoch, which in Paxos corresponds to Ballot, with the same meaning

In Paxos algorithm, a newly elected master process will work in two stages. First, the new Leader will communicate with other processes to collect and submit the proposals put forward by the previous Leader, and then start to put forward its own proposals. ZAB adds a synchronization phase on top of Paxos.

The design objectives of Paxos algorithm and ZAB protocol are also different in essence:

  • ZAB–> Highly available distributed data primary/secondary system
  • Paxos–> Distributed consistent state machine system

The resources

  • From Paxos to ZooKeeper: Distributed Consistency Principles and Practices