This article understands the Zookeeper data synchronization process

In the Zookeeper cluster, clients are randomly connected to a node in the Zookeeper cluster.If it is a read request, the data is read directly from the current node; if it is a write request, the request is forwarded to the leader to commit the transaction. The leader then broadcasts the transaction, and as long as more than half of the nodes have successfully written, the write request is committed(Class 2PC transactions). The whole process is shown in the figure below:

The whole data synchronization mechanism is realized by ZAB protocol. Let’s focus on the implementation principle of ZAB protocol.

`ZAB`Introduction of agreement

Zookeeper Atomic Broadcast (ZAB) is an Atomic Broadcast protocol specially designed for Zookeeper to support crash recovery. ZooKeeper relies on ZAB protocol to achieve distributed data consistency. The ZAB protocol consists of two basic modes, crash recovery and message broadcast.

When the whole cluster is started, or when the network of the leader node is interrupted or crashes, ZAB will enter the recovery mode and elect a new leader. After the leader server is elected, more than half of the machines in the cluster complete data synchronization with the leader node (synchronization refers to data synchronization). To ensure that more than half of the machines in the cluster are consistent with the leader server), the ZAB protocol exits recovery mode.

When more than half of the followers in the cluster have completed synchronization with the leader, the cluster enters broadcast mode. At this time, when the leader node is working normally, start a new server to join the cluster. Then this server will enter the data recovery mode directly and synchronize data with the leader node. After synchronization is complete, non-transactional requests can be processed normally.

The leader node can process both transactional and non-transactional requests. The follower node can only process non-transactional requests. If the follower node receives a non-transactional request, it forwards the request to the leader server.

Message broadcast implementation principle

The message broadcast process is actually a simplified version of the two-phase commit process, which is as follows:

leaderUpon receiving a message request, assign the message a globally unique 64-bit auto-increment ID (zxid).
leaderOne for each followerFIFOQueues (implemented over THE TCP protocol to implement the global order feature) will carryzxidMessage as a proposal (proposal) to allfollower.
whenfollowerTo receiveproposalFirst,proposalWrite the data to the diskleaderRespond to aack.
whenleaderReceived legitimate number (more than half of the nodes)ACKLater,leaderI’m going to go to thesefollowersendcommitCommand, and the message is executed locally.
whenfollowermessage-receivingcommitCommand, the message is submitted.

About ZXID

Zxid is the transaction ID of Zookeeper. To ensure the sequence consistency of transactions, Zookeeper uses the increasing transaction ID (zxID) to identify transactions. All proposals have their zxID attached at the time they are made.

Zxid is a 64-bit number. The higher 32 bits are epoch. It is used to indicate whether the leader relationship has changed. The lower 32 bits are used to increment the count.

epochCan be understood as the age or cycle of the current cluster, eachleaderJust like emperors, they have their own reign, so every time they change dynasties,leaderAfter the change, one is added to the previous era. So it’s oldleaderAfter the crash, no one listened to him becausefollowerOnly listen to the current eraleaderThe command.

Crash recovery implementation principle

We have already talked about the message broadcast process in ZAB protocol. Under normal conditions, this method is fine. However, if the leader node crashes or the leader server loses contact with half of the follower nodes due to network problems, The Zookeeper cluster will then go into crash recovery mode. In the crash recovery state, zAB needs to do two things: elect a new leader and synchronize data.

If the leader loses contact with more than half of the followers, a network partition may be created between the leader and the followers. In this case, the leader is no longer a legitimate leader.

When we talked about message broadcasting earlier, we know that the message broadcasting mechanism of the ZAB protocol is a simplified version of the 2PC protocol, which requires only half of the nodes in the cluster to respond to commit. But it can’t handle the data inconsistency that occurs when the leader server crashes. Therefore, a crash recovery mode was added to the ZAB protocol to solve this problem.

Crash recovery in ZAB requires that if a transaction Proposal is successfully processed on one machine, the transaction should be successfully processed on all machines, even if there is a failure. To do this, let’s first think about scenarios that cause data inconsistencies in Zookeeper and how zAB’s crash recovery protocol should handle these scenarios.

Messages that have been processed cannot be lost

whenleaderReceived legal quantityfollowertheACKAfter that, go to eachfollowerradioCOMMITCommand is also executed locallyCOMMITAnd returns “Success” to the connected client. But if it’s in the variousfollowerUpon receipt ofCOMMITBefore the commandleaderIt hangs, causing the remaining servers to not execute the message.

A typical example is C2 in the figure. At some point during the normal operation of the cluster, Server1 is the leader server, and the announcements P1, P2, C1, P3, and C2 are broadcast. (a) If the leader server sends a message to a proposal2, it crashes and exits immediately. (B) if the leader server sends a message to a proposal2, it crashes and exits immediately. (C) if the leader server sends a message to a proposal2, it crashes and exits immediately. (D) If the leader server sends a message to a proposal2, it crashes and exits immediately.

Discarded messages cannot reappear

When the leader receives a message requesting to generate a proposal, the proposal is suspended. Other followers do not receive the proposal. Therefore, the message is skipped after a new leader is selected through the recovery mode. In this case, the suspended leader restarts and becomes a follower. The leader retains the proposal status of the skipped message, which is inconsistent with the status of the entire system and needs to be deleted.

The solution

If the ZAB protocol needs to meet the above two conditions, a leader election algorithm must be designed. This ensures that the Proposal that has been submitted by the leader can be submitted and that the Proposal that has been skipped is discarded.

ifleaderThe electoral algorithm can guarantee that the newly electedleaderThe server has the highest number of all machines in the cluster (ZXIDMaximum) transactionProposalThat would guarantee that the newly electedleaderThere must be a proposal already submitted. Because all the proposals wereCOMMITBefore, there had to be more than halffollowerReturned to theACKThat is, more than half of the nodes must have the proposal on the transaction log of the serverproposalTherefore, as long as there are a legitimate number of nodes working normally, there must be a node that saves all theCOMMITOf the messageproposal State.
The other one,zxidIt is 64 bits. The higher 32 bits areepochNumber, every time it goes throughleaderElect a new oneleaderThe newleaderwillepochNumber +1, the lower 32 bits is the message counter, each received a message this value +1, newleaderThe value is reset to 0 after the election. The advantage of this design is that it is oldleaderIt will not be elected asleaderSo it’s going to bezxidDefinitely less than the current oneleader. When the oldleaderAs afollowerAccess to newleaderAfter the newleaderWill let it have all the oldepochSize is notCOMMITtheproposalCleared.

The leader election source analysis is explained in detail in the next section.

This article understands the Zookeeper data synchronization process

ZABIntroduction of agreement

Message broadcast implementation principle

About ZXID

Crash recovery implementation principle

Messages that have been processed cannot be lost

Discarded messages cannot reappear

The solution

Related Posts

JVM internal storage structure details

Common solutions for distributed transactions and open source projects

Advanced JAVA Development skills: Java8 new date and time API (4)JSR-310: Common computing tools)

`ZAB`Introduction of agreement