In the Zookeeper cluster, clients are randomly connected to a node in the Zookeeper cluster.If it is a read request, the data is read directly from the current node; if it is a write request, the request is forwarded to the leader to commit the transaction. The leader then broadcasts the transaction, and as long as more than half of the nodes have successfully written, the write request is committed(Class 2PC transactions). The whole process is shown in the figure below:
The whole data synchronization mechanism is realized by ZAB protocol. Let’s focus on the implementation principle of ZAB protocol.
ZAB
Introduction of agreement
Zookeeper Atomic Broadcast (ZAB) is an Atomic Broadcast protocol specially designed for Zookeeper to support crash recovery. ZooKeeper relies on ZAB protocol to achieve distributed data consistency. The ZAB protocol consists of two basic modes, crash recovery and message broadcast.
When the whole cluster is started, or when the network of the leader node is interrupted or crashes, ZAB will enter the recovery mode and elect a new leader. After the leader server is elected, more than half of the machines in the cluster complete data synchronization with the leader node (synchronization refers to data synchronization). To ensure that more than half of the machines in the cluster are consistent with the leader server), the ZAB protocol exits recovery mode.
When more than half of the followers in the cluster have completed synchronization with the leader, the cluster enters broadcast mode. At this time, when the leader node is working normally, start a new server to join the cluster. Then this server will enter the data recovery mode directly and synchronize data with the leader node. After synchronization is complete, non-transactional requests can be processed normally.
The leader node can process both transactional and non-transactional requests. The follower node can only process non-transactional requests. If the follower node receives a non-transactional request, it forwards the request to the leader server.
Message broadcast implementation principle
The message broadcast process is actually a simplified version of the two-phase commit process, which is as follows:
leader
Upon receiving a message request, assign the message a globally unique 64-bit auto-increment ID (zxid
).leader
One for each followerFIFO
Queues (implemented over THE TCP protocol to implement the global order feature) will carryzxid
Message as a proposal (proposal
) to allfollower
.- when
follower
To receiveproposal
First,proposal
Write the data to the diskleader
Respond to aack
. - when
leader
Received legitimate number (more than half of the nodes)ACK
Later,leader
I’m going to go to thesefollower
sendcommit
Command, and the message is executed locally. - when
follower
message-receivingcommit
Command, the message is submitted.
About ZXID
Zxid is the transaction ID of Zookeeper. To ensure the sequence consistency of transactions, Zookeeper uses the increasing transaction ID (zxID) to identify transactions. All proposals have their zxID attached at the time they are made.
Zxid is a 64-bit number. The higher 32 bits are epoch. It is used to indicate whether the leader relationship has changed. The lower 32 bits are used to increment the count.
epoch
Can be understood as the age or cycle of the current cluster, eachleader
Just like emperors, they have their own reign, so every time they change dynasties,leader
After the change, one is added to the previous era. So it’s oldleader
After the crash, no one listened to him becausefollower
Only listen to the current eraleader
The command.
Crash recovery implementation principle
We have already talked about the message broadcast process in ZAB protocol. Under normal conditions, this method is fine. However, if the leader node crashes or the leader server loses contact with half of the follower nodes due to network problems, The Zookeeper cluster will then go into crash recovery mode. In the crash recovery state, zAB needs to do two things: elect a new leader and synchronize data.
If the leader loses contact with more than half of the followers, a network partition may be created between the leader and the followers. In this case, the leader is no longer a legitimate leader.
When we talked about message broadcasting earlier, we know that the message broadcasting mechanism of the ZAB protocol is a simplified version of the 2PC protocol, which requires only half of the nodes in the cluster to respond to commit. But it can’t handle the data inconsistency that occurs when the leader server crashes. Therefore, a crash recovery mode was added to the ZAB protocol to solve this problem.
Crash recovery in ZAB requires that if a transaction Proposal is successfully processed on one machine, the transaction should be successfully processed on all machines, even if there is a failure. To do this, let’s first think about scenarios that cause data inconsistencies in Zookeeper and how zAB’s crash recovery protocol should handle these scenarios.
Messages that have been processed cannot be lost
whenleader
Received legal quantityfollower
theACK
After that, go to eachfollower
radioCOMMIT
Command is also executed locallyCOMMIT
And returns “Success” to the connected client. But if it’s in the variousfollower
Upon receipt ofCOMMIT
Before the commandleader
It hangs, causing the remaining servers to not execute the message.
A typical example is C2 in the figure. At some point during the normal operation of the cluster, Server1 is the leader server, and the announcements P1, P2, C1, P3, and C2 are broadcast. (a) If the leader server sends a message to a proposal2, it crashes and exits immediately. (B) if the leader server sends a message to a proposal2, it crashes and exits immediately. (C) if the leader server sends a message to a proposal2, it crashes and exits immediately. (D) If the leader server sends a message to a proposal2, it crashes and exits immediately.
Discarded messages cannot reappear
When the leader receives a message requesting to generate a proposal, the proposal is suspended. Other followers do not receive the proposal. Therefore, the message is skipped after a new leader is selected through the recovery mode. In this case, the suspended leader restarts and becomes a follower. The leader retains the proposal status of the skipped message, which is inconsistent with the status of the entire system and needs to be deleted.
The solution
If the ZAB protocol needs to meet the above two conditions, a leader election algorithm must be designed. This ensures that the Proposal that has been submitted by the leader can be submitted and that the Proposal that has been skipped is discarded.
- if
leader
The electoral algorithm can guarantee that the newly electedleader
The server has the highest number of all machines in the cluster (ZXID
Maximum) transactionProposal
That would guarantee that the newly electedleader
There must be a proposal already submitted. Because all the proposals wereCOMMIT
Before, there had to be more than halffollower
Returned to theACK
That is, more than half of the nodes must have the proposal on the transaction log of the serverproposal
Therefore, as long as there are a legitimate number of nodes working normally, there must be a node that saves all theCOMMIT
Of the messageproposal
State. - The other one,
zxid
It is 64 bits. The higher 32 bits areepoch
Number, every time it goes throughleader
Elect a new oneleader
The newleader
willepoch
Number +1, the lower 32 bits is the message counter, each received a message this value +1, newleader
The value is reset to 0 after the election. The advantage of this design is that it is oldleader
It will not be elected asleader
So it’s going to bezxid
Definitely less than the current oneleader
. When the oldleader
As afollower
Access to newleader
After the newleader
Will let it have all the oldepoch
Size is notCOMMIT
theproposal
Cleared.
The leader election source analysis is explained in detail in the next section.