What is ZAB agreement?
ZAB protocol is a crash recovery atomic broadcast protocol specially designed for distributed coordination service ZooKeeper. ZooKeeper relies on the ZAB protocol to implement distributed data consistency. Based on this protocol, ZooKeeper implements a system architecture in active/standby mode to maintain data consistency among replicas in the cluster.
Zab protocol is divided into two parts:
- Broadcast (boardcast) : In Zab, all write requests are handled by the leader. In normal working state, the leader receives the request and processes it through the broadcast protocol.
- Recovery: When the service is started for the first time or the leader node is down, the system enters recovery mode until a new leader with a legal number of followers is elected. Then the new leader is responsible for synchronizing the entire system to the latest state.
1.1 Broadcast (Boardcast)
The broadcast process is actually a simplified two-phase commit process:
- After receiving the message request, the Leader assigns the message to a globally unique 64-bit increment ID called zxID. The causal order can be realized by comparing the sizes of ZXids.
- The Leader distributes the message with the ZXID as a proposal to all followers through a first-in, first-out queue (implemented through TCP to achieve global order).
- After receiving the proposal, the follower writes the proposal to the hard drive and sends an ACK to the leader after the hard drive is successfully written.
- After receiving a valid number of ACKs, the leader sends a COMMIT command to all followers. The followers execute the message locally.
- When the follower receives the COMMIT command for the message, the follower executes the message
Broadcasting process
Compared with the complete two-phase commit, the biggest difference of Zab protocol is that the transaction cannot be terminated. The followers either ACK back to the leader or abandon the leader. At a certain point, the state of the leader and the state of the follower are likely to be inconsistent. So it can’t handle the leader failing, so the Zab protocol introduces recovery mode to handle this problem. On the other hand, Zab broadcast does not require transaction termination, that is, it does not require all followers to return an ACK to COMMIT, but requires only a valid number of followers (F +1 out of 2F +1 servers), which also improves overall performance.
1.2 Recovery
Since the broadcast part of the Zab protocol cannot handle the leader failure, the Zab protocol introduces recovery mode to handle this problem. To make the system work properly after the leader hangs up, the following two problems need to be solved:
- Messages that have been processed cannot be lost
- Discarded messages cannot reappear
Messages that have been processed cannot be lost
This occurs in the following scenarios: After the leader receives a valid number of ACKs from followers, he broadcasts the COMMIT command to each follower. The leader also performs the COMMIT command locally and sends a “success” message to the connected client. However, if the follower leader hangs before each follower receives the COMMIT command, the remaining servers do not execute the message.
As shown in Figure 1-1, the COMMIT commands Server1 (leader) and Server2 (follower) of message 1 have been executed, but Server3 has not received the COMMIT command of message 1. At this time, the leader Server1 has been suspended. It is likely that the client has received a reply that message 1 has been successfully executed, and after recovery mode needs to ensure that message 1 has been executed on all machines.
To achieve this goal, Zab’s recovery mode uses the following strategies:
- Elect the node with the maximum proposal (i.e., the maximum ZXID) as the new leader: All proposals must have a legal number of follower Acks before being committed, that is, a legal number of server transaction logs must contain the proposal. Therefore, as long as a legal number of nodes work normally, There must be a node that holds the proposal states of all the COMMIT messages.
- The new leader processes the proposal but not COMMIT messages in his transaction log.
- The new leader and followers form a first-in, first-out queue. They first send proposals that they have and that the followers do not have to the followers, and then send the COMMIT command of these proposals to the followers. To ensure that all proposals are saved by all followers and all messages are processed by all followers. The preceding policies ensure that processed messages are not lost
Discarded messages cannot reappear
This situation occurs in the following scenarios: When the leader receives the message request to generate the proposal and hangs, other followers do not receive the proposal. Therefore, after the recovery mode is used to re-elect the leader, the message is skipped. At this point, the previously suspended leader restarts and registers as a follower. He retains the proposal state of the skipped message, which is inconsistent with the state of the whole system and needs to be deleted.
As shown in Figure 1-2, when Server1 is down and the system enters the new normal state, message 3 is skipped and P3 in Server1 needs to be cleared.
Zab achieves this by cleverly designing the ZXID. An ZXID is 64 bits, and the high 32 is the epoch number. Every time a new leader is elected, the epoch number will be +1. The lower 32 bits are the message counter, which is +1 each time a message is received and reset to 0 after the new leader is elected. The advantage of this design is that the old leader hangs and restarts. It will not be elected as the leader, because its ZXID must be smaller than that of the new leader. After the old leader is connected to the new leader as a follower, the new leader will make it clear all proposals with the old epoch number that have not been committed.
Second, the summary
- In the master-slave architecture, the leader crashes. How can data consistency be guaranteed? After the leader crashes, the cluster will elect a new leader and the recovery phase will begin. The new leader will have all the submitted proposals, so it will ensure that the followers synchronize the submitted proposals and discard the unsubmitted proposals (as recorded by the leader). This ensures data consistency across the cluster.
- How can I quickly elect a leader if the cluster cannot process write requests? This is achieved through the Fast Leader Election. The Election of the Leader only needs more than half of the nodes to vote, so that the Leader can be elected as soon as possible without waiting for the votes of all nodes.