ZooKeeper is introduced in Functions, Application Scenarios, and Alternatives of ZooKeeper. ZooKeeper uses the master-slave mode and ZAB protocol to solve single-point problems. ZAB protocol is the key to ensure distributed consistency. This paper will further discuss the ZAB protocol and some considerations.

ZAB agreement

ZooKeeper Atomic Broadcast (ZAB) ensures the consistency between the primary and secondary nodes. Let’s take a look at the architecture of ZooKeeper:

As can be seen from the figure, only one Server is the Leader, while the other servers play the role of Follower or Observer. For the sake of convenience, we only discuss the Leader and Follower roles in this article. The first problem is how to specify a Server as the Leader when starting the cluster. Suppose there are three servers A, B, and C, and the Leader election process is as follows:

  1. Each Server issues a vote

    When the Server starts, the state is LOOKING, and a vote is issued to elect the Leader in the format of (id, ZXID).

    • Id: the SID value of the Leader being elected (unique ID for each Server, as configured by the configuration file)
    • ZXID: transaction ID of the elected Leader

    ZXID: the ZXID of each Server is 0 because the cluster is started.

    Since each Server does not know about the other machines, the only certainty is that it is running, so the first vote goes to itself. If the SIDs of servers A, B, and C are 1, 2, and 3, the votes issued are (1,0), (2,0), and (3,0), respectively.

  2. Receives votes from each server

  3. Processing to vote

    The received vote will be compared with its own vote in turn. The comparison rule is to compare ZXID and ID size in turn. Here, since ZXID is 0, it cannot determine which vote is larger, so id will be compared. When the number of votes received is greater than the number of votes received by the Server, the voting content is updated. Otherwise, the voting content is not updated and the voting is sent to other servers in the cluster.

  4. Vote statistics

    After each vote, all the votes will be counted. If more than half of the votes elect a Server, the Leader will be considered elected and no vote will be sent.

  5. Changing server state

    After the Leader is determined, the status of the Leader is changed to LEADING; if the Leader is Follower, the status is changed to FOLLOWING.

The above process allows you to elect a Leader from the cluster. Here’s how the ZAB protocol works in master-slave mode. There are two main phases: message broadcast and crash recovery. In the message broadcast phase, the data of the primary and secondary nodes is synchronized. When the Leader fails, the Leader is re-elected and data is synchronized to avoid a single point of failure.

Let’s start with the message broadcast phase. Here we focus on two types of client requests, transactional and non-transactional. A transaction request is simply a write operation that changes data, whereas a non-transaction request is a read operation.

  • Non-transactional requests: the Leader and Follower return results based on local data.

  • Transaction requests are processed by the Leader and forwarded to the Leader by followers.

When the Leader receives a transaction request, a message is broadcast:

  1. Leade converts the transaction request into a transaction Proposal, which is broadcast to all followers.
  2. When the Follower receives the Proposal, it writes the transaction operation to the transaction log and sends an Ack response back to the Leader.
  3. When more than half of the Ack responses are received, the Leader considers the transaction committable, commits the transaction, and broadcasts a Commit message.
  4. When the Follower receives a Commit message, it commits the local transaction.

Through message broadcast, data consistency between master and slave nodes can be achieved. If the Leader breaks down or loses contact with more than half of the followers due to network problems, the crash recovery phase is initiated to ensure consistency.

In the crash recovery phase, a new Leader is elected through the Leader election, and then data synchronization is performed. The Leader election process here remains the same as described above. And that’s where ZXID comes in. The ZXID is a 64-bit number. The first 32 bits are epochID and the last 32 bits are transaction ID.

  • EpochID: The initial value is 0 when a cluster is started. When a new Leader is elected, the system considers that a new Leader cycle starts.
  • Transaction ID: the initial value is 0 when the cluster is started, and incremented by 1 whenever a Proposal Proposal is generated for a transaction request;

EpochID is designed to avoid conflicting transaction ids. For example, A proposal with transaction ID 3 on Server A, the Leader, breaks down before it can be broadcast. However, the transaction ID on the followers is still 2. When A new Leader is elected by the followers, Server C, When A new transaction request is processed with transaction ID 2 + 1 = 3 and Server A restarts and joins the cluster, A conflict occurs when two different transaction requests have the same ID. For example, the epochID of Server A is 0 when Server A is the Leader, and the epochID of Server C is 1 when Server A and Server C are compared. If it finds that the record of its proposal is <0, 1><0, 2><0,3> while that of the Leader is <0, 1><0, 2>< 1, 3>, the proposal will be rolled back to <0, 2> and the <1, 3> proposal will be synchronized again to keep the same as that of the Leader.

The comparison of transaction IDS participating in voting ensures that the elected Leader has the largest transaction ID, that is, the data is up to date. As long as other servers synchronize data with the Leader, the data consistency of the whole cluster is guaranteed. When the crash recovery phase is complete, the cluster enters the message broadcast phase again.

thinking

This part is the author in the study of ZAB protocol to think about several questions.

(1) In the broadcast message stage, why is it required to submit ACK responses received at least half of the time?

In 2PC, it is required to receive all ACK responses before submitting. This condition is too harsh. If one Follower crashes, transaction processing will be affected. In ZAB, the transaction processing is reduced due to the influence of followers, and only half of the transaction is required. What is Split Brain and how did Zookeeper solve it? This article is very clear.

(2) If the Leader broadcasts two proposals P1 and P2 successively, Follower A normally receives P1 and P2 successively, while Follower B receives P2 instead of P1, or P2 first and then P1 in the wrong order due to network problems. How to ensure the reliability and sequence of messages in ZooKeeper?

The Leader maintains a message queue for each Follower and sends messages using THE TCP protocol to ensure the reliability and sequence of messages. Therefore, even though Follower A has received P1 and P2, the Leader can resend P1 for the Follower B who has not received P1 and start sending P2 after confirming that P1 has been sent successfully.

(3) Does the Client have to read the latest data?

Because of the ZAB process, we can know that it is not necessarily up to date. For example, Follower B does not receive a Commit request for a transaction due to network problems and the data version is later than that of other servers. In this case, the data read by Follower B is not the latest in the cluster. Although ZAB does not guarantee strong consistency, it does guarantee sequential consistency.

Sequential consistency: Update commands sent by the client are executed by the server in the order they are sent

To ensure that the latest data is read, run Sync to ensure synchronization before reading the data.

reference

  1. From Paxos to ZooKeeper

  2. Apache ZooKeeper: How do writes work

If you like my article, you can scan the code and follow my public account: “Cao Niazi” to discuss the technology together