1. Implementation of ZooKeeper

1.1 Handling Single Points of Failures in ZooKeeper

We know that ZooKeeper can solve the single point of failure of distributed system by Master election, as shown in the figure.

Figure 1.1 ZooKeeper troubleshooting single points of failure

ZooKeeper uses Master elections to help distributed systems solve single points of failure, ensuring that only one Master provides services for distributed systems at any time. This means that the distributed single point of problem is handed over to ZooKeeper to handle. I wonder if you have found a problem at this time – “failover to ZooKeeper”. If you look at the figure below, you can see that if our ZooKeeper only uses one machine to provide services, if this machine fails, the distributed system will directly become a double Master mode, and then it will be meaningless to introduce ZooKeeper into the distributed system. This means that ZooKeeper needs to ensure availability and recovery during its implementation. Only in this way can we rest assured to use ZooKeeper as the starting point to build our distributed system, so as to achieve the purpose of cost saving and bug reduction.

1.2 ZooKeeper Operating Mode

The ZooKeeper service has two different running modes. There is the standalone mode, where there is only one ZooKeeper server. This pattern is simple and suitable for testing environments, even in unit testing, but does not guarantee high availability and recovery. ZooKeeper in a production environment is typically run in “replication mode” on a cluster of computers called an ensemble.

Figure 1.2 ZooKeeper cluster

ZooKeeper achieves high availability through replication, providing services as long as more than half of the machines in the aggregate are available. For example, in an aggregate of five nodes, the data of each Follower node is a copy of the data of the Leader node, which means that the data view of each node is the same, so that five nodes can provide ZooKeeper service. And the failure of any two machines in the assembly can guarantee the continuity of service, because the remaining three machines are more than half.

Note that a six-node assembly can also tolerate only two machine failures, because if three machines fail, the remaining three machines are not more than half of the assembly. For this reason, an assembly usually contains an odd number of machines.

Conceptually, what ZooKeeper does is ensure that every modification to the Znode tree is copied to more than half of the machines in the aggregate. If less than half of the machines fail, at least one machine will save the latest state, and that machine will be our Leader. The remaining replicas will eventually be updated to this state. If the Leader dies, one machine can be selected to continue serving as the new Leader, since the other machines have a copy of the Leader.

1.3 Read/Write Mechanism of ZooKeeper

(1) overview

The core idea of ZooKeeper is to provide a non-locking mechanism to Wait Free core services for distributed system synchronization. It provides simple interfaces for file creation and read/write operations. The system core itself does not provide lock and mutually exclusive services for file reading and writing, but provides update operations based on version comparison. The client can implement lock logic based on this. See Figure 1.3 below.

Figure 1.3 Using Versions to prevent inconsistencies due to Concurrent updates

 

(2) ZK cluster service

Zookeeper is a cluster consisting of multiple servers. A cluster has one Leader and multiple followers. The client can connect to any ZooKeeper service node to read and write data, as shown in Figure 1.4 below.

Figure 1.4 ZooKeeper cluster service

For each Server in the ZK cluster, a copy of the data is kept. Zookeeper uses a simple synchronization policy to ensure data consistency through the following two basic guarantees:

① Globally serialize all write operations

② Ensure that the instructions of the same client are executed by FIFO (and FIFO of message notification)

All read requests are responded to locally by the Zk Server, and all update requests are forwarded to the Leader for implementation.

(3) the ZK components

ZK components, as shown in Figure 1.5. ZK components with the exception of the Request Processor, each Server that makes up the ZK service makes copies of these components. \

Figure Shows the ZooKeeper components

 

ReplicatedDatabase is an in-memory database that contains the entire Data Tree. For recovery, updates are logged to disk and writes are serialized to disk before being applied to the in-memory database.

Each ZK Server can serve multiple clients. Clients can connect to a Server to submit requests. Read requests are serviced by a local copy of each Server database. Write requests that change the state of the server need to be handled through a consistency protocol.

As part of the consistency protocol, all write requests from the Client are forwarded to a separate Server, called the Leader. The other servers in the ZK cluster, called followers, receive proposal messages sent by the Leader and reach a consensus on forwarding the messages. The message layer handles leader failure and synchronizes Followers and leader.

ZooKeeper uses a custom atomic messaging protocol. Because the messaging layer is atomic, ZooKeeper is able to ensure that local replicas do not diverge. When the leader receives a write request, it calculates what state the system will be in when the write completes, and then converts it into a transaction that captures the state. \

(4) ZK performance \

ZooKeeper is widely used by applications and has thousands of clients accessing it at the same time, so we need high throughput. The workload we designed for ZooKeeper had a read/write ratio of 2:1 or more. However, we found that ZooKeeper’s high write throughput also allowed it to be used in some write-dominated workloads. ZooKeeper provides high read throughput through a state copy of the local ZK on each Server. Therefore, fault tolerance and read throughput are measured by the number of servers added to the service. Write throughput is not measured by the number of machines added to the service.

At its birthplace, Yahoo, for example, ZooKeeper has a baseline throughput of over 10,000 operations per second for writer-dominated workloads; Throughput is several times higher for normal read-led workloads.

Ii. Guarantee of ZooKeeper

After the above analysis, we know that to ensure the high availability of ZooKeeper service, we need to adopt distributed mode to write multiple copies of redundant data, which brings consistency problems, which in turn brings performance problems, and thus falls into an endless loop without solution. So here, it involves our famous CAP theory in the distributed field. Here, I will briefly introduce it to you. You can refer to the detailed content of CAP online.

2.1 CAP theory

(1) Theoretical overview

CAP theory exists in the distributed domain:

C: All data changes are synchronized.

② The system has good response performance.

C) Partition tolerance D) Partition tolerance In practical terms, partitioning is a time-bound requirement for communication. If the system cannot achieve data consistency within the time limit, it means that A partitioning situation has occurred and it must choose between C and A for the current operation, meaning that the system is available regardless of any message loss.

It has been proved that any distributed system can only satisfy two points at once, not all three. Therefore, it would be foolish to waste energy thinking about how to design the perfect system for all three, and the trade-offs should be tailored to the application scenario.

(2) Consistent classification

Consistency refers to that data read from the outside of the system is the same under certain constraints, that is, data changes within the system should be synchronized among all nodes. Consistency levels can be classified as follows:

1. Strong consistency. At any time, any user can read the last successful update of data.

② Monotonic consistency. At any time, any user who reads a value after an update will never read a value older than that value. That is, the order of available data must be monotonically increasing.

③ Session consistency. Any user who reads a value after an update in a session will not read a value older than this value in the current session. Session consistency further loosens the constraints on the basis of monotonic consistency, ensuring monotony only within a single session of a single user, but not between different users or different sessions of the same user.

Eventual consistency. The user can only read the value after one update, but the system guarantees that the data will eventually reach a completely consistent state, but the time required is not guaranteed.

⑤ Weak consistency. The user cannot read the last updated value within a certain time.

2.2 ZooKeeper and CAP theory

We know that they are also is a kind of distributed system, which on the consistency of some people think that it is a kind of strong consistency of the service (through the sync operation), also some people think that is monotone consistency (update said most of the concepts), and human is the ultimate consistency consistency (order), each have each reason here anyway don’t argue. It then makes a compromise between fault tolerance and availability of partitions, which is consistent with CAP theory. ZooKeeper ensures data consistency in the following ways

① Sequence consistency

Updates from any particular client are committed in the order they were sent. That is, if a client updates the value of Znode Z to a and later updates the value of z to B, no client can see the value of A after seeing the value of z as b (if there are no other updates to Z).

(2) the atomic

Each update either succeeds or fails. This means that if an update fails, no client will see the result of the update.

③ Single system image

A client sees the same system view no matter which server it connects to. This means that if a client connects to a new server in the same session, it will see no older system state than it saw on the previous server. When a server fails and one of its clients needs to try to connect to other servers in the assembly, all servers that are behind the failed server will not accept the connection request until they catch up with the failed server.

(4) the persistence

Once an update is successful, its results persist and cannot be undone. This indicates that the update is not affected by a server failure.

ZooKeeper principle

3.1 Principle Overview

At the core of Zookeeper is the atomic broadcast mechanism, which ensures synchronization between servers. The protocol that implements this is called the Zab protocol. Zab protocol has two modes: recovery mode and broadcast mode.

(1) Recovery mode

Zab goes into recovery mode when the service starts or after the leader crashes, and the recovery mode ends when the leader is elected and most servers have completed state synchronization with the Leader. State synchronization ensures that the leader and Server have the same system state.

(2) Broadcast mode

Once the Leader has synchronized the status of most followers, he can start to broadcast messages. When a Server is added to the ZooKeeper service, it starts in recovery mode, discovers the Leader, and synchronizes the status with the Leader. When the synchronization ends, it also participates in the message broadcast. The ZooKeeper service remains Broadcast until the Leader crashes or the Leader loses most of the Followers.

Broadcast mode is very similar to 2PC (two-phrase commit) in distributed transactions: That is, the Leader raises a resolution and the Followers vote on it. The Leader calculates the results of the vote and decides whether the resolution is approved or not. If the resolution is approved, the Leader will do nothing.

Figure 3.1 Two-phase Commit

In broadcast mode, the ZooKeeper Server accepts Client requests and forwards all write requests to the leader, who then broadcasts updates to the followers. The leader does not commit the update until more than half of the followers have persisted the change, and the client receives a successful update response. The protocol used to reach consensus is designed to be atomic, so that every change either succeeds or fails.

Figure 3.2 Data flow diagram of ZooKeeper

3.2 Zab Protocol details

3.2.1 Broadcast Mode

The broadcast pattern is similar to a simple two-phase commit: the Leader initiates a request, collects votes, and finally commits. Figure 3.3 illustrates the message flow of our protocol. We can simplify the two-phase commit protocol because we don’t have a case of “aborts”. The followers either confirm the Leader’s choice or discard the Leader’s choice. The absence of “aborts” means that, as long as a specified number of machines confirm the veto, instead of waiting for all machines to respond.

Figure 3.3 The flow of message with Protocol

The broadcast protocol uses THE TCP FIFO channel in all communication, and by using this channel, it is very easy to maintain order. Messages are delivered in order through the FIFO channel. The order of received messages is preserved as soon as they are processed.

The Leader broadcasts proposals that have already been delivered. Before sending a Proposal message, the Leader will assign a monotonically increasing unique ID to the Proposal, called zxID. Because Zab ensures causality, messages submitted are also sorted by ZXID. Broadcast encapsulates the Proposal into a message, adds it to the output queue pointing to the followers, and sends it to the followers through the FIFO channel. When followers receive a Proposal, they write it to disk and, if possible, write it in batches. Once written to the disk medium, the Follower sends an ACK to the Leader. When the Leader receives a specified number of ACKS, the Leader broadcasts the COMMIT message and delivers it locally. After receiving a COMMIT message from the Leader, the followers also submit the message.

Note that this simplified two-phase commit by itself does not solve the Leader failure, so we added recovery mode to solve the Leader failure.

3.2.2 Recovery Mode

(1) Overview of recovery phase

The Zab agreement is in broadcast mode until the Leader fails or loses the specified number of Followers. To ensure progress, a new Leader must be elected during the recovery process and all servers must eventually be in the correct state. For Leader election, a survival algorithm with a high probability of success is needed. The Leader election protocol allows not only a Leader to know that it is the Leader, but also that a specified number of followers agree with this decision. Servers will not progress if an error occurs during the Leader election phase. Eventually a timeout occurs, and the Leader election is re-run. In our implementation, the Leader election is implemented in two different ways. If a specified number of servers are up and running, quick elections can complete in a few hundred milliseconds.

(2) Guarantee at the recovery stage

The tricky part of this recovery process is the sheer number of proposed conflicts at a given time. The Maximum number of conflicts proposal is a configurable option, but the default is 1000. In order to enable the protocol to operate even in the case of Leader failure. We need to make two specific assurances:

① We must never forget messages that have been delivered. If a message is delivered on one machine, it must be delivered on every machine.

② We must discard messages that have been skipped.

(3) Guarantee example

The first:

If a message is delivered on one machine, it must be delivered on every machine, even if that machine fails. For example, there is a situation where the Leader sends a COMMIT message, but the Leader fails before the COMMIT message reaches any other machine. That is, only the Leader himself receives the COMMIT message. C2 in Figure 3.4.

Figure 3.4 The flow of message with Protocol

Figure 3.4 is an example of the “first guarantee” (Deliver messages cannot be forgotten). In this figure, Server1 is a Leader, denoted by L1, and Server2 and Server3 are followers. First, the Leader initiates two proposals, P1 and P2, and sends P1 and P2 to Server1 and Server2. Then the Leader made a Commit to P1, namely C1, followed by a Proposal, namely P3, and then made a Commit to P2, namely C2. At this moment, our Leader hung up. At this time, only the Leader receives the messages of P3 and C2.

Because the Leader has delivered the C2 message, the client can see the result of the transaction in the message. Therefore, the transaction must be able to deliver across all other servers to enable the client to see a consistent view of the service.

Article 2:

A message that is skipped must still be skipped. For example, a situation occurs where the Leader sends a VETO message, but the Leader fails before the VETO message reaches any other machine. In other words, only the Leader receives the VETO message. As shown in P3 in Figure 3.4.

In Figure 3.4 none of the servers can see proposal 3, so in Figure 3.5 when Server 1 recovers it needs to discard proposal 3 while the system recovers.

Figure 3.5

Figure 3.5 is an example of a “second guarantee” (skip messages must be discarded). After Server1 fails, Server3 is elected as the Leader, which is represented by L2. There are also messages P1 and P2 that have not yet been delivered, so L2 will deliver them before issuing new proposals P10000001 and P10000002. Therefore, L2 issues two COMMIT messages, C1 and C2, before L2 issues the new proposals, P10000001 and P10000002.

If Server1 is the Leader again after recovery, and P3 is delivered again after P10000001 and P10000002, the sequential guarantee will be violated.

(4) Realization of warranty

If the Leader election protocol guarantees that the new Leader has the highest offer number in the Quorum Server, that is, the highest Zxid. The newly elected leader will have all the messages that have been delivered. The newly elected Leader, before proposing a new message, must first ensure that all the messages in the transaction log have been selected and delivered by Quorum Follower. Note that we can make the new Leader a server that handles transactions with the highest ZXID as an optimization. In this way, as the newly elected Leader, there is no need to find the Followers with the highest number of Followers and get the missing transaction.

(1) the first

All Servers that are started correctly will either become the Leader or follow a Leader. The Leader ensures that its Followers see all the proposals and deliver all the messages that have already been delivered. By queuing all Proposals that the newly connected followers have not yet seen, and then queuing the COMMIT message of that PROPOSAL until the last COMMIT message. After all such messages have been queued, the Leader will add followers to the broadcast list for future suggestions and confirmation. This is to ensure consistency, because if a message P has been delivered in the old leader-server1, even if it has just been delivered and hangs, when the old leader-server1 restarts, Our clients can see the transaction for the message P DELIVER from the Server, so to ensure that each Client sees a consistent view, we need to deliver the message on each Server.

(2) the second

Messages that skip has chosen, but cannot deliver, are also easier to process. In our implementation, the Zxid is made up of 64-bit numbers, with the lower 32 bits used as a simple counter. The high 32 bits are an epoch. Whenever the new Leader takes over it, the epoch with the largest Zxid in the log will be obtained. The epoch bit of the new Leader Zxid is set to epoch+1 and the counter bit is set to 0. Epoch was used to mark changes in leadership relationships and Quorum Servers were required to identify the leader by epoch, avoiding multiple leaders issuing different proposals using the same Zxid.

One advantage of this approach is that we can skip an instance of a failed leader, thus speeding up and simplifying the recovery process. If a crashed Server restarts with an unreleased Proposal, all previously unreleased proposals will never be delivered. And it cannot be a new leader, because any possible Quorum Servers will have a Server whose Proposal comes from a new epoch and therefore it will have a higher ZXID. When the Server is connected as a Follower, the Leader checks the last submitted proposal whose epoch is the latest proposal epoch of the Follower (that is, the C2 proposal delivered in the new leader-server2 in Figure 3.5). The followers are told to truncate the transaction log until the final Proposal, C2, for the delivery of the epoch in the new Leader. In Figure 3.5, when the old leader-server1 is connected to the new leader-server2, the Leader will tell him to clear proposal 3, P3, from the transaction log, specifically all proposals after P2, because all proposals after P2 are known only to the old leader-server1. Other servers do not know.

(5) Paxos and Zab

① Paxos consistency

The consistency of Paxos cannot meet the requirements of ZooKeeper. We can use the following example. Let’s assume that the ZK cluster consists of three machines, Server1, Server2, and Server3. Server1 is the Leader, and he generates three proposals, P1, P2 and P3. But Server1 hangs after P1 is sent. See Figure 3.6 below.

Figure 3.6 Server1 is the Leader

After Server1 fails, Server3 is elected as the Leader because there is only one Proposal — P1 in Server3. Therefore, Server3 makes a new Proposal based on P1 — P2 ‘, whose Zxid is 02. See Figure 3.7 below.

Figure 3.7 Server2 becomes the Leader

Server2 also hangs after sending P2 ‘. At this point Server1 has been restarted and is the Leader again. Server1 will then send proposals P2 and P3 that have not yet been delivered. P2 will be rejected because the Zxid of P2 ‘in follower-server2 is the same as that of P2 in leader-server1. And P3 will be accepted by Server2. See Figure 3.8.

Figure 3.8 Server1 becomes the Leader again

Let’s analyze the Proposal in follower-server2, because P2′ overwrote P2. Therefore, proposal-P3 in Server2 cannot take effect because its parent node does not exist.

② Zab consistency

First, let’s examine why ZooKeeper requirements are not met in the above example. ZooKeeper is a tree structure, and many operations need to be checked before they can be executed. For example, Server2 has three proposals in Figure 3.8. P1 creates node “/zk”, P2′ creates node “/c”, P3 creates node “/a/b”, and P3 creates node “/a/b”. Therefore, we can see that Paxos consistency does not meet the requirements of ZooKeeper consistency.

To achieve consistency, ZooKeeper uses the Zab protocol. Zab makes the following guarantees to achieve the consistency required by ZooKeeper.

(a) Zab ensures that all transactions initiated by the same leader are applied sequentially and that the new leader can initiate transactions only after all transactions of the previous leader have been applied.

(b) Some messages that have been skipped need to still be skipped.

I think we can all understand the first guarantee, which is to ensure the consistency of data views from Server to Server. Let me focus on the second one, how it works. To enable this, Skip messages that have been skipped. We introduced epoch in Zxid, as shown below. Whenever the Leader changes, the epoch bit is incremented by 1 and the counter position is 0.

Figure 3.9 Zxid

Let’s continue with the example above and see how he fulfills Zab’s second promise. Let’s assume that the ZK cluster consists of three machines, Server1, Server2, and Server3. Server1 is the Leader, and he generates three proposals, P1, P2 and P3. But Server1 hangs after P1 is sent. See Figure 3.10 below.

Figure 3.10 Server1 is the Leader

After Server1 fails, Server3 is elected as the Leader because there is only one Proposal — P1 in Server3. Therefore, Server3 sent a new Proposal — P2 ‘on the basis of P1. Due to the change of Leader, the epoch needs to be added by 1, so the epoch is changed from 0 to 1, while counter needs to be set to 0. So P2 prime has an Zxid of 10. See Figure 3.11 below.

Figure 3.11 Server3 is the Leader

Server2 also hangs after sending P2 ‘. At this point Server1 has been restarted and is the Leader again. Server1 will then send proposals P2 and P3 that have not yet been delivered. Because the Zxid of P2 ‘in Server2 is 10, and the ZXIDS of P2 and P3 in leader-Server1 are 02 and 03 respectively, the epoch bit of P2’ is higher than that of P2 and P3. Therefore, both P2 and P3 of leader-server1 will be rejected, and the second guarantee of Zab will be realized. See Figure 3.12.

Figure 3.12 Server1 becomes the Leader again