The basic principle and application scenarios of Zookeeper have been introduced in detail in the previous article, although the underlying storage principle and how to use Zookeeper to achieve distributed lock have been introduced. But I think this is only a superficial understanding of Zookeeper. Therefore, this article will give you a detailed talk about the core underlying principles of Zookeeper. For those not familiar with Zookeeper, go back.

ZNode

This is the basic, minimal unit of data storage in Zookeeper. Zookeeper abstracts a file system-like storage structure into a tree. Each Node in the tree is called a ZNode. A data structure is maintained in ZNode to record the version number and Access Control List (ACL) changes of data in ZNode.

With the version number of the data and its updated Timestamp, Zookeeper can verify that the cache requested by the client is valid and coordinate the updates.

In addition, when the Zookeeper client performs an update or deletion operation, it must carry the version number of the data to be modified. If Zookeeper detects that the corresponding version number does not exist, it does not perform the update. If valid, the version number of ZNode will be updated after the data is updated.

This version number logic is used in many frameworks, such as RocketMQ, where brokers register with NameServer with a version number called DateVersion.

Let’s take a closer look at the data Structure that maintains the data related to the version number. It’s called the Stat Structure, and its fields are:

field paraphrase
czxid Zxid of the node to be created
mzxid The zxID of the node was changed for the last time
pzxid The zxID of the child node was changed for the last time
ctime The number of milliseconds between the start of the current epoch and the creation of the node
mtime The number of milliseconds elapsed between the start of the current epoch and the last time the node was edited
version Number of changes to the current node (version number)
cversion Number of changes made to the children of the current node
aversion Number of ACL changes on the current node
ephemeralOwner SessionID of current temporary node owner (null if not temporary node)
dataLength The length of the current node’s data
numChildren Number of children of the current node

For example, using the stat command, we can view the specific value of the stat Structure in a ZNode.

The epoch and ZXID are related to the Zookeeper cluster, which will be introduced in detail later.

ACL

Access Control lists (ACLs) are used to Control ZNode permissions. The permission Control is similar to that in Linux. In Linux, there are three types of permissions: read, write, and execute. The corresponding letters are R, W, and X. Its permission granularity is also divided into three types, namely owner permission, group permission and other group permission. For example:

Drwxr-xr -x 3 USERNAME GROUP 1.0K 3 15 18:19 dir_nameCopy the code

What is granularity? Granularity is the classification of objects that permissions are applied to. The above three types of granularity are described as the permission division of ** to user (Owner), Group to which the user belongs, and Other groups (Other) **, which should be regarded as a standard of permission control, typical three-paragraph formula.

Although there are also three stages in Zookeeper, there are differences in granularity between them. In Zookeeper, Scheme, ID, and Permissions represent the permission mechanism, users allowed to access, and specific Permissions respectively.

Scheme represents a permission mode, which has the following five types:

  • worldIn this Scheme,IDCan only beanyone, which means everyone can access it
  • Auth represents an authenticated user
  • Digest Uses the user name and password for verification.
  • IP Only certain IP addresses are allowed to access ZNode
  • The X509 is authenticated by the client certificate

At the same time, there are five types of permissions:

  • CREATE Creates a node.
  • READ Gets the node or lists its children
  • WRITE Allows you to set node data
  • DELETE Deletes child nodes
  • ADMIN can set permissions

As in Linux, this permission has an abbreviation, for example:

The getAcl method user can view the corresponding ZNode permission, as shown in the figure, we can output the result in three sections. Respectively is:

  • Scheme uses world
  • idA value ofanyone, indicating that all users have permissions
  • Permissions’ specific permissions are cdrWA, an abbreviation for CREATE, DELETE, READ, WRITE, and ADMIN, respectively

The Session mechanism

Now that you know the Version mechanism of Zookeeper, you can explore the Session mechanism of Zookeeper.

As we know, There are four types of nodes in Zookeeper, namely, persistent node, persistent sequential node, temporary node and temporary sequential node.

In a previous article we talked about how if a client creates temporary nodes and then disconnects, all temporary nodes are removed. After the client’s Session expires, all temporary nodes created by the client will be deleted.

So how does Zookeeper know which temporary nodes are created by the current client?

The answer is the **ephemeralOwner (Owner of the temporary node) ** field in the Stat Structure

EphemeralOwner stores the SessionID of the Owner who created the ephemeralOwner. With the SessionID, the ephemeralOwner is automatically matched to the corresponding client. After the Session expires, all temporary nodes created by the client can be removed.

When creating a connection, the corresponding service must provide a string with all servers and ports separated by commas, for example.

127.0.0.1:3000:2181,127.0.0.1:2888,127.0.0.1:3888

After receiving the string, the Zookeeper client randomly selects a service or port to establish a connection. If the connection is later broken, the client selects the next server from the string and continues trying until the connection succeeds.

In addition to this basic IP+ port, Zookeeper supports paths in connection strings in versions after 3.2.0, for example.

127.0.0.1:3000:2181127.00 0.1:2888127.00 0.1:3888 / app/a

In this way, /app/a is treated as the root of the current service, and all node paths created under it are prefixed with /app/a. For example, if I create a node /node_name, the full path will be /app/a/node_name. This feature is particularly useful in multi-tenant environments, where each tenant considers itself to be the top-level root directory /.

When a connection is established between the Zookeeper client and server, the client gets a 64-bit SessionID and password. What is this code for? We know that Zookeeper can deploy multiple instances, so if a client disconnects and then establishes a connection with another Zookeeper server, the client will carry this password when establishing a connection. The password is a security measure of Zookeeper and can be verified by all Zookeeper nodes. In this way, sessions are valid even when connected to other Zookeeper nodes.

Session expiration occurs when:

  • The specified expiration time has passed
  • The client did not send heartbeat in the specified time. Procedure

In the first case, the expiration time is transmitted to the server when the Zookeeper client establishes a connection. The expiration time range is currently only between tickTime 2x and tickTime 20x.

Ticktime is a configuration item of the Zookeeper server. It is used to specify the interval at which the client sends heartbeat messages to the server. The default value is Ticktime =2000, in milliseconds

The Session expiration logic is maintained by the Zookeeper server. Once a Session expires, the server immediately deletes all temporary nodes created by the Client and notifies all clients that are listening on these nodes of the change.

In the second case, the heartbeat in Zookeeper is implemented through PING requests. Every once in a while, the client sends PING requests to the server, which is the essence of the heartbeat. The heartbeat allows the server to perceive that the client is still alive, as well as the client to perceive that the connection to the server is still valid. This interval is **tickTime**, which defaults to 2 seconds.

Watch mechanism

Now that we know about ZNode and Session, we can move on to the next key feature, Watch. The word “Watch” is mentioned more than once in this article. First use a sentence to summarize its role

Register a listener for a node and receive a Watch Event whenever the node is changed (such as updated or deleted)

Just as there are various types in ZNode, there are also various types of Watch, namely disposable Watch and permanent Watch.

  • Once a disposable Watch is triggered, it is removed
  • The permanent Watch remains after it is triggered and can continue to listen for changes on ZNode. It is a new feature in Version 3.6.0 of Zookeeper

One-time Watch can be set in parameters when calling getData(), getChildren(), exists() and other methods, while permanent Watch needs to be implemented by calling addWatch().

In addition, a one-time Watch will have problems, because there is an interval between the event triggered by the Watch reaching the client and the establishment of a new Watch at the client. If changes occur during this time interval, the client will not be aware of them.

Zookeeper cluster architecture

ZAB agreement

After laying the groundwork, we can further understand Zookeeper from the perspective of the overall architecture. To ensure high availability, Zookeeper adopts the master-slave read/write separation architecture.

We know that in the similar Redis master-slave architecture, nodes communicate with each other using the Gossip protocol, so what is the communication protocol in Zookeeper?

The answer is **ZAB (Zookeeper Atomic Broadcast) ** protocol.

ZAB is an atomic broadcast protocol that supports crash recovery and is used to transmit messages between Zookeeper and keep all nodes in sync. The ZAB features high performance, high availability, easy to use, and easy to maintain. It also supports automatic fault recovery.

ZAB protocol divides nodes in a Zookeeper cluster into three roles: Leader, Follower, and Observer, as shown in the following figure:

In general, this architecture is similar to the Redis master-slave or MySQL master-slave architectures (see previous articles if you are interested).

  • Redis master-slave
  • MySQL master-slave

The difference is that there are two roles in the typical master-slave architecture: Leader and Follower (or Master and Slave), but Zookeeper has an Observer.

So what is the difference between an Observer and a Follower?

In essence, both functions are the same, providing Zookeeper with the ability to scale horizontally, allowing it to handle more concurrency. However, the difference is that the Observer does not participate in the election of the Leader.

Sequential consistency

As mentioned above, only the Leader node can process write requests in the Zookeeper cluster. If the Follower node receives a write request, it will forward the request to the Leader node, but the Follower node itself will not process the write request.

The Leader node receives the messages and processes them one by one in a strict order of requests. This is a great feature of Zookeeper, which ensures that messages are sequentially consistent.

For example, if message A arrives before message B, then in all Zookeeper nodes, message A arrives before message B, and Zookeeper ensures the global order of messages.

zxid

So how does Zookeeper keep messages in order? The answer is by zxID.

The zxID can be simply understood as the unique ID of the Message in Zookeeper. Each node communicates and synchronizes data by sending a **Proposal (transaction Proposal) **. The Proposal will bring the ZXID and specific data (Message). The ZXID consists of two parts:

  • Epoch can be understood as dynasty, or the version of Leader iteration. Each Leader’s epoch is different
  • Counter Counter that increments with each message

This is also the underlying implementation of the unique ZXID generation algorithm. Since each Leader uses a unique epoch, the value of counter is different for different messages in the same epoch, so all proposals have a unique ZXID in the Zookeeper cluster.

Recovery mode

The Zookeeper cluster is in broadcast mode. Conversely, if more than half of the nodes go down, they go into recovery mode.

What is recovery mode?

In the Zookeeper cluster, there are two modes:

  • Recovery mode
  • Broadcasting mode

When a Zookeeper cluster fails, it enters recovery mode, also known as Leader Activation. As the name implies, a Leader is elected during this phase. The nodes generate zxids and proposals and then vote on each other. There are two main principles for voting:

  • The elected Leader must have the largest zxID of all followers
  • And more than half of the followers have returned ACK, indicating that they approve of the elected Leader

If an exception occurs during the election, Zookeeper directly launches a new election. If all goes well, the Leader is elected, but the cluster is not yet ready for service because the new Leader and followers have not yet synchronized critical data.

After that, the Leader will wait for the rest of the followers to connect and then send their missing data to all the followers through a Proposal.

As for how to know which data is missing, the Proposal itself should be logged, and a Diff can be made by the value of the lower 32 bits Counter of the zxID in the Proposal

Of course, there is an optimization here. If there is too much missing data, it will be inefficient to send the Proposal one by one. Therefore, if the Leader finds too much missing data, he will take a snapshot of the current data and directly package it and send it to followers.

The newly elected Leader’s Epoch will add +1 to the original value and reset Counter to 0.

Did you think this was the end? In fact, there is still no normal service up here

After data synchronization is complete, the Leader will send a NEW_LEADER Proposal to the followers. The Leader will Commit the NEW_LEADER Proposal only after more than half of the followers return an Ack. The cluster can work properly.

At this point, the recovery mode ends and the cluster enters broadcast mode.

Broadcasting mode

In broadcast mode, after receiving the message, the Leader sends a Proposal (transaction Proposal) to all other followers. After receiving the Proposal, the Follower returns an ACK to the Leader. After the Leader receives the Quorums ACK, the Proposal is submitted and applied to the node’s memory. What is quorum?

Zookeeper officials recommend that at least one out of two Zookeeper nodes return an ACK. If there are N Zookeeper nodes, the calculation formula is N /2 + 1.

This may not be intuitive. In plain English, if more than half of the followers return an ACK, the Proposal can be submitted and applied to zNodes in memory.

Zookeeper uses 2PC to ensure data consistency between nodes (as shown in the figure above). However, as the Leader needs to interact with all the followers, the communication cost becomes high and Zookeeper performance deteriorates. Therefore, in order to improve Zookeeper performance, more than half of the followers return ACK instead of all the followers return ACK.

This is the end of this blog, welcome to wechat search to follow [SH full stack notes], reply [queue] to get MQ learning materials, including basic concept analysis and RocketMQ detailed source code analysis, continue to update.

If you find this article helpful, please give it a thumbs up, a comment, a share and a comment.