One, foreword

Hello everyone, I am your brother Boom, not afraid of heaven, not afraid of earth, just afraid you don’t come in.

This article has recorded the learning process of ZK, ZK is really a very powerful middleware, it is JAVA development, multi-level queue design is absolutely child, learn the idea of source code. But still want to poke fun, ZK source code inside the name is true…. I will not say, compared to Spring source, is really much more difficult.

Introduction to ZK

Before understanding Zookeeper, you need to have a certain understanding of distributed knowledge, so what is a distributed system?

Typically, a single physical node is easy to achieve performance, calculation or capacity bottlenecks, so this time need multiple physical nodes to complete a task, a distributed system is the essence of distribution on different network or a computer program components, each other to work together through information transmission system, Zookeeper is a distributed application coordination framework, which has a wide range of application scenarios in the distributed system architecture.

What is Zookeeper?

Zookeeper is a distributed coordination framework and a sub-project of Apache Hadoop ecology. It is mainly used to solve some data management problems often encountered in distributed applications, such as unified naming service, status synchronization service, cluster management, and management of distributed application configuration items.

Core concepts of Zookeeper

File system data structure + listener notification mechanism.

1. Data structure of the file system

Zookeeper maintains a data structure similar to a file system:

Each subdirectory entry is called a Znode(directory node). Similar to a file system, we can add and remove Znodes freely, adding and removing sub-Znodes within a Znode.

There are four types of ZNodes:

After the client disconnects from ZooKeeper, this node still exists. As long as you do not manually delete this node, it will exist forever.

PERSISTENT_SEQUENTIAL persistent sequential node: The node still exists after the client is disconnected from ZooKeeper. Zookeeper numbers the node sequentially.

3. EPHEMERAL: EPHEMERAL node was deleted after the client was disconnected from ZooKeeper. And temporary nodes cannot create children.

4. EPHEMERAL_SEQUENTIAL temporary sequential node: After the client is disconnected from ZooKeeper, the node is deleted. Zookeeper numbers the node name sequentially.

5. Container node (added in 3.5.3. If there are no child nodes under the Container node, the Container node will be automatically cleared by Zookeeper in the future.

6, TTL nodes (disabled by default, only through the system configuration zookeeper. ExtendedTypesEnabled = true open, unstable).

2. Listen for notification mechanism

A client can register to listen on any node it cares about, including directory nodes and recursive subdirectory nodes

  1. If the listener is registered for a node, the client will be notified when the node is deleted or modified

  2. If the listener is registered for a directory, the client is notified when a child node of the directory is created or deleted

  3. If a client is registered to listen on recursive children of a directory, the client will be notified of changes in the directory structure of any of the children under the directory (children are created or deleted) or changes in the root node.

Note: All notifications are one-time, and the listener, whether on a node or directory, is removed once triggered. For recursive child nodes, the listener is for all child nodes, so events under each child node are also fired only once.

Application scenarios of Zookeeper

Let me give you an idea

  • Distributed queue: through ZK persistent sequential nodes can be achieved.

  • Cluster election: The Leader election function of the ZK can be used.

  • Publish-and-subscribe: ZK’s registered listener.

  • Distributed configuration center and registry

    1. Create a PERSISTENT PERSISTENT node,Create /config/ JSON format of the project name configuration file
    2. The client listens to this node,Get -w /config/ Project nameWhen the data of this node changes, it means that the configuration file JSON changes, and the client can sense it at the first time. Because listening is one-time, loop listening can be.
  • Distributed locking (emphasis here)

    ZK distributed locks can be divided into fair locks and unfair locks

    Unfair lock:

    1. The request comes in, creates a temporary node /lock,create -s /lockIf the node exists, the ZK server will tell you that the node already exists and cannot be created again.get -w /lock/
    2. The request to acquire a lock is processed and the lock is released, i.edelete /lock/The node that currently acquired the lock will notify all connections listening on that node, which is a disaster in high concurrency situations.

    In the case of serious concurrency problems, the performance of the above implementation method will be severely degraded. The main reason is that all connections are listening on the same node. When the server detects the deletion event, all connections should be notifiedHerd behaviour. So how do we avoid that? Let’s do it this way.

    Fair lock:

    1. When the request comes in, create a temporary order node directly under /lock.
    2. Determine if you are the smallest node under /lock. If it is minimal, the lock is acquired. If not, listen on the previous nodeget -w /lock/The previous node
    3. The request to acquire a lock is processed and the lock is released, i.edelete /lock/The node that currently acquired the lock will be notified to the next node, repeating step 2.

4. Comparison between ZK and Redis distributed locks

Redis distributed locks

1. Setnx + Lua script

Advantages: Redis is memory based and has high read/write performance. Therefore, redis-based distributed locking has high efficiency

Disadvantages: In a distributed environment, node data may be synchronized, which affects the reliability. For example, now there is a Redis cluster with 3 master nodes and 3 clusters. The client command is written to the master node of machine 1. When the data is ready to be synchronized to the master node, the master node is suspended and the slave node does not receive the latest data. In this case, the slave node is elected as the master, resulting in the invalidity of the distributed lock previously added.

2. Redission

Advantages: Solves the synchronization availability problem of Redis cluster

Disadvantages: online is to say: release time is short, stability and reliability to be verified. Personally, the market has been stable, is a more mature and more perfect distributed lock.

ZK distributed lock

Advantages: Data synchronization (zooKeeper returns only after synchronization) and primary/secondary switchover (ZooKeeper services are unavailable during primary/secondary switchover) do not occur in Redis and high reliability is achieved

Cons: Reliability is guaranteed at the expense of efficiency (but still high). Performance is not as good as Redis.

V. ZK cluster Leader election

There are three types of roles in the Zookeeper cluster mode

  • Leader: Processes all transaction requests (write requests) and can process read requests. There can be only one Leader in the cluster.
  • Follower: The Follower node can only handle read requests and serves as a candidate node for the Leader. That is, if the Leader breaks down, the Follower node must participate in the Leader election and may become the new Leader node.
  • Observer: Processes only read requests. Can’t participate in elections.

For each update request from the client, ZooKeeper assigns a globally unique ascending number. This number reflects the sequence of all transactions and is called Zxid (ZooKeeper Transaction Id).

Leader Election Process

  1. ZK internally uses a fast election algorithm based on two factors (myID, Zxid). Myid is the cluster serial number maintained in the configuration file, and Zxid is the transaction ID of ZK.
  2. Suppose that two ZK clusters are started, and the corresponding myIDS are 1 and 2, then the Leader election will occur during the startup process. The election process is as follows
  3. Since the project has just started, there are no transactions. Therefore, the machine whose myid is 1 will vote (1,0) and receive (2,0). The machine whose Zxid is larger is the Leader, and the Leader whose Zxid is larger is the Leader’s new data. If the zxids are the same, the Leader whose myid is larger is selected by default. If the zxids are the same, (2,0) is recommended to become the Leader.
  4. If myid=2, select (2,0) as the Leader. If myid=2, select (1,0) as the Leader.
  5. Due to the half rule of ZK, when (number of cluster /2+1) machines select the same machine as the Leader, the machine will switch from Looking to the Leader state, while other machines will switch to the Follower state.

Why is an odd number of ZK clusters recommended?

  1. Assume that there are four clusters, one Leader fails, and three followers are left. Due to the half rule of Leader election, (3/2+1)= two followers vote for the same machine to become the Leader.
  2. Assume that the number of followers in a cluster is 3, one Leader fails, and two followers remain. Due to the half rule of Leader election, (2/2+1)= two followers vote for the same machine to become the Leader.
  3. Since a cluster of three and four machines requires two followers to vote for the same machine to become the Leader after the Leader fails, why not save the cost of one machine?

So, in general, it’s cost-saving.

The Leader elects a multi-tier queue architecture

The whole ZK election layer can be divided into election application layer and message transport layer

ZK maintains queues and threads at both the application layer and transport layer. We will distinguish between application layer queues and transport layer queues, application layer threads and transport layer threads.

  1. The application layer designed a uniform queue for sending votes (SendQueue at the application layer) and a queue for receiving votes (ReceiveQueue at the application layer), and then started a WorkerSender thread to scan SendQueue at the application layer. Start an application-layer WorkerReceiver thread to scan the Transport layer ReceiveQueue queue for the transport layer
  2. The transport layer maintains a send queue for all the machines in the ZK cluster except the current machine, that is, each machine corresponds to a send queue, and each send queue corresponds to a sending thread, which continuously scans its own queue.
  3. Since each machine in the ZK cluster has established a long Soket link, when the sending thread detects a new vote message, it sends it to the transport-layer ReceiveQueue thread on the corresponding machine through the Socket. Then, the ReceiveQueue thread at the transport layer dumps the message to the uniform ReceiveQueue queue. When the WorkerReceiver thread at the application layer detects the message in the ReceiveQueue queue, it forwards the message to the ReceiveQueue queue at the application layer
  4. Finally, the votes are counted and the Leader is elected.

Bao, is there any doubt why ZK did this?

  1. Asynchronous improves performance.
  2. The message is queued according to the sending machine to avoid the interaction when sending messages to each machine. For example, if a machine fails to send a message, it will not affect the message sent to the normal machine.

Five, ZK’s brain split problem

Split brains are usually found in clusters such as ElasticSearch and Zookeeper. These clusters have one brain. For example, ElasticSearch has a Master node and Zookeeper has a Leader node.

What is a split brain?

To put it simply, in a normal ZK cluster, there is only one Leader, and this Leader is the brain of the whole cluster. Brain splitting, as the name implies, produces multiple leaders.

Description of the scene of ZK’s midbrain split

For a cluster, to improve the availability of the cluster, it is common to use multi-room deployment. For example, now there is a cluster consisting of six ZKS deployed in two rooms:

Normally, there is only one Leader in this cluster, so if the network between the two rooms is broken, the ZKS in the two rooms can still communicate with each other. If the half mechanism is not considered, a Leader will be selected in each room.

This is the equivalent of one cluster being divided into two clusters, resulting in two “brains”, a phenomenon known as “split brain”. For this case, also can see that actually are supposed to be a unified foreign service, a cluster is now two clusters provide service at the same time, if after a while, suddenly broken network unicom, so this time can be a problem, just two clusters are providing services to the public, how to merge data, data conflict how to solve the problem, and so on.

One of the prerequisites for the split brain scenario is that the half mechanism is not considered, so in fact, the Split brain problem will not easily occur in Zookeeper cluster, because of the half mechanism.

Why is the ZK semi mechanism greater than, rather than greater than or equal to?

This is related to the split brain problem, for example, back to the previous scene with the split brain problem: When the network in the middle of the machine room is disconnected, the three servers in the machine room 1 will conduct Leader election, but the condition of the half mechanism is “the number of nodes > 3”, that is to say, at least four ZKservers are needed to select a Leader, so it cannot select a Leader for machine room 1. In the same way, no Leader can be selected for room 2. In this case, the entire cluster has no Leader when the network between rooms is disconnected. If the half-mechanism condition is “number of nodes >= 3”, then both room 1 and room 2 will choose a Leader. This could explain why the semicircular mechanism is greater than rather than greater than equal, in order to prevent a split brain.

If we assume that we only have 5 machines now, we also deploy them in two rooms:In this case, the condition of the half mechanism is “number of nodes > 2”, that is, at least 3 servers are needed to select a Leader. In this case, the network of the equipment room is disconnected, which has no effect on the equipment room 1, and the Leader is still the Leader. For the equipment room 2, the Leader cannot be selected. There is only one Leader in the cluster. Therefore, it can be concluded that with half mechanism, there is either no Leader or only one Leader for a ZK cluster, so that ZK can avoid the problem of split brain.

The famous ZAB agreement

What is the ZAB protocol?

ZAB protocol is a conformance protocol specially designed for ZK to support crash recovery. Based on this protocol, ZK implements a master-slave system architecture to maintain data consistency among the replicas in the cluster.

In distributed systems, the master-slave architecture model is commonly used, where a Leader server is responsible for write requests from external clients. And then everything else is the Follower server. The Leader server synchronizes write data from the client to all the Follower nodes.

In this case, all write requests from the client are sent to the Leader. After the Leader finishes writing data, the write requests are forwarded to the Follower. There are two problems to solve:

  1. How the leader server updates the data to all the followers.
  2. What if the Leader server suddenly fails?

In order to solve the above two problems, ZAB protocol designs two modes:

  1. Message broadcast mode: Updates data to all followers

  2. Crash recovery mode: How to recover from a Leader crash

Message broadcast mode

The ZAB protocol message broadcast process uses an atomic broadcast protocol, similar to a two-phase commit process. All write requests sent by the client are received by the Leader. The Leader encapsulates the request into a transaction Proposal and sends it to all Follwer. Then, according to the feedback from all Follwer, if more than half of them respond successfully, the commit operation is performed.

  1. The Leader converts the client Request into a Proposal
  2. The Leader prepares a FIFO queue for each Follower and Observer and sends the Proposal to the queue.
  3. The Follower and Observer take out the Proposal of the team Leader and return the ACK to the Leader.
  4. If the Leader receives more than half of the ACK feedback, the Leader performs the COMMIT operation and sends the COMMIT to all followers and observers.
  5. The Follower and Observer perform the commit operation.

This is the whole message broadcast mode. Let’s start by looking at what happens if the Leader node crashes during message broadcast. Can you keep the data consistent? What if the Leader commits locally but the commit request is not sent? This is the second mode: crash recovery mode.

Crash recovery mode

Collapse: When the Leader loses contact with more than half of the Follwer, a new round of elections is held to select a new Leader.

Since some scenarios cannot be recovered, ZAB protocol crash recovery must meet the following two requirements:

  1. Be sure to discard proposals submitted only by the Leader but not by the followers.
  2. Ensure that transactions that have been committed by the Leader will eventually be committed by all followers.

All right, let’s start the recovery.

  1. The new Leader is selected with the largest Zxid, indicating that the current transaction is the latest.
  2. The new Leader submits the Proposal to the other Follower nodes
  3. The Follower node rolls back or synchronizes data based on the Leader’s message. The ultimate goal is to ensure that all nodes in the cluster have consistent copies of data.

This is the whole recovery process, which is basically a log of every operation, and then the latest operation before the accident is restored and then synchronized.