Data synchronization in the ZooKeeper cluster

Author: HelloGitHub- Lao Xun

Hello, here is HelloGitHub’s HelloZooKeeper series, free, open source, fun, entry-level ZooKeeper tutorials for beginners with basic programming skills.

Project address: github.com/HelloGitHub…

In the previous article we introduced how ZK is persisted, but in this chapter we will formally learn how followers or observers synchronize data with the Leader after an election.

1. Completion of the election

After the election, our magog Glory was elected Leader of the current office cluster, so now assume that the relationship diagram of the various offices looks like this:

Now let’s talk about how Ma Xiaoyun and Ma Xiaoteng synchronize data with Ma Guoguo.

After the exhausting election, Ma Xiaoyun and Ma Xiaoteng lost the competition by a narrow margin and had to be relegated to followers. After sorting out their emotions, the first thing they had to do was to report their own information to Ma Guoguo through the operator, using the special code FOLLOWERINFO. The data mainly had its own epoch and myID:

After receiving the FOLLOWERINFO, Ma Guoguo will also make statistics. After reaching more than half of the FOLLOWERINFO, he will calculate a new epoch based on the information given by all followers, and then send the new epoch back to other followers with the password of LEADERINFO

Then go back to Ma Xiaoyun and Ma Xiaoteng and record the new epoch after receiving The LEADERINFO. Then reply to Ma Guoguo with an ACKEPOCH code and bring the maximum ZXID of my side, indicating that the previous LEADERINFO has been received

Then, Ma Guoguo will also wait for more than half of the ACKEPOCH notifications, and will give different synchronization strategies based on the information of each Follower. As for the different synchronization strategies, I will introduce them to you first:

DIFF, if the Follower’s records are not much different from the Leader’s, incremental synchronization is used to send write requests to the Follower one by one
TRUNC indicates that the zxID of the followers is ahead of the current Leader (possibly the previous Leader), so the followers need to cut off the excess part and degrade it to the same level as the Leader
SNAP: if the Follower’s records differ too much from the current Leader’s, the Leader directly sends his entire memory data to the Follower

As for which strategy to use and how to judge, the following are explained one by one.

1.1 the DIFF

After receiving a write request, each ZK node maintains a write request queue (the default size is 500, configured by zookeeper.commitLogCount) and records the write request in the queue. The zxID of the first incoming write request in the queue is minZxid (min), and the zxID of the last incoming write request is maxZxid (Max). When the upper limit is reached, the earliest incoming write request is removed. Knowing these two values, Let’s see what DIFF thinks.

1.1.1 Recovering from the Write Request Queue in Memory

In one case, if the zxID reported by followers via ACKEPOCH is between min and Max, the DIFF policy is used for data synchronization.

In our example, the Leader’s ZXID is 99, indicating that the queue storing 500 write requests is not full at all, so min is 1 and Max is 99. Obviously 77 and 88 are in this range. Magog will then find the interval for the other two followers, send them a DIFF first, and then send them a PROPOSAL and a COMMIT package

1.1.2 Recovering from the Disk File Log

Another case is that if followers zxid not min and Max range, but when a zookeeper. SnapshotSizeFactor configuration is greater than 0 (the default is 0.33), will try to use the log to DIFF However, if the total size of log files to be synchronized cannot exceed one third of the size of the latest snapshot file (default: 0.33), DIFF synchronization can be performed by reading write request records in log files. The synchronization method is the same as above: first send a DIFF to the followers, then find the range of the followers in the log file, and then send PROPOSAL and COMMIT one by one.

After receiving the PROPOSAL cipher message, the followers process the PROPOSAL one by one as they do with the client request, and gradually recover the data to be consistent with that of the Leader.

1.2 the SNAP

Suppose the three offices look like this

Magog’s write request queue records write requests from 277 to 777 in the default configuration. Assuming that the current scenario does not meet 1.1.2 above, Magog knows that the current situation needs to be synchronized through SNAP.

Ma Guoguo will first send a SNAP request to Ma Xiaoyun and Ma Xiaoteng to get them ready

Then the data in the current memory will be serialized (the same as the snapshot file) and sent to Ma Xiaoyun and Ma Xiaoteng together.

After ma Xiaoyun and Ma Xiaoteng received the whole snapshot from Ma Guoguo, they would first clear all the information of their current database, and then directly deserialize the received snapshot to complete the recovery of the entire memory data.

1.3 TRUNC

The scenario for the last strategy assumes this:

Suppose Ma Xiaoteng was the previous Leader, but recovered after a power failure and joined the cluster as a Follower again, but his ZXID was larger than Max. At this time, Ma Guoguo would send TRUNC to Ma Xiaoteng. (As for why Ma Xiaoyun did not take TRUNC as an example in the figure, Because if Ma Xiaoyun’s ZXID is larger than Ma Guoguo’s, Ma Guoguo cannot be elected Leader in the current scenario).

Ma Guoguo will send TRUNC to Ma Xiaoteng (Ma Xiaoyun is ignored here)

Suppose ma Xiaoteng’s local log file directory is like this:

/ TMP └ ─ ─ zookeeper └ ─ ─ the log └ ─ ─ version - 2 └ ─ ─ the 0 └ ─ ─ the 500 └ ─ ─ the 800Copy the code

After receiving TRUNC, Ma Xiaoteng will find all log files greater than 777 in the local log file and delete them, namely log.800 here. It then finds the 777 zxID record in the log.500 file and changes the read/write pointer of the current file to 777. Then the read/write operation of the file will start from 777, overwriting the subsequent records.

On the horse side, after determining the synchronization strategy and sending it to the other two horses, it will send a message of NEWLEADER to them

After ma Xiaoyun and Ma Xiaoteng received NEWLEADER, if they had synchronized data through SNAP before, they would force a new snapshot file to themselves. It then replies with an ACK message telling Maggog that its data synchronization is complete

Then Magog will also wait for half of the same ACK to be received, and then send a UPTODATE to the other two horses, telling them that the office data is consistent now and they can start to provide services externally

Then Ma Xiaoyun and Ma Guoguo will reply an ACK to Ma Guoguo after receiving the UPTODATE, but this time Ma Guoguo will not deal with the ACK after receiving the UPTODATE, so after the UPTODATE, each office can officially provide services.

The above mentioned so much, but Ma Xiaoyun and Ma Xiaoteng are followers, if they are observers? How to synchronize with the above steps?

The difference is that in the first step, followers send FOLLOWERINFO, while observers send OBSERVERINFO. There is no difference except that followers follow the same steps to synchronize data.

Two, continue to dig

Now to explain some of the details in ape terms, the three different data synchronization strategies are different in the specific methods used by the Leader when sending followers

2.1 Three Policy Sending modes

If the synchronization method of DIFF or TRUNC is adopted, the Leader does not send the different data when it finds it, but puts it into a queue in sequence and finally starts a thread to send it one by one

DIFF :

TRUNC:

However, SNAP synchronization will not be put into the queue. Both SNAP messages and the entire serialized memory snapshot will be directly written to the socket between servers.

2.2 God’s Perspective

Let’s look at the whole process of message interaction of the three strategies again. Here we take Ma Xiaoyun as an example

2.2.1 the DIFF

2.2.2 TRUNC

2.2.3 the SNAP

You can see that the beginning and the end are the same, that is, the middle request will be sent according to different policies. At this point, the overall logic of how a Follower or Observer synchronizes messages with the Leader is complete.

2.3 summary

The followers and observers can synchronize data in three modes: DIFF, SNAP, and TRUNC
DIFF requires the Follower or Observer and Leader data to be within min and Max, or configured to allow recovery from the log file
TRUNC: When the zxID of a Follower or Observer is larger than that of the Leader, the node automatically deletes redundant data related to the zxID to make it consistent with that of the Leader
As the final means of data synchronization, SNAP directly serializes the entire memory data and sends it to followers or observers to recover the data

I looked at the number of words in the article and decided to spice it up by writing a short article about ACLs, which I had been putting off for a long time.

No rules, no fangyuan

Zoodefs.ids. OPEN_ACL_UNSAFE is the ACL parameter

client.create("/ Update video/Dance /20201101"."This is Data, you can either record some business Data or you can write whatever you want.".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
Copy the code

If zookeeper.skipACL is configured, if this parameter is set to yes (case sensitive), the current node forgoes ACL verification. The default value is no

How does this ACL rule, which permissions, and how is reflected in the server side? First of all, ACL is divided into two parts: Permission and Scheme. Permission is the Permission for operations, while Scheme specifies which authentication mode to use. Let’s take a look at it below.

3.1 Permission Introduction

Firstly, ZK divides permissions into five types:

READ (R), gets node data or a list of child nodes
WRITE (W for short) to set node data
CREATE (C for short) to CREATE a node
DELETE (D for short), DELETE the node
ADMIN (A for short), set the ACL permission of A node

Then, the 5 kinds of permissions are simple int data at the code level, and the only way to determine whether there is a permission is to use the & operation, and the target permission is not equal to 0. Details are as follows:

		int		binary
R		1			00001
W		2			00010
C		4			00100
D		8			01000
A		16		10000
Copy the code

Suppose the current client permissions are RWC, the corresponding value is the sum of the permissions 1 + 2 + 4 = 7

		int		binary
RWC	7			00111
Copy the code

For any node that has R, W, or C permissions, the ampersand result is not 0. Therefore, it can be determined that the client has RWC permissions.

However, if the client deletes the target node and checks the permissions, the result is 0, indicating that the client does not have the permissions to delete the target node, and the client is returned with permissions error

Int binary RWC 7 00111 D 8&01000 ------------------ result 0 00000Copy the code

3.2 Scheme is introduced

There are four types of Scheme, namely IP, world, Digest and super. In fact, there are two types of Scheme: ONE is IP for IP addresses, and the other is world, digest and super for “Username: password”. In fact, the ACL consists of three parts: Scheme :id:perms. The value of ID depends on the scheme type. In this case, the VALUE of ID is an IP address, and perms is the RWCDA described in the previous section.

The first two parts of scheme: ID tell the server “Who am I?” “, and the last part of perms stands for “What can I do? Error in either of these problems will cause the server to throw an exception NoAuthException telling the client that the permissions are insufficient.

3.2.1 IP

Let’s start by looking directly at a piece of code in which I wrote IP 10.11.12.13 arbitrarily

ZooKeeper client = new ZooKeeper("127.0.0.1:2181".3000.null);
List<ACL> aclList = new ArrayList<>();
aclList.add(new ACL(ZooDefs.Perms.ALL, new Id("ip"."10.11.12.13")));
String path = client.create("/abc"."test".getBytes(), aclList, CreateMode.PERSISTENT);
System.out.println(path); / / output/ABC
client.close();
Copy the code

You can see that/ABC is printed correctly, and you can see the/ABC node by looking at the list of children of /

ZooKeeper client = new ZooKeeper("127.0.0.1:2181".3000.null);
List<String> children = client.getChildren("/".false);
System.out.println(children); // output [ABC, zookeeper]
client.close();
Copy the code

But now you get an error if you try to access the node’s data

ZooKeeper client = new ZooKeeper("127.0.0.1:2181".3000.null);
byte[] data = client.getData("/abc".false.null);
System.out.println(new String(data));
client.close();
Copy the code

Exception in thread "main" org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /abc
Copy the code

You can try changing the above IP address to 127.0.0.1 and re-create the node. After that, you can access the node normally. In production environment, IP mode is not used much (maybe I don’t use IP mode much). I don’t think ZK needs to deal with this aspect.

3.2.2 World

This mode should be the most commonly used (manual dog head)

So let’s look at a piece of code

ZooKeeper client = new ZooKeeper("127.0.0.1:2181".3000.null);
List<ACL> aclList = new ArrayList<>();
aclList.add(new ACL(ZooDefs.Perms.READ, new Id("world"."anyone"))); // The difference is this line
String path = client.create("/abc"."test".getBytes(), aclList, CreateMode.PERSISTENT);
System.out.println(path); / / output/ABC
client.close();
Copy the code

I changed scheme to World mode, and the id of World mode is fixed: Anyone can’t use any other value, and I also set perms as R, so this node can only read data but cannot do other operations. If you use setData to modify its data, you will also get permission errors

ZooKeeper client = new ZooKeeper("127.0.0.1:2181".3000.null);
Stat stat = client.setData("/abc"."newData".getBytes(), -1); // NoAuth for /abc
Copy the code

Looking back to zoodefs.ids. OPEN_ACL_UNSAFE, the unsafe. zoodefs.ids. OPEN_ACL_UNSAFE is the most common static constant from ZK, representing unverified permissions

Id ANYONE_ID_UNSAFE = new Id("world"."anyone");
ArrayList<ACL> OPEN_ACL_UNSAFE = new ArrayList<ACL>(Collections.singletonList(new ACL(Perms.ALL, ANYONE_ID_UNSAFE)));
Copy the code

3.2.3 Digest

This is the user name and password we are familiar with, or the code first

ZooKeeper client = new ZooKeeper("127.0.0.1:2181".3000.null);

List<ACL> aclList = new ArrayList<>();
aclList.add(new ACL(ZooDefs.Perms.ALL, 
  new Id("digest", DigestAuthenticationProvider.generateDigest("laoxun:kaixin")))); / / 1

String path = client.create("/abc"."test".getBytes(), aclList, CreateMode.PERSISTENT);
System.out.println(path);
client.close();
Copy the code

This writing must note 1 in the username, password string must pass DigestAuthenticationProvider. GenerateDigest method of packing, using this method would be to encode the incoming string.

After the packaging laoxun: kaixin actually became laoxun: / xQjqfEf7WHKtjj2csJh1 / aEee8 =, the process is as follows:

laoxun:kaixinThe entire string is SHA1 encrypted first
The encrypted result is Base64 encoded
Concatenate the user name with the encoded result

Another way to write the above code is to add permission information to the client context using addAuthInfo

ZooKeeper client = new ZooKeeper("127.0.0.1:2181".3000.null);
client.addAuthInfo("digest"."laoxun:kaixin".getBytes()); / / 1.
List<ACL> aclList = new ArrayList<>();
aclList.add(new ACL(ZooDefs.Perms.ALL, new Id("auth".""))); // 2
String path = client.create("/abc"."test".getBytes(), aclList, CreateMode.PERSISTENT);
System.out.println(path);
client.close();
Copy the code

Add auth information to the session of the current client by using addAuthInfo. The id of the Digest is username:password. Both username and password are customized.

Then there is the query code

ZooKeeper client = new ZooKeeper("127.0.0.1:2181".3000.null);
client.addAuthInfo("digest"."laoxun:kaixin".getBytes()); // This line is an error if commented
byte[] data = client.getData("/abc".false.null);
System.out.println(new String(data)); // test
Copy the code

No matter how it is created, addAuthInfo must be used to add permission information to the session before the node can be queried

3.2.4 Super

It can be seen from the name that this mode is the mode of the administrator. If the user name and password are set for the nodes created before, other clients cannot access them. If the client exits, these nodes cannot operate, so the administrator role is required to reduce the dimension of the nodes.

First Super model is to open, I it is assumed that the administrator user name HelloZooKeeper, password as niubi, after encoding is HelloZooKeeper: PT8Sb6Exg9YyPCS7fYraLCsqzR8 =, You then need to specified on the server startup environment zookeeper. DigestAuthenticationProvider. SuperDigest configuration, Parameter is HelloZooKeeper: PT8Sb6Exg9YyPCS7fYraLCsqzR8 =.

The node is created assuming laoxun: Kaixin mode and can be accessed using the administrator’s password

ZooKeeper client = new ZooKeeper("127.0.0.1:2181".3000.null);
client.addAuthInfo("digest"."HelloZooKeeper:niubi".getBytes()); / / 1.
byte[] data = client.getData("/abc".false.null);
System.out.println(new String(data)); // test
client.close();
Copy the code

Here you can see that the Super mode at 1 is essentially a Digest. The specified scheme is a Digest, and the subsequent ID values are in plain text, not encoded. Remember that!

3.3 Permission Summary table

Here I list the permissions for most of the operations provided by the server:

operation	The required permissions	describe
create	CREATE of the parent node	Create a node
create2	CREATE of the parent node	Create the node and return the node data
createContainer	CREATE of the parent node	Creating container nodes
createTTL	CREATE of the parent node	Create a node with a timeout period
delete	DELETE of the parent node	Remove nodes
setData	WRITE of the current node	Setting Node Data
setACL	ADMIN of the current node	Example Set the permission information of a node
reconfig	WRITE of the current node	Reset some of the configurations (more on that later)
getData	READ of the current node	Querying Node Data
getChildren	READ of the current node	Gets the list of child nodes
getChildren2	READ of the current node	Gets the list of child nodes
getAllChildrenNumber	READ of the current node	Gets the number of all child nodes (including grandchildren)
getACL	ADMIN or READ for the current node	Obtain the permission information of a node

You can see that deleting and creating nodes is the parent node’s permission, only read and write is its own permission. In addition, operations that do not appear in the table can be considered as not requiring ACL permission verification, while other operations require only a valid session or some special functions, such as createSession and closeSession. Save more on sessions for the next post

3.4 Principles behind ACLs

We just spent a little bit of time talking about what an ACL is and how to use it? In order to save space, this time we will directly enter into the monkey talk.

First of all, I want to remind you of the previous picture

The permission part (blue font) in the figure was omitted from the previous article without explanation. Today we will talk about this permission field in detail.

As you can see from the map, the permission field is stored directly in the server node as a number (long, 64-bit integer). -1 is a special value that does not check permissions, corresponding to the OPEN_ACL_UNSAFE constant.

ACL permissions, whether provided when a node is created (the ACL argument is a List) or provided through the addAuth method (which can be called multiple times), indicate that a client can have multiple permissions, such as multiple user names and passwords, multiple IP addresses, and so on.

As I mentioned before, ACL consists of three parts, namely Scheme: ID :perms. For simplicity, I will use this form to represent an ACL later.

The server uses two hash tables to store the bidirectional relationship between the currently received ACL list and its corresponding digit, something like this (I made up the ACL values in the figure) :

The ZK server will maintain a number starting from 1, and when a new ACL is received, it will insert the two hash tables simultaneously. In addition to these two hash tables, ZK server also maintains a session permission information for each client. This permission information is added by addAuth, but only the Scheme: ID part of the client’s permission information is saved. Therefore, you can verify the permission of this operation on the client based on the following three information:

Information about nodes represented by two hash tablesscheme:id:permsThere can be more than one
The permission information in the client session context is onlyid:permsThere can be more than one
Permission requirements for this operation are listed in Table 3.3

The verification process is as follows:

In addition, the validator can be customized. Users can customize their own schemes and verification logic, which needs to be configured in the server environment variable as Zookeeper.authProvider. At the beginning of configuration, the corresponding value is corresponding to a class type the full path, this class must implement org. Apache. Zookeeper. Server auth. AuthenticationProvider interface, and the class must be loaded into the ZK service end, This allows you to parse a custom scheme to control the entire validation logic. This function is more advanced, I have not used it, we should supplement knowledge to understand ~

Today we have seen how followers and observers synchronize data with the Leader and how ZK provides access control acLs. In the next article we will talk about ZK session management and how clients and servers maintain sessions. And how is the heartbeat maintained between the different nodes of the server?

Finally, give this article a thumbs up. What? You say you don’t want to?

As always, if you have any questions about this article, it can also be suggestions or questions about the principles of ZK, please come to the warehouse to make an issue to us, or come to the topic of language sparrow discussion.

Address: www.yuque.com/kaixin1002/…