Zookeeper election mechanism and data consistency

1. Election for the first time

Suppose there are five servers: these servers start in sequence

Node-1: myid = 1
Node-2: myid = 2
Node-3: myid = 3
Node-4: myid = 4
Node-5: myid = 5

1.1. First server

The first server goes online and initiates an election. Server 1 votes for itself. At this time, the number of votes of server 1 is one, less than half (3 votes), and the election cannot be completed

1.2. Second server

Second server online, once again to launch a election and vote for himself, and then the server 1 and 2 exchange server the vote, the server found your myid is less than 1 server 2, change the vote to elect the server 2, the server is 0, 1 votes server 2 votes are two votes, but because there is no more than half of the results, So server 1 and server 2 are both LOOKING

1.3. Third server

After the third server startup, as well as the above two, cast himself, exchange information, found that the server 1 and 2 myid are less than three, the server at this time the service period 1 and 2 will change votes for the server, the service period 1 and 2 are zero votes, server 3 is 3 tickets, by this time the server has three votes more than half, server 3 elected leader, Change the status of servers 1 and 2 to FOLLOWING, and the status of server 3 to LEADING

1.4. Fourth server

After the fourth server online, will be launched, this server 1, 2, 3 are not & state, however, will not change the vote, exchange after the vote, server 3 is 3 tickets, server 4 is 1, the minority is subordinate to the majority, the server is to change the vote for the server 3, 4 and modify the status for the FOLLOWING.

1.5. Fifth server

Server 5 startup is the same process as server 4.

Ii. Post-failure election

2.1. The occurrence of elections

Initial startup of the server (first startup)
The server cannot keep the connection with the leader during running (the leader node or follower node may be faulty).

Two situations also occur when the server enters the election process:

The current cluster already has a leader node: In this case, when the current server attempts to vote the leader node, it will be informed of the information about the leader node in the current cluster. It only needs to establish a connection with the Leader machine and synchronize the status.
When the leader of the cluster fails, the triggered election must be carried out according to certain rules.

2.2. Data consistency ZAB protocol

When the above situation occurs, it involves data consistency. Regarding data consistency, we need to know how zK’s ZAB protocol solves data consistency, which involves three processes:

News broadcast
Crash recovery
Message synchronization

2.2.1. Message broadcasting

When a transaction request (write request) comes in, the leader wraps the request into a Proposal transaction and adds a globally unique 64-bit incrementing transaction ID (ZXID). This ID can be used to compare the order of messages because it is the globally unique incrementing ID.

The master node then broadcasts the Proposal transaction to the other slave nodes. The communication will pass through a message queue. The master node allocates a separate queue to each slave node and then sends the Proposal to the queue

After receiving the Proposal from the node, persist it to disk and return an ACK to the primary node

Finally, when the master node receives more than half of the acks from the slave nodes, it will commit the local machine 3 transaction, broadcast the commit, and complete the respective transaction commit after the slave node receives the commit.

2.2.2. Crash Recovery

The above process is divided into two phases: the first phase: the Porposal transaction is broadcast to the ACK returned from the node, and the second phase: the COMMIT message is broadcast after half an ACK. There are two types of data inconsistencies caused by the failure of the master node in this phase:

An outage occurs when the master node initiates a transaction, which causes the slave node to be unable to retrieve the transaction
An outage occurs when the master node receives more than half of the ACKS without sending a COMMIT message.

To solve the above problem, look at the crash recovery mode, which is mainly to solve the above two phases of the data inconsistency caused by the failure of the master node.

After the primary node recovers and joins the cluster again, the discarded messages cannot appear again
Resolved that messages that have already been processed cannot be lost when the primary node is elected

SID, ZXID, and Epoch:

SID: indicates the server ID, which is the same as myID and identifies the unique machine in the cluster
ZXID: a globally unique 64-bit incrementally increasing transaction ID. At some point, the value of the ZXID may not be the same for each machine in the cluster, depending on the zK server’s processing logic for client update requests.
Epoch: Code name of each leader term. If there is no leader, the logical clock in the same voting process is the same, and this value will increase after each vote.

Rules for electing Leader:

Epoch The big wins out
Epoch same, transaction ID larger wins
If the transaction ID is the same, the server ID larger wins

2.2.3. Data synchronization

The next work is data synchronization. In the election process, it is confirmed that the primary node is the node with the largest ZXID through voting. In the synchronization stage, all copies in the cluster are synchronized with the latest Proposal history obtained by the Leader in the previous stage.

Iii. Brief introduction of CAP theory

CAP theory is a set of theoretical knowledge formed by data management in distributed system. CAP is an architecture problem that must be considered in designing distributed system.

Consistency: Data is updated consistently, so data changes are synchronized. In the case of multiple copies of data, failure occurs, and data is inconsistent between multiple copies.
Availability: Good response performance, multiple replicas fail, data is transferred and cannot be accessed during transfer.
Partition tolerance: reliability, data persistence, no data loss in any case, multiple copies. Stored on different physical devices.

For most large-scale Internet application scenarios, there are many hosts, scattered deployment, and the current cluster scale is increasingly large. Therefore, node failures and network failures are normal. In addition, service availability should be ensured to reach N n9, that is, P and A, and C should be abandoned (the next best thing is to ensure final consistency). There are places where the customer experience is affected, but not to the extent that it causes user flow. What is guaranteed and what is discarded depends on the actual situation.

Now there is BASE theory: basic availability, soft state, and final consistency. This is an extension of CAP theory. For C in CAP, most of them adopt a strategy to ensure final consistency.