Election Process Overview
After the previous zooKeeper-related articles, we also have some understanding of ZooKeeper. We know that there are three server roles in ZooKeeper, namely Leader, Follower and Observer, among which Observer only acts as a monitoring coordinator. It does not participate in zooKeeper’s external services and zooKeeper election. As we know from the previous content, ZooKeeper election can be divided into two types: the first is the election process after the whole ZooKeeper cluster is started; the second is during zooKeeper operation. The election operation performed by ZooKeeper when the Leader crashes. Let’s start with the election triggered at startup
Election process when the server is started
When we build ZooKeeper, we often have a configuration file, which stores a MyID, which is used to mark the machine number of different clients in the cluster. When at least two machines with myID are started, the cluster election process of ZooKeeper begins
1. Each server initiates a poll
Since the current state is just started, each service instance will vote itself as the Leader server by default, and each vote contains the most basic elements required for election, such as myID and ZXID. We represent these two in the way of votes, for example, myID is 1, ZXID is 0. We will represent it as (1,0). Since each server instance preferentially votes as the Leader, the default vote generated by server1 is (1,0), and the default vote generated by server2 is (2,0), and so on
2. Receives the vote from another server instance
Each server instance will receive the vote information from other server instances. When receiving the vote, it will start the process of verifying and processing the vote
3. Verify and process ballots
Votes sent by other service instances need to go through a series of verification, such as whether they are the votes of this round and whether they are from an instance in the Looking state. After verification, they will be pk compared with the ballot information of the current instance. The rules of comparison are roughly as follows:
(1) Compare the zxids first. If the ZXids are different, the service instance where the larger vote is located acts as the Leader
② If two votes have the same ZXID, myID will be compared. By default, the service instance with the larger myID will act as the Leader
Server1’s myID is 1, server2’s myID is 2, server1’s myID is 1, server2’s myID is 2. If the myid is greater than its own, server2 should be the Leader, so server1 will update its vote to (2,0), and then send out the new vote information the next time
4. Count every vote
After each vote, all the votes will be counted to determine whether more than half of the instances received the same vote information. For server1 and Server2, the election process can be completed only if the votes of these two instances are the same. If the number of instances is singular, Only the server instances up to (number of instances + 1) / 2 need to receive the same number of votes. After the above process, as long as server1 compares the votes and sends out the information of (2,0), the election can be completed
5. Synchronize the server instance status
Once the election is completed and the Leader instance is selected, each service instance will update its status. If the service instance is Follower, it will change to Follower, and if the service instance is Leader, it will change to LEADING.
Elections that take place while the server is running
Except when the ZooKeeper cluster is started, the Leader always acts as the Leader in the cluster. Even if a Follower dies or a new machine instance joins the cluster, the Leader is not affected. However, once the Leader fails to respond or breaks down, the Zookeeper cluster will not be able to provide external services. Instead, a new round of Leader election will be conducted, and this election process is roughly similar to the election process of initializing the cluster. But the difference is that at this point each machine will switch from its own running state to the election state
1. Update the status
When the Leader instance dies, all the remaining Follower instances change their service status to LOOKING, and then proceed to the Leader election process
2. Same election process
The general process of Leader election is the same, which will not be described here. After the election, each server instance changes its state to the corresponding role state according to its role. At this time, the election is complete and the Zookeeper cluster resumes providing services.
Zookeeper election algorithm
We know the general process of ZooKeeper election, but we all know that the election process is based on the algorithm, which zooKeeper election algorithm? Zookeeper provides three Leader election algorithms: The LeaderElection algorithm, UDP FastLeaderElection algorithm, and TCP FastLeaderElection algorithm. The election algorithm can be specified in the electionAlg attribute in zoo. CFG configuration file. The corresponding values of these three election algorithms are 0-3 respectively, where 0 is the LeaderElection algorithm, which is implemented using UDP protocol. 1 represents the UDP version of the FastLeaderElection algorithm, which is in non-authorized mode. 2 represents the UDP version of the FastLeaderElection algorithm, but in authorized mode. 3 represents the FastLeaderElection algorithm implemented by TCP protocol.
However, it should be noted that since zookeeper3.4.x, Zookeeper has officially abandoned the three Leader election algorithms implemented by UDP protocol 0-2, and only retained the FastLeaderElection algorithm implemented by TCP protocol 3. This is why we did not analyze each election algorithm in the general process described above.
Details of the Leader election
After learning the general process of election, we found that the overall process and algorithm design is not difficult, but how to deal with common problems? At this time, we need to go into details to learn. Firstly, Zookeeper designs multiple server states to deal with different situations. This state is defined in the org.apache.zookeeper.server.quorum.QuorumPeer. ServerState class as follows:
①LOOKING: Find the status of the Leader service. After the status is in the current state, the Leader election process will be carried out
②FOLLOWING: indicates that the current server is in the Follower state, indicating that it is the Follower service
③LEADING: indicates that the current server is in the Leader state, indicating that the server is the Leader service
④OBSERVING: Indicates that the OBSERVING service is an Observer
We also mentioned above, every time after the Vote, the Vote is included in the basic elements, namely ZXID and myid, and the definition of the votes in the apache. The zookeeper. Server. The quorum. That class, the code is as follows:
final private int version;
final private long id;
final private long zxid;
final private long electionEpoch;
final private long peerEpoch;
Copy the code
Let’s illustrate some common attributes as follows:
attribute | instructions |
---|---|
id | Elected SID |
zxid | ID of the current Leader transaction |
electionEpoch | The logical clock, parsed out, is currently in which round of voting, and each time a new round of voting takes place, it increments by one |
peerEpoch | The epoch of the current elected Leader |
state | The current state of the service |
With that out of the way, let’s look at election communication. We talked about CilentCnxn, which is the manager in the Zookeeper client that handles I/O network traffic, The corresponding Zookeeper server also has a class –QuorumCnxManager class to accept and process the communication in the Leader election, and the whole process can be divided into several parts, which are roughly as follows:
Message queues process messages
The QuorumCnxManager class maintains a number of queues for receiving, waiting to be sent messages, and defines message senders, etc. Except for the receive queue, the other queues are all collections grouped by SID. Common queues and attributes are defined as follows:
- RecvQueue: Message receiving queue, used to store all received messages
- QueueSendMap: Queue for sending messages. It is defined as a Map and set as a key based on the SID group. A queue is maintained for each SID to ensure that messages are sent and received without affecting each other
- SenderWorkMap: a set of transmitters. Each senderWork sender corresponds to a remote zooKeeper, which is responsible for sending messages. Within senderWorkMap, it is also maintained according to SID groups.
- LasteMessageSent: The most recently sent message. In this collection, a most recently sent message is maintained for each SID
Establish a connection
In order to communicate with each other, instances in the ZooKeeper cluster need to be connected in pairs. When the QuorumCnxManager class starts up, it constructs a ServerSokect to listen to the communication port elected by the Leader. When receiving requests, The receiveConnection function is called to handle the connection. To avoid creating TCP connections repeatedly, Zookeeper establishes a rule that only allows a machine with a large SID to establish connections to a machine with a small SID. After the connection is connected, Create the corresponding senderWorker and the corresponding message receiver RecvWorker based on the SID of the remote service instance
Messages are received and sent
When a message receiver receives a message repeatedly, it stores it in the recvQueue queue. Sending a message is relatively simple. Since each SID has an independent SendWorker, it only needs to get the data to be sent from queueSendMap. The message that was just sent is stored in lasteMessageSent, but note that when the queue with the sent message is found to be empty, the message that was just sent is fetched from lasteMessageSent and sent again as a message. This is designed to prevent the recipient from not receiving the message. Zookeeper has a mechanism for handling repeated messages. Therefore, Zookeeper sends repeated messages to ensure correct processing of the messages
FastLeaderElection algorithm
Now let’s look at the core algorithm implementation of FastLeaderElection algorithm. The flow chart is as follows:
1. Self-increasing the number of elections
In the FastLeaderElection implementation, there is a logicalClock property, which identifies the number of current elections. Zookeeper requires that each election must be initiated in the same election cycle, so before each election, the logicalClock increment is triggered. Reach the current election cycle
2. Initialize the ballot
We already know the votes in front of the class definition in apache. The zookeeper. Server. The quorum. Vote, the initialization phase, each server will put himself as the Leader, so will initialize a first is given priority to with their votes
3. Send the initial ballot
After initializing the vote, it stores its vote information into the sendQueue queue and sends it out with the workerSender for each SID
4. Accept external voting information
Initialization phase, in addition to send their own votes information, will also receive services from other instances of the vote, the information stored in recvQueue queue, if found unable to obtain information to other votes, will confirm whether the current service instances and other service instance maintained a connection, if it is found that connection is disconnected or not connected, Then the connection will be established again, of course, the connection is still compared to the current service SID of the service to initiate a connection, to prevent repeated connection creation
5. Count the number of elections
When the initial ballot is sent, it starts to process the received ballot information of other service instances. First, it determines whether the number of external votes received is greater than the current ballot
- If the number of votes is greater than the number of votes in the current service, the logicalclock of the current service is updated, all votes received are emptied, the votes are compared again with external votes to determine if you really want to change your vote, and the vote information is resend.
- If the number of votes for the external vote is less than the number of votes for the current service instance, it simply ignores the vote information and continues to send its own vote
- If the external vote and the own service instance have the same number of elections, then you need to enter the comparison operation between the votes
6. Comparison of votes
Vote comparison is the core logic of the entire election algorithm, implemented in FastLeaderElection#totalOrderPredicate method. The vote comparison is mainly used to determine whether the vote information of the current service instance needs to be changed and then send the new vote information again. So we will compare it by election count, ZXID, and SID:
- If the number of elections for the external vote is greater than the number of elections for the current service instance, a vote change is required
- If the election cycle is consistent, the ZXID will be compared. If the ZXID of the external vote is large, the ballot information of its service instance needs to be changed
- If the ZXids are the same, then the SIDs need to be compared. The SIDs of the external vote are larger, and the vote information of its own service instance needs to be changed
7. Change the ballot information and send it again
After comparing votes, if it is found that it needs to change its own ballot information, it will first modify its own ballot information, and then send it again according to the new ballot information
8. Filing of ballot count
Each service instance in the process of sending and receiving, whether or not to empty or ignoring some votes, saves every vote in the recvSet archiving of information statistics, internal storage, in accordance with SID to distinguish the vote, then statistical calculation, as long as found that more than half of the service instance voted on, you can stop sending the vote, The election is completed, otherwise the process of sending and receiving ballots is repeated above
After the Leader is elected, the state of the service instance is changed. After the voting is stopped and the final Leader service instance is counted, the state of each service instance is changed. The specific logic is to check whether the elected Leader is LEADING. Then, you can change its status to “FOLLOWING” or “OBSERVING” according to the situation
So far, the core process of FastLeaderElection algorithm has been completed, but we need to pay attention to the fact that the previous steps 4-8 May go through many times, because in each election process, even if more than half of the votes have been received, the Leader service may not be selected, and multiple elections may be needed, and in each election voting process, There will be a process of sending votes for several times, because the cycle of sending votes for each service instance is random. Meanwhile, it should be noted that even if the votes are more than half, the Leader service instance is elected, but it is not finished immediately, but wait for 200ms to ensure that there is no loss of better votes for other services