Personal public account “code nong notes”, welcome to pay attention to view more wonderful articles.
sequence
In a common distributed system, there are always problems such as machine downtime or network anomalies (including messages delayed, lost, repeated, out of order, and network partitions). Based on this, a consistency algorithm adapted to various scenarios is developed to solve the problem of how to quickly and correctly reach a consensus on the value of a certain data within the cluster in a distributed system where the above anomalies may occur, and to ensure that no matter the occurrence of any of the above anomalies, the consistency of the whole system will not be damaged.
Because involves the theory more, this article draws lessons from a lot of blogger’s article, read carefully repeatedly, here special thanks.
CAP
In July 2000, Professor Eric Brewer of University of California, Berkeley proposed CAP conjecture at ACM PODC conference. Two years later, Seth Gilbert and Nancy Lynch of the Massachusetts Institute of Technology proved CAP theoretically. Since then, CAP theory has officially become the accepted theorem in distributed computing.
CAP theory: A distributed system can only satisfy at most two of Consistency, Availability and Partition tolerance simultaneously.
concept
Consistency
Consistency: All nodes see the same data at the same time. That is, data on all nodes is consistent at the same time after the update operation is complete and returned to the client.
Availability
Availability means “Reads and writes always succeed”, that is, the service is always available and the response time is normal.
Partition tolerance
Partition fault tolerance refers to “the system continues to operate despite arbitrary message loss or failure of part of the system”, That is, a distributed system can still provide services that meet the requirements of consistency and availability when a node or network partition fails.
CAP weigh
choose | instructions |
---|---|
CA | Abandon partition fault tolerance, enhance consistency and availability, in fact, is the traditional stand-alone database choice |
AP | Abandoning consistency (and by consistency I mean strong consistency) in favor of partition fault tolerance and availability is the design choice of many distributed systems, such as many NoSQL systems |
CP | Abandon availability, pursue consistency and partition fault tolerance, basically do not choose, network problems will directly make the entire system unavailable |
According to CAP theory, we know that consistency, availability and fault tolerance of partitions cannot be satisfied at the same time. Which one should be discarded?
For most large-scale Internet application scenarios, there are many hosts, scattered deployment, and the current cluster scale is increasingly large. Therefore, node failures and network failures are normal. In addition, service availability should be ensured to reach N n9, that is, P and A, and C should be abandoned (the next best thing is to ensure final consistency). There are places where the customer experience is affected, but not to the extent that it causes user flow.
In situations where money is involved and there is no compromise, C must guarantee. Network failure would rather stop service, this is to ensure CA, abandon P. It seems that there have been no less than 10 accidents in the domestic banking industry in recent years, but the impact is small, reporting is not much, the general public know little. Another way is to guarantee CP and discard A. For example, a network fault is read-only.
Which is better, there is no conclusion, can only be decided according to the scene, suitable is the best.
The BASE theory of
Dan Pritchett, architect of eBay, based on the practice summary of large-scale distributed system, put forward BASE theory in an article published on ACM. BASE theory is an extension of CAP theory, and the core idea is that Strong Consistency, CAP consistency is strong consistency), but an application can adopt an appropriate approach to Eventual Consitency.
BASE refers to Basically Available, Soft State, and Eventual Consistency.
concept
Basically Available
Basic availability refers to the distributed system in the event of failure, allowing the loss of part of the availability, that is, to ensure the availability of the core.
During the e-commerce boom, some users may be directed to degraded pages to cope with the surge in traffic, and the service layer may only provide degraded services. This is a partial loss of usability.
Soft State
A soft state is one that allows the system to have intermediate states that do not affect the overall availability of the system. Generally, one copy of data in distributed storage has at least three copies. The soft state is reflected in the delay of copy synchronization between different nodes. Mysql Replication’s asynchronous replication is another example.
Eventual Consistency
Final consistency means that all data copies in the system reach a consistent state after a certain period of time. Weak consistency is the opposite of strong consistency, and final consistency is a special case of weak consistency.
The BASE and ACID
In general, BASE theory is oriented towards large, highly available and scalable distributed systems, which is the opposite of ACID, a traditional thing. It is completely different from ACID’s strong consistency model, but it sacrifices strong consistency to obtain availability, and allows data to be inconsistent in a period of time, but eventually reach a consistent state. However, in actual distributed scenarios, different business units and components have different requirements for data consistency. Therefore, ACID characteristics and BASE theory are often combined in the design process of distributed system architecture.
2PC
Two-phase Commit Protocol (2PC), also known as two-phase Commit Protocol (2PC), is a classic strongly consistent and centralized atomic commit protocol. By centralization I mean that there are two protocols
Class node: One is the central coordinator node (Coordinator) and N participant nodes (Partcipant). The transaction submission process is divided into two stages for processing.
2PC The operation succeeds. All participants agree
2PC Execution fails. If any participant disagrees, the execution fails
Phase one: Submit the transaction request
The transaction manager sends a Prepare message to each participant, and each database participant executes the transaction locally and writes a local Undo/Redo log. The transaction is not committed. (Undo log records data before modification and is used for database rollback. Redo log records data after modification and is used for writing data files after transaction commit.)
Phase 2: Transaction execution
If the transaction manager receives an execution failure or timeout message from each participant, it sends a Rollback message to each participant. Otherwise, send a Commit message; According to the instructions of the transaction manager, participants perform commit or rollback operations and release lock resources used during transaction processing.
Note: Lock resources must be released at the last stage.
The advantages and disadvantages
advantages
The principle is simple and the implementation is convenient
disadvantages
- Single point of service: if the coordinator suddenly crashes, the transaction process cannot continue or the state is inconsistent
- Consistency is not guaranteed: If the coordinator crashes while sending COMMIT requests in the second phase, it is possible that some participants COMMIT a transaction on a COMMIT request, while others abandon the transaction without being requested
Cause inconsistencies.
- Blocking: In order to ensure the completion of transaction submission, participants must lock related resources after completing the first-stage transaction execution until formal submission, affecting the throughput of the system.
Participants wait for the next request from the coordinator after completing the transaction execution in phase 1, and can abandon the transaction if the coordinator times out. This scheme still has the disadvantage of not being able to guarantee consistency, but it does not result in the situation where resources are locked up and cannot continue as described in some sources.
3pc
3PC, full name of “Three Phase Commit”, is an improved version of 2PC, which divides the “Prepare for committing transaction request” process into two parts (CanCommit and PreCommit). A transaction protocol consisting of CanCommit, PreCommit and doCommit is formed.
The basic flow
step
CanCommit
- The coordinator sends a CanCommit request containing transaction contents to all participants, asks if the transaction commit operation can be performed, and waits for the response from each participant.
- After receiving the CanCommit request containing transaction content from the coordinator, the participant will give a Yes response and enter the preparatory state if he thinks the transaction can be successfully executed, otherwise he will give a No response.
PreCommit
After the coordinator receives the response from all participants, the participants respond with Yes at CanCommit, and perform the transaction pre-commit:
- 1 The coordinator sends a preCommit request (issues a preCommit request and enters prepared)
- 2 Transaction preCommit (After receiving a preCommit request, a participant performs a transaction and records Undo and Redo information in the transaction log.)
- 3) Each participant feedback the result of transaction execution to the coordinator (if the participant successfully performs the transaction operation, then feedback Ack)
After the coordinator has received responses from all participants, the participant reports No on CanCommit and interrupts the transaction:
- 1 the coordinator sends an interrupt request :(the coordinator issues an abort request to all participants.)
- 2 Interrupt a transaction (participants interrupt a transaction either after receiving an ABORT request from the coordinator or while waiting for a timeout during the coordinator’s request)
DoCommit
The DoCommit phase completes the actual transaction commit or completes the transaction rollback.
When an ACK message is received during the second PreCommit phase, the commit is complete:
- 1 The coordinator sends a commit DoCommit request (the coordinator transitions from a pre-commit state to a commit state and sends a DoCommit request to all participants)
- 2 Participants commit transactions (After receiving the DoCommit request, participants formally commit transactions and release transaction resources occupied during the entire transaction execution.)
- 3. Each participant feedback the result of transaction submission to the coordinator (if the participant successfully completes the transaction submission, the Ack response will be feedback)
- 4 Transaction completion (The coordinator completes the transaction after receiving Ack messages from all participants.)
If no ACK message is received during PreCommit phase 2, the transaction is interrupted:
- 1 The coordinator sends abort requests (the coordinator sends abort requests to all participant nodes)
- 2 Participants perform transaction rollback (perform transaction rollback based on the recorded Undo information and release the resources occupied during the entire transaction execution after the rollback is completed)
- 3 Participants report the rollback result to the coordinator (participants send AN Ack message to the coordinator after completing the rollback.)
- Interrupt transaction (The coordinator interrupts the transaction after receiving an Ack message from all participants.)
Note: During the DoCommit phase, the coordinator may break down and the network between the coordinator and the participant may fail. If the participant fails to receive the coordinator’s DoCommit request or Abort request, the participant will continue to commit the transaction after the request times out.
The advantages and disadvantages
Reduced congestion
- After the participant returns the response to the CanCommit request, it waits for the second-stage instruction. If the wait times out, it is automatically abort, reducing blocking.
- After returning the PreCommit request response, the participant waits for the third stage instruction. If the wait times out, the transaction is automatically committed, which also reduces the blocking.
Solve single point of failure
- After the participant returns the response to the CanCommit request, it waits for the second-stage instruction. If the coordinator goes down, the wait timeout is automatically abort.
- After the participant returns the response of the PreCommit request, he/she waits for the third-stage instruction. If the coordinator goes down, he/she will automatically commit the transaction after the timeout.
Data inconsistencies still exist
For example, if an ABORT request is issued by the phase 3 coordinator, and some participants do not receive abort, then the commit is automatically made, causing data inconsistencies.
paxos
Paxos is a consensus algorithm based on messaging and highly fault-tolerant, which was proposed in 1990 by Leslie Lamport (English: Leslie Lamport, “La” in LaTeX). Note that Paxos is often mislabeled as a “consistency algorithm.” However, consistency and consensus are not the same concept. Paxos is a consensus algorithm.
Algorithm process
The problem solved by Paxos algorithm is distributed consensus, that is, how the processes in a distributed system agree on a certain value (resolution) by consensus.
The Paxos algorithm runs on asynchronous systems that allow downtime, does not require reliable message delivery, and can tolerate message loss, delay, out-of-order, and repetition. It uses the Majority mechanism to ensure the fault tolerance of 2F+1, that is, the system with 2F+1 nodes can allow at most F nodes to fail simultaneously.
A Proposer with one or more proposals can make a Proposal, and the Paxos algorithm makes one of the proposals reach agreement among all the proposals. The majority in the system simultaneously endorsed the proposal, i.e., agreement was reached. Agree on at most one firm proposal.
Paxos divides the roles in the system into Proposer (Proposer), decision maker (Acceptor), and Learner (Learner):
- Proposer: No. The Proposal information includes the Proposal ID and Value.
- Acceptor: to participate in decisions made to respond to proposals numbered with Proposers. When a Proposal is received, it can be accepted. If a majority of Acceptors accept the Proposal, it is said to have been approved.
- Learner: Not participating in the decision, learning the most recent agreed proposal (Value) from Proposers/Acceptors.
In a concrete implementation, a process may perform multiple roles simultaneously. For example, a process may be both Proposer and Acceptor and Learner. A Proposer makes a proposal, an Acceptor decides whether to accept the proposal, and a learner learns the result.
There is also a very important concept called a Proposal. The final value to be agreed upon is in the proposal. If a Proposer sends a proposal to which more than half of acceptors agree, it considers the value in the proposal to have been selected. An Acceptor tells a Learner that a value is selected, and the Learner assumes that value is selected. Once an Acceptor accepts a proposal, an Acceptor accepts that the value in the proposal has been selected.
To avoid a single point of failure, there is a set of acceptors. If an Acceptor sends a proposal, each member of the Acceptor set may agree to the proposal and each Acceptor can only approve one proposal. The proposal is considered selected.
The Byzantine problem
Byzantine generals problem: Generals of the Byzantine Army must unanimously decide whether to attack an enemy group. The problem was that the generals were geographically separated and relied on their correspondents to deliver orders, but there were traitors among the correspondents who could tamper with messages, traitors who could trick certain generals into taking offensive action; To precipitate a decision with which not all generals agree, such as to precipitate an attack when the generals do not wish to; Or confuse some generals so they can’t make a decision.
The Paxos algorithm is based on the assumption that there is no Byzantine General problem, namely: * The channel is secure (the channel is reliable) and the signal emitted cannot be tampered with because the Paxos algorithm is based on message passing *.
Theoretically, in distributed computing, it is impossible to attempt to achieve a consistent state over asynchronous systems and unreliable channels. Therefore, in the process of consistency research, it is often assumed that the channel is reliable, but in fact, most systems are deployed in a LAN, so message tampering is very rare; On the other hand, incomplete messages caused by hardware and network problems only need a set of simple verification algorithms. Therefore, in practical engineering, it can be assumed that all messages are complete, that is, not usurped.
The agreement process
Paxos in the original author’s Paxos Made Simple is relatively concise:
Phase 1
(a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.
(b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered pro-posal (if any) that it has accepted.
Phase 2
(a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v , where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
(b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.
(a) A proposer selects a proposal number * N * and then puts the proposal number *n* into the prepare* Request and sends it to the majority of acceptors.
In our case, a proposer sends a prepare request to all acceptors, but with a majority of acceptors accepting it for performance reasons. Then he can send prepare requests to only one majority of acceptors. It is up to the proposer to decide which acceptors it sends.
If an acceptor accepts a prepare request whose number *n* is greater than any of the prepare* requests it has responded to, it will respond to the request. It also promises not to accept proposals numbered less than N, and will return the proposal with the largest number, if any, of the proposals it has accepted.
In one of our pictures above, (TODO, which picture?) , E only returned OK because it had not accepted the offer before, but ACD also returned the largest offer it had accepted before, namely Alice, in addition to returning OK
(a) If the proposer receives a response from a majority of acceptors to prepare* requests (numbered N), then it sends a request to each of those acceptors This contains the proposal number *n* and a value *v*. *v* is the value of the proposal with the largest number in the response, and can be any value if there is no proposal in the response.
For example,
This example comes from ocavue.com/paxos.html#… “Underline, strongly fun nuggets display pictures have a bug, do not support SVG vector pictures, please click on the public link mp.weixin.qq.com/s/D_nYSERTf…
Imagine a distributed system for selling tickets, like 12306. The system consists of five machines, located in different locations. For the sake of explanation, we set up the system to sell only one ticket. Each of the five machines has its own database to store the names of those who bought the tickets. If Machine A thinks it sold the tickets to me and machine B thinks it sold them to you, that’s bad news. So we want consistency in distributed systems, and in this case the name of the buyer is the same across different machines.
At this point machine D receives a ticket request from Alice. So D first enters the Prepare stage:
D sends one to the other four machinesproposed(here withBlue arrowExpress a proposal) :
Where P-1D represents:
- This is a proposal.
- The proposed ID is
1D
Each proposal requires an incremented globally unique ID, the simplest of which is the current time plus the name of the current machine. This ID is used throughout the Paxos process. It is worth noting that D did not tell the other machines the buyer’s name, Alice, during the proposal phase.
When the other machines received the offer, they found that they had not received it before and agreed to it. Specifically, the following things were done:
- will
P-1D
recorded - Promise not to accept it again
ID < "1D"
The proposal of - Reply OK to D hereRed arrowIndicates a response to a proposal
D received the reply from the other machines and found that with his consent, more than half of the machines had agreed to sell tickets to Alice (in fact, all five machines agreed to this). Even if the majority has agreed to the proposal, D considers that the proposal has been approved and enters the Commit phase.
During the Commit phase, D issues one to all machinesThe resolution(here withGreen arrowExpress express resolution) :
Where a-1D-Alice represents
- This is A for Accept
- The ID of the resolution is
1D
- Resolution: Sell tickets to Alice
When the other four machines arrived at the decision, they would record the content of the decision and return it to D. D finally sends the successful ticket purchase message to Alice, which is represented by the graph. Here, the purple arrow represents the reply to the resolution:
At this point, all five machines think the ticket has been sold to Alice, and consistency is guaranteed.
The above situation is the ideal situation that all machines and networks can run normally, but the reality is always not ideal, and one of the biggest value of distributed system is to be able to deal with the failure of some nodes, so let’s simulate the situation of node failure.
#Example of node failure
Let’s say two machines B and E fail. Repeat the above steps to see what happens.
In the first step, D accepts Alice’s request and sends a proposal to the other machines:
Step two, except for B and E, the other machines areReplied to the offer:
Third, D decides that the proposal has been accepted by the majority (A, C, D) and sends the resolution to everyone
The fourth step, in addition to the failure of the machine, other machines have responded to the resolution. Finally, D sends a message to Alice that the ticket has been purchased successfully.
So far, so good. Since only a few machines failed, there was still a majority (3 machines > 5 machines / 2), so the system was not affected. However, if we fix B and D at this time, we will find a problem: B and D are like fresh vegetables, they still think the ticket is not sold! How does the Paxos algorithm solve this situation? Let’s continue with this example:
↑ B and E returned to work, but they had no information about P-1D and C-1D-Alice at this time
Suppose Bob comes to buy a ticket, he makes a ticket request to machine B, and then MACHINE B sends an offer to the other machines.
For this proposal, E replied OK, but A, C and D replied “1B” < “1D”, Fail. Because A, C, and D have previously promised that they will not accept A proposal request with id < “1D”.
So B has to give up1B
Propose, but then propose an ID2B
Propose, and send the offer to other machines:
The rest of the machines then agreed, but A, C and D also agreed with this message: “I have sold the ticket to Alice, you cannot tell me to buy the ticket to someone else” :
B has received five points of approval for proposal 2B including himself, so B can proceed to the next step, which is to issue the decision. However, after knowing that the tickets had been taken away by Alice, B was forced to modify the content of the resolution, that is, to issue a resolution to sell tickets to Alice. After all:
After receiving the decision, other machines also write the decision in their own database and return the result to B. B then informs Bob that the ticket has been sold:
In this example, our distributed system handled the node failure situation very well, resulting in the following results:
- The same ticket was not sold to two people. Although Bob’s requested machine B didn’t know the ticket had been taken by Alice at first, Bob eventually found out. This is the final consistency.
- Eventually all the machines have the same information stored on them, which is
P-2B
和C-2B-Alice
.
#A more realistic situation
Processing multiple PaxOS instances simultaneously:
In reality, a ticketing system cannot sell only one ticket as in the example above. By treating each ticket as a separate PAXOS algorithm process, such as adding a unique ticket identifier to each request and response, we can logically process the sale of multiple tickets simultaneously.
*TODO adds ICONS
Tickets of the turntable
Another assumption we made was that a ticket would only be sold once, but in reality a ticket could be refunded and sold to another customer. In PaxOS, we need to introduce the concept of a state machine: simply put, a state becomes another state after one operation.
When a ticket is sold, its status changes from available to unavailable, and when it is refunded, its status becomes available again.
Train ticket sales and bank account balances can be expressed as follows:
Once you know the initial state and all the operations, you can calculate the current state according to the logic of the state machine. We can specify an instance of the Paxos algorithm for each state and synchronize states 1, 2, 3,…. on all machines using the Paxos algorithm This keeps the same record on all machines.
Deadlock condition of Paxos
If a proposer knows from an ACCpter message that a proposer with a higher number has been submitted at that time, then that proposer is silent for a period of time rather than immediately making a higher proposal. A proposer reproposals a proposer with a proposer at the end of the quiet period, which is the estimated time it takes for a proposal to be accepted. The reason for a master proposer in a system is that it would be too slow if paxOS was used for each data change, or it would be faster if the request was sent from the master node, because it saves unnecessary PAxOS time. So the primary proposer is selected using the PAXOS algorithm, because it picks the primary much less often than it changes the data. If the proposer loses its numbered proposer, the cluster is not available at any time, so it usually uses a lease. If the proposer loses its numbered proposer, then the lease expires and the other proposer can choose a new proposer. If it doesn’t, then the proposer releases its numbered proposer.
raft
Raft is a more understandable consistency algorithm proposed by Stanford to replace the currently widely used but difficult to understand Paxos algorithm. There are currently open source implementations in various major languages.
The principle of
In Raft, each node will be in one of three states:
-
Follower: All nodes start with the follower state. If the leader message is not received, the state becomes candidate
-
Candidate: Can “pull votes” from other nodes, and become the leader if they get the majority of votes. This process is called the Leader Election
-
Leader: All changes to the system will pass through the leader first, and each change will write a log entry. After the leader receives the change request, the process is as follows:
- Copy the log to all follower nodes (Replicentry)
- Most nodes do not commit logs until they respond
- Notify all followers that the log has been submitted
- All followers also submit logs
- Now the whole system is in a consistent state
This process is called Log Replication.
Leader Election
If the follower does not receive append messages from the leader within the Election timeout, it becomes a candidate. To avoid election conflicts, the timeout is a random number between 150 and 300ms.
A candidate initiates a new election term to “solicit votes” :
- Reset your timer
- Vote for yourself
- Send a Request Vote message
If the receiving node has not voted in the new term it will vote for the candidate and reset its own election timeout. Candidates who win the majority of votes become the leader and periodically send heart-append Entries messages to reset the timer for each follower. The current Term continues until a follower fails to receive the heartbeat and becomes a candidate.
What if two candidates become candidates at the same time?
A Splite Vote occurs. The two nodes may both draw the same number of votes, so it is difficult to win or lose. The election fails, and there is no leader in this term. After that, the followers whose timer expires will become candidates and add the term to start a new round of voting.
Log Replication
When changes occur, the leader copies the log to the follower node, also by Append Entries for heartbeat messages. The Log Replication process has been listed previously and will not be repeated here.
Split brain problem: In a high availability (HA) system, when two connected nodes are disconnected, the entire system is split into two independent nodes, and the two nodes begin to compete for shared resources. As a result, the system is chaotic and data is damaged.
Raft handles network partitioning (” split brain “) correctly. Suppose A to E have five nodes, and B is the leader. If split-brain occurs, A and B become A subregion, and C, D, and E become A subregion. At this point, C, D, and E will be elected to elect C as the leader of the new term. Thus we have two leaders with different terms in two subpartitions. In this case, if A client writes logs to USER A, user B cannot copy the logs to most followers, so the logs are in the uncommitted state. Meanwhile, the other client can correctly complete the write operation to C, because C is the new leader and only knows D and E.
When the network communication is restored, B can send heartbeat to C, D, and E, but finds that “dynasty change” has occurred. Because C’s term value is larger, B is automatically downgraded to follower. Both A and B then roll back the uncommitted logs and copy the latest logs from the new leader.
To illustrate
Raft tutorial is highly recommended. It is not covered here.
ZAB agreement
Zookeeper Atomic Broadcast (ZAB) is a crash recovery consistency protocol specially designed for distributed coordination service Zookeeper. Based on this protocol, ZooKeeper implements a master-slave system architecture to maintain data consistency between replicas in the cluster.
The ZAB protocol allows only one main process (the leader) to receive and process client transaction requests. After receiving the transaction request, the leader converts the requested transaction into a transaction proposal. Since the leader will create a queue for each follower, he puts the transaction proposal into the response queue to ensure the sequence of the transactions. After receiving the proposal, the followers write it in a transaction to the local log and send an Ack message to the leader. If more than half of the followers return an Ack message, The leader submits the proposal and sends a COMMIT message to the other nodes.
Knowledge reference segmentfault.com/a/119000003… The article.
concept
Three roles
Leader: responsible for the core of the working mechanism of the whole Zookeeper cluster. The main tasks are as follows:
- A unique scheduler and handler of transaction requests to ensure sequential processing of cluster transactions
- A scheduler for each server within a cluster
Follower: a Follower of the Leader whose main jobs are as follows:
- Handles non-physical requests from clients and forwards transaction requests to the Leader server
- Participate in the voting of the transaction request Proposal
- Participate in Leader election voting
The Observer: Zookeeper is a role introduced by ZooKeeper since 3.3.0. It does not participate in the voting of transaction request Proposal or Leader election, but only provides non-transaction services (queries). It generally improves the non-transaction processing capability of the cluster without affecting the transaction processing capability of the cluster.
Three states
In THE ZAB protocol, it is defined that a process can distinguish its role by its own state. During running, a process may have one of the following states:
- LOOKING: In this state, the Leader election state will be entered
- FOLLOWER: Indicates the status when the FOLLOWER server and the Leader server synchronize
- LEADING: Status of the Leader server as the Leader of the main process
When all processes of ZAB protocol are started, their initial state is LOOKING. At this time, the Leader does not exist in the process group until after the election. After the election is successful, the Leader enters the message broadcast mode (introduced later). In this case, roles in the Zookeeper cluster are not in the LOOKING state.
ZXID
Zookeeper messages are strictly causal, so each transaction request must be sorted and processed in order of precedence. So how does Zookeeper keep requests in order? One of the key points is the ZXID.
So how does ZXID work?
After receiving the transaction request, the Leader server will generate the corresponding Proposal for each transaction request to broadcast, and before broadcasting the transaction Proposal, the Leader server will first allocate a globally monotonically increasing unique ID for this transaction Proposal. We call this the transaction ID (ZXID).
ZXID is a globally ordered 64-bit number that can be divided into two parts:
- The high 32 bits are: epoch, representing the period. Whenever a new Leader server is elected, the ZXID of the largest transaction in its local log is extracted, the epoch value operation is resolved to add 1 as the new epoch, and the lower 32 is zero.
- The lower 32 bits are: counter, which is a simple monotonically incrementing counter that increments by one for each transaction request from the client;
The lower 32 bits of the epoch are incremented by 1 every time the epoch is selected. What is the role of the higher 32 bits of the epoch in ZAB protocol?
As we know, a new epoch value is generated whenever a new Leader server is elected, and as we know above, the conditions triggering the election of the Leader server during service operation are as follows: When the Leader server is interrupted, run out, or restart, or half of the servers in the cluster cannot communicate with the Leader server.
This indicates that the whole Zookeeper cluster is in an abnormal condition at this time, and we do not know the stage of message broadcast before the abnormal occurs. After the other Follower nodes in the cluster re-elect their leaders from the crash recovery state, If the old Leader recovers the connection and enters the cluster. At this point, the epoch of the old Leader will definitely be smaller than that of the new Leader. In this case, the old Leader will be changed into a Follower and the data of the new Leader will be synchronized. Even if the old Leader then sends a request to the other Follower nodes, the followers will compare the values of their ZXIDS, because the epoch of the followers is greater than the epoch of the old Leader by 32 bits. So followers ignore this request.
Two kinds of model
ZAB protocol includes two modes: crash recovery and message broadcast.
When the Zookeeper cluster enters the collapse recovery mode, the Zookeeper cluster elects its Leader in two cases:
- A Leader election is held during server startup.
- When the Leader server has network interruption, run out, restart and other abnormal situations during the server running, or when half of the servers in the cluster cannot communicate with the Leader server, the Leader server enters the crash recovery mode and starts to elect the Leader server.
After the Leader server is elected, it enters message broadcast mode and starts to receive and process client requests.
Message broadcast mode
- After receiving the request, the Leader server will assign a ZXID to the transaction before broadcasting the Proposal and then broadcast it.
- The Leader server assigns a separate queue to each Follower server, then puts the transaction proposals to be broadcast into these queues and sends messages according to the FIFO policy.
- Each Follower server writes it to the local disk as a transaction log after receiving it and sends an ACK response back to the Leader server upon successful writing.
- When more than half of the servers respond, the Leader broadcasts a Commit message to all the Follower servers. After receiving the message, the Follower commits the transaction.
The whole process is similar to a two-phase commit process, but with a difference. ZAB simplifies the two-phase commit model by starting the transaction Prososal after more than half of the Follower servers have responded with acks, without waiting for all servers to respond.
Crash recovery mode
As mentioned above, crash recovery is triggered in abnormal situations such as network interruption, rout and restart, and the process is as follows
summary
In the high-concurrency distributed scenario, in order to ensure the consistency of data, a variety of theories have been born, with their own application scenarios, and it is a process of gradual development. Cap and Base theory reflect various scenarios of current distributed existence. 2PC, 3PC, Paxos, RAFT and Zab are all common consistency protocols. I hope this article will give you a preliminary understanding.
This article is a theoretical summary, so stay tuned for examples of which components use these theories in subsequent articles.