The consistency problem can be regarded as a temple problem in the distributed domain, and its research can be traced back decades.
Question of the Byzantine general
Leslie Lamport’s paper “The Byzantine Generals Problem” (cf. [1]), published over thirty years ago.
Byzantium was the capital of the Eastern Roman Empire in what is now Istanbul, Turkey. Because of the vast territory of the Byzantine Empire, for defensive purposes, each army was separated far apart, and generals had to rely on messengers to carry messages from one another. In times of war, all the generals in the Byzantine army had to reach a consensus to decide if they had a chance to win before attacking the enemy’s camp. However, there may be traitors and enemy spies in the army, and the decision of the generals may disturb the order of the whole army. When consensus is reached, the result does not represent the majority opinion. At this point, the Byzantine problem was formed as to how the remaining loyal generals could reach an agreement without being influenced by traitors or spies, given that some of their members were known to be unreliable. The Byzantine hypothesis is a modeling of the real world, where computers and networks can behave unpredictably due to hardware errors, network congestion or disconnection, and malicious attacks.
Lamport has been researching such issues and has published a series of papers. But a summary is to answer the following three questions:
-
Is there a solution to a distributed consistency problem like the Byzantine Generals?
-
What conditions, if any, do I have to satisfy?
-
On the basis of the specific precondition, a solution is proposed.
Lamport has already answered The first two questions in The paper “The Byzantine General Problem”, and proposed an algorithm named Paxos for The third question in The later paper “The Part-time Parliament”. This paper uses a lot of mathematical proofs, but I can hardly understand them (I can’t recognize all the mathematical symbols). -; ), later Lamport wrote another paper “Paxos Made Simple” which completely abandoned all the proof of mathematical symbols and used pure English logical derivation. I barely read the word for word, and then feel if there is understanding, but you ask me to understand, my standard should still do not understand. For me, there is a clear criterion for understanding an algorithm, that is, if I really understand it, I can map the algorithm into code in my mind, but after reading the following paper, I can not achieve the clarity of mapping into code only if I understand it.
Lamport thinks Paxos is simple, but maybe it’s just for his head. The truth is that it’s still hard to understand, so Raft was built in the hope of a more comprehensible alternative to the Paxos algorithm. Understandable as one of the main objectives of the Algorithm, In Search of an Understandable Consensus Algorithm.
Before entering into the main topic, I remembered an old story which can intuitively feel the difference in comprehensibility of different thinking perspectives on a problem.
Understandability from different perspectives
About twenty years ago, when I was in junior high school, I came across an interesting question in a book probably called divergent Thinking in Mathematics (I can’t remember the title clearly).
A and b two people take turns to put black and white weiqi on a round table, each time put a son, chess pieces are not allowed to overlap, who has no place to put first lose. How can I put it to win?
The question is twofold. First, is there a surefire way to win? Second, if so, how? Pause here and think about it for ten seconds.
The diagram above answers this question, first mover wins, using three different ways of thinking.
-
Suppose the table is only as big as a go child.
-
If the table is infinitely large, the first mover occupies the center of the circle. Since the circle is symmetrical, you can always find a place on the other side of the circle as long as your opponent can still find a place.
-
A circle can have a number of small circles of equal diameter and tangent to each other.
Three different ways of thinking gradually deepen in comprehensibility difficulty. The first is extremely simplistic thinking, but mathematically imprecise. The second is limit thinking, combined with the first is mathematical induction, mathematically rigorous. The third type of thinking is figurative thinking, which uses geometric concepts but is difficult to understand for people without basic knowledge of geometry.
Comprehensible description of Raft protocol
Raft’s paper is a lot easier to read than Paxos’s simple paper, but it still diverges a lot and is quite lengthy. After reading the book, I pondered that I would be more secure and become truly my own. Here I use the first minimalist approach from black and white to describe and proof-of-concept how Raft protocol works.
There are three types of roles in a cluster organized by the Raft protocol:
-
A Leader
-
From the followers
-
Candidate
Like a democratic society, leaders are chosen by popular vote. No leader at first, all of the participants in the cluster is the masses, so first open round, all people can participate in during the election campaign, at this time all the role becomes a candidate of the masses, began after the democratic vote for leaders that the leader’s tenure, then the election, all except the leader of the candidates and get back to the masses obey leader leadership role. The concept of “tenure” is mentioned here, expressed in the Term Term. That’s all the core concepts and terminology for Raft protocol and it’s a good fit for a real democracy, so it’s easy to understand. The transition of the three roles is shown below, which is easy to understand in combination with the following election process.
Leader Election Process
In A minimalist way of thinking, A minimal Raft democratic cluster needs three participants (A, B, C) to be able to cast A majority vote. The initial state ABC are all followers, and then there are three possible scenarios in which an election is initiated. In the figure below, the first two can both elect the Leader, while the third one indicates that Split Votes are invalid, in which each party Votes for itself and no party wins a majority of Votes. Each participant then takes a random Election Timeout to re-vote until one party has a majority. The key here is the random timeout. The first party to recover from timeout asks the other two parties in timeout to vote, and then they can only vote for each other, quickly reaching an agreement.
After the Leader is elected, he maintains his rule by periodically sending heartbeat messages to all followers. If the Follower does not receive the heartbeat from the Leader for a period of time, the Follower considers that the Leader may have hung up and initiates the main selection process again.
Impact of the Leader node on consistency
Raft protocol strongly relies on the availability of the Leader node to ensure consistency of cluster data. The flow of data can only be transferred from the Leader node to the Follower node. After the Client submits data to the cluster Leader node, the Leader node receives the data in Uncommitted state. Then the Leader node copies the data to all the Follower nodes concurrently and waits for the response to be received. Ensure that at least half of the nodes in the cluster have received data before confirming data receipt to the Client. Once an Ack response is sent to the Client, the data is Committed. The Leader node sends a notification to the Follower node to inform the Follower node that the data is Committed.
During this process, the master node can fail at any stage. Take a look at how Raft protocol ensures data consistency for different stages.
1. Data arrives at the Leader node
The Leader failure in this phase does not affect consistency.
2. Data reaches the Leader node but is not copied to the Follower node
In this phase, the Leader hangs up and the data is in the uncommitted state. The Client will not receive an Ack and will consider the timeout failure to initiate a safe retry. If the Follower node does not have this data, the Client can resubmit the data after the master is selected again. After the original Leader node is recovered, it joins the cluster as a Follower and synchronizes data from the new Leader to forcibly keep the data consistent with that of the new Leader.
3. Data reaches the Leader node and is successfully copied to all nodes of Follower, but no response is received from the Leader
At this stage, the Leader hangs up. Although the Follower node remains Uncommitted, the data can be submitted after the Leader is elected again. At this point, the Client can retry the submission because it does not know whether the submission is successful. In this case Raft requires RPC requests to be idempotent, that is, to implement an internal de-duplication mechanism.
4. Data reaches the Leader node and is successfully copied to some nodes of Follower, but is not received by the Leader
At this stage, the Leader hangs up, the data on the Follower node is Uncommitted and inconsistent, and Raft protocol requires that only the node with the latest data be voted. Therefore, the node with the latest data will be selected as the Leader and forced to synchronize data to followers, so that data will not be lost and ultimately consistent.
5. The data reaches the Leader node and is successfully copied to all or most of the followers nodes. The data is in the submitted state at the Leader node but not in the unsubmitted state at the Follower node
In this phase, the Leader hangs up and a new Leader is elected. The process is the same as that in phase 3.
6. The data reaches the Leader node and is successfully copied to all or most of the Follower nodes. The data is submitted on all nodes but has not responded to the Client
At this stage, the Leader fails, and the data in the Cluster is actually consistent. The repeated retries of the Client based on the idempotent policy have no impact on consistency.
7. In the case of split brain caused by network partition, two leaders appear
Network partition separates the original Leader node from the followers node. If the followers fail to receive the heartbeat message from the Leader, they will elect a new Leader. At this time, two leaders are generated. The original Leader is alone in a region, and the data submitted to it cannot be copied to most nodes, so the submission will never succeed. Data submitted to the new Leader can be submitted successfully. After the network is restored, the old Leader automatically demotes to Follower if the new Leader has a new Term in the cluster and synchronizes data from the new Leader to achieve data consistency in the cluster.
This exhaustive analysis of all the scenarios for the smallest cluster (3 nodes) shows that Raft protocol deals with consistency very well and is easy to understand.
Conclude this article with a summary from the last section of Raft’s paper.
The algorithm takes correctness, efficiency and simplicity as its main design goal. While these are worthy goals, none of these goals will be achieved until the developer writes a usable implementation. So we believe that comprehensibility is equally important.
Think about it. Paxos was published on Leslie Lamport’s website in 1990. When did we first hear about it? When will an implementation be available? Raft algorithm was published in 2013 and you can see in reference [5] how many open source implementation libraries there are in different languages, which is why comprehensibility is important.
[1]. LESLIE LAMPORT, ROBERT SHOSTAK, MARSHALL PEASE. [The Byzantine General Problem](http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf). 1982
[2]. Leslie Lamport. [The Part-Time Parliament](http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf). 1998
[3]. Leslie Lamport. [Paxos Made Simple](http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf). 2001
[4]. Diego Ongaro and John Ousterhout. [Raft Paper](https://ramcloud.stanford.edu/raft.pdf). 2013
[5]. Raft Website. [The Raft Consensus Algorithm](https://raft.github.io/#implementations)
[6]. Raft Demo. [Raft Animate Demo](http://thesecretlivesofdata.com/raft/)
Write a little, draw a little,”Instant between“Everything has changed. If you feel good, you can press or scan the QR code to follow.