Overview of Byzantium

First of all, what is Byzantium?

Byzantium was the Byzantine Empire, also known as the Eastern Roman Empire, with its capital in what is now Istanbul, Turkey. As you can see in the picture above, the Byzantine Empire stretched across Asia, Europe, and Africa, and encircled the Mediterranean.

Question of the Byzantine general

In the era of cold weapons, armies led by hundreds of generals were often stationed in different parts of such vast territories for conquest and counter-conquest, as shown above.

The realistic problem that the armies of The Byzantine Empire faced was how to realize the unified dispatch and message interaction of all the armies in the vast territory. In that era of cold weapons, the transmission of messages and instructions could only rely on the communication soldiers to carry out manual transmission.
And so is because it is for the message, if communication soldiers in en route from diseased, weather, geographical and other reasons for failing to send messages, or defect, deliberately not to messaging or will be false tamper with orders to the unified command, the command from the generals will be very different, action is not unified guarantee, Therefore, the channel through which the message is transmitted is unreliable, which leads to the problem of consistency in the execution of the action.

So this is the Byzantine general problem.

The Byzantine general problem, in essence, is a problem of consistency between different individuals (generals) in a distributed environment (vast territory) receiving messages (orders) through unreliable channels (communication soldiers) and taking consistent actions.

“The Byzantine General Problem” in distributed Domain

In fact, the Byzantine general problem is a common problem existing in reality. As long as the scene and conditions are established, it can be regarded as the Byzantine general problem in a certain field, and a realistic subset of a wide range of problems.

We put this question to the computer and the Internet, distributed is common in Internet service deployment plan, due to hardware error, network congestion or disconnect and malicious attack, computer and network may appear unpredictable behavior, this scenario and Byzantine generals problems and conditions are similar.

According to this, Leslie Lamport et al. proposed the famous Byzantine Failures to discuss the difficulty of trying to achieve consistency through message passing over unreliable channels with message loss in distributed environment.

Leslie Lamport is also the author of the famous Paxos consistency algorithm. In this series of distributed introduction, Raft consistency algorithm was designed to improve Paxos. Here’s a quick look at the concept correlation clue trail, and if you’re interested, read on.

Here’s a comparison between the Byzantine Empire and distributed services in real life:

scenario	source	channel	The message	Unreliable cause
The Byzantine Empire	Central <—-> General General <—-> General	Communication soldiers	The command	Weather, geography, war, mutiny
Distributed service	Service node <—-> Service node	network	message	Damage, timeout, interruption, malicious attack

Core analysis of the problem

Core problem

The core problem of Byzantine general problem in distributed environment/service is how to reach consensus among nodes distributed in the network in the absence of trusted central nodes and trusted channels. This is the first issue to be addressed in distributed services, and it is the foundation and fundamental guarantee for all other issues.

Problem analysis

Here the attempt to analyze the essence of the problem requires the analysis of the root from the phenomenon to the essence, and I personally think that there is the first oneDistributed environmentThe objective reality of service clusteringNodes scatteredWhen we make full use of and enjoy the advantages brought by distribution, we also need to face and solve the unified coordination and command of massive nodesConsistency problem, and whether it is the unified control center to issue instructions or nodes to negotiate with each other must go throughMessage interactionIn order to realize communication, each other can understand the latest state information of the other party and the external environment, and the minimum possible carrying factor of carrying interaction isinformationSo we try to analyze from the root!

According to the information theory proposed by Claude Elwood Shannon, we generally divide information into three parts: source, destination and channel. Information body shuttles between source and destination as carrier through channel.

Based on the above concepts, we extend it toDistributed environmentEach service node is when it sends a messagesourceRole, if receiving messagesHe knowsRole,channelIs a two-way channel built in front of a node.

Constitute a	role	role	There may be problems
source	Communication initiator	Generating body of information	Cannot trace the source or exception
He knows	Communication receiver	Receiving body	Forgery, anomaly
channel	The communication channel	Conveying body	Hijacked, bugged, interrupted
The message body	Communication carrier	Load body	Tamper with, lose

Therefore, we analyze that this problem is characterized by node unreliability, channel unreliability and information unreliability. Therefore, targeted solutions should be carried out according to the characteristics of these problems to ensure the achievement of consistency.

Problem solving

Solution: Verbal agreement

The implementation process

Oral agreement, also known asOral Message. A method that satisfies the following three conditions is calledOral agreement:

Every message sent can be delivered correctly (channel absolute trust)
The receiver of the message knows who sent the message (the source of the message is known, but the previous source is unknown, i.e., not traceable)
Be able to see missing messages

The implementation of the scheme is as follows:

[step-1] Each node receives the Commander Command command-1
[STEP-2]Each node receivesCommanderAfter the message is sent to theThe other nodesEventually, each node receives a set of messages from the other nodes{Commander messages, Commander messages passed by other nodes}To determine how to proceed next based on the message set.

node	Collection of received messages
A	{ Command_{Commander sent}, the Command_{B transfer}, the Command_{C transfer} }
B	{ Command_{Commander sent}, the Command_{A passing}, the Command_{C transfer} }
C	{ Command_{Commander sent}, the Command_{A passing}, the Command_{B transfer} }

Unreliable scenario

Commander is not reliable

当CommanderWhen it is unreliable, the information set received by each node is

node	Collection of received messages	Of judgment	Node Execution Result	Distributed consistency
A	{ command-0_{Commander sent}And the command – 1_{B transfer}And the command – 1_{C transfer} }	Instruction inconsistency	Does not perform	consistent
B	{ command-1_{Commander sent}, the command – 0_{A passing}And the command – 1_{C transfer} }	Instruction inconsistency	Does not perform	consistent
C	{ command-1_{Commander sent}, the command – 0_{A passing}And the command – 1_{B transfer} }	Instruction inconsistency	Does not perform	consistent

The Node is not reliable

A few are unreliable

A few nodes are unreliablewhenAWhen it is unreliable, the information set received by each node is

node	Collection of received messages	Of judgment	Node Execution Result	Distributed consistency
A	{ command-1_{Commander sent}And the command – 1_{B transfer}And the command – 1_{C transfer} }	Unreliable node	Run the forgery command command-0	Most consistent
B	{ command-1_{Commander sent}, the command – 0_{A passing}And the command – 1_{C transfer} }	Instruction inconsistency	Does not perform	Most consistent
C	{ command-1_{Commander sent}, the command – 0_{A passing}And the command – 1_{B transfer} }	Instruction inconsistency	Does not perform	Most consistent

Most are unreliable

Most nodes are unreliablewhenA, CWhen it is unreliable, the information set received by each node is

node	Collection of received messages	Of judgment	Node Execution Result	Distributed consistency
A	{ command-1_{Commander sent}And the command – 1_{B transfer}, the command – 0_{C transfer} }	Unreliable node	Run the forgery command command-0	~~Majority inconsistency~~
B	{ command-1_{Commander sent}, the command – 0_{A passing}, the command – 0_{C transfer} }	Instruction inconsistency	Does not perform	~~Majority inconsistency~~
C	{ command-1_{Commander sent}, the command – 0_{A passing}And the command – 1_{B transfer} }	Unreliable node	Run the forgery command command-0	~~Majority inconsistency~~

To sum up, we can see that whether Commander commands or Node nodes execute commands, when there are a large number of unreliable nodes, the consistency in the distributed system will be damaged and cannot be guaranteed. If n represents the total number of all nodes (including Commander and Node), and M represents the number of unreliable nodes, then they must meet the quantitative relationship such as N ≧3m+1 to ensure the consistency of the distributed system. If you are interested, you can derive the relationship by yourself. The following is an example:

Total nodes (n)	Commander number (1)	The Node number (n – 1)	Maximum node number of unreliability (m)
4	1	3	1
7	1	6	2
10	1	9	3
.	.	.	.

insufficient

The message cannot be traced and the correctness of the message transmitted by other nodes cannot be determined
When most unreliable nodes exist, no agreement can be reached