Rambling on distributed systems, Byzantine generals and blockchain

Blockchain is a hot topic these days. Some want to use it to change the world, some want to use it to cheat.

But we’re only going to look at technology today. From a technical point of view, blockchain is a technology related to distributed systems. How does it relate to the various concepts of distributed systems? Today, this article takes this opportunity to discuss the core issues and concepts of distributed systems with you. Finally, we will discuss blockchain technology in a logical and consistent way.

Distributed systems and consistency issues

A system powered by blockchain technology, such as the Bitcoin Network or Ethereum, is technically a very large distributed system. So let’s start with distributed systems.

For technical people, especially server developers, almost everyone deals with distributed systems on a regular basis. As the service grows in size, it inevitably evolves into a complex distributed system. Typically, distributed databases can store data on multiple nodes in some way to ensure data consistency on a high availability basis.

In fact, consensus problem is a core problem to be solved in distributed systems. A distributed system is usually composed of multiple nodes of equal status, and the interaction between each node is like several people getting together to discuss a problem. Let’s consider a more specific scenario, for example, three people are discussing where to go for lunch. The first person says that a hotpot restaurant has just opened nearby and I hear it is very good. But the second person said, no, it takes too long to eat hot pot, so I’d better have some porridge. The third person said, “I just went to that porridge restaurant yesterday. It tastes terrible. I might as well go to McDonald’s.” As a result, the three men were deadlocked and could not agree.

Some people say, this is not easy to solve, vote. So the three people cast a round of votes, but everyone still insisted on his proposal, and the problem was not solved. Someone has another idea, why don’t we choose a leader, what the leader says, we will listen to him, so that we do not have to fight. So, everyone began to vote for the leader. The result is very tragic, everyone thinks they should be the leader. The three people finally found that “choosing a leader” was still in essence the same as the original problem of “where to eat”, which was equally difficult to solve.

At this time I am afraid that some readers are thinking, these three people are wrong…… What’s the point of fighting over something as trivial as a meal? In fact, it is possible for each node in a distributed system to interact with each other without some strictly defined rules and protocols. If the system isn’t consistent, it doesn’t work at all.

As a result, there were smart people who designed consensus Protocols like the ones we all know like Paxos, Raft, Zab, etc. Similar to the previous discussions, if translated into Paxos terms, each node can put forward its own proposal (called proposal, which contains the specific value of the proposal), and the ultimate goal of the agreement is that each node reaches the same proposal according to certain rules. But whose proposal shall prevail? An easy rule to think of might be this: whichever node makes a proposal first prevails, and the subsequent proposal is invalid. However, the situation in a distributed system is much more complicated than a few people getting together to discuss a problem. There is also the problem of network latency, which makes it difficult to order all the events that are happening globally. Taking A simple example, suppose that nodes A and B send their proposals to nodes X and Y at almost the same time respectively, but due to the different delay of messages in the network, the final result is: X receives A’s proposal first, and then B’s proposal. Y, on the other hand, received B’s proposal first and then A’s. So it’s hard to agree on who comes first in terms of X and Y.

Furthermore, the situation is even more complicated if you consider the possibility of node outages and message loss. A node outage can be considered a special case of message loss, where all messages sent to the node are lost. Under the theoretical framework of CAP, this is equivalent to the occurrence of network partitioning, i.e., P in CAP. Why can node outages and message loss be attributed to network fragmentation? Because the cases are virtually indistinguishable. For example, several nodes may be unreachable, that is, messages sent to them by other nodes may not be received or responded to. The real reason may be that the network is disconnected, the destination nodes are down, or the message is delayed indefinitely. In short, it means that some nodes in the system can not be contacted, they can not participate in decision-making, but it does not mean that they can not be contacted again after a period of time.

For the sake of visualization, let’s assume that some nodes are down. At this point, can the remaining nodes agree on the proposal without some of the nodes participating in the decision? Even if an agreement is reached, after the downed nodes recover (note that they may not know anything about the agreed proposal among the other nodes), will they re-object to the agreed proposal and cause confusion? All these problems are to be solved by the distributed consistency protocol.

There is not enough space here to discuss the implementation of these protocols in detail, but interested readers can refer to the relevant papers. In fact, understanding the question is more important than understanding the answer to the question. In summary, we need to know that there are distributed consistency algorithms in place that can solve the problems discussed above and ensure that in a decentralized network, each node can eventually agree on a proposal. And, in general, as long as a majority of the nodes (or a quorum) in the network are still alive (that is, they can send and receive messages to each other), this proposal for consistency can be reached.

Question of the Byzantine general

An important prerequisite for the consistency protocol we discussed earlier is that each node can be trusted, and they all adhere strictly to the same set of rules. This condition can be basically satisfied in a company’s internal network. But what happens if this condition is not met? Suppose that some nodes in the network are malicious, not only disobeying the protocol, but deliberately causing mischief (e.g., sending random messages), how can other normal nodes function?

In distributed systems theory, this Problem is abstracted into a famous Problem, the Byzantine Generals Problem. That’s a question posed by Leslie Lamport, author of Paxos. Lamport was also the 2013 Turing Award winner.

It starts with a story (Lamport made it up, of course). Several armies of the Byzantine Empire attacked the enemy outside the city and then quartered separately. Each army was led by a Byzantine general. In order to develop a unified battle plan, each general needed to communicate with other generals via messenger. But among the Byzantine generals, traitor might be found. The purpose of these traitor generals was to thwart the agreed war plans of the other loyal generals. They may do anything to that end, such as colluding, deliberately spreading false information, or not spreading any information at all.

Let’s look at a simple example. Suppose you have five generals, and they vote on whether to attack or retreat. Two of them thought we should attack, and two of them thought we should retreat, so the score was two to two. The fifth general happened to be a traitor. He told the first two that they should attack, but told the second two that they should retreat, so the first two generals decided to attack and the second two decided to retreat. There is no agreed battle plan.

This problem is obviously more difficult than consistency in a trusted environment, which we discussed in the previous chapter. To solve this problem, we wanted to find an algorithm that would allow us to achieve the following goals despite the presence of a traitor:

All the loyal generals were given the same battle plan. Like they all decide to attack, or they all decide to retreat, as opposed to some generals who think we should attack, and others who decide to retreat.
A loyal general should not only get the same battle plan, but also ensure that the battle plan is reasonable. For example, attacking would have been a better plan, but the traitors blocked the plan and eventually worked out a plan to retreat together. So our algorithm failed.

As you can see, goal A is fairly straightforward, and at least given an algorithm it’s easy to determine whether it achieves this goal. But goal B is a mystery. Whether a battle plan is “reasonable” is inherently difficult to define. Even in the absence of a traitor, loyal generals do not always have a sensible plan. This involves a very important problem in scientific research. If a thing cannot be clearly defined in a formal way, it will be impossible to study it, and the thing itself will not rise to the level of science. One of Lamport’s outstanding contributions to the Byzantine general problem is to reduce this seemingly ill-defined problem to one that can be accurately described in mathematical terms. So let’s look at how this works.

First consider the process by which generals make battle plans (assuming there are no traitors). Each general, based on his own observations of the situation, gave his proposed battle plan — whether to attack or retreat. Each general then sent his advice to each of the other generals by messenger. Now that each general knows the advice of the other generals, plus his own advice, he needs to get a final battle plan based on all this information. For clarity, we’ll number each general 1, 2… The advice given by each general is written down as V (1), V (2)… V (n), there are n values, some of which stand for attack, some of which stand for retreat. After the messenger delivered the message, each general saw the same battle proposal V (1), V (2)… V (n), of course one of them was suggested by the current general himself. Then as long as each general uses the same method, for all v(1), V (2)… Put this information together and you get the same final battle plan. For example, an easy way to think about it is to vote on v(1), v(2)… V (n) votes on different battle plans, and finally chooses the battle plan with the most votes.

Of course, the resulting battle plan is not guaranteed to be the best, but it should be the best we can do. We still assume there are no traitors among the generals. We find that goals A and B can be “watered down” : we no longer care whether the generals can agree on A final battle plan and whether it is “reasonable”; We only care if each general received exactly the same combat advice v(1), V (2),… , v (n). As long as each general received exactly the same advice, and they were put together in the same way, it was easy to get a coherent battle plan. Whether or not the final battle plan is the best depends on a lot of “human” factors, and we don’t care about that.

Now we consider a traitor among the generals. Following the previous thread, we still expect every general to receive exactly the same advice v(1), V (2),… , v (n). Now let’s take a closer look at one of these values, v(I), which in the previous description represents the battle proposal from the ith general. If the ith general is loyal, then there is nothing wrong with this definition. However, if the ith general is a traitor, then there is a problem. Why is that? Because a traitor can do as he pleases, he may well give different proposals to different generals in order to disrupt the whole plan of battle. In this case, different loyal generals may receive different values of v(I) from the ith general. So the definition of v(I) is not right, it needs to be changed.

In any case, even if there are traitors, we expect each general to end up summarizing based on exactly the same battle proposals, which are still labeled V (1), V (2),… , v (n). However, the v(I) here no longer represents the battle proposal from the ith general, but the ith proposal that each general finally sees after some consistent algorithm we have devised. There are two cases to discuss here. In the first case, if the ith general is loyal, then we naturally want this V (I) to be the battle proposal sent by the ith general. In other words, we want the consistency algorithm to be processed so that the ith general’s proposal, if loyal, can be faithfully conveyed to the other generals without being disturbed by the traitor’s actions. This is a prerequisite for a “sensible” war plan. In the second case, if the ith general is a traitor, it is possible to send different offers to different generals. At this time, we could not only listen to his side of the story, but hoped that after the consistency algorithm processing, all generals could fully exchange opinions, and then according to the information relayed by other generals, a V (I) could be obtained through comprehensive judgment. It doesn’t really matter whether the V (I) is an attack or a retreat, the key is to make sure that each general gets the same V (I). Only in this way, generals, after summarizing all the v(1), V (2),… V (n) to get the final, fully consistent battle plan.

Based on the above analysis, we find that in both cases we only need to focus on how the proposal made by a single general (i.e., the ith general) is communicated to the other generals. The point is finally here! At this point, we can reduce the original problem to a subproblem. This subproblem is what Leslie Lamport in his thesis really names the “Byzantine Generals Problem”. ‘In this question, the realm of the commanding general is focused on a single general.’ The realm of the commanding general is called a lieutenant. ‘ The following is a precise description of the “Byzantine general problem”.

How can a general send a command to n-1 adjutants ensure the following two conditions:

(IC1) All loyal lieutenants eventually receive the same orders.
(IC2) If the general is loyal, then all loyal adjutants accept the orders given by the general.

This actually corresponds to the two cases we’ve already discussed. If the master is loyal, then condition IC2 ensures that the order is delivered faithfully, and condition IC1 is naturally satisfied. If the master is going to be a traitor, then condition IC2 is meaningless, and condition IC1 guarantees that even if the traitor master is going to issue different orders to each adjutant, each adjutant will still eventually get the same order.

There are two possible areas of confusion here.

First, some ask, can a general still be a traitor? The leader is a traitor. What’s the point? In fact, the “Byzantine general problem” is a subproblem of the original problem. When n generals decide on battle plans by passing messages, it can be broken down into n “Byzantine general problems”, with each general as the general and the other N-1 generals as the adjutants. If there is an algorithm that can solve the “Byzantine general problem”, then running n instances of the algorithm at the same time will result in each general getting the exact same sequence of battle suggestions, i.e. V (1), V (2),… , v (n). Finally, each general puts V (1), V (2),… V (n) uses the same method of summing up (e.g. by majority vote) to get the final battle plan.

Second, when the general is a traitor, he can send different orders to different adjutants, how can each adjutant still end up with the same order? That’s what the algorithm needs to solve. In fact, this is easy to explain (we mentioned this idea earlier), since the general may send different orders to different adjutants, the adjutant cannot directly take the orders from the general, but also has to see what orders are relayed by other adjutants. Then, a single adjutant synthesising the commands relayed by all the adjutants (plus those directly sent by the general) might have a more complete picture, allowing for a consistent judgment (in practice an iterative process).

Well, after all this space, we have described the Byzantine general problem itself. This is actually the hardest part. As we mentioned in the previous chapter, understanding the question is more important than understanding the answer. Once the problem itself is analyzed, how to design an algorithm to solve it is a matter of detail. We won’t go into the details of the algorithm here, but interested readers can check out the following papers:

“The Byzantine Generals Problem. Download address: lamport.azurewebsites.net/pubs/byz.pd…
“Reaching Agreement in the Presence of Faults, download address: lamport.azurewebsites.net/pubs/reachi…

We only mention here the conclusion of the algorithm presented in this paper.

Using different message models, the “Byzantine general problem” has different solutions.

If generals are using oral messages, that is, messages can be tampered with when relayed, then to deal with m traitors, you need at least 3m+1 generals (at least 2m+1 of them loyal).
If the general use of signing messages between (signed messages), that is to say, after the message is sent to is not fake, will be found as long as been tampered with, then deal with m a traitor, only need at least 2 m + general, that is to say, at least two loyal general (if only one loyal generals, obviously the question doesn’t make sense). This situation effectively amounts to no limit on the number of loyal generals.

Fault tolerance

As we mentioned earlier, distributed consistency protocols, represented by Paxos, run in a trusted environment. In the Byzantine General problem, there are malicious nodes in the network. So it’s tempting to wonder: Is Paxos a special case of the Byzantine general problem with zero traitors?

There’s a problem with looking at it that way. In “the Byzantine General problem,” the traitors were left with loyal generals. The word “loyal” implies that he is functional (that is, you can communicate with him via messages). Why do you say that? We know that a traitor can do anything, including sending an error message, or not sending any message at all. “Do not send any messages” is equivalent to not working properly, or some kind of fault has occurred. So, simple malfunctions, not just deliberate malevolence, should be classed as traitors. It made no difference to the other generals.

In this understanding, “loyalty” is not the right word. The number of traitors is zero, which means every node in the network is working properly. But Paxos is also designed to be fault-tolerant, and as we discussed earlier, Paxos still works when a few nodes in the network fail (such as a downtime). It can be seen that Paxos cannot be regarded as a special solution of the “Byzantine general problem” when the number of traitors is zero.

What about the relationship between the Byzantine General problem and distributed consistency algorithms like Paxos? We can analyze from the degree of fault tolerance.

Generally speaking, the design of a computer system, from a chip to a distributed network, needs to consider a certain degree of fault tolerance. But errors can be divided into two broad categories according to their different nature:

Byzantine fault. Such errors may appear inconsistent to different observers.
Non-byzantine fault. Mistakes that do not fall into the first category.

The meaning of these two types of errors is not as straightforward as it sounds.

Let’s start with the Byzantine error. In the Question of the Byzantine General, the malevolent acts of traitors fall into this category. From the point of view of different generals, a traitor may send a completely different battle proposal. In a computer system, failing nodes or components can behave inconsistently, which is not malicious, but falls into this category. For example, an unstable channel causes random errors in messages sent by nodes to other nodes, or corrupted messages. For example, in a database system, the data after the commit has been synchronized to the disk (through the fsync of the operating system), but the data is still not successfully removed from the disk due to a sudden power failure or even data disorder.

Let’s look at non-Byzantine errors. Lamport also uses the word non-Byzantine in one of his papers on Paxos (see Paxos Made Simple). But the naming of the word is a bit confusing. In a distributed system, if a node is down or the network is down, some nodes will not work. The other node can’t really tell the difference between the two cases. In its view, it just finds that a node is temporarily unreachable (that is, receiving messages has timed out). It is impossible to tell whether the node itself is faulty, the network is down, or messages are severely delayed. Moreover, after a while, the node may recover (either on its own or with manual intervention). In other words, for a node with such an error, we simply do not receive a message from it, not an error message from it. Instead, as long as a message is received from it, the message itself is “faithful.”

So the Byzantine error is a much stronger kind of error. In the “Byzantine general problem”, traitors send inconsistent battle advice, which is a Byzantine error; Not sending any messages is a non-Byzantine error. So the algorithm that solves the Byzantine general problem can deal with both Byzantine errors and non-Byzantine errors. This may sound a little strange, but it’s just a matter of naming.

In short, the solution to the Byzantine General problem should be the strongest kind of distributed consistency algorithm that can theoretically handle any error. Paxos can only handle non-Byzantine errors. This tolerance of Byzantine fault tolerance is often referred to as “Byzantine fault tolerance,” or BFT.

In this sense, BFT’s algorithm should be able to solve distributed consistency problems under any error, including those solved by Paxos. So why not uniformly use BFT’s algorithm to solve all distributed consistency problems? Why bother designing algorithms like Paxos? We didn’t discuss the algorithm for solving the Byzantine general problem in detail, so we won’t do a careful analysis here. But it is easy to imagine that providing such tolerance for error as the BFT must come at a high price. For example, you need a lot of delivery of messages. Paxos, on the other hand, doesn’t have to offer as much fault tolerance, so it can run more efficiently. In addition, Lamport’s algorithm to solve the “Byzantine General problem” in his paper also has stronger requirements on timing assumptions of the system. This is also easy to understand, since the algorithm’s fault tolerance requirements are so high, naturally, the operating environment of the assumption may be higher. Since this problem is a very critical problem in distributed systems, we will discuss it separately here. In Lamport’s paper, the algorithm’s assumption for the system is as follows:

The absence of a message can be detected.

This hypothesis requires that if a renegade general sends no message (or if the message is lost), the event can be detected. Obviously, this can only depend on some kind of time-out mechanism, which depends on the clocks between nodes to reach a certain degree of synchronization, i.e. the clock offset cannot exceed a maximum value. This is actually a synchronous model. Paxos’s system assumption is less strong in this regard, and it is based on the asynchronous model, with no specific requirements for the system clock. In my previous article, “Is RedIS-based Distributed Locking Secure?” This issue was also mentioned in the article. This can sometimes be a source of controversy for distributed algorithms.

Specifically, according to Paxos’ paper, Paxos’s design is based on an asynchronous, non-Byzantine model of systems:

Nodes can run at any speed and may break down or restart. However, some variables that need to be recorded during algorithm execution should be able to be recovered after a restart.
The message can be duplicated and lost, but it cannot be corrupted.

The first requirement is to persist the data in the database and ensure that no Byzantine errors (as we mentioned earlier) occur during the disk drop. In practice, however, Byzantine errors can happen (albeit very rarely) due to sudden power outages, disk caching, and other real-world problems, so this requires some special engineering treatment.

The second message above, corrupted, is a Byzantine error. So Paxos requires that no message corruption occur. In the case of using TCP for message transmission, this can be considered to meet the requirements.

In summary, the algorithm that solves the “Byzantine general problem” provides the strongest fault tolerance, namely BFT, whereas Paxos can only tolerate non-Byzantine errors. However, Paxos is based on the asynchronous model, which is a weaker system assumption than the synchronous model, and therefore a more robust algorithm, provided only non-Byzantine errors occur. Of course, Paxos is also more efficient.

Block chain

In reality, really need to reach the BFT fault-tolerant systems rarely, unless some fault tolerance requirement is very high system, control system, such as Boeing plane or ferry Dragon spacecraft such systems (see www.weusecoins.com/bitcoin-byz)… .

A typical example of BFT that we are used to is blockchain. A blockchain network is a completely open network in which miner nodes are free to join and free to quit. These nodes could of course be malicious, so blockchain networks must be designed with this in mind. This is, in effect, the classic Byzantine general problem.

In order to make the connection between blockchain and the “Byzantine General problem” more clear, let’s start with a very rough introduction to blockchain technology.

Take the Bitcoin network for example. Its core operation is bitcoin transactions, in which the owner of one bitcoin transfers a certain amount of his bitcoin to someone else. First, a bitcoin owner initiates a transaction by signing the transaction with his private key and sending the transaction request to the miner node. Miners package all the transactions they receive into a block and run a series of highly complex operations to find a Nonce value, ensuring that it and other information in the block will hash out the desired result. This step is critical to the entire blockchain network and is known as Proof of Work. Miners then post the block on the Internet, and other miners verify the block. This verification includes both verifying the transaction signature (using the bitcoin owner’s public key) and verifying the validity of the proof of work. If verified, the block is attached to the longest current blockchain.

If two miners complete the block packaging and proof of work almost simultaneously, they may both publish the block, at which point the blockchain forks. But miners are constantly creating new blocks and attaching them to the current longest blockchain, so eventually whichever fork becomes longer and which fork is recognized by the majority of miners. In this way, blockchain is not a chain, but a tree. We know that in a data structure like a tree, there is only one path from the root node to the leaf node. Thus, the currently active blockchain is actually the longest path in the tree from the root node to the leaf node. As long as a block is on the longest chain, it is valid and all transactions it contains are solidified (recognized by most nodes).

We have only given a very cursory overview of how blockchain works. For more details, check out the official Ethereum wiki at:

Github.com/ethereum/wi…

Let’s start by discussing how blockchain technology relates to the “Byzantine General problem”.

Earlier, when we discussed the question of the Byzantine general, we reached the following conclusions:

If word of mouth is used, at least two-thirds of generals need to be loyal.
If signed messages are used, there is no requirement on the number of loyal generals.

As described above, the messages we use in the blockchain should be signed messages. The details are as follows: the transaction in each block is signed to ensure that it cannot be tampered with and that the message can only be sent by the original originator. So, this is the second case mentioned above. Is there no requirement for the number of loyal nodes in a blockchain network? Apparently not. In the Bitcoin network, for example, malicious nodes are required to have no more than 50 percent of the computing power. Why do the two seem inconsistent? This is because the “Byzantine general problem” focuses only on a sub-problem, which concerns the situation in which one general (called the general) sends orders to all the other generals (called the adjutant). The final summary of all orders required the agreement of all loyal generals. If the number of loyal generals is too low, no matter what the final battle plan is, it will fail because the traitors may not carry it out. This is similar to the situation in the Bitcoin network, where the selection process for the longest chain is equivalent to the operation of the generals summing up all the commands (by majority vote).

In the Byzantine General problem, a traitor may send inconsistent orders to different generals. If the algorithm is not well designed, it may lead to an eventual failure to reach agreement. In a blockchain network, similar behavior would be costly. This is because the miner node has to go through proof of work when it publishes blocks. If it publishes inconsistent blocks, it will need proof of work for each block, which will consume a lot of computing power. Plus, there’s no incentive to do that, it just leads to more forks, not the longest chain.

How do you look at proof of work in the context of the Byzantine general problem? It actually makes it more expensive to be a traitor, greatly reducing the likelihood of a traitor by more than half. By way of comparison, if there were a real Byzantine general problem in history, it would be conceivable that the cost of infiltrating Byzantine generals would be high for enemy spies. Therefore, it can be assumed that there are not too many traitors among generals. But when it comes to computer networks, the cost of being a traitor miner is very low without something like proof of work. This makes it more likely that there will be more traitors than loyal miners.

Of course, from an economic point of view, it is also unwise to be a traitor miner if you need proof of work. Since it has more computing power, it is better to make money from mining in a reasonable way. But that’s beyond technology.

In addition to Proof of Work, there is something called Proof of Stake. While some have questioned the drawbacks of this mechanism (e.g., nothing at stake), in the context of the “Byzantine general problem,” it also raises the cost of being a traitor. It’s like a spy trying to infiltrate the board of directors. The cost must be higher because he needs to own a lot of stock first.

What exactly is a blockchain? Some say it’s a super ledger that can’t be tampered with, others say it’s a decentralized transaction system, and still others say it’s the underlying tool for building digital currencies. However, from a technical point of view, it is first and foremost a distributed network that solves the Byzantine general problem, enabling data consistency and security in a completely open environment. Other attributes, however, are attached to the nature of the technology.

In today’s post, we’ve started with the issue of consistency in distributed systems, moved on to the “Byzantine General problem”, and finally moved on to blockchain. These parts gradually deepen, although it is a “ramble”, it is the logic of the back and forth.

Finally, is blockchain a great innovation? Is. Bitcoin system, in particular, is a successful innovation that combines distributed system technology with financial system business.

Will blockchain revolutionize the future? All we can say now is: maybe.

(after)

Rambling on distributed systems, Byzantine generals and blockchain

Distributed systems and consistency issues

Question of the Byzantine general

Fault tolerance

Block chain

Related Posts

Django4- One-to-one, one-to-many, many-to-many relationships

Logback and logback.xml

OSI model and TCP/IP model in computer network