This is the first in a series of articles designed to detail the Tendermint Consensus agreement

  1. Basic knowledge of consensus protocol, discuss the security model of consensus protocol and PBFT protocol
  2. Detailed Tendermint consensus protocol, introduces the two-stage voting protocol and the locking and unlocking mechanism
  3. Weighted proposer rotation selection algorithm in the Tendermint project

The result of any consensual agreement is a GeneralAgreement by the Majority. The consensus protocols on which blockchain systems operate are no exception. As a distributed system, the most basic goal of blockchain system is to maintain the correctness of the system.

Intuitively, the correctness of a blockchain system involves two aspects: no ambiguity and the ability to handle requests to update its state. Among them, the former corresponds to the Safety requirements of distributed system, while the latter corresponds to the availability requirements of distributed system. The correctness of distributed system is mainly maintained by consensus protocol. Considering that distributed system involves multiple nodes and network communication, the instability of both nodes and network communication brings great challenges to the design of consensus protocol.

Semi-synchronous network model and Byzantine fault tolerance

In order to rule all possible problems, researchers of distributed systems describe various possible problems in communication between nodes and networks through node failure model and network model. In the node failure model, fail-stopfailure refers to the failure of a node to participate in the consensus protocol due to configuration errors.

Such failures have no adverse effects on other parts of the distributed system, except that the nodes themselves stop running. However, in the design of common chain distributed system, in addition to the Failure of nodes, we also need to consider the situation of active evil of nodes, which are included in the fault model of Byzantine Failure.

The Byzantine fault model includes all possible unexpected situations of nodes, including passive outage failures and voluntary deviations from the consensus protocol.

For the sake of narration, we use downtime to refer to the situation when a passively occurring node stops operating, and Byzantine failure to refer to the situation when an actively occurring node arbitrarily deviates from the consensus protocol.

Compared with the node failure model which can be roughly divided into passive and active models, network communication modeling is more difficult. The network itself has the problem of instability and communication delay, and because all network communication is finally completed by nodes. But the nodes themselves may suffer from downtime or Byzantine failures. Therefore, when a node does not receive network packets from another node, it is usually difficult to determine whether the fault is caused by the node or the network itself. Although network communication may be affected by multiple aspects, researchers found that network models can be standardized and classified from the perspective of communication delay. For example, when a node is down, packets cannot be sent from the node, so the corresponding communication delay is unknown and can be any time. Based on the concept of communication delay, network communication models can be divided into the following three categories:

  • Synchronous network model: network communication has fixed and known time upper bound. Under this model, the network communication delay between two nodes in the network is the maximum, even if the malicious node exists, the communication delay caused by the malicious node does not exceed.

  • Asynchronous network model: The network communication has unknown delay, and the upper bound of delay is not known, but the message can still be successfully delivered. In this model, the network communication delay between two nodes in the network can be any possible value, that is, if there is a malicious node, the malicious node can arbitrarily extend the communication delay.

  • Semi-synchronous network model: it is assumed that there is GlobalStabilizationTime(GST), which is asynchronous network model before GST and synchronous network model after GST. That is, there is a fixed and known upper bound of time for network communication.

Malicious nodes in the network can arbitrarily extend the GST backward, and there will be no notification when no GST occurs. In this model, at the point in timeIs the delay of the message being delivered.

Synchronous network model is the ideal network environment, each message sent through the network can be received in a predictable time, but this model can not reflect the real network communication, so in the real network, there are always network faults, leading to the failure of the hypothesis of synchronous network model. The asynchronous network model goes to the other extreme and cannot reflect the real network situation, and the FLP(Fischer-Lynch-Paterson) theorem points out that under this model, as long as one node breaks down, there is no consensus protocol that can reach consensus in a limited time. In contrast, the semi-synchronous network model can better describe the network communication in the real world: network communication is usually synchronous, but it may fail for a short time and then recover.

This is believed that everyone has had this experience in their own Internet experience, a period of time the web page opened slowly, but usually the speed of the web page is very fast, and after the network is restored to normal, there is usually no notice, only after you try to know that the network really restored to normal. Peer-to-peer (P2P) network communication, which is often used in blockchain projects, also enables a node to send and receive information from multiple network channels. It is not practical to block the network information transmission of a node for a long time. Therefore, the rest of this article will default to a semi-synchronous network model.

Public networks that allow nodes to join and leave dynamically need to take Byzantine failures into account when designing and selecting consensus protocols. Therefore, the common protocol design goal of the public chain network is to ensure the security and availability of the network under the premise of tolerating Byzantine failures under the semi-synchronous network model. Researchers of distributed systems point out that in order to ensure the security and availability of the system, the consensus protocol itself needs to meet the following three points:

  • Validity: The value finally agreed by the honesty node must be the value proposed by the honesty node
  • Agreement: All honesty nodes must agree on the same value
  • Termination: Honest nodes must eventually agree on a value

You can see that correctness and consistency guarantee the security of distributed systems, i.e. honest nodes never agree on a random value, and once consensus is reached all honest nodes agree on the value. Finality guarantees the availability of distributed systems, and a distributed system that can never be agreed upon is useless.

The Byzantine general problem and CAP theorem

Is it really possible to design consensus protocols that are Byzantine fault-tolerant and satisfy correctness, consistency, and finalizability under semi-synchronous networks? How many Byzantine nodes can the protocol tolerate in the system? CAP theorem and Byzantine general problem provide answers to these two questions and become the basic principles guiding the design of Byzantine fault-tolerant consensus protocol.

Lamport, Shostak and Pease abstracted the consensus mechanism design problem in distributed systems as the Byzantine general problem in 1982. The problem can be expressed as a number of generals leading their armies in war and stationed in different places. In order to win the war, the generals had to develop a unified plan of action. However, since the generals were stationed far away from each other, they could only communicate with each other through the signal corps, meaning that they could not be present at the same time to reach a consensus in person. Unfortunately, there were traitors among the generals, and the renegade generals hoped to disrupt the unity of the loyal generals by sending false information, and the signalmen themselves could not deliver the message to their destination.

It is assumed that each signalman can prove that he is bringing a message from a general. In the real BFT consensus protocol, each node has its own public and private key to establish an encrypted communication channel with each other to ensure that its message will not be tampered with in network communication, and the message receiver can also verify the message sender based on this. As already mentioned, any consensus agreement is ultimately a consensus of the majority, and in the process of generals communicating with each other to decide whether to attack or retreat, a general also makes his own decision based on the majority of the information he has gathered.

Lamport et al. ‘s study of the problem showed that the generals could not reach a unified decision with one-third or more of the defectors in the node. For example, suppose there are three generals and only one traitor. In the figure on the left, assume that General C is the judge, while A and B are loyal. If A wants to launch an attack and sends A message to B and C that it wants to supply only, and the traitor C sends A message to B that the message sent by A to him is retreat, then B cannot make his own decision after receiving the message from A and B.

B doesn’t know who the traitor is, and he can’t make a decision based on the information he’s received. However, if A is A traitor, he can send different messages to B and C. At this point, C truthfully reports the information he has received to B, and B also sees the contradictory information and cannot make any decision. In both cases,B received the same information and could not tell which of A and C was the traitor. So it can be seen that honest General B has no choice between the two scenarios shown below.

According to this conclusion, by extension, we know that when there areTwo generals and at most one of themA traitor, ifWhen the generals could not reach a consensus, can reach a consensus. According to this conclusion, we can know the number of nodes when Byzantium failsThe value exceeds the total number of nodes in the systemA third of the, there is no consensus agreement that can be reached among all honest nodes. only, consensus can be reached. Therefore, without losing generality, the subsequent discussion of consensus agreement is default.

Lamport et al. ‘s conclusion on the Byzantine general problem draws the line between the possible and the impossible in terms of tolerance of Byzantine failures for the design of Byzantine fault-tolerant protocols. In the realm of possibilities, what is the ultimate effect of consensus protocol design? Can it guarantee the security and usability of distributed systems completely? The CAP theorem proposed by Brewer in 2000 provides the answer. The CAP theorem states that a distributed system requires the following three basic properties, but any distributed system can only satisfy at most two of the three properties simultaneously.

  1. Consistency: Any node either provides the latest status information or does not provide any status information upon corresponding request
  2. Availability: All nodes in the system must be able to continue read and write operations
  3. Partition Tolerance: The system can tolerate the loss of as many messages as possible between two nodes and still operate properly

The ultimate goal of distributed system is to provide similar consistent services externally, so the consistency attribute requires that two nodes in the system cannot provide contradictory state information, nor can they provide expired information, which can ensure the security goal of distributed system. The availability attribute ensures that the system can continuously update its own state and ensure the availability goal of the distributed system. The tolerance attribute of network partition is related to network communication delay. In the semi-synchronous network model, it can be considered as the state before GST time. In this case, the network is in asynchronous state, the network communication delay has unknown delay, and the communicating nodes may not receive information from each other, so the network is considered to be in the partitioned state. The tolerance attribute of network partition requires that the distributed system can still function normally in the face of network partition.

The proof of CAP theorem can be achieved by the following diagram, where the curve in the figure represents the network partition, and each network has four nodes, which are distinguished by the numbers 1, 2, 3 and 4. The distributed system stores color information, and the state information stored by all nodes is blue at the beginning.

  1. Network partition tolerance and availability mean loss of consistency: in the leftmost figure, node 1 changes state in red when it receives a new request, node 1’s state transition information is passed to node 3, and node 3 updates state information in red. However, the status information of nodes 3 and 4 is still blue because the network partition did not receive the corresponding information. If node 2 is used to query the status information, the blue value returned by node 2 is not the latest system status, resulting in loss of consistency.
  2. Network partition tolerance and consistency mean loss of availability: in the middle figure, the initial state information of all nodes is blue. When the updated status information of nodes 1 and 3 is red, nodes 2 and 4 remain outdated status information in blue due to network partitions. Similarly, when querying status information through node 2, node 2 needs to follow the consistency, so it needs to inquire other nodes to confirm that its status is the latest before returning status information. However, due to the existence of network partition, node 2 cannot receive any information from node 1 and node 3. At this point, node 2 cannot be sure that its state is the latest, so it chooses not to return any information, and the system loses its availability.
  3. Consistency and availability mean a loss of network partition tolerance: in the figure on the far right, there is no initial network partition, and status updates and queries can proceed smoothly. However, once the network partition occurs, it degenerates to one of the first two cases. Therefore, it is proved that any distributed system cannot guarantee the three attributes at the same time.

The discovery of the CAP theorem seems to declare that the objective of the aforementioned consensus agreement is unattainable. However, the careful reader of the above proof will notice that the extreme cases mentioned in the proof, such as the situation where the information cannot be transmitted completely due to network partition, are rarely encountered in the real world, especially when using P2P network communication. In the second case, the real system rarely does not return any information like node 2. The usual practice is to query other nodes and wait for an appropriate time, and return the latest status they think, no matter whether they really get the request information from other nodes. Therefore, although the CAP theorem points out that no distributed system can satisfy the three attributes at the same time, it is not a binary choice of black and white. As the designer of the consensus protocol, he can make trade-offs among the three attributes according to the requirements of the distributed system. However, since communication latency is always involved in distributed systems, the reality of any distributed system is to make choices in terms of availability and consistency while ensuring some degree of network partition tolerance. Specifically, in the second case, whether node 2 returns some value that may be outdated or no value at all. Returning a value that may be out of date may violate the consistency property, but guaranteed availability, while not returning any value forfeits the availability of the system, but guaranteed consistency. The Subsequent Tendermint Consensus protocol can be considered a trade-off between consistency and, in some cases, loss of usability.

The genius of Satoshi Nakamoto is to achieve a reliable Byzantine consensus in a large-scale distributed network by combining PoW mechanism, Satoshi consensus protocol, economic incentive measures and appropriate parameter configuration in an unreliable network environment under the constraint of CAP theorem. Whether Bitcoin’s mechanism design really solved the Byzantine general problem has been controversial in academic circles. In The paper The Bitcoin Backbone Protocol, Garay, Kiayias, and Leonardos In Analysis and Applications, the connection between mechanism design in Bitcoin and Byzantine consensus is analyzed in detail. To put it simply, Nakamoto Satoshi consensus is a probabilistic Byzantine fault-tolerant consensus protocol, which depends on the network communication environment, the proportion of computing power of malicious nodes and other conditions. When the network communication environment is good and the computing power proportion of malicious nodes is less than 1/2, Satoshi consensus can reliably solve the Byzantine consensus problem in distributed environment. However, when the network communication environment deteriorates, even if the proportion of computing power of malicious nodes is less than 1/2, nakamoto Satoshi consensus cannot reach a reliable conclusion on the Byzantine consensus. It is worth noting that the quality of the network environment is relative to bitcoiin’s block out interval, and the 10-minute block out interval chosen by Bitcoiin can ensure that the network communication environment of the system is good in most cases, because according to the actual operation experience of Bitcoiin network, The broadcast time of a block in a distributed network is usually less than 1 second. On the other hand, the setting of economic incentives can motivate most nodes to actively comply with the agreed behaviors. Therefore, it can be considered that under the current Bitcoin network parameter configuration and mechanism design, the mechanism design in Bitcoin has reliably solved the Byzantine consensus problem under the specific scenario of the current world network environment.

Introduction to PBFT protocol

It is not easy to design consensus protocols for Byzantine Fault Tolerance in semi-synchronous networks. The first Practical Consensus protocol for Byzantine Fault Tolerance was the Practical Byzantine Fault Tolerance protocol designed by Castro and Liskov in 1999. PBFT). PBFT is the first polynomial-complexity Byzantine fault-tolerant consensus protocol for inclusionThe communication complexity of a distributed system with 5 nodes isCastro and Liskov report in the paper that the overall performance loss is only 3% after the transformation from a centralized file system to a distributed file system using PBFT protocol. This section gives a brief introduction to the PBFT protocol, paving the way for a detailed understanding of the Tendermint protocol and the improvement of the Tendermint protocol.

containsThe PBFT protocol of the node can tolerate the mostTwo Byzantine nodes, as required in the original PBFT paperThere are full connections between nodes, i.eAny two of the nodes establish network connections with each other. All nodes work together to maintain system status through network communication. In Bitcoin network, any node is allowed to participate in or exit the consensus process through computing mining at any time, while PFBT protocol needs to determine all participating nodes before the start of the protocol, and the joining and exiting of nodes are managed by the administrator. All nodes in the PBFT protocol are classified into two types, master and slave. There is only one master node at any time, and all nodes take turns to be the master node. All nodes run in a rotation process called a View, and the primary node is re-selected in each View. The primary node selection algorithm in PBFT is very simple. All nodes become the primary node in turn according to the number. In each view, all nodes attempt to reach a consensus on the state of the system. Is worth to mention is that PBFT agreement in each node has its own digital signature key pair, all messages sent (including the client request message) need to sign the operation, to ensure the integrity of information in the network transmission and traceability of the message itself (can according to the signature value judgment, a message is sent by who).

The following figure shows the basic process of PBFT consensus protocol. Assume that the current primary node is node 0. Client C sends a request to primary node 0. After receiving the request, the primary node broadcasts the request to all secondary nodes, all secondary nodes process the request and return the result to the client. The client receives it from a different node (based on the signature value)The result can be taken as the final result of the operation. Because at most there areA Byzantine node, which means the client receivedAt least one of the results comes from the honest node, and the security of the consensus protocol ensures that all honest nodes agree on the same state, so feedback from one honest node is sufficient to confirm that the corresponding request has been processed by the system.

In order to ensure the state synchronization of all honest nodes, PBFT protocol has two constraints on each node. One is that all nodes must start from the same state, and the other is that the state transition of all nodes must be deterministic, that is, given the same state and request, the result of operation execution must be the same. Under the constraints of these two conditions, as long as the whole system agrees on the processing order of all transactions, the state of all honest nodes can be guaranteed to be consistent. This is the main purpose of PBFT protocol: to reach a consensus among all nodes on the order of transactions, thus ensuring the security of the entire distributed system. In terms of availability, PBFT consensus protocol relies on the timeout mechanism to detect anomalies in the process of consensus and timely start the View Change protocol to try to reach consensus again.

The figure above shows a simplified PBFT protocol workflow. Where C is the client, 0, 1, 2, and 3 respectively represent four nodes. 0 is the primary node of the current view, 1, 2, and 3 are secondary nodes, and node 3 is a faulty node. In normal cases, PBFT consensus protocol uses a three-stage protocol to reach a consensus on the transaction order between nodes. The three stages are as follows: Pre-prepare, Prepare, and Commit.

  • Preparation Node The primary node allocates serial numbers to received client requests and broadcasts them to secondary nodes<PRE-PREPARE, v, n, d, sig>Message containing the hash value requested by the clientd, the current view sequence numberv, the serial number assigned to the request by the master noden, and the signature information of the primary nodesigPBFT protocol scheme design separates request transmission and request sorting process, request transmission is not discussed here. The slave node that receives the message checks that the message is valid and then accepts the message and enters the preparation phase. The message check in this step checks for basic signatures, hash values, and the current view. The most important check is to determine whether the master node has assigned the same serial number to different client request messages in the current view.
  • In the preparation phase, messages are broadcast from the node to all nodes, including itself<PREPARE, v, n, d, sig>Indicates that you agree with the current viewvThe hash value is zerodThe client requests to assign an ordinal numbern“And has his own signaturesigThe node receiving the message checks whether the signature is correct and the view serial number matches, and accepts the valid message. When a node receives a request about a clientPRE-PREPAREMessages (from the master node) and from2fTwo sub nodesPREPAREWhen all match, it means that in the current view the system has agreed on the serial number of the client request. Because that means there are in the current view2f+1Five nodes agree to the assignment of the request sequence number, since it contains at mostfA message from a malicious node, which means at leastf+1The assignment of the request sequence number is approved by the honesty node. When there isfWhen a malicious node, the city node has2f+1A,f+1It’s the majority of the honest nodes, which is the consensus of the majority.
  • When a node (both master and slave) receives a client requestPRE-PREPAREThe message and2faPREPAREAfter the news, the whole network broadcast the news<COMMIT, v, n, d, sig>And enter the submission phase. This message is used to indicate that the node has observed that the network has reached a consensus on the serial number assignment of the client request message. When the node receives2f+1aCOMMITAfter the message, it means at leastf+1In other words, most of the honesty nodes have observed that the whole network has reached a consensus on the assignment of the serial number of the client request message. The node can then process the client request and return the execution result to the client.

Roughly speaking, the preparation phase is where the master node assigns the number of all new client requests, the preparation phase is where all nodes agree on the number of client requests within the view, and the commit phase is used to ensure consistency of the number of client requests across views. In addition, the design of PBFT protocol itself does not rely on the request message to be submitted in accordance with the assigned sequence number, but allows the request message to be submitted out of order, which can improve the execution efficiency of the consensus protocol. However, when the final execution, the message is still executed in accordance with the sequence number assigned by the consensus protocol to ensure the consistency of the distributed system.

During the PBFT three-stage protocol execution, the node itself not only maintains the status information of the distributed system, but also logs the received consensus information. Adding a log that changes CPU resources using Checkpoints This parameter is a powerful feature of the PBFT protocol. Adding a log that changes CPU resources using Checkpoints A checkpoint can be set every 100 or 1000 serial numbers according to the request number. After the client request at the checkpoint is executed, the node broadcasts the message

across the network, indicating that the system status hash value of the node is D after the client request with serial number N is executed. And by his signature sig for the guarantee. After receiving 2F +1 matching CHECKPOINT messages (one of which can come from itself), it means that the majority of honest nodes in the network have reached a consensus on the system status after executing the client request with serial number N. Then, the logs related to all client requests with serial number less than n can be cleared. A node needs to save the 2F +1 CHECKPOINT message as a proof that the status is valid. The corresponding CHECKPOINT is called a Stable CHECKPOINT.
,>

PBFT three-stage protocol can ensure the consistency of client request processing sequence, and the setting of checkpoint mechanism not only helps nodes to recycle garbage, but also further ensures the consistency of distributed system state, which can ensure the security of the distributed system mentioned above. How is availability another goal of distributed systems guaranteed? In the semi-synchronous network model, timeout mechanism is usually introduced. The timeout time window is set to delay the network environmentIn the semi-synchronous network model, network delay is assumed after GSTIn a specific system implementation, an initial value is usually set according to the network situation of system deployment. When a timeout event occurs, in addition to triggering the corresponding processing process, additional mechanisms will also be used to readjust the waiting time. For example, the wait time after a timeout event can be adjusted by an algorithm like TCP’s exponential backout.

In PBFT protocol, timeout mechanism is also introduced to ensure system availability. The PBFT protocol is also required to ensure system security and availability in the event of Byzantine failures on the master node itself. When a Byzantine fault occurs on the master node, for example, the slave node does not receive the pre-prepare message from the master node within the set waiting time or the pre-prepare message sent by the master node is judged to be illegal, the slave node can broadcast < view-change, v+1, n, C, P, Sig > indicates that the node requests to switch to a new view numbered V +1, n indicates the request number corresponding to the latest CHECKPOINT on the node, and C indicates the 2F +1 valid CHECKPOINT message used to prove the CHECKPOINT. After the latest stability checkpoint and before the VIEWCHANGE message is sent, the system may have agreed on the number of some request messages in the previous view. To ensure consistency of the number of request messages in the view switch, the VIEWCHANGE message needs to pass this information to the new view. This is also the meaning of the P field in the message, which contains all the client request messages whose request number is greater than N collected from the node and the proof that the request number has reached consensus on the node: valid pre-prepare messages and 2F matching PREPARE messages about the request. When the master node in VIEW V +1 collects 2F +1 VIEWCHANGE messages, it can broadcast a new-view message to bring the entire system into the NEW VIEW. In order to combine the three-stage protocol of PBFT to ensure the security of the system, the information construction rules of New-View are very complicated, which will not be described in detail here. Interested readers can refer to the original paper of PBFT.

You can seeVIEWCHANGEContains a great deal of information, such asCContained in the2f+1Signature information,PContains several signature collections, each of which has2f+1And because at least there is2f+1Three nodes need to be sentVIEWCHANGEThe message forces the system into the next new view, where you can see the constructionVIEWCHANGEandNEW-VIEWIn addition to the complex logic of information, the communication complexity of view transformation protocol is, this complexity also limits the PBFT protocol to support only a small number of nodes, and when there are 100 nodes, PBFT is often too complex to actually deploy. It is worth noting that some information will be PBFT protocolThe complexity of communication boils down toA full connection between two nodes is not appropriate. By changing the fully connected network topology to the distributed hash table-based P2P network topology commonly used in blockchain projects today, the high communication complexity introduced by fully connected networks can be significantly improved. However, it is difficult to improve communication complexity during view transformation. In recent years, some researchers have proposed to reduce the traffic in this step by using aggregation signature technology. By using aggregation signature technology, we can transform2f+1The signature information is compressed into one signature information to reduce the traffic during view transformation.

* this article was written by longcpp, a member of the CoinEx Chain development team. CoinEx Chain is the world’s first DEX dedicated public Chain developed based on Tendermint Consensus protocol and Cosmos SDK. With the help of IBC, the paper realizes the integration of DEX public chain, smart contract chain and privacy chain to solve the impossible triangle of Scalability, Decentralization and security blockchain. High performance support for digital asset trading and Defi applications based on smart contracts.