preface
Distributed consistency is an important and widely explored and demonstrated problem in computer science. Let’s start with three business scenarios.
1. Train station tickets
If our end user is a traveler who often takes the train, he usually goes to the ticket office of the station to buy the ticket, and then takes the ticket to the ticket barrier, and then gets on the train and starts a wonderful journey —- everything seems to be so harmonious.
Imagine if his chosen destination is Hangzhou and there is only one ticket left on a train to Hangzhou. Another passenger at a different ticket window may have purchased the same ticket at the same time. If say ticketing system did not undertake consistent safeguard, two people bought ticket success. At the gate, one of the passengers is told that his ticket is invalid…
Of course, the modern Chinese railway ticketing system rarely has such problems. But in this example, we can see that the end user requirements for the system are very simple:
“Please send me the ticket. If there are no tickets left, please tell me the ticket is invalid when you sell it.”
This has put forward strict consistency requirements to the ticket purchasing system —- system data (in this case, the middle finger is that train to Hangzhou’s remaining votes) no matter in which ticket window, every moment must be accurate!
2. Bank transfer
If our end user is a college student who has just graduated, usually when he gets his first month’s salary, he will choose to remit money home. When he comes to the bank counter and completes the transfer operation, the clerk at the bank counter will kindly remind him: “Your transfer will arrive in N working days!” .
At this time, the graduate would be frustrated, and would say to the clerk: “Ok, how long does not matter, as long as the money is not less!” …
This has become the most basic demand of almost all users for the modern banking system
3. Online shopping
Let’s say our end user is an online shopping expert. When he sees a favorite product with a stock of 5, he will quickly confirm the purchase, write down the delivery address, and then place an order…
However, at the moment of placing an order, the system might tell the user: “Low stock!” . At this time the vast majority of consumers will complain that their actions are too slow, so that the beloved goods were taken away by others.
But in fact, engineers who have experience in online shopping system development must understand that the inventory displayed on the commodity details page is usually not the real inventory of the commodity, and the system will check the real inventory of the commodity only when the real order is made. But who cares?
Interpretation of the problem
For the above three examples, I believe you must be aware that our end users have different requirements for data consistency when using different computer products:
1, some systems, not only to respond to users quickly, but also to ensure that the system data for any client is true and reliable, just like the train station ticketing system
2. Some systems need to ensure absolutely reliable data security for users. Although there is a delay in data consistency, strict consistency must be ensured in the end, just like the transfer system of banks
3, some systems, although some of the data can be said to be “wrong” to the user, but in the whole process of using the system, the system data will be accurately checked in a process, so as to avoid unnecessary losses to the user, just like the online shopping system
The presentation of distributed consistency
An important problem to solve in distributed systems is data replication.
In our daily development experience, many developers have encountered this problem: suppose client C1 updates a value K in the system from V1 to V2, but client C2 cannot read the latest value of K immediately, and it takes some time to read it.
This is normal because there is a delay between database replications.
The data replication requirements of distributed systems generally come from the following two reasons:
1. To increase the availability of the system and prevent system unavailability caused by a single point of failure
2. Improve the overall performance of the system. Through load balancing technology, data copies distributed in different places can provide services for users
While it is self-evident that data replication brings huge benefits to distributed systems in terms of availability and performance, the consistency challenges that data replication brings are also challenges that every system developer has to face.
The so-called distributed consistency problem refers to the data inconsistency that may occur between different data nodes after the data replication mechanism is introduced in the distributed environment and cannot be solved by computer applications themselves. To put it simply, data consistency means that when updating data of one copy, you must ensure that other copies can be updated as well; otherwise, data of different copies will be inconsistent.
So how to solve this problem? One way to think about it is “since the problem is caused by delayed action, I can block the write action until the data is copied.”
Yes, this seems to solve the problem, and some systems are architectsthat use this idea directly. But while this approach solves the problem of consistency, it also introduces a new problem: write performance.
If your application scenario has a large number of write requests, subsequent write requests will block the write operations of the previous request, resulting in a sharp decline in overall system performance.
In general, we cannot find a distributed consistent solution that satisfies all the system attributes of distributed systems. Therefore, how to ensure data consistency without affecting system performance is the key consideration and tradeoff of every distributed system. Thus, consistency levels are born:
1. Strong consistency
This level of consistency is the most intuitive for users. It requires the system to write and read what it reads. The user experience is good, but the implementation often has a large impact on the system performance
2. Weak consistency
This consistency level constrains that the system does not promise that the written value can be read immediately or that the data consistency will be reached after a certain time level (for example, the second level) is reached
3. Final consistency
Final consistency is a special case of weak consistency. The system ensures that data consistency can be achieved within a certain period of time. The reason why final consistency is singled out here is that it is a highly respected consistency model in weak consistency, and also a highly respected model in the industry for data consistency in large distributed systems
Problems with distributed environments
Distributed system architecture has been accompanied by many difficulties and challenges since its inception:
1. The communication is abnormal
In the evolution from centralized to distributed, network factors are inevitably introduced, which also introduces additional problems due to the unreliability of the network itself.
A distributed system needs network communication between nodes, so every network communication is accompanied by the risk of network unavailability. If hardware devices or systems such as network fiber, router or DNS are unavailable, the distributed system cannot successfully complete a network communication.
In addition, even if the network communication between nodes of the distributed system can be carried out normally, the delay will be larger than that of single operation.
It is generally believed that in modern computer architecture, the delay of single machine memory access is on the order of nanosecond (usually 10ns), while the delay of a normal network communication is about 0.1-1ms (equivalent to 105 times of memory access delay). Such a huge delay difference will also affect the process of sending and receiving messages. Therefore, message loss and message delay become very common.
2. Network partition
When the network is abnormal, the network delay between some nodes in the distributed system is increasing, and finally, among all nodes of the distributed system, only some nodes can communicate normally, while others cannot —- we call this phenomenon network partition.
When the network partition appears, the distributed system will appear local small clusters, in extreme cases, these local small clusters will independently complete the original need for the entire distributed system to complete functions, including data processing, which presents a very big challenge to the distributed consistency.
3, three states
In the above two points, we have learned that in the distributed environment, the network may have various problems, so every request and response of the distributed system has a unique concept of three states, namely success, failure and timeout.
On a traditional stand-alone system, an application can call a function and get a very clear response: success or failure. In distributed system, because the network is not reliable, although in most cases, the network communication can also receive the response of success or failure, when the network is abnormal, there may be timeout phenomenon, usually in the following two cases:
(1) Due to network reasons, the request is not successfully sent to the receiver, but message loss occurs in the process of sending
(2) After the request is successfully received by the receiver, it is processed, but message loss occurs in the process of sending the response back to the sender
When such a timeout occurs, the initiator of the network communication cannot determine whether the current request has been successfully processed
4. The node is faulty
Node failure is another common problem in distributed environment. It refers to the failure or “zombie” phenomenon of the server nodes that make up the distributed system. As a rule of thumb, every node can fail and it happens every day
Distributed transaction
With the development of distributed computing, things have been widely used in the field of distributed computing.
In a stand-alone database, it is easy to implement a transaction processing system that meets ACID characteristics, but in a distributed database, the data is scattered on different machines, so how to implement distributed transaction processing of the data is very challenging.
Distributed things refer to the fact that the participants of the things, the server supporting the things, the resource server and the thing manager are located on different nodes of the distributed system. Usually, a distributed thing involves the operation of multiple data sources or business systems.
Can assume one of the most typical distributed scene: a call across the bank transfer operation involves two different banking services, one of which is a local Banks with the withdrawal of services, the other one is the target of Banks to offer deposit service, the two service itself is stateless and independent of each other, together constitute a complete distributed objects.
If a withdrawal from a local bank is successful, but the deposit service fails for some reason, it must be rolled back to the state before the withdrawal, or the user may find that their money is missing.
As can be seen from this example, a distributed transaction can be regarded as composed of multiple distributed operation sequences, such as the withdrawal service and deposit service in the above example. This series of distributed operation sequences can usually be called sub-things.
Therefore, distributed transactions can also be defined as nested transactions and thus have ACID transaction properties. However, in distributed transactions, the execution of each sub-transaction is distributed, so it is extremely complicated to implement a distributed transaction processing system that can guarantee ACID properties.
Theory of CAP
A classical distributed system theory. CAP theory tells us that A distributed system cannot simultaneously meet the three basic needs of C: Consistency, A: Availability and P: Partition tolerance, and only two of them can be simultaneously met at most.
1. Consistency
In a distributed environment, consistency is the property of data consistency across multiple replicas. Under the requirement of consistency, when a system performs an update operation in the consistent state of data, it should ensure that the system data is still in the consistent state.
For a copy of the data distribution in different distributed node system, if the data to the first node update after operation and the update is successful, but not get the data on the second node corresponding updates, so on the second node data read operations, access to the remains of the old data (or called dirty data), This is a classic case of distributed data inconsistency.
A distributed system is considered to have strong consistency if all users can read the latest value of a data item after a successful update operation is performed
2. Usability
Availability means that the services provided by the system must always be available, and the results of each operation request can always be returned within a limited period of time. The emphasis here is on “finite time” and “return result”.
“Finite time” means that the system must be able to respond to a user’s operation request within a specified period of time, beyond which the system is considered unavailable.
In addition, “limited time” refers to the performance indicators that are designed at the beginning of the system design, and often vary greatly from system to system. In any case, the system must have a reasonable response time for user requests, otherwise users will be disappointed with the system.
“Return result” is another very important metric of availability, requiring the system to return a normal response after processing a user request. A normal response result is usually a clear reflection of the outcome of the processing of the request, success or failure, rather than a return result that confuses the user.
3. Fault tolerance of partitions
Fault tolerance of partitions limits the following features of a distributed system: A distributed system needs to be able to provide consistent and available services when encountering any network partition failure, except when the entire network environment is faulty.
Network partition is to point to in a distributed system, different nodes distribution in different subnet or long-distance network (room), due to some special reasons lead to these sub network in the condition of the network is not connected, but each network’s internal network is normal, resulting in the network environment of the whole system been split into several isolated areas.
It is important to note that each node that makes up a distributed system joins and exits as a special network partition.
Since a distributed system cannot simultaneously satisfy the three characteristics of consistency, availability, and fault tolerance of partitions, we need to discard one of them:
Here’s a chart to illustrate:
To be clear, fault tolerance of partitions is a fundamental requirement for a distributed system. Since it is a distributed system, the components of the distributed system must be deployed to different nodes, otherwise there would be no distributed system at all, hence the emergence of subnetworks.
For distributed system, network problem is an inevitable exception, so fault tolerance of partition has become a distributed system must face and solve the problem. As A result, systems architects often have to focus on finding A balance between C (consistency) and A (availability) based on business characteristics.
The BASE theory of
BASE is an acronym for Basically Available, Soft state, and Eventually consistent. BASE theory is the result of the tradeoff between consistency and availability in CAP. It comes from the summary of distributed practice of large-scale Internet system and is gradually evolved based on CAP theorem.
The core idea of BASE theory is that even if strong consistency cannot be achieved, each application can adopt appropriate ways to achieve the final consistency of the system according to its own business characteristics. Let’s look at the three elements of BASE:
1. Basic availability
Basic availability means that a distributed system allows for a partial loss of availability in the event of an unpredictable failure —- note that this is by no means equivalent to the system being unavailable. Such as:
(1) Loss of response time. Normally, an online search engine returns the search results to users within 0.5 seconds. However, the response time of the search results increases by 1 to 2 seconds due to the fault
(2) loss on the system functions: under normal circumstances, when an e-commerce site for shopping, consumers can almost complete each order, but in some big promoting shopping boom, due to a surge in consumer shopping behavior, in order to protect the stability of the shopping system, some consumers may be lead to a downgrade page
2. Soft state
Soft state refers to allowing intermediate state of data in the system and considering that the existence of such intermediate state does not affect the overall availability of the system, that is, allowing delay in the process of data synchronization between data copies of different nodes
3. Final consistency
Final consistency emphasizes that all copies of data, after a period of synchronization, eventually reach a consistent state. Therefore, the essence of final consistency is that the system needs to ensure the consistency of the final data, rather than ensuring the strong consistency of the system data in real time.
In general, BASE theory is oriented towards large, highly available and scalable distributed systems, which is the opposite of ACID, a traditional thing. It is completely different from ACID’s strong consistency model, but it sacrifices strong consistency to obtain availability, and allows data to be inconsistent in a period of time, but eventually reach a consistent state.
However, in actual distributed scenarios, different business units and components have different requirements for data consistency. Therefore, ACID characteristics and BASE theory are often combined in the design process of distributed system architecture.
The last
Welcome to pay attention to my public number [programmer chasing wind], the article will be updated in it, sorting out the data will be placed in it.