A year ago, I wrote about the first version of the Consistency Model article. Because it was written in a hurry, and because the subject was important enough to warrant more careful treatment, I was not entirely satisfied. ACM Queue wants to publish it in its own magazine, so please ask me to consider it carefully. I was able to take this opportunity to improve the article. This article is the latest version.
Ultimate Consistency – Building reliable distributed systems on a global scale requires a trade-off between consistency and availability.
Amazon’s Cloud computing is based on infrastructure such as S3 (Simple Storage Service), SimpleDatabase, and EC2 (Elastic Compute Cloud), which provide resources for building internet-scale computing platforms and a wide variety of applications. These infrastructure services are demanding and require high scores in security, scalability, availability, performance and cost-effectiveness, all while continuing to serve millions of customers around the world.
Behind these services are massive distributed systems that operate on a global scale. That scale brings additional challenges. Because when a system handles trillions of requests, events that are usually less likely to occur are bound to occur and need to be considered in advance in the system design and architecture. Given the global scale of these systems, we typically use replication techniques to ensure consistent performance and high availability. While replication brings us closer to our goals, it cannot achieve them in a completely transparent manner. In many cases, customers of these services will face the consequences of using replication techniques within the service.
One manifestation of this is the type of data consistency provided, especially when the underlying distributed system provides the final consistency model for data replication. When Amazon designed these large-scale systems, it used a set of guidelines and abstractions related to large-scale data replication, and focused on the balance between high availability and data consistency. In this article, I’ve provided some background on our approach to delivering reliable distributed systems that need to operate on a global scale. An earlier version of this article was published on the All Things Distributed blog in December 2007 and has been greatly improved with the help of readers.
Historical perspective
In an ideal world, there would be only one consistent model: when an update occurs, all observers would see that update. The first time this model was difficult to implement was in database systems in the late 1970s. The best “period work” on the subject is Bruce Lindsay et al. ‘s “Distributed Database Notes”. It explains the basic principles of database replication and discusses a number of techniques that deal with achieving consistency. Many of these technologies attempt to achieve distribution transparency — that is, to system users, there appears to be only one system rather than multiple collaborating systems. In the meantime, many systems have taken an approach that tends to let the whole system fail rather than undermine its transparency.
In the mid-1990s, these practices were reexamined with the rise of large Internet systems. People at the time began to consider the idea that usability might be the most important attribute of these systems, but they were struggling with the trade-offs. Eric Brewer, a systems professor at the University of California, Berkeley, and then the head of Inktomi, put these different tradeoffs together in a keynote speech at the 2000 PODC (Principles of Distributed Computing) conference. He introduced the CAP theorem, which states that only two of the three properties of a shared data system — data consistency, system availability, and tolerance for network partitions — can be realized at any given time. A more formal identification can be found in a 2002 paper by Seth Gilbert and Nancy Lynch.
A system that does not tolerate network partitioning can achieve consistency and availability of data, usually through the use of transactional protocols. For this to happen, the client and the storage system must be part of the same environment; In some cases, they fail as a whole and, therefore, the client cannot observe the partition. An important observation is that in large distributed systems, network partitions are given; Therefore, consistency and availability cannot be achieved at the same time. This means that you have two options for what you want to remove: relaxing consistency will allow the system to remain highly available under partitionable conditions, while making consistency a priority means that under certain conditions the system will be unavailable.
Both options require the client developer to know what the system provides. If the system emphasizes consistency, developers need to deal with the fact that the system may not be available, such as writing. If a write fails because the system is unusable, then the developer has to deal with what to do with the data to be written. Writes are always acceptable if the system emphasizes availability, but in some cases reads do not reflect the results of recently completed writes. Then, the developer must decide whether the client needs to always have access to the absolutely up-to-date updates. A large number of applications can handle slightly stale data and serve well under this model.
In principle, the consistency property of a transactional system defined in the ACID property (Atomicity, Consistency, Isolation, Durability) is a different consistency guarantee. In ACID, consistency is the guarantee that the database is in a consistent state when a transaction is completed; For example, when transferring money from one account to another, the total amount held in both accounts should not change. In ACID-based systems, this consistency is usually the responsibility of the developer who wrote the transaction, but can be aided by database management integrity constraints.
Consistency – client and server
There are two ways of looking at consistency. One is from the developer/customer point of view: how they observe data updates. The second approach comes from the server side: how updates flow through the system and what guarantees the system can provide for updates.
Client consistency
The client has the following components:
- A storage system. For now we think of it as a black box, but one should assume that behind it is massive and highly distributed, and that it is built to guarantee durability and availability.
- Process A. This is a process that writes to and reads from the storage system.
- Processes B and C. These two processes read and write to the storage system independently of process A. It doesn’t matter whether they are real processes or threads in the same process; The important thing is that they are independent and sharing information requires communication.
Client-side consistency is related to how and when an observer (in this case process A, B, or C) views updates to data objects in the storage system. In the following example that illustrates the different types of conformance, process A updates the data object:
- Strong consistency. After the update is complete, any subsequent accesses (via A, B, or C) will return the updated value.
- Weak consistency. The system does not guarantee that subsequent accesses will return the updated value. A number of conditions need to be satisfied before a value can be returned. The time period between the update and ensuring that any observer always sees the updated value is called the inconsistency window.
- Eventual consistency. A special form of weak consistency; The storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failure occurs, the maximum size of the inconsistent window can be determined based on factors such as communication delays, system load, and the number of replicas involved in the replication scheme. The most popular system for achieving ultimate consistency is DNS (Domain Name System). Updates of names are distributed according to the configured schema in conjunction with a time-controlled cache; Eventually, all clients will see the update.
The final conformance model has a number of important variants to consider:
- Causal consistency. If process A notifies process B that the data item has been updated, subsequent accesses by process B return the updated value, with A guarantee that the write will supersede the earlier write. Accesses by process C that have no causal relationship with process A follow the normal final consistency rules.
- Read-Write-Consistency. Read-Write-Consistency. This is an important model where process A, after updating A data item, always accesses the updated value and never sees the old value. This is a special case of the causal consistency model.
- Session consistency. This is a practical version of the previous model in which a process accesses the storage system in the context of a Session. As long as the session exists, the system guarantees read-write consistency. If a session terminates due to some failure scenario, a new session needs to be created and there is no overlap between sessions.
- Monotonic read consistency. If the process sees a particular value for an object, any subsequent accesses will never return any previous values.
- Monotonic write consistency. In this case, the system guarantees serialization of writes through the same process. Systems that do not guarantee this level of consistency are notoriously difficult to program.
The above features can be combined. For example, you can combine monotone reading with session consistency. From a practical point of view, these two features (monotone read and read your writes) are most desirable, but not always necessary, in an ultimately consistent system. Using these two features makes it easier for developers to build applications, while allowing storage systems to relax consistency and provide high availability.
As you can see from these variants, there can be many different situations. Whether the consequences can be handled depends on the particular application.
Ultimate consistency is not some esoteric property of extremely distributed systems. Many modern RDBMSs (relational database management systems) that provide primary and backup reliability implement replication technology in both synchronous and asynchronous modes. In synchronous mode, replica updates are part of a transaction. In asynchronous mode, updates arrive late at the backup, usually via log shipping. In the latter mode, if the master server fails before the log is sent and reads from the promoted master copy, the old, inconsistent value will appear. Also to support better scalable read performance, RDBMSs have begun to provide the ability to read from backups, which is a classic example of providing the ultimate consistency guarantee, where the inconsistency window depends on the cycle of log delivery.
Server-side consistency
On the server side, we need to take a deeper look at how updates flow through the system to understand what makes the developers of the system feel the different patterns. Before we begin, let’s establish some definitions:
N = number of nodes that store data copies W = number of copies that need to confirm receipt of updates before the update is completed R = number of copies contacted when accessing the data object through read operation
If W + R > N, then the write and read sets always overlap, ensuring strong consistency. In a primary and secondary RDBMS scenario where synchronous replication is implemented: N=2, W=2, R=1, the client will always get the same result regardless of which copy it reads from. In asynchronous replication with read from backup enabled, N=2, W=1, R=1. In this case R + W = N, there is no guarantee of consistency.
The problem with these quorum protocols configurations is that when the system cannot write to the W node due to a failure, the write operation must fail, signaling that the system is unavailable. When N = 3 and W = 3 and only two nodes are available, the system has to fail the write.
In distributed storage systems with high performance and high availability, the number of replicas is usually greater than 2. Systems focused solely on fault tolerance typically use an N = 3 (W = 2, R = 2) configuration. Systems that need to service a very high read load often replicate more data than is required for fault tolerance; N can be tens or even hundreds of nodes, and R is configured to be 1 so that a single read returns a result. Consistence-focused systems are set to W = N in response to updates, which may reduce the probability of successful writes. For systems that focus on fault tolerance but not consistency, the common configuration is to run with W = 1 for minimal update persistence, and then rely on latency (propagation) techniques to update other copies.
How N, W, and R are configured depends on the situation and the performance path that needs to be optimized. In R = 1 and N = W, we are optimized for read conditions, and in W = 1 and R = N, we are optimized for fast writes. Of course, in the latter case, there is no guarantee of persistence in the event of a failure, and if W < (N + 1) / 2, there is the possibility of conflicting writes when the write set does not overlap.
Weak/final consistency occurs when W+R <= N, which means it is possible that read-write sets will not overlap. If this is a well thought out configuration and not based on failure cases, setting R to anything other than 1 makes little sense. This happens in two very common situations: the first is the aforementioned large-scale replication for read extensions; The second is where data access is more complex. In a simple key-value model, it is easy to compare versions to determine the most recent value written to the system, but in a system that returns a set of objects, it is more difficult to determine the correct most recent set. In most systems where the write set is smaller than the replica set, there is a mechanism to apply updates to the remaining nodes in the replica set in a deferred manner. The time period before all replicas have been updated is the inconsistency window discussed earlier. If W+R <= N, it is easy for the system to read data from nodes that have not yet received updates.
Whether read-write consistency, session consistency, and monotone consistency can be achieved often depends on the client’s “stickiness” to the server that executes the distributed protocol. If it’s the same server every time, it’s easier to guarantee reads and writes and monotonous reads and writes. The same server makes managing load balancing and fault tolerance slightly more difficult, but it is a simple solution. Using a sticky session is easy to understand and provides a level of exposure that customers can reason about.
Sometimes the client implements reads and writes and monotonous reads. By adding a version at write time, the client discards reading from the version before the last one it saw.
Partitioning occurs when some nodes in the system cannot connect to other nodes, but the client group has access to both sets of nodes. If you use the classic majority mediation approach, a partition with W nodes in the replicated set can continue to be updated when another partition becomes unavailable. The same is true for read sets. Assuming that the two sets overlap, a few sets will become unavailable by definition. Partitions do not occur very often, but they do occur between and within data centers.
In some applications, the unavailability of any partition is unacceptable, and it is important that clients that access the partition continue to run. In this case, both parties allocate a new set of storage nodes to receive data and perform a merge when the partition heals. Inside Amazon, for example, shopping carts use an always-write system; In the case of partitions, customers can continue to place items in the cart even if the original cart is on another partition. Once the partition is restored, the shopping cart application assists the storage system in merging the shopping cart.
Amazon chateau marmont
Amazon’s Dynamo is such a system that puts all of these features under explicit control of the application architecture. It is a key-value storage system that is widely used within Amazon’s e-commerce platform services, like AWS (Amazon’s Web Service). One of the design goals of Dynamo is to allow the owner of an application, who creates instances of Dynamo storage systems, to trade off consistency, persistence, availability, and performance, which often span multiple data centers.
conclusion
In a large, reliable distributed system, data inconsistencies must be tolerated for two reasons: to improve read and write performance under high concurrency; And handling partitioning situations where most models make parts of the system unavailable, even if the nodes are up and running.
Whether inconsistencies are acceptable or not depends on the client application. In all cases, developers need to be aware that consistency guarantees are provided by the storage system and need to be taken into account when developing applications. The resulting consistency model has many practical improvements, such as session consistency and monotone reads, that give developers better tools. Many times, the application can handle the ultimate consistency guarantee of the storage system without any problems. A particularly popular example is a website where we can have the concept of user-perceived consistency. In this case, the inconsistent window needs to take less than the expected time for the customer to load the next page. Causes the update to propagate to the entire system before the next read is expected.
The goal of this article is to increase awareness of the complexity of engineering systems that need to operate on a global scale and need to be carefully tuned to ensure that they provide the persistence, availability, and performance required by applications. One of the tools that the system designer has is the length of the consistency window, during which the client of the system may be exposed to the reality of large-scale systems engineering.
<br/>
reference
- Brewer, E. A. 2000. Towards robust distributed systems (abstract). In Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (July 16-19, Portland, Oregon): 7
- A Conversation with Bruce Lindsay. 2004. ACM Queue 2(8): 22-33.
- DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (Stevenson, Washington, October).
- Gilbert, S., Lynch, N. 2002. Brewer’s Conjecture and the feasibility of consistent, available, partition-tolerant Web services. ACM SIGACT News 33(2).
- Lindsay, B. G., Selinger, P. G., et al. 1980. Notes on distributed databases. In _Distributed Data Bases, ed. I_. W. Draffan and F. Poole, 247-284. Cambridge: Cambridge University Press. Also available as IBM Research Report RJ2517, San Jose, California (July 1979).
The original link: http://www.allthingsdistribut…
In this paper, the author: cyningsun this address: https://www.cyningsun.com/06-… Copyright Notice: All articles in this blog are licensed under the CC BY-NC-ND 3.0CN license unless otherwise stated. Reprint please indicate the source!