When designing a distributed system, we usually plan and implement it around the goal of high availability and data consistency in the face of possible network delay and unpredictable request traffic. It is not easy to achieve this goal completely. Therefore, a basic theorem, CAP theorem, was born in the field of distributed systems, which is used to guide the design of distributed systems. The characteristics of distributed systems are divided into a fault tolerance consistency model from three perspectives: system high availability, data consistency and network fault tolerance. In this way, system designers only need to weigh and design a fault tolerant consistency model of partition suitable for business scenarios according to the characteristics of business scenarios, which greatly simplifies the difficulty of distributed system design.

Therefore, CAP theorem is a must for architects to master, and it affects the technical selection and decision of architects for distributed systems. And since that’s so important, let’s learn the CAP theorem together.

What is the CAP

CAP theorem was originally a conjecture put forward by Eric Brewer, a computer scientist at the University of California, Berkeley, at the ACM PODC in 2000, hence the name Brewer’s theorem. Then, in 2002, Seth Gilbert and Nancy Lynch of THE Massachusetts Institute of Technology published a proof of the CAP theorem, making it an accepted theorem in distributed systems.

The CAP theorem points out that in a distributed system connected across LAN and sharing data, the three constraint attributes of Consistency, Availability and Partition Tolerance can only satisfy two at the same time.

Here is a brief description of the three attributes:

  • Consistency: The data obtained by the client’s read operation is always the data written last time, which requires strong consistency of data read and write.
  • Availability: Client requests can always get a normal response from non-faulty system nodes within a limited time, with no timeouts and no errors such as 502.
  • Zone fault tolerance: The system can still provide services when network partitions occur, that is, nodes cannot communicate properly or data synchronization is delayed.

It is important to note that CAP describes a normal distributed system scenario where there is a network connection and data is shared across nodes. If there is only one copy of data in the whole system, and other nodes have no corresponding copies, there is no need to share data across nodes, then the distributed system is not the object that CAP cares about, nor can it be designed and implemented according to CAP theorem.

In-depth understanding of CAP

After understanding the basic concept of CAP, we will further study the attributes of C, A and P to deepen our understanding of CAP.

C: Consistency

Consistency here is described in different ways from different perspectives. In a distributed system, the data of each node is the same. On the client side, the representation is that the result of a read operation is always the latest written. Which need to be clear is that for a distributed system node, is likely to be some point with different data: if the atomic operation on a node, the node in the process of execution data to other nodes is not completely consistent, only atomic operation after completion of execution, the node data will continue to maintain the synchronization. For common transaction operations, only after the transaction is committed, the client can read the data written by the transaction. If the transaction fails, the data is rolled back to the old data, and no data written in the middle of the transaction is read.

When a client initiates a write request, the node receiving the write request responds in a timely manner and synchronizes updated data to another node to ensure data consistency. The specific working process is as follows:

  1. The client sends a write operation to node 1 to update data X to 1.
  2. The update succeeds. The system synchronizes the updated data from node 1 to node 2 and updates the old data X on node 2 to 1.
  3. When the client sends a read operation to node 2 to obtain data X, it will get the latest value of X: 1.

Consistency emphasizes strong consistency of data, which is important for some systems. For example, in the case of inventory deduction in the e-commerce system and transfer deduction in the financial system, any consistency problem may cause serious consequences.

A: Availability

After consistency, let’s move on to usability, which is a relatively simple concept but just as important as consistency. In order for the system to be available, it is necessary to ensure that the system can return a valid response, except in the case of failure of all nodes, allowing the response to be old data to the client, but not failing to respond and timeout.

Availability emphasizes the availability of services, but does not guarantee the correctness of data. With a simple example to describe the availability of the distributed system are as follows: allows the client to node 1 or 2 a read operation, when a node failure, regardless of whether or not consistent data between the nodes, as long as have a node service to receive requests, and then respond to the value of X, this means that the two nodes service is to meet usability.

In the description of usability, it is also worth mentioning what constitutes an effective response. To return a valid response, no timeouts or errors are allowed. The result may not be correct, for example, old data is returned, but the client can conduct normal business processing after receiving it.

P: fault tolerance of partitions

After C and A, let’s finally talk about P: partition fault tolerance. Due to the distributed system multiple nodes are often deployed in multiple network environment to communicate with each other, is hard to avoid some network problems, such as network packet loss, delay, network news network interruption, etc., can lead to problems of communication between nodes, data synchronization operation cannot be completed, partition fault tolerance requires the system even in the case of a network partition appear, Can continue to provide services to the client.

Because the distributed system is different from the single machine, it involves the communication and data interaction between multiple nodes, which can not avoid network problems. If there is no fault tolerance of partition, it means that the system does not allow any errors in the communication between nodes, which means that the system is not available, which is unacceptable in the vast number of systems. Therefore, fault tolerance of partition faults between nodes must be considered, which is also the reason why the fault tolerance of partition is usually guaranteed first in CAP theorem.

How to apply the CAP theorem

Now that we know the consistency (C), availability (A), and fault tolerance (P) of the CAP theorem, let’s look at how to use it. The CAP theorem states that C, A and P attributes cannot be met at the same time, and in the case of network interaction and data synchronization, there will be delay and data loss, which we must accept and ensure that the system does not hang up. So partition fault tolerance must be guaranteed, and the only choice left is between consistency (C) and availability (A). Consistency is selected to ensure data correctness, but it also means that the system may be unavailable; Availability ensures high availability of services, but it also means that data may be inconsistent. Next, we will discuss the characteristics of CP architecture and AP architecture, and how to choose suitable architecture strategy according to different distributed scenarios.

CP

For a distributed system based on CP architecture, in order to ensure consistency, if data X on node 1 has been updated to 2 after a network partition occurs, data X on node 1 cannot be synchronized to node 2 because the data synchronization channel between nodes has been interrupted, and data X on node 2 remains 1. If a client accesses data X on node 2, node 2 will return an error indicating that an error occurred until data is synchronized between nodes. Of course, such a treatment obviously violates the requirements of availability, so the CAP theorem can only satisfy CP.

If a distributed scenario requires strong consistency, or if the system can tolerate long periods of unresponsiveness but the data needs to be consistent, the CP architecture is suitable for designing the corresponding distributed system. In such a system, the availability of the system will be sacrificed once the data cannot be synchronized due to network partitioning, and the response will be made after the node data is consistent. In the open source community, there are many applications that use CP architecture, such as Redis, HBase, MongoDB, ZooKeeper, Etcd, Consul, and so on, all choose CP attribute instead of certain availability.

AP

In a distributed system designed with the AP architecture, to ensure availability, data X on node 1 is updated to 2 after the network partition occurs. However, data on node 1 cannot be synchronized to node 2 because the data synchronization channel between nodes has been interrupted, and data X on node 2 remains 1. When the client visits node 2 to obtain data X, it receives a normal response. The old data X = 1, but in fact the latest data X is already 2, which does not meet the requirement of consistency. Therefore, CAP theorem can only satisfy AP.

There are also many scenarios suitable for AP, such as some query systems, commodity query of e-commerce systems, etc. Most of them sacrifice certain data consistency to ensure the availability of the system, so as to ensure user experience. In the open source world, typical applications using AP model include Eurka and Cassandra.

Do I have to choose two out of three

Referring to the CAP theorem, most people assume that distributed systems can only choose between C and A in any case. However, the premise here is that network partitioning occurs in the system. If there is no network partitioning in the system, that is to say, when P does not exist, there is no need to give up C or A. Therefore, how to ensure CA without partitioning should also be considered in architectural design. In addition, a distributed system may not only choose between AP and CP, and different internal modules have different scenarios. It is completely possible that one module adopts AP architecture and another adopts CP architecture. As a good architect, you should not be limited by what most people think of the CAP theorem, but design a distributed system that fits your business scenario.

conclusion

This paper mainly understands and recognizes CAP theorem, and the meaning of each C, A, P, as well as the application of CAP theorem. It is important for architects to know the CAP theorem. Because for distributed system, network failure is inevitable, how to maintain the system in accordance with the normal behavior logic when the network failure is very important. A qualified architect needs to be able to make tradeoffs and design usable and stable distributed systems based on the CAP theorem in combination with actual business scenarios and specific requirements.

The resources

  • CAP, unseen – Wikipedia en.wikipedia.org/wiki/CAP_th…

  • Want to be an architect, you must know the CAP time.geekbang.org/column/arti…

  • CAP theorem: three choose two, the architect must learn to choose time.geekbang.org/column/arti…

This article is published by OpenWrite!