CAP is a must for architects and engineers who develop or design distributed systems.

(But: The focus of this article is not to discuss the theory and details of CAP, but to talk about how CAP plays a guiding role in the development of microservices. It will be illustrated through several examples of microservices development, and try to get close to the development.)

CAP’s theorem, also known as Brewer’s theorem, was a conjecture by Eric Brewer, a computer scientist at the University of California, and later proved to be the accepted theorem in distributed computing. However, brewer did not define Consistency, Availability, Partition tolerance (CAP) in detail when he came out of CAP, so there are many voices on the Internet that have different interpretations of CAP.

The CAP theorem

There are two versions of CAP theorem in the development, and the second version shall prevail

In a distributed system (a collection of nodes connected to each other and sharing data), when it comes to read and write operations, only two of Consistence, Availability, and Partition Tolerance can be guaranteed, and the other must be sacrificed.

This version of CAP theory focuses on distributed systems and emphasizes the importance of connecting and sharing data. In fact, it also addresses some of the drawbacks of the first version of CAP theory. Distributed systems do not always connect and share data. So distributed systems such as memcached clusters are not covered by CAP theory. Mysql clusters are interconnected and replicated by data sharing.

Consistency

Consistency means that the value of the write operation must be returned regardless of which node the write operation is performed on

Availability

Non-failing nodes return a reasonable response in a reasonable amount of time

Partition Tolerance

When the network partition appears, the system can continue to travel agency duties

In A distributed environment, the network can’t be 100% reliable, likely to fail, so the option of the partition is A must, if chose to give up the CA and P, in case of zoning phenomenon, in order to ensure that C, system need to write, at this time with A conflict, if it is in order to guarantee A, will appear normal partition can write data, The faulty partition cannot write data and conflicts with C. Therefore, it is theoretically impossible for a distributed system to choose CA architecture, but must choose CP or AP architecture.

Distributed transaction BASE theory

BASE theory is an extension and supplement to CAP and AP scheme in CAP. Even if AP scheme is selected, how to achieve C better in the end.

BASE is the abbreviation of three phrases: basic availability, flexible state, and final consistency. The core idea is that even if strong consistency cannot be achieved, applications can adopt appropriate ways to achieve final consistency.

A practical example of CAP in service

For the CAP of the project, please refer to li yunhua’s book learning architecture from scratch, in which chapter 21,22 describes the theoretical details of CAP and the evolution process of CAP version in detail.

Here the emphasis is on how to guide and apply the god-like CAP in our micro services. I will probably cite some common examples

Service registry, AP or CP?

Problems solved by the service registry

Before we discuss CAP, let’s clarify what the service registry is primarily about: service registration and service discovery.

  • Service registration: The instance registers its service information with the registry, including the host IP address of the service, the Port of the service, and the status and access protocol information of the service.

  • Service discovery: Instances request information about services that the registry relies on, and service instances obtain information about registered service instances from the registry to request services they provide.

Some of the components currently serving as registries are: Dubbo’s ZooKeeper, SpringCloud’s Eureka, Consul, rocketMq’s nameServer, and HDFS’s nameNode. At present, dubbo and SpringCloud are the mainstream microservices, and ZooKeeper and Eureka are the most used. Let’s take a look at how to choose a registry according to CAP theory. (ZK can also be used for SpringCloud, though it is not a mainstream option).

They choose the CP

Zookeep ensures that CP, that is, the access request to ZooKeeper at any time can obtain consistent data results, and the system has fault tolerance for network segmentation, but it cannot guarantee the availability of each service. From a practical point of view, when using ZooKeeper to obtain the service list, if the ZK is electing or more than half of the machines in the ZK cluster are unavailable, the data will not be available. Therefore, ZK cannot guarantee service availability.

Had to choose the AP

Eureka guarantees AP, and eureka prioritizes availability in design. Every node is equal, and the failure of some nodes will not affect the work of normal nodes. There will be no leader election process similar to ZK. As long as one Eureka exists, the entire service is guaranteed to be available, although the information on the service may not be up to date.

Data consistency between ZooKeeper and Eureka is abnormal

To be clear, Eureka was originally created as a registry, but ZK exists more as a distributed coordination service, simply because its features are assigned to the registry by Dubbo. Its responsibility is more to ensure that data (configuration data, state data) is consistent across all services under its jurisdiction. Therefore, it is not difficult to understand why ZK is designed as CP rather than AP. The core algorithm of ZK, ZAB, is to solve the problem of consistent synchronization of data among multiple services in distributed system.

The deeper reason is that ZooKeeper is constructed according to CP principle, that is, it must keep the data of each node consistent. If zooKeeper nodes are disconnected or the network is divided in the cluster (for example, the subnets of switches cannot communicate with each other), ZK will remove them from its management scope. The external world cannot access these nodes, even if they are healthy and can provide normal services, resulting in the loss of requests to these nodes.

However, Eureka has no such concerns at all. Its nodes are relatively independent and there is no need to consider the problem of data consistency. This should be because Eureka was designed for the registry. This is more conducive to maintaining and ensuring the robustness of Eureka in operation.

Let’s take a look at what problems data inconsistency may bring to Eureka in the registration service. It is nothing more than that there are too many services registered on a node and too few services registered on a node. In a certain moment, some IP nodes may be called less and some IP nodes may be called less. There may also be dirty data that should have been deleted but wasn’t.

Summary: Service registration should choose AP or CP

For service registration, for the same service, even if the registry of different nodes to save service registration information is not the same, will not cause disastrous consequences, for service consumers, consumption is the most important thing, even to get the data is not the latest data, consumer itself also can retry attempt failed. It is better than the whole service being unavailable without instance information in pursuit of data consistency.

Therefore, for service registration, availability is more important than data consistency, choose AP.

Distributed lock: AP or CP?

There are three ways to realize distributed lock:

  • Implement distributed lock based on database
  • Distributed lock based on Redis
  • Distributed lock based on ZooKeeper

Implement distributed lock based on database

Build table structure

The UNIQUE KEY idx_lock (method_lock) of the table is used as the UNIQUE primary KEY. When the table is locked, insert is performed. If the database is successfully entered, the lock is successful.

However, this method for single master but can not automatically switch the primary and secondary mysql, the basic can not achieve the fault tolerance of P partition, (mysql automatic primary and secondary switchover at present there is no perfect solution). We can say that this approach is strongly dependent on the availability of the database, the database write operation is a single point, once the database fails, resulting in the lock is not available. This approach is largely outside the scope of CAP discussion.

Distributed lock based on Redis

Redis single-threaded serial processing is a natural solution to serialization problems and is perfect for distributed locks.

Implementation method:

Setnx key value Expire_time Returns 1 if the lock is obtained, and 0 if the lock fails to be obtainedCopy the code

To solve the problem of no master/slave switchover of database locks, you can choose redis cluster or Sentinel sentinel mode to realize master/slave failover. When the master node fails, the sentinel will select the node from the slave and become a new master node.

Sentinel mode failover is monitored and judged by sentinel cluster. When maser is abnormal, the replication will be suspended and a new slave will be elected as master. Sentinel does not care whether the replication of master and slave data is complete and consistent during the re-election.

Therefore, redis copy mode belongs to AP mode. Ensure availability. In the master/slave replication, the master has data, but the slave may not have data. At this time, once the master hangs up or the network jitter or other reasons, the slave node may be switched to the slave node, which may cause two locks to be obtained at the same time

The process is as follows:

  1. Business thread-1 requests a lock from the master node
  2. Business thread-1 acquires the lock
  3. Service thread -1 acquires the lock and starts to execute the service
  4. At this point, the locks redis just generated have not been synchronized between the master and slave
  5. Redis at this point the primary node hangs
  6. The secondary node of Redis is upgraded to the primary node
  7. Business thread-2 requests a lock on the new master node
  8. Business thread -2 has acquired the lock returned by the new master node
  9. Service thread -2 obtains the lock and starts to execute services
  10. Business thread-1 and business thread-2 are performing tasks at the same time

The above problem is not a defect of Redis, but redis uses the AP model, which by itself does not ensure our requirements for consistency. Redis officially recommended redLock algorithm to ensure, the problem is that Redlock needs at least three instances of redis master and slave to achieve, the maintenance cost is relatively high, equivalent to Redlock using three REDis clusters to achieve their own another set of consistency algorithm, more cumbersome, in the industry is also used less.

Can I use Redis as a distributed lock?

Whether we can use Redis as a distributed lock is not a redis problem per se, but depends on the business scenario. We first need to confirm whether our scenario is suitable for AP or CP. If we don’t have a very strong transaction consistency problem in social Posting and other scenarios, Redis provides us with a high performance AP model, which is very suitable, but if it is a transaction type, very sensitive to data consistency, we may want to look for a more suitable CP model

Distributed lock based on ZooKeeper

First, the mode of ZK is CP model. That is to say, when the ZK lock is provided to us for access, the ZK cluster can ensure that the lock exists in every node of ZK.

(This is actually guaranteed by the ZK leader through the two-phase commit write request, which is also a bottleneck point for the larger size of THE ZK cluster)

Zk lock implementation principle

Before we talk about the zK lock problem, let’s look at a few features in ZooKeeper that build a distributed lock for ZK

Features:

  • Orderly node

When in a parent directory such as/lock created under orderly node, the node will be in accordance with the strict order created by node lock000001, lock000002, lock0000003, and so on, can strictly guarantee all the ordered node sort by name.

  • Temporary node

The client creates a temporary node. When the session ends or the session times out, Zookepper automatically deletes the solution ID.

  • Event listeners

When reading data, we can set monitoring on the node. When the data of the node changes (1 node is created, 2 nodes are deleted, 3 nodes are changed into 4 nodes), ZooKeeper will notify the client.

Combining these features, let’s see how ZK combines distributed locks.

  1. Service thread -1 Service thread -2 applies to create ordered temporary nodes in the /lock directory of the ZK
  2. Business thread-1 grabbed the file /lock0001, which is the least ordered node in the entire directory. Thread-1 grabbed the lock
  3. Business thread 2 can only grab /lock0002 file, is not the lowest order node, thread 2 failed to acquire the lock
  4. Business thread-1 establishes a connection with LOCK0001 and maintains a heartbeat, which is the lease period of the lock
  5. When business thread -1 completes, it releases the connection to ZK, which releases the lock
Zk distributed lock code implementation

Zk official client does not support the direct implementation of distributed lock, we need to write our own code to use these features of ZK to implement.

Summary: whether to use CP or AP distributed lock

First of all, we have to understand the scenario where we use distributed lock, why we use distributed lock, and what problems we solve with it. First, we talk about the technology selection of distributed lock after the scene.

Both REDis and ZK, for example, the AP model of Redis will limit many application scenarios, but it has the highest performance among several. Zookeeper’s distributed lock is much more reliable than Redis, but its cumbersome implementation mechanism leads to its performance worse than Redis, and ZK’s performance will deteriorate with the expansion of the cluster.

Simply put, understand the business scenario first, and then choose the technology.

How do distributed transactions get out of ACID and into CAP/BASE

If the transaction, the ACID is commonly used traditional database design concept, the pursuit of strong consistency model, relational database model of ACID with high consistency + availability, so it is difficult to partition, so ACID in micro service has been unable to support, we are back to CAP to seek solutions, but according to the above discussion, the CAP theorem, Either CP or AP. If we pursue consistency of data and ignore availability, this will not work in microservices. If we pursue availability and ignore consistency, then some important data (such as payment, amount) will be full of loopholes, which will not be acceptable. So we want both consistency and usability.

However, can we make some compromises on consistency? Instead of pursuing strong consistency, we pursue final consistency. Therefore, BASE theory is introduced. In distributed transactions, BASE proposes a final consistency solution for CAP. Data is allowed to be inconsistent over a period of time, as long as final consistency is guaranteed.

Achieve final consistency

Weak consistency: The system cannot guarantee that subsequent accesses will return updated values. The updated value can only be returned after a number of conditions are met. The period from the start of the update operation until the system guarantees that any observer will always see the updated value is called the inconsistency window.

Final consistency: This is a special form of weak consistency; The storage system guarantees that if there are no new updates to an object, eventually all accesses will return the last updated value of the object.

The BASE model

The BASE model is the opposite of the traditional ACID model. Unlike ACID, BASE emphasizes the sacrifice of high consistency to achieve availability, allowing data to be inconsistent over a period of time as long as the final consistency is guaranteed.

BASE model Anti-ACID model, completely different ACID model, sacrifice high consistency, obtain availability or reliability: Basically Available Support partition failure (e.g. sharding fragmentation database) Soft state the Soft state can be asynchronously asynchronous for a period of time. Eventually consistent, rather than consistent all the time.

Distributed transaction

In distributed systems, there are no more than a few solutions to implement distributed transactions. The schemes are different, but in fact they all follow the BASE theory, which is the final consistency model.

  • Two-stage Submission (2PC)
  • Compensation Transaction (TCC)
  • Local message table
  • MQ transaction messages

Two-stage Submission (2PC)

In fact, there is a database XA transaction, but the actual use of the real Internet is very little, two-phase commit is XA principle.

There are two stages in the XA protocol:

  1. The transaction manager requires that each database involved in a transaction precommit this operation and reflect whether it can be committed.
  2. The transaction coordinator requires each database to commit or roll back data.

To explain why the unmodified two-phase commit in the Internet system is rarely used in the industry, the biggest disadvantage is the problem of synchronization blocking. After the resource is ready, the resource in the resource manager is blocked until the commit is completed, the resource is released. This in the Internet of high concurrency big data today, the two-stage submission is unable to meet the current development of the Internet.

In addition, although the two-phase commit protocol is designed for strong consistency of distributed data, there is still the possibility of data inconsistency. For example:

For example, in the second phase, if the coordinator sends a transaction Commit notification, but only some participants receive the notification and perform the Commit operation due to network problems, and the rest of the participants are blocked because they did not receive the notification, data inconsistencies will occur.

Compensation Transaction (TCC)

TCC is a two-stage model of servitization. Each business service must implement try, Confirm and calcel methods. These three methods can correspond to Lock, Commit and Rollback in SQL transactions.

Compared to two-phase commit, TCC addresses several issues

Synchronization blocking, the introduction of a timeout mechanism, after the timeout compensation, not like two-phase commit locked the entire resource, resources into the form of business logic, granularity smaller. Because of the compensation mechanism, it can be controlled by the business activity manager to ensure data consistency.

1.) try stage

A try is a preliminary operation that performs preliminary verification. It is responsible for checking all services and reserving service resources

2.) confirm stage

Confirm is the confirmation operation to be continued after the try phase is completed. It must meet the idempotent operation. If the confirm operation fails, the transaction coordinator will trigger continuous execution until it meets the requirement

3). Cancel is a cancellation that frees resources reserved for the try phase if the try fails, and must be idempotent and as likely to be executed continuously as confirm

An example of placing an order and generating order withholding inventory:

Now let’s see, how does our process of ordering and reducing inventory add TCC

During the try, the inventory service will be asked to reserve N inventory for this order, and the order service will generate an “unconfirmed” order, and the two reserved resources will be generated at the same time. During the confirm, the resources reserved in the try will be used. According to the TCC transaction mechanism, if the resources reserved in the try stage can be normally reserved, Then confirm must be able to complete submission

During the try phase, if the task fails to be executed, the interface operation cancel is performed to release the resources reserved during the try phase.

This is not about how TCC transactions are implemented, but about the application of distributed transactions to CAP+BASE theory. The implementation can be referred to: github.com/changmingxi…

Local message table

Local message table this solution was originally proposed by eBay, eBay complete solution queue.acm.org/detail.cfm?…

Local message table is the most widely used implementation in the industry. Its core idea is to split distributed transactions into local transactions for processing.

In the case of local message queues, the core is to turn large transactions into small transactions, as illustrated in the example above of placing orders and holding inventory

  1. When we go to create the order, we add a local message table, write the create order and subtract inventory to the local message table, in the same transaction (relying on the database local transaction for consistency)
  2. Configure a scheduled task to rotate the local transaction table, scan the local transaction table, and send the unsent message to the inventory service. When the inventory service receives the message, it will reduce the inventory and write the transaction table of the server to update the status of the transaction table.
  3. The inventory server notifies the order service through scheduled tasks or directly, and the order service updates the status in the local message table.

It is important to note that for some scanning and sending tasks that fail, they will be resend, so the interface must be idempotent.

Local message queue is the BASE theory and the ultimate consistency model, which is suitable for the situation where consistency is not high.

The MQ transaction

RocketMq has officially announced support for distributed transactions in version 4.3. Before selecting Rokcetmq for distributed transactions, be sure to select a version 4.3 or later.

Distributed transactions are implemented in RocketMQ and are essentially an encapsulation of the local message table, moving the local message table inside MQ.

As an asynchronous assured transaction, the two transaction branches are asynchronously decouple via MQ. The design process of RocketMQ transaction message also draws on the two-phase commit theory, and the overall interaction process is shown in the following figure:

MQ transactions are a layer of encapsulation of the local message table, moving the local message table inside MQ, so it is also based on the BASE theory, is the ultimate consistency pattern, for strong consistency requirements are not so high transactions, at the same time MQ transactions will be the whole process asynchronous, is also very suitable for high concurrency.

RocketMQ chooses asynchronous/synchronous flush, asynchronous/synchronous replication, CP and AP thinking behind it

Although synchronous/asynchronous flush and synchronous/asynchronous replication are not directly applied to cAP, availability and consistency are also involved in the configuration process

Synchronous or asynchronous disk flushing

RocketMQ messages can be persisted, and data will be persisted to the disk. In order to improve performance, RocketMQ tries to ensure that the disk is written in the same order as possible. When the messages are written by Producer to RocketMQ, there are two ways to write messages to the disk:

  1. Asynchronous flush: When a message is quickly written to the pagecache of the memory, the write success status is immediately returned. When the number of messages accumulated in the memory reaches a certain level, a unified disk write operation is triggered. This ensures high throughput, but there is a risk that messages may be lost without being saved to disk.
  2. Synchronous flush: the message is quickly written to the pagecahe memory, immediately notify the flush thread to flush, wait for the flush completion, wake up the waiting thread, return the message write success status.

Synchronous replication/asynchronous replication

A broker group has Master and Slave. Messages need to be copied from Master to Slave, so there are both synchronous and asynchronous replication modes.

  1. Synchronous replication: The write success status is reported to the client after the Master and Slave write success.
  2. Asynchronous replication: The Master sends the write success status to the client as long as the write is successful.

The advantage of asynchronous replication is that it can improve the response speed, but at the expense of consistency, generally the algorithms implementing this protocol need to add additional compensation mechanism. Synchronous replication has the advantage of ensuring consistency (typically via a two-phase commit protocol), but it is expensive, has poor availability (see CAP theorem), and introduces more problems such as conflicts and deadlocks. It is worth noting that the Lazy+Primary/Copy replication protocol is very useful in a real production environment.

The RocketMQ configuration must be based on service scenarios. Set the disk flush mode and the primary/secondary replication mode, especially the SYNC_FLUSH mode. The frequent disk write operation reduces performance significantly. In general, you should set the Master and Slave to ASYNC_FLUSH and the Master and Slave to SYNC_MASTER to ensure data loss even if one machine fails.

conclusion

In micro service building, always can’t escape from CAP theory, because the network is not stable, always always aging hardware, the software will likely bug, so partition fault tolerance is unavoidable in micro service proposition, so to speak, as long as it is distributed, as long as is the cluster are faced with the AP or CP option, but you are so greedy, To have both consistency and availability, there is only one compromise with consistency, which is to introduce BASE theory and achieve final consistency as the business allows.

The choice between AP and CP really depends on the understanding of the business. For example, money, CP model is preferred for inventory, and AP model is preferred for community Posting. Frankly speaking, based on the understanding of the business, it is a process of choice and compromise.