Since November 11, 2008, Ant Financial has been constantly breaking through the limits of existing technologies in the impact of large-scale traffic on November 11 every year. The peak of payments on Singles Day in 2010 was 20,000 transactions per minute, but by the time of Singles Day in 2017, the figure had changed to 256,000 transactions per second.


The peak value of payments on Singles Day in 2018 was 480,000 transactions per second, and that of 2019 was 544,000 transactions per second, setting a new record and 1,360 times that of the first event in 2009.


In such a large payment TPS behind in addition to peak cutting and other icing on the cake application level optimization, the most thirst quench the most substantial approach when the number of sub-database sub-table based on the unit, ant technology called LDC (logical data center).


Instead of talking about code-level analysis, this article tries to illustrate the most exciting principles in the simplest possible description.


I think anyone who cares about distributed system design has struggled with the following questions:

  • What is the most thirst quenching design behind the massive payment of Alipay? In other words, what is the key design to achieve alipay’s high TPS?

  • LDC is what? How can LDC implement remote multi-live and remote DISASTER recovery?

  • What exactly is the CAP Spell? How do I understand P?

  • What is split brain? What does CAP have to do with it?

  • What is PAXOS and what problems does it solve?

  • What is the relationship between PAXOS and CAP? Can PAXOS escape the CAP curse?

  • Can Oceanbase escape the CAP curse?


If you’re interested in this, it’s a bare-bones argument, a rejection of arcane vocabulary and a straightforward logic.


LDC and unitization


The Logic Data Center (LDC) is proposed compared with the Traditional Internet Data center-IDC. The central idea of a logical data center is that the entire data center is logically coordinated and unified no matter how the physical structure is distributed.



Implicit in this is strong architecture, the challenge of distributed systems is to work together as a whole (availability, partitioning tolerance) and unity (consistency).



Unitization is the inevitable choice trend of large Internet system, take the most popular example to illustrate unitization.



We always say that TPS is difficult to improve, indeed any Internet company (such as Taobao, Ctrip, Sina) its transaction TPS is measured at most one hundred thousand (average level), it is difficult to go up.


Because the database storage layer bottleneck exists no matter how many horizontal expansion of the server can not bypass, and from the perspective of the whole Internet, the world’s e-commerce transactions TPS can easily hundreds of millions.



This example brings us some thinking: why can the sum of TPS of several Internet companies be so large and the number of users served is extremely scary, but it is difficult to improve the TPS of a single Internet company?



At its core, each Internet company is a large, independent unit that serves its own users independently.



This is the basic characteristics of unitization, any Internet company, it wants to expand the service capacity of its own system exponentially, will inevitably go to the road of unitization.


The essence of it is divide-and-conquer, where we divide a large number of users into several parts and simultaneously make multiple copies of the system, each deployed independently, each serving a specific group of users.



Take Taobao for example, after this, there will be a lot of Taobao systems for different users to serve, each Taobao system do one hundred thousand TPS, N such systems can easily do N* one hundred thousand TPS.



The key to the realization of LDC lies in the architecture design of the unialized system. Therefore, in ants, LDC and unialized system are inseparable, which is also a problem for many students. It seems to have no relationship, but in fact, it is the unialized system design that makes LDC come true.


Summary: The biggest pain point that shard solves is the database single point bottleneck, which is determined by modern binary data storage systems (i.e., I/O speed).


Unitary deployment is only a mode of system deployment after database and table separation. This deployment mode has great advantages in DISASTER recovery.



Evolution of system architecture


Almost any Internet company of any size has its own system architecture iteration and update, the approximate evolutionary path is much the same.



At the earliest, in order to quickly launch the business, all functions would be put into one application, and the system architecture is shown in the figure below:


Such an architecture is obviously problematic, single machine has obvious single point effect, the capacity and performance of single machine is very limited, and the use of small and medium-sized opportunities to bring a lot of waste.



As the business grew, this became a major conflict, so the engineers adopted the following architecture:

It was the company’s first taste of distribution, a horizontal scaling of an application that combined the computing power of multiple microcomputers and could beat similarly-priced small and medium-sized machines.



Over time, the application server CPU became normal, but there were still many slow requests due to the performance bottleneck caused by the single point database.



So the programmers decided to use a master-slave database cluster, as shown below:

Most of these reads can access the slave library directly, reducing the stress on the master library. However, this way still can not solve the write bottleneck, write still needs to be handled by the master library, when the volume of business increases again, write has become an urgent bottle to be processed


At this time, the database and table scheme appeared:


Split table can not only split the same library, but also split the same table, which is called horizontal split.



Tables with different functions are placed in different libraries, which generally corresponds to vertical split (split by business function), and in this case, microservitization.



This approach is extreme enough to support TPS visits at level 10,000 or higher. However, with the expansion of the same application, the number of links in each database also increases dramatically, which makes the resources of the database itself become the bottleneck.



The essence of this problem is that full data shares all application resources without discrimination. For example, user A’s request may be allocated to any application server under load balancing, so all applications need to link to user A’s branch library, and the number of database connections becomes A Cartesian product.



In essence, the resource isolation of this model is not complete enough. To solve this problem, you need to move the logic that identifies user repositories up the hierarchy, from the database layer to the routing gateway layer.



In this way, all requests from client A coming in from application server A must go to db-A, so THAT CLIENT A does not need to link to other database instances, and a unified prototype is born.


If you think about it, apps interact with each other (for example, A transfers money to B), which means that apps don’t need to link to other databases, but they do need to link to other apps.


If common RPC frameworks such as Dubbo use TCP/IP protocol, it is equivalent to replacing the previous link with the database with the link between other applications.


Why does this eliminate the bottleneck? Firstly, due to reasonable design, data interaction between applications is not huge. Secondly, the interaction between applications can share TCP links. For example, Socket links between A->B can be reused by multiple threads in A.


General databases such as MySQL do not work, which is why MySQL needs database link pools.


As shown in the figure above, when we package the whole system as a unit, each type of data is destined to be digested in the unit from the beginning. Because of this complete isolation, the whole unit can be easily deployed to any machine room and still ensure logical unity.


The following figure shows the deployment mode of three places and five equipment rooms:


Practice of ant unitized architecture


Ant Alipay should be the largest payment tool in China. The TPS of its payment on singles Day and other activities can reach hundreds of thousands of levels, and the number may be even larger in the future. This determines that ant’s unitization architecture will inevitably shift from single room to multi-room in terms of capacity requirements.


On the other hand, remote Dr Also determines that the IDC rooms must be deployed in remote locations. On the whole, Alipay also uses three places and five centers (IDC room) to ensure the availability of the system.


Different from the previous description, Alipay divides the units into three categories (also known as CRG architectures) :


  • RZone (Region Zone) : A literal translation might be a bit confusing. In fact, it is the smallest unit of the overall deployment of all business systems that can be divided into databases and tables. Each RZone connected to the database can hold up a sky, the business run.

  • GZone (Global Zone) : Global unit, which means only one Global unit. Non-detachable data and services are deployed, such as system configurations.

    In actual cases, GZone can be deployed on other sites, but only for DISASTER recovery. Only one GZone provides global services at a time. GZone is generally relied upon by RZone to provide mostly read services.

  • CZone (City Zone) : As the name implies, this unit is deployed in a City. Non-detachable data and services are also deployed, such as user account services, customer information services, etc. Theoretically, CZone will be accessed by RZone at a much higher rate than GZone.

    CZone is a unit that is optimized for a specific GZone scenario by deploying data and services in a GZone with “write/read lag” so that RZone only needs to access the local CZone rather than the remote GZone.


The “write/read lag phenomenon” is a statistical result of practice by ant architects, who have found that in most cases, enough time has passed before data is accessed after it has been written.


This kind of example is common in our life. We may make our first deposit long after we finish our bank card. After we set up our Weibo account, it may take us a long time to send a tweet. After we download and create our Taobao account, we may have to browse for several minutes before placing an order.


Of course, the time difference in these examples far exceeds the system synchronization time. Generally speaking, the delay of a remote location is less than 100ms. Therefore, as long as a CZone can write data 100ms later, the data and services are suitable for CZone.


I believe you will look at this and ask: why these three units? In fact, behind the corresponding is different nature of the data, and the service is just a set of operations on the data.



Let’s explain the CRG architecture of Alipay according to the different data nature. At present, almost all Internet companies’ database and table rules are based on user ID.


The data of the whole system around users can be divided into the following two categories:


User flow data: typical user orders, user comments, user behavior records, etc.


These data are flowing data generated by user behaviors and have natural user isolation. For example, the order list of user B can never be seen on user A’s App. Therefore, this kind of data is very suitable for independent deployment of services after the separation of databases and tables.


Shared data between users: This type of data falls into two further categories. One type of shared data is user data such as account numbers and personal blogs that may be accessed by all users.


For example, when A transfers money to B and A sends A message to B, it is necessary to confirm whether account B exists. Or A wants to read B’s personal blog.


The other is user-independent data, such as commodity, system configuration (exchange rate, preferential policies), financial statistics and other non-user data, which is difficult to link with a specific type of users and may involve all users.


For example, if the commodity data is stored according to the location of the commodity (which requires a two-dimension sub-database and sub-table), the user in Shanghai still needs to access the commodity in Hangzhou.


This will constitute cross-site and cross-zone access, which still cannot reach the ideal state of unitary, and the two-dimension database and table will bring complexity to the whole LDC operation and maintenance.


Note: There are other types of classification online and inside Alipay, such as flow type and status type, and sometimes there are three types: flow type, status type and configuration type.


Personally, although these methods try to classify abstract data at a higher level, the boundary is actually very vague and counterproductive.


By direct analogy, we can easily divide the services corresponding to the above two types of data into RZone and GZone. RZone contains the services responsible for fixed customer groups after database and table division, while GZone contains the services corresponding to common data shared between users.


So far, everything is perfect, and this is the dominant unitized topic. Compared with the CRG architecture of Alipay, we found that C (City Zone) was missing at first sight. CZone is indeed an innovation of Ant in the field of unitary practice.



Then analyze GZone. The reason why GZone can only be deployed in a single location is that its data requirements are shared by all users and cannot be divided into databases and tables, while multi-location deployment will bring inconsistencies caused by remote delay.


For example, real-time risk control systems, if deployed in multiple locations, one RZone reading directly from the local, can easily read the old risk control state, which is dangerous.



At this point, ant architects asked themselves a question — can’t all data tolerate latency? This problem seems to open the door to a new world. Through analysis of RZone’s existing business, the architects found that 80% or more of the time, data updates are not required to be read immediately.



For this type of data, we allow RZone services in each region to access the local data directly. To provide these rzones with local access to this data, the Ant architect designed the CZone.



In the CZone scenario, write requests are generally written from GZone to the repository where the common data resides, then synchronized to the entire OB cluster, and read services are provided by CZone. Such is the case with alipay’s membership service.


Even if architects design a perfect CRG, even in the practical application of ants, there are still unreasonable CRG classification of various systems, especially the phenomenon of CG non-differentiation is common.



Alipay unitary remote live and disaster recovery


A brief introduction to the technology of traffic stifling


After the unit is changed, different ground is deployed more just. For example, the two units in Shanghai are user services whose ids range from 00 to 19 and from 40 to 59.


The two units in hangzhou serve users with ids [20~39] and [60,79], so Shanghai and hangzhou are remote hypermetro.


The basic requirement of Alipay’s unitization is that each unit has the ability to serve all users, that is, which users of that specific unit can be dynamically configured. So these units of remote hypermetro also act as each other’s backup.


Found that the work of cold and hot spare has been used very disorderly. At the earliest, cold backup refers to the backup (also called offline backup) of the database after it needs to be shut down during data backup. In case of modification during data backup, data backup is called hot backup (also called online backup) without shutting down.



It is not known from what date, cold standby in the primary and secondary systems means that the standby machine is shut down, only after the primary server is down, the secondary server will be started.


The same hot backup becomes the secondary server and is also started, but there is no traffic. Once the primary server is down, the traffic is automatically sent to the secondary server. This article is not going to use the second interpretation because it feels a little wild.



In order to make which users can be accessed by each unit configurable, Alipay requires the unit management system to be configurable from flow to unit and from unit to DB.


As shown below:

Spanner is the reverse proxy gateway developed by ant based on Nginx. It is also easy to understand that some requests are expected to be forwarded to other IDC Spanner at the reverse proxy layer without entering the back-end service, as shown in Figure 2.



Then the requests that should be processed in this IDC can be directly mapped to the corresponding RZ, as shown in arrow 1.



Once in the back-end service, if the request is only to read the user flow data, it will not normally be routed.



However, for some scenarios, A request from user A may be associated with access to user B’s data, for example, user A transfers money to user B, and user A calls the accounting system to increase the balance of user B after the deduction.



At this point, re-routing is involved, and there are also two results: redirecting to other IDCs (as shown in arrow 3) or redirecting to other Rzones of this IDC (as shown in arrow 4).



RZone access to DB partition This is pre-configured. In the figure above, the relationship between RZ and DB partition is as follows:


RZ0* –> a

RZ1* –> b

RZ2* –> c

RZ3* –> d


Let’s take an example to illustrate the whole process of traffic tiling. Assume that user C belongs to data partition C, and user C visits cashier.alipay.com in Hangzhou (made up arbitrarily).



① At present, Alipay will route traffic by region by default, and the specific bearer is Global Server Load Balancing (GLSB) developed by ourselves:


https://developer.alipay.com/article/1889


It will automatically resolve cashier.alipay.com to the IP address of HANGZHOU IDC (or jump to the domain name of IDC) according to the IP address of the requester.



You should know that most of the DNS service providers’ addresses are configured by people, GLSB belongs to the dynamic configuration of the domain name system, there are also more popular online similar products, such as peanut shell and so on (built private station students should be very familiar with).



(2) The request is sent to the Spanner cluster server of IDC-1. Spanner reads the routing configuration from the memory. Spanner knows that the RZ3* of user C is not the same IDC, so it directly transfers the request to IDC-2 for processing.



③ After entering IDC-2, the request is allocated to RZ3B for processing according to the traffic matching rules.



④RZ3B accesses data partition C after receiving the request.



⑤ After processing, return to the original route.



You can see the problem. If you make another request like this, won’t you have to call and return the body across the region every time?



There is indeed such a problem, for this kind of problem, Alipay architects decided to continue the decision logic to the user terminal.



For example, each IDC room will have its own domain name (which may not be the real name) :


  • IDC – 1 for cashieridc-1.alipay.com

  • IDC – 2 for cashieridc-2.alipay.com


So when the request is returned from IDC-1, it will redirect the front-end request to cashieridc-2.alipay.com (if it is App, just replace the interface domain name of the REST call), and all subsequent user actions will take place in this domain name. Avoid the delay caused by walking through IDC-1.


Alipay disaster recovery mechanism


Traffic switching is the basis and prerequisite for a DISASTER recovery switchover. After a disaster occurs, the common method is to divert traffic from a disaster unit to a normal unit. This traffic switching process is commonly known as cutting traffic.



There are three levels of disaster preparedness under alipay LDC architecture:


  • Dr Between units in the same equipment room

  • Intra-city EQUIPMENT room Dr

  • Remote disaster recovery between equipment rooms


Disaster recovery between units in the same equipment room: The possibility of a disaster is relatively high (but actually very small). The minimum disaster for an LDC is when a unit goes down for some reason (partial socket disconnection, line aging, human error).



As can be seen from the figure in the previous section, each RZ group has two units, A and B, which are used for same-room Dr. In addition, A and B work in active-active/standby mode.


Normally, unit A and Unit B share all the requests. Once Unit A dies, Unit B automatically assumes the traffic share of Unit A. This disaster recovery scheme is the default.



Intra-city DISASTER recovery between equipment rooms: A disaster is less likely to occur. This kind of disaster is generally caused by the equipment room wire network cable is cut off, or the equipment room maintenance personnel operation error.



In this case, it is necessary to manually formulate the flow tiling (tangent flow) scheme. Here is an example to illustrate this process, as shown in the following picture in two IDC rooms in Shanghai.

The whole flow configuration process consists of two steps. First, you need to modify the access configuration of the data partition corresponding to the RZone in the disaster equipment room.


Suppose our solution is that RZ2 and RZ3 of IDC-2 machine room take over RZ0 and RZ1 of IDC-1 respectively.



So the first thing to do is to take the access of partition A and partition B from RZ0 and RZ1 and allocate it to RZ2 and RZ3.



To be (the initial mapping shown above) :


RZ0* –> a

RZ1* –> b

RZ2* –> c

RZ3* –> d


To:


RZ0* –> /

RZ1* –> /

RZ2* –> a

RZ2* –> c

RZ3* –> b

RZ3* –> d


Then modify the mapping configuration between the user ID and RZ. Assume before:


[00-24] –> RZ0A(50%),RZOB(50%)

[25-49] –> RZ1A(50%),RZ1B(50%)

[50-74] –> RZ2A(50%),RZ2B(50%)

[75-99] –> RZ3A(50%),RZ3B(50%)


According to the requirements of the Dr Solution, the mapping configuration is changed to:


[00-24] –> RZ2A(50%),RZ2B(50%)

[25-49] –> RZ3A(50%),RZ3B(50%)

[50-74] –> RZ2A(50%),RZ2B(50%)

[75-99] –> RZ3A(50%),RZ3B(50%)


After this, all traffic will be sent to IDC-2. During this period, some users who have sent requests to IDC-1 will receive a message indicating that they failed and retry.



In practice, the whole process is not done after a disaster occurs. The whole process is prepared in the form of a plan configuration and pushed to each traffic tipper (integrated into all services and Spanner).


Here you can think, why cut database mapping first, then cut traffic? This is because if traffic is cut first, it means that a large number of doomed requests are sent to the new normal cell, affecting the stability of the system (the database is not ready yet).



Remote inter-room Dr: This is basically the same as intra-city inter-room Dr (which is also an advantage of unitary) and will not be described again.


CAP analysis of ant unitary architecture


Review the CAP


(1) the definition of CAP

The CAP principle means that any distributed system can satisfy at most two of them at the same time, but not all three.



The so-called distributed system, in plain English, is one thing that one person does, now divided among several people to do.



Let’s briefly review the meanings of each dimension of CAP:


Consistency, which is simple to understand, means that the same data is consistent on every node at all times.


This requires that any updates be atomic, i.e., all succeed or all fail. Imagine how inefficient it would be to use distributed transactions to guarantee atomicity for all systems.


Availability. This might seem easy to understand, but not much is really clear. I prefer to think of availability as something that can be read and written at any time.


For example, when we use transactions to lock all nodes for a certain write operation, if one node becomes unavailable, the entire system becomes unavailable.


For a shard NoSQL middleware cluster (Redis, Memcached), once a shard is down, the system’s data is not complete, and the data read from the broken shard is unresponsive, that is, not available.


It should be noted that the choice of CP distributed systems does not mean that availability is completely gone, but that availability is not guaranteed.


In order to increase the guarantee of availability, this kind of middleware often provides a “shard cluster + replication set” solution.


Partition tolerance? This may also be what many articles fail to explain. P is not an independent property like CA, it depends on CA for discussion.


Reference: “unless the entire network is down, the system can work at any time”, meaning that a small network down, node down, will not affect the entire system CA.


I find this explanation somewhat confusing, so I prefer to say that availability and consistency can still be guaranteed when the network between nodes is down (network partitions occur).


From a personal perspective, partition tolerance can be divided into “availability partition tolerance” and “consistency partition tolerance”.


The key to whether partitioning will affect availability is whether all nodes need to communicate with each other to complete a transaction, and if not, it will not affect availability.


Fortunately, it is unlikely that a distributed system will be designed to complete a transaction that requires all nodes to work together. For example, MySQL with full synchronous replication is a typical example.


The key to whether partition will affect consistency is whether there is a scheme to ensure consistency when there is a split brain, which is fatal to master-slave synchronous databases (MySQL, SQL Server).


Once the network appears partition, produces the brain split, the system will appear one data two values state, who does not feel that he is wrong.


It should be noted that in the same LAN, the probability of network partition is very low, which is why the most familiar databases (MySQL, SQL Server, etc.) do not consider P.


The following is a classic relationship between Caps:


Another point to note is that it is difficult for distributed systems to meet CAP requirements if the system must be both read and write. If only read is considered, then CAP can easily meet both requirements.


For example, a calculator service accepts expression requests, returns calculation results, and is distributed horizontally. Obviously, such a system has no consistency problem, network partition is not afraid, and availability is very stable, so it can meet CAP.


②CAP analysis method

Let’s talk about the relationship between CA and P. If you don’t care about P, the system can easily implement CA.

P is not a separate property, but represents whether the target distributed system is fault-tolerant to network partitioning.



If you do it, it must have A P, and then you have to think about whether you chose A or C in the partition case. Therefore, to analyze CAP, it is recommended to first determine whether fault tolerance is done for partition situations.


The following is a personal summary of the general approach to analyzing CAP compliance in a distributed system:


If (there is no possibility of partition | | after partition does not affect the availability or consistency | | partition is considered in influential but – P) {if availability (partition tolerance – A under P)) return “AP”; Else if(C under P) return “CP”; } else{if(with availability -A && with consistency -C) {return AC; if(with availability -A && with consistency -C) {return AC; }}



It should be noted that if partitioning tolerance is considered, availability and consistency without partitioning are not considered (mostly satisfactory).


Scale-out application + CAP analysis of a single database instance


Let’s go back to the origin of distributed application systems. In the early days, each application was a single unit running on a server. Once the server was down, the service became unavailable.



On the other hand, due to the complex business functions of the single application, the requirements on the machine are gradually becoming higher, and the ordinary microcomputer can not meet the requirements of the performance and capacity.



So take it down! Back when IBM was selling small business machines, Alibaba proposed replacing minicomputers with distributed microcomputers.



So we found that the biggest pain point that distributed systems solve is the availability of a single stand-alone system.



To be highly available, you must be distributed. On the road of the development of an Internet company, the first encounter with distributed should be in the horizontal expansion of single application.



That is, multiple instances of the same application are started and connected to the same database (regardless of whether the database is a single point for simplicity), as shown below:


Such a system naturally has AP (availability and partition tolerance) :


  • On the one hand, it solves the problem of low availability caused by a single point.

  • On the other hand, whether or not these horizontally extended networks of machines are partitioned, these servers can provide services independently because they do not need to communicate with each other.



However, there is no consistency in such a system, and it would be nice to imagine that every instance can insert and update into the database (note that transactions are not discussed here).


So we turned to DB to do this, where “database transactions” were used. In the case of MySQL, which most companies would choose to use for transactions, the database becomes a single point and bottleneck again.



A single point is like a single machine (library mode is not considered in this example). In theory, it is not called distributed. If the CAP must be analyzed, the analysis process should look like this according to the above steps:


  • Partition tolerance: See if partition tolerance is considered first, or if there is an impact after partitioning. A single MySQL cannot form a partition. Either the entire system is dead or alive.

  • Availability zone tolerance: If the node is down, the system becomes unavailable. Therefore, the availability zone tolerance is not satisfied.

  • Consistency partition tolerance: The biggest benefit of a single point of use in a partitioned case is that consistency can be guaranteed whenever available.


So such a system, in my opinion, only satisfies CP. A has but is not excellent. It can be seen that CAP is not black and white.



Including the often said BASE (final consistency) scheme, in fact, C is not good, but finally achieved consistency, BASE chose to compromise on consistency.


As for distributed application + single point database mode is considered as a pure distributed system, this may be a little different from everyone, the above is just my personal understanding, is not a distributed system is not important, important is the analysis process.


In fact, we discuss distributed, is to hope that the availability of the system is multiple systems live, a hung another can also top, obviously a single system does not have such high availability characteristics.


So in my opinion, in a broad sense, CAP is also suitable for single point single machine system, single machine system is CP.


At this point, we also seem to find that the CAP curse of scale-out service application + database systems mainly occurs in the database layer.


Because most of these service applications simply perform computations (like calculators) and do not need to collaborate with each other, the problem of data consistency from all write requests falls down to the database layer.



Imagine if there is no database layer, but the application itself ensures data consistency, then such applications are involved in state synchronization and interaction. ZooKeeper is a good example.


Scale-out application + CAP analysis for master/slave database clusters


In the previous section we discussed the multi-application + single-database instance pattern, which is generally CP biased, whether it is a distributed system or not.



In reality, technologists will quickly find this architecture irrational — usability is too low.


As a result, the pattern shown in the figure below has become the current architecture used by most small and medium-sized companies:


From the figure above I can see that only one of the three database instances is the master library and the others are the slave libraries.


To some extent, this architecture greatly alleviates the “read availability” problem, and such architectures generally do read/write separation to achieve higher “read availability”. Fortunately, reading accounts for more than 80% of most Internet scenarios, so this architecture can be widely used for a long time.



Write availability keeps the main library alive with Keepalived HA frameworks, but if you think about it, there is no performance increase in availability. Fortunately, at least the system won’t become unavailable just because an instance dies.


When usability is barely up to par, the CAP analysis is as follows:

  • Partition tolerance: Again, the partition tolerance, master-slave database has communication between nodes, they need to heartbeat to ensure that there is only one Master.

    However, once partition occurs, each partition will select a new Master, which will cause split brain. Common master-slave databases (MySQL, Oracle, etc.) do not have their own solution to solve split brain. So partition tolerance is not considered.

  • Consistency: Regardless of partitions, consistency is satisfied because there is only one primary library at any one time.

  • Availability: Regardless of partitioning, availability is guaranteed by the existence of an HA mechanism, so availability is obviously satisfied.


So a system like this, we think of it as AC. Let’s take a closer look, if there is a way to arbitrate the consistency problem after the brain split produces inconsistent data, is it possible to satisfy P?



Is had by preset rules to try to solve the consistency of the problems of the main library system, such as CouchDB, it by writing, version management to support many libraries of stage through the DBA configuration in its arbitration rules (namely merging rules, such as who the timestamp of the latest effect) for automatic arbitration (to be automatic), In order to ensure the final consistency (BASE), automatic rules can not be merged in the case of human decision only.


CAP analysis of ant unitized LDC architecture


① Overcome zoning tolerance

Before we discuss CAP for ant LDC architecture, let’s consider partition tolerance. Why do some of the most popular BASE (final consistency) architectures choose to lose real-time consistency rather than discard partition tolerance?



There are two kinds of partitioning:


When a machine goes down and then reboots, it looks as if it has been disconnected for a while, as if the network is unreachable.



In the case of remote deployment, remote live means that data may be written to each location, and occasional network delay spikes (network delay curves increase sharply) and network failures between remote locations can lead to small network partitions.



As mentioned above, if a distributed system is deployed in a LAN (a physical room), the probability of partition is very low. Even if there is a complex topology, there will rarely be network partition in the same room.



The probability of different places will be greatly increased, so the three places and five centers of ants must think about such a problem, zoning tolerance can not be lost!



The same can happen between rooms at different ISPs (imagine playing DOTA in a team with your friend, who is at Telecom, and you at Unicom).



In order to deal with the intermittent disconnection of network delay spikes in a machine room at a certain time, a good distributed system must be able to deal with the consistency problem in this case.



So how do ants solve this problem? As discussed above, each unit of the LDC machine room is actually composed of two parts: the application server responsible for business logic calculation and the database responsible for data persistence.


Most application servers are like calculators that are not themselves responsible for write consistency, and the task is relegated to the database. So ants solve the problem of distributed consistency is the key to the database!



Ant readers can probably guess the focus of the following discussion — OceanBase (OB), the first independently developed distributed database in China, is indeed gaining a lot of popularity.


Before we discuss OB, let’s think about Why not MySQL?



First, as pointed out in the CAP triangle, MySQL is a distributed system that satisfies AC but not P.



Imagine a MySQL master-slave database cluster. When a partition occurs, the slaves in the partition in question assume that the Master has died, so they become the Master of their partition.


When the partition problem is recovered, the data of two primary libraries is generated, and it is impossible to determine who is correct, which is the partition that causes the consistency to be broken. The consequences are serious, and it is one of the reasons ants prefer to study OceanBase themselves.



So how do you make a distributed system partition tolerant? As usual, we talk about availability partition tolerance and consistency partition tolerance:



Availability zone tolerance guarantee: The key to availability zone tolerance is not to allow all nodes to complete a transaction. This is very simple. Do not require all nodes to participate in a transaction at the same time.



Consistency partitioning tolerance guarantee: Let’s be honest, you can’t get real-time consistency with partitioning.


But it’s not easy to ensure final consistency. Once a partition is created, how do you ensure that only one proposal is created at a time?



In other words, how to ensure that you still have only one brain? Let’s take a look at how the PAXOS algorithm solves the split brain problem.


A “brain” system is a write-capable system, and a “non-brain” system is a read-only system, corresponding to the slave libraries in the MySQL cluster.


Here’s a PAXOS definition from Wikipedia:

Paxos is a family of protocols for solving consensus in a network of unreliable processors (that is, processors that may fail).


Basically, PAXOS is a consensus mechanism in a cluster of nodes that are not particularly reliable.



Paxos requires that any proposal be considered credible only if it has at least (N/2)+1 system node approval. One of the basic theories behind this is majority rule.



Imagine if the whole system went down after a majority of the nodes agreed, and after a restart, you could still vote on which value was valid (the one that the majority kept).



Such a setting also cleverly solves the problem of consensus in the case of partitions, because once a partition is created, the number of nodes in at most one partition will be greater than or equal to (N/2)+1.



The MySQL cluster can also be resolved by other methods, such as Ping a public IP address at the same time, and the winner continues to be brain, which obviously creates another single point.


If you know anything about Bitcoin or blockchain, you know that the underlying theory of blockchain is also PAXOS. Blockchain relies on PAXOS ‘contribution to ultimate consistency to resist malicious tampering.


The distributed application system covered in this article uses PAXOS to solve partition tolerance. In essence, one is to resist the deterioration of some nodes, and one is to prevent the disconnection of some nodes.


You’ve heard of PAXOS as the only solution to the distributed consistency problem.



The more I understand this sentence, the more strange it will be, which will make people think that PAXOS escaped from the CAP constraint. Therefore, I prefer to understand it as: PAXOS is the only consensus algorithm that guarantees the final consistency of distributed system (the so-called consensus algorithm means that everyone operates according to this algorithm, and the final result will be the same).



PAXOS didn’t escape the CAP curse, after all, the consensus was between (N/2)+1 nodes, and the data on the remaining (N/2)-1 nodes was old and still inconsistent.


So the contribution of PAXOS to consistency is that after a transaction, some nodes in the cluster have the correct result of the transaction (the consensus result), which is then asynchronously synchronized to other nodes to ensure final consistency.



Here’s an excerpt from Wikipedia:

Paxos is a family of protocols for solving consensus in a network of unreliable processors (that is, processors that may fail).Quorums express the safety (or consistency) properties of Paxos by ensuring at least some surviving processor retains knowledge of the results

.


In addition, PAXOS does not require real-time synchronization of all nodes. In essence, it takes the availability of partitions into consideration and ensures the availability of the system by reducing the number of participants required to complete a transaction.


(2) CAP analysis of OceanBase

As mentioned above, thousands of applications in the unitized architecture are just like calculators. There is no CAP limit in itself, and its CAP limit sinks to its database layer, namely OceanBase (OB for short in this section), a distributed database developed by ants.



In an OB system, each database instance has read and write capabilities that can be dynamically configured (see Part 2).



In practice, most of the time, for a certain type of data (fixed user number segment data) only one cell is responsible for writing to a node at any one time, and the other nodes are either synchronous between real-time libraries or asynchronous data synchronization.



OB also adopted the PAXOS consensus protocol. The number of nodes (including themselves) synchronized between real-time libraries needs to be at least (N/2)+1 to solve the partition tolerance problem.



Here is an example of Teacher Ma changing his English name to illustrate the subtlety of OB’s design:

Assume that the database is divided into databases and tables according to user ID. The data segment corresponding to Teacher Ma’s user ID is in [0-9], and unit A is responsible for data writing at the beginning.


Suppose Teacher Ma (user ID is assumed to be 000) is changing her English name with alipay App. Teacher Ma made A mistake at the beginning and typed it as Jason Ma. Unit A received the request.



At this time, partition occurred (for example, the network of A was disconnected), and we transferred the write permission of unit A to the data segment [0,9] to unit B (change the mapping). Teacher Ma wrote it correctly this time, it was Jack Ma.



Before the network is disconnected, the request has entered A, and after the write permission is transferred to unit B and takes effect, A and B simultaneously write the English name of teacher ma into the data segment [0,9].



If both are allowed to write at this time, there will be inconsistency. Unit A says I saw Teacher Ma set Jason Ma, and Unit B says I saw Teacher Ma set Jack Ma.



However, this situation will not happen. When A suggested that I should set The English name of Teacher Ma as Jason Ma, no one responded to it.


Because of the partition, other nodes are unreachable to it, so this proposal is automatically discarded. A also knows that it is its own partition, and there will be A primary partition to complete the write task for it.



Similarly, after B proposed to change The English name of Teacher Ma to Jack Ma, most nodes responded, so B successfully wrote Jack Ma into the account record of Teacher Ma.



If A suddenly recovers after the write permission is transferred to cell B, it does not matter. Both write requests require (N/2)+1 node transaction lock at the same time. Through no-wait design, after B obtains the lock, all other transactions that claim the lock will be rolled back because of failure.


Let’s analyze OB’s CAP below:

  • Partition tolerance: OB nodes communicate with each other (need to synchronize data with each other), so there are partition problems. OB ensures availability by only synchronizing to some nodes. This shows that OB has done partition fault tolerance.

  • Availability zone tolerance: OB transactions only need to be synchronized to (N/2)+1 nodes, allowing the remaining small half of the node partitions (down, down, etc.) to be available as long as (N/2)+1 nodes are alive.

    In extreme cases, such as 5 nodes divided into 3 parts (2:2:1), it is indeed unavailable, but this is less likely.

  • Consistency partition tolerance: In the case of partition, it means that some nodes are disconnected, and consistency is obviously not satisfied. However, consensus algorithm can ensure that only one value is valid at present, and finally achieve the final consistency through synchronization between nodes.


So OB still hasn’t escaped the CAP spell, it becomes AP+ final consistency (C) when partitioning is created. Overall, it is AP, i.e., high availability and partition tolerance.


conclusion


I feel that this article involves a lot of knowledge, each point can be discussed separately for a long time. Back to our main theme, what is the technological design behind the massive payment of Singles’ Day?


I think it’s the following:

  • RZone design based on user sub-database sub-table. The exclusive use of one unit by each user group resulted in an explosion in the capacity of the entire system.

  • RZone’s anti-brain-cracking design (PAXOS) for OB in network partitioning or DISASTER recovery switchover. As we know, RZone is single-brain (reading and writing are all in the library corresponding to a unit), while multiple brains may be generated in the process of network partitioning or hot switching in disaster recovery. OB solves the consensus problem in the case of split brain (PAXOS algorithm).

  • Czone-based local read design. This ensures that a large proportion of public data with “write/read lag” can be accessed locally at high speed.

  • The remaining dibble has no local access, only real-time access to GZone’s public configuration data, and does not make much of a splash.

    The TPS created by the user, for example, is not that high. For real-time inventory data, we can reduce the amount of GZone adjustment by “page display query through application layer cache” + “verify when placing an actual order”.



This is the CRG architecture of ANT LDC. I believe that 544,000 beats per second is far from the upper limit of LDC, and the number can be higher.



Of course, the success of double 11 massive payment is not only determined by such a set of design, but also the means of operation and technology, such as preheating peak clipping, as well as the joint efforts of hundreds of brothers and sisters. We hereby pay tribute to all the students left behind on Double 11.



Thank you for your reading. Any deficiencies or omissions in this article are welcome.



END


Selection of recommended



The programmer was only 25 years old. Can you believe it?
The spring of COBOL, the ancient language, is coming.
Top 10 programming languages for April 2020: Python ranked top 3, and Scratch, children’s programming language, ranked TOP20
▌ An average of 14,368 yuan! In April, the national salary for programmers was released