Brief introduction:For the interpretation of the concept of cloud native, we often hear about microservices and containers, so what is the relationship between these technologies and enterprise disaster recovery? In fact, all walks of life have the demand for disaster recovery, such as the financial industry also has a strong demand for disaster recovery. But how to build up the disaster tolerance and live capacity, in fact, every enterprise needs to think about. I hope this post will give you some ideas.

For the interpretation of the concept of cloud native, we often hear about microservices and containers, so what is the relationship between these technologies and enterprise disaster recovery? In fact, all walks of life have the demand for disaster recovery, such as the financial industry also has a strong demand for disaster recovery. But how to build up the disaster tolerance and live capacity, in fact, every enterprise needs to think about. I hope this post will give you some ideas.

Function evolution of the Dr System

What WE talk about today is actually a part of the DISASTER recovery system. You can take a look at the evolution of the entire disaster recovery architecture:

Dr 1.0: During the construction of the original application system, the service system is deployed in the equipment room based on the traditional architecture. What are the emergency measures or troubleshooting methods? In this period, only data backup was considered, mainly on a cold standby basis. In addition to providing business rooms, additional rooms may be made for disaster scenarios. From the perspective of system construction, data may be synchronized from a separate machine room to another machine room for cold backup and switchover in case of problems. However, in practice, the machine room switch is usually not chosen. Even the financial industry, which conducts routine drills for DISASTER recovery systems every year, does not dare to switch when the system goes wrong in the production process.

Disaster 2.0: More application oriented. For example, cloud native, or upper-level application in traditional IOE system, switching is not just a simple cut over to load the original cold data, but hope to quickly pull up the application in another room when cutting over. To achieve replication at the data layer without too much latency, we usually require hypermetro. But generally do hypermetro will have some requirements, such as distance requirements within a certain range can do the city hypermetro. Hypermetro is more likely to be used in AQ mode. That is, all services are performed on the production side and other services are performed in another equipment room.

Dr 3.0: Hope to live in different places. What is a lot? This means no longer limited to two rooms, but rather three or more rooms. For example, Alibaba’s business is distributed in multiple computer rooms. How to provide external business support at the same time requires corresponding technical support. And remote live means not limited to the distance, such as 200 kilometers or the same city, because today’s machine room deployment across the country.

Business continuity and Dr Overview

In fact, there is a systematic method for business continuity, which refers to the norms and guidance accumulated over the years in the construction of disaster recovery systems. There are several dimensions in this method:

1. The active service is not the same as the original DISASTER recovery (Dr). The services that are in another equipment room are directly pulled over. In the construction of a DISASTER recovery system, it is very difficult to realize the cost and technical implementation of all services.

2. To ensure real-time operation, it is necessary to ensure that the core business will not stop services due to various reasons such as power failure in the machine room.

3. M is for security system. Nowadays, all industries may have their own different means and management methods, but What Ali provides is to transform this part of things into technology, tools and products, so that people can quickly build business viability based on this method and product when they build viability in the future.

The BCM architecture and IT DISASTER recovery capability is a practical guidance framework. In terms of completeness, business continuity at the top is the goal, and below are the various ways to achieve it. At the bottom, you can see things like IT plans, business continuity plans to deal with special problems, etc., which are considered in the original disaster recovery, but we consider these things in the product system from the perspective of multiple activities.

The Dr Methods mentioned in this section are common: From cold backup to same-city Active-active, or same-city active-active plus remote cold backup (two centers and three centers), these methods are relatively standard in the industry. The remote live is like in the two places in the three center of the three computer rooms at the same time to provide live ability, on the basis of the previous, and with the original traditional disaster recovery has some differences. There are also differences in construction costs, such as building the ability to live in different locations, which requires more investment in construction costs than traditional ones (such as same-city and dual-center).

When building live capacity, you also take into account the reality of the business. For example, in different industries, or for example, in the multi-live aspect, only two reads are required to be implemented, then the construction cost and the time to switch services will differ in different cases. Remote multi-live capability From the horizontal timeline can achieve minute level switch, but if based on cold backup may need to switch in days.

Why does Ali do extra work

In Alibaba’s business model, the reasons for doing more work are similar. As mentioned above, if multi-activity is not adopted, it is necessary to build another machine room, which is very expensive, because the machine room is only used for data synchronization at ordinary times and is not in the running state. During this period, the corresponding version of the production system and the version of the disaster recovery system need to be continuously updated. However, in reality, when the original cold standby or two or three centers fail, I dare not switch, because it is very likely that I cannot switch back after switching.

There are three main appeals to doing more work:

1. Resources. With the rapid development of services today, the capacity of resources in a single site is limited. We know that cloud native and cloud computing provide high availability and disaster recovery capabilities, but the ability of cloud computing to be deployed in different machine rooms requires the support of the underlying infrastructure. We hope to expand our business to not be limited by physical machine rooms, but also reach multiple machine rooms to receive services at the same time.

2. Diversified business needs and local or remote deployment requirements;

3. For disaster recovery events. For example, the optical cable is broken, or the power supply and heat dissipation of the computer room due to weather problems, which will lead to the failure of the single room. Today’s demand is not limited to one machine room, but to multiple machine rooms deployed across the country in different configurations that can be flexibly adjusted according to the business model.

Since these demands are urgent for the ability to live more, Ali has made plans and products to live more according to its own business needs and technical capabilities.

Dismantlement of multiple living architectures

Dismantlement of multiple living architectures

1, remote mutual preparation: today we talk about how good cloud native, how good cloud computing, there is no ability to live these technologies are actually idle state. Cold standby does not work, and what state to cut to cold standby is mostly a human decision. Because layer upon layer reporting has a great impact on business, mature customers will have some plans, such as what kind of impact and fault need to switch, but in fact, based on the cold standby mode, generally dare not switch.

2. Same-city active-active: there is a certain distance limitation. Common active-active mode applies to the upper layer, such as cloud native PaaS layer. In the data layer, because the same city can use the way of storage, the problem of the main machine room will be transferred to the backup machine room for the database, but this advantage is that the machines and resources of the machine room on both sides belong to the active state. In addition, when the machine room is in an active state, there is no need to worry about the difference between the production version and the backup version of the machine room.

3. Two places and three centers: in addition to considering the problem of providing the same city, the ability to deal with faults will be stronger. To build a cold spare machine room in different places is similar to the first scheme of cold spare.

4. Remote multi-activity: Multiple data centers can provide services externally at the same time. Due to the limitation of distance, replication at the data level may be limited by the network, resulting in the problem of delay. There are a lot of technical problems to be solved, such as how to quickly switch from Beijing machine room to Shanghai, and how to cut when the underlying data is not fully synchronized under physical constraints. Instead of switching over the original disaster recovery mode, we need to do a lot of preparation work and subsequent data compensation process. We integrate this thing into the product system, the physical limits can’t be broken and we use architectural patterns to optimize.

Progressive multi-active Dr Architecture

For the key core business, in fact, when doing multi-vivisystem or project, the business will do some sorting, today is talking about the unitary sorting.

Progressive multi-active Dr Architecture

Double read, two places three center, under normal circumstances two machine rooms at most half and half to divide, this is the simplest. According to this model, the rules of business segmentation can be found, such as the user number, just as the bank may divide the business into half and half according to the card number or the user’s location. And in the live inside we hope to be flexible to match, such as the processing capacity of the machine room, what kind of failure is encountered, the flow can be adjusted to 50%, 60% or other proportions. In the same way, traffic access can be uniformly distributed in multiple equipment rooms.

In terms of technology, remote backup is one-way data replication. Remote hypermetro is bidirectional. Bidirectional means that the two equipment rooms can switch to each other if any one of them fails. This is a very important technical implementation, in the digital layer to try to avoid the problem of circular replication, not after the data synchronization, another machine room thought is new data replication back. And in the case of multiple rooms, the traditional way is in the database with a serial number, serial number in live need rules to generate global uniqueness, and is not based on a single room but based on the whole cluster, we need to consider multiple computer generated serial number cannot be repeated, which requires products have some rules to solve the problem.

Multi-active Dr Scheme

Architecture diagram for the Live Product solution

1, access layer: in the live inside the first to solve is very important traffic access layer. The access layer can carefully control the access rules. According to the rules of service sharding, the traffic must be accurately mapped to each equipment room in the lower layer. After the traffic comes in, it needs to determine which equipment room the traffic users should provide services in. How does this work in practice?

The traditional way is domain name switching. For example, the domain name of the front-end has two computer rooms. If the address of the domain name is cut during switching, the whole service is originally connected to computer room A and can be switched to another computer room B through the domain name. The problem with this approach is that it affects the business you are doing. For example, if a fault occurs in an equipment room, services need to be switched to another equipment room. If the domain name is switched, services at the bottom level will be affected. In addition, the switch at the bottom layer cannot be linked with the whole cloud native PaaS layer. The upper layer can’t sense the switch at the lower layer, and the traffic at the front has been switched to another machine room. The call in the middle may still be in the original machine room unit, which has a great impact on business continuity. In extreme cases, this mode can solve some problems. For example, if a machine room cannot do any service and there is a backup machine room, cutting the domain name is also a solution.

Another way is to use the cloud native micro service, you can mark the flow in the micro service, after marking in the cloud native micro service technology system to pass the mark down, as far as possible to consider the request in a unit or a room, can not jump to other rooms.

2. Application layer: The specification of access routing in the middle layer includes components of service routing, which can be provided separately in our product system. For example, some customers say they don’t want to use the whole solution, because they may have all the open source components in the middle of the solution, but they want to implement the ability to live more. So the upper layer can be used to define exactly how many logical units there are with our whole live control cut flow, and provide apis for intermediate calls. Globally unique sequence numbers, routing rules, and sharding rules are provided by the previous layer. Which like marking and identification of flow looks be like simple, actually in the live scene, for example, some are used when disassembling decomposition coupling of distributed, and used in architecture, if in a room without spending the switch, then what is the mode of synchronization to another room, need to use this kind of problem with cloud native are needed to solve.

3,The data layer: involves the logic of replication and writing. The write ban control in our scenario has a logic in the database that automatically generates code once a switch occurs in the front end. For example, when the data of the switched target machine room is recovered, the code with time will be automatically generated. Only when the data is recovered will the writing action be released again. We will prohibit write to ensure the protection of the database and the judgment of database latency. If the underlying data synchronization capability is not strong, switching and most services can be performed, but many write services may not be able to be performed, because the database is restricted by write prohibition rules. In addition, the rules of data synchronization are more controlled on the overall rules for replication requirements in the logic of multiple equipment rooms.

Based on the whole solution system, we came up with a concept (shown above) : the four-letter ACRONYM MSHA stands for the ability to provide cloud native as a live product today, and we wanted to play a small role in these four numbers: 0, 1, 5, 10 minutes of prevention.

The first is zero-minute prevention, as mentioned earlier with tangent flow, which is one way to deploy a blue-green publishing environment in two rooms. Even if the same machine room can be defined under the logic of the control console, the blue and green release can be carried out in the same machine room very quickly. A machine room to do blue and green release is limited by the support of technical products, through this component can be clearly drawn, which resources belong to one unit, which resources are another unit, at the same time to quickly achieve blue and green release of this unit;

Second, 5 minutes to locate, the original city of cold disaster preparation technology, for example, tend to make decisions very difficult, or who do switch to bear the consequences, we hope that based on this platform can directly see the influence of fault today, what problems related to corresponding stakeholders need to do what kind of action, or do what operation, restore the application; When a fault occurs, the system can quickly find the problem of the fault. For example, after locating the problem for 5 minutes, it can initiate the decision whether to do the cutting flow.

Thirdly, 10-minute recovery. Finally, we hope that the whole process of getting the whole business up and running can be controlled within 10 minutes through this mode.

Best practices for multi-active Dr

Here are a few examples is ali to external enterprise application cases, the disaster live ability not only can be used in the public cloud, because the cloud does not represent when the application deployed on the cloud, all natural high availability is provided by the cloud to, when you use the resources you will find there are different areas of cloud, the same areas containing different available area. When used on public clouds that need to be combined with actual situation, such as most of the customers may be on the south side, then on the south side of room may open a first node, then when ali room there is a problem, the customer’s business will be affected, although the corresponding business clients deployed on the cloud, the cloud products also provides high availability, However, if a fault occurs in the equipment room, services are still affected. So the solution is that the live capability can be deployed not only in the cloud, but also in the machine room like commercial software.

Case 1: Same-city hypermetro

A logistics customer, in fact, is in the same city within the scope of the use of live, although the problem is not big with the traditional technology, the benefits of living reflected in the corresponding SDK, for example, can be automatically identified, do not need to do too much business transformation, can be marked the request automatically passed down. The disaster recovery capability is much faster in the RTO area after completion.

Case 2: Remote double reading

The difficulty of the case of remote double-reading is that the distance is over thousands of kilometers. Under this distance limit, it is difficult to read or write data. Data replication itself has a delay. The logic of this product is to know clearly which read services are used, which services are imported to the read machine room, and what the replication status is. The minute-level RTO is much better than the original and can dynamically switch services online.

Case 3: Remote hypermetro

The enterprise customer using remote hypermetro currently has two computer rooms and may expand in the future. When doing this project, I did a lot of product adaptation development, because in order to realize reading, there is a lot of work in the middle layer of the basic capacity of the original product, the whole process is from more active product research and development to adaptation of application scenarios, and then to comprehensive transformation with the business. The core point is business continuity, so it does not mean that all services will be multiplexed in the machine room in the future, but only for critical services. For example, on singles’ Day every year, our core business is to ensure that orders cannot be affected. Then logistics, through decoupling or other means, will not be as high as orders and transactions in terms of business continuity. The key point is how to ensure that the services and products involved in the core trading link will not have problems in the multi-dimensional switching.

This multi-live control platform suggests that you experience it yourself. After two or more units are defined in the console, when one machine room fails, we want to use multi-live to quickly switch its application to the other machine room. The premise of switching is to define the points in the console. No matter the logical points in a single equipment room or points in multiple physical equipment rooms, they must be mapped to the multi-active management console. In the control console, we will set some rules, such as the access of unified services, which dimensions are used to segment the access traffic, or mark it in the form of ID. When cutting flow dynamic display which dimensions of the flow to another machine room, failure can be quickly matched, this is relatively simple.

Now our ability to help customers to deploy, often also in the system through the console to do some cutting flow and exercise, take a look at whether the room is affected by some, because the system is also supporting other schemes, such as fault drills, cooperate with the fault applied to switch to another room and so on.

conclusion

The ability of multi-live disaster recovery has been practiced in Alibaba’s internal business for many years, and it took a long time to evolve it into products. The purpose is to hope that this set of products and programs can help enterprises build their multi-live capacity within 30 days. In particular, many of the product deployments on the public cloud are already out of the box and take much less time to build. We hope that the multi-live capability of this product and solution can help enterprises quickly realize fault switching and multi-live capability construction at the minute level.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.