Introduction: For the interpretation of the concept of cloud native, we often hear about micro-services, containers, so what is the relationship between these technologies and enterprise disaster recovery? In fact, there is a demand for disaster recovery from all walks of life, such as the financial industry for disaster recovery also has a strong demand. But how to build up the disaster tolerance and survivability, in fact, every enterprise needs to think well. I hope this post can provide you with some relevant ideas.



As for the interpretation of the concept of cloud native, we often hear about micro-services and containers, so what is the relationship between these technologies and enterprise disaster recovery? In fact, there is a demand for disaster recovery from all walks of life, such as the financial industry for disaster recovery also has a strong demand. But how to build up the disaster tolerance and survivability, in fact, every enterprise needs to think well. I hope this post can provide you with some relevant ideas.

Function evolution of disaster recovery system



Today’s talk about more activities, in fact, is a part of the disaster recovery system, you can look at the evolution of the whole disaster recovery system architecture:

Disaster Recovery 1.0: In the construction process of the original application system, the business system is deployed in the computer room based on the traditional architecture. Then what are the relevant emergency measures or fault handling methods? During this period, only data backups were considered, mainly in the form of cold backups. In addition to providing business room, additional room may be considered for disaster scenarios. From the perspective of system construction, it may choose to use a separate computer room to synchronize data to another computer room for cold standby, and switch when problems occur. However, in the actual situation, usually do not choose to switch the machine room, even the financial industry, which does the routine drill of disaster preparation system every year, is not willing to switch when the system has problems in the production process.

Disaster Recovery 2.0: More application considerations. For example, cloud native, or further application in the traditional IOE system, switching is not only simple cut over, the original cold standby data to load out, but also hope that the cut over can be quickly pulled up in another computer room. In order to achieve replication at the data layer without too much delay, we usually have a double active requirement. But generally do double work will have some requirements, such as distance requirements within a certain range can do the same city double work. More double operations will be applied in AQ mode, which is to do the whole business in the production side and do other business in the other room.

Disaster recovery 3.0: hope to make more live in different places. What is much? This means that it is no longer limited to two rooms, but rather three or more rooms. For example, Alibaba’s business is distributed in multiple computer rooms, so how to provide external business support at the same time requires corresponding technical support. And “off-site” means not limited to distance, such as 200 kilometers or within the same city, because today’s machine rooms are deployed all over the country.

Business continuity and disaster recovery overview



In terms of business continuity, there is a systematic approach, which refers to the norms and guidance accumulated over the years in the construction of disaster recovery system. There are several dimensions in this approach:

1, live more business is not the same as the original disaster tolerance, the equivalent in another room the same business directly pulled in the past, but the choice of valuable business. Because in the construction of disaster recovery system, it is very difficult to realize the multi-activity of all the businesses in terms of cost and technology.

2. Guarantee of real-time operation requires that the core business will not stop service due to various reasons such as power cut in the machine room.

3. M stands for security system. Today, all walks of life may have their own different means and management methods, but what Ali provides is to transform these things into technologies, tools and products, so that people can quickly build business activities based on this set of methods and products when they are building multi-activity capabilities in the future.

BCM system and IT disaster resilience is a practical guiding framework. In terms of completeness, business continuity at the top is the goal, and below are the various ways to achieve it. At the bottom, you can see the IT plan, the plan to deal with the failure of the special problem of business continuity, etc., which were taken into account in the original disaster recovery process, but we took these things into account in the product system from the perspective of multi-activity.



In fact, the several disaster recovery methods mentioned here are relatively common: from cold standby to double activation in the same city, double activation in the same city plus cold standby in other places (two places and three centers), these are relatively standard methods in the industry. And different living is like in the two places three centers in the three room at the same time to provide the ability to live, on the basis of the previous, and with the original traditional disaster tolerance has some differences. There is also a difference between the construction cost and the traditional one. For example, the construction of the capacity of multi-construction in different places requires more investment than the traditional one (for example, double construction in the same city and three centers in two places).

When building multi-activity capabilities, you will also consider the actual situation of the business. For example, the construction cost and the time to switch businesses will be different in different industries, or in the case of multi-activity only requires the realization of both sides of the read. The remote multi-activity capability can be switched at minute level from the perspective of horizontal time axis, but it may need to be switched by day if it is based on cold standby.

Why is Ali working extra hours

Under Alibaba’s business model, the reasons for doing more work are similar to those mentioned earlier. As mentioned above, another machine room has to be built if more work is not used in the same city for cold standby. The cost is very high, because the machine room is only used for data synchronization and is not in operation state. During this period, the corresponding version of the production system and the version of the disaster standby system need to be continuously updated. But in fact, the original cold standby or the two and three centers fail when dare not to switch, because it is likely to be unable to cut back after the switch.

There are three main appeals to living longer:

1. Resources. With the rapid development of today’s business, single resource capacity is limited. We know that cloud native and cloud computing provide high availability and disaster tolerance capabilities, but cloud computing is deployed in different computer rooms, and the ability of multi-activity across regions itself needs the support of the underlying infrastructure. We hope to expand the business to be not limited to the physical computer rooms, but also to reach multiple computer rooms to receive business at the same time.

2. Diversified business needs, requiring local or remote deployment requirements;

3. For disaster recovery events. For example, the cable is cut off, or the problem of power supply and heat dissipation in the machine room due to the weather, which will lead to the failure of the single room. Today’s demand is not limited to a machine room, but there are a number of machine rooms deployed in different forms across the country, according to the business model can be flexible regulation.

Because these demands demand the ability to live more urgently, so Ali according to its business needs and technical ability to make more live programs and products.

Disassembly of the multi-active architecture



Disassembly of the multi-active architecture

1, remote mutual preparation: today we talk about how good the cloud native, cloud computing is how good, there is no more ability to live these technologies are actually idle state. Cold standby state does not work, and in what state to cut to cold standby mostly depends on human decision. Due to the large impact of multi-layer reporting on the business, more mature customers will have some contingency plans, such as what kind of impact and fault need to do this switch, but in fact based on the cold standby mode, they generally dare not to do the switch.

2. Double activity in the same city: there is a certain distance limit. Common double activity mode can be applied in the upper layer, such as cloud native PaaS layer on both sides of the computer room can be distributed. At the data level, because the main machine room can be stored in the same city, the database will be cut to the standby room when there are problems in the main machine room. However, this advantage is that the machines and resources in both machine rooms are in the state of activity. In addition, the machine room in the active state do not have to worry about the production of the version of the machine room with the version of the standby, will not be afraid to cut.

3. Two places and three centers: In addition to considering the problem of providing services in the same city, it will have a stronger ability to deal with failures. A cold standby machine room is built in another place, which is similar to the first plan of cold standby.

4. Multiple activities in different places: the ability to have multiple data centers to provide external services at the same time. Due to the limitation of distance, the replication at the data level may be limited by the network, and the problem of delay is bound to exist. There are many technical problems to be solved, such as how to quickly switch from the Beijing machine room to Shanghai, and how to cut the underlying data without complete synchronization due to physical restrictions. We don’t switch modes like the original disaster recovery mode, but we have to do a lot of preparation work and subsequent data compensation process. We integrated this set of things into the product system, and when the physical limits could not be breached, we used architectural patterns to optimize.

A progressive multi-active disaster recovery framework

For the key core business, in fact, when doing multiple vivisystems or projects, the business will be sorted out. Today, we will talk about the unitary sorting.



A progressive multi-active disaster recovery framework

Double reading, two places three centers, under normal circumstances two room at most half half to points, this is the most simple. In this way, rules can be found for dividing services, such as by subscriber number, just as a bank might split a 50-50 service by card number or subscriber’s location. And in the live inside we hope to be flexible to match, such as the processing capacity of the machine room, what kind of failure is encountered, the flow can be adjusted to 50%, 60% or other proportions. It is also the same in multiple computer rooms, can be uniformly distributed traffic access.

In terms of technology, for example, remote standby is one-way data replication, and remote double activity is two-way, which means that any one of the two computer rooms may have problems, so they can be switched over to each other. One of the most important is the technical implementation, in the digital layer to find a way to avoid the problem of circular replication, can not be in the data synchronization, another room that is new data and copy back. And in the case of multiple rooms, the traditional way is in the database with a serial number, serial number in live need rules to generate global uniqueness, and is not based on a single room but based on the whole cluster, we need to consider multiple computer generated serial number cannot be repeated, which requires products have some rules to solve the problem.

More active disaster recovery plan



Architecture diagram of a multi-live product solution

1. Access layer: the first one to be solved in multi-activity is the very important traffic access layer. The access layer can control the rules of access very fine. According to the rules of business segmentation, it should be accurate enough to map to each room of the lower layer. After the traffic comes in, it needs to judge which room the traffic users should provide services in. How does this work in practice?

The traditional way is to switch domain names. For example, the domain name in the front end has two computer rooms. When switching, the address of the domain name is cut off. Then the whole business was originally received from the computer room A, and the domain name can be switched to another computer room B. The problem with this approach is that it affects the business being done. For example, after a computer room problems need to quickly cut the business to another computer room, through the domain name switch, the bottom of the ongoing business will be affected. In addition, this kind of low-level switch cannot be linked with the whole cloud native PaaS layer. The upper layer is cut and the lower layer is not aware of it. We do not know that the traffic in front has been switched to another machine room, including the call in the middle may still be in the original machine room unit, which is actually greatly affected for business continuity. In extreme cases, this mode can solve some problems, such as a computer room can not do a bit of business and there is a spare computer room, then the domain name is also a way to cut the past.

Another way is to use the cloud native micro service, can be in the micro service in the face of traffic marking, marking after the cloud native micro service technology system to pass the mark down, as far as possible the request is considered in a unit or a machine room to do, can not jump to the other machine room.

2. Application layer: the specification of access routing in the middle layer includes service routing components, which can be provided separately in our product system. For example, some customers say that they don’t want to use the full solution, because they may have all the open source components in the middle layer of the solution, but they want to achieve the ability to live more. So the upper layer could be using our entire multi-active management cut flow, defining exactly how many logical units there are, and providing APIs for intermediate calls. Globally unique sequence numbers, routing rules, and sharding rules are provided to him by the previous layer. Which like marking and identification of flow looks be like simple, actually in the live scene, for example, some are used when disassembling decomposition coupling of distributed, and used in architecture, if in a room without spending the switch, then what is the mode of synchronization to another room, need to use this kind of problem with cloud native are needed to solve.

3. Data layer: logic related to copy and write. The write ban control in our scheme will have a logic in the database, that is, once the switch occurs in the front end, the code will be automatically generated. For example, when the target computer room is switched, the code with time will be automatically generated. Only when the data is restored will the action of writing be released again. We will guarantee the protection of the database and judge the delay of the database by banning write. If the underlying data synchronization ability is not strong enough, switching and most of the business can do, but many write business may not be able to do, because the database is restricted by the prohibition of write rules. In addition, the rules of data synchronization require higher control on the overall rules for replication under the logic of multiple computer rooms.



Based on the whole package, we came up with a concept (as shown in the figure above) : MSHA stands for the ability to deliver cloud native products today, and we want to play a small role in four numbers: 0, 1, 5, and 10 minutes of prevention.

The first is 0-minute prevention. As mentioned earlier, tangent flow, you can deploy a blue-green publishing environment in two machine rooms. This is one way. Even if the same machine room can also define two units under the logic of the control desk, very quickly in the same machine room for blue and green release. The blue-green release of a computer room is limited by the support of technical products. Through this component, we can clearly delineate which resources belong to one unit and which resources belong to another unit, and realize the blue-green release of this unit quickly.

Second, 5 minutes to locate, the original city of cold disaster preparation technology, for example, tend to make decisions very difficult, or who do switch to bear the consequences, we hope that based on this platform can directly see the influence of fault today, what problems related to corresponding stakeholders need to do what kind of action, or do what operation, restore the application; When a fault occurs, the fault can be found quickly through this system. For example, after 5 minutes of locating the problem, it can initiate the decision of whether to do the tangential flow.

Thirdly, 10-minute recovery. Finally, we hope to control the recovery of the whole process of business operation within 10 minutes through this model.

Best practices for living and responding to disasters

Here are a few examples is ali to external enterprise application cases, the disaster live ability not only can be used in the public cloud, because the cloud does not represent when the application deployed on the cloud, all natural high availability is provided by the cloud to, when you use the resources you will find there are different areas of cloud, the same areas containing different available area. When used on public clouds that need to be combined with actual situation, such as most of the customers may be on the south side, then on the south side of room may open a first node, then when ali room there is a problem, the customer’s business will be affected, although the corresponding business clients deployed on the cloud, the cloud products also provides high availability, However, once the failure scenario involves the machine room, the corresponding business will still be affected. So the solution is that the multi-activity capability can be deployed in the cloud as well as in the computer room just like commercial software.

Case 1: Double activities in the same city

In fact, a logistics customer is using multi-activity within the scope of the same city. Although the traditional technology is not a big problem, the benefits of using multi-activity are reflected in that there is a corresponding SDK, which can be automatically identified, and the business does not need to do too much transformation, and the marking request can be automatically passed on. Disaster resilience is much faster in the RTO after it’s done.

Case two: double reading in different places

The difficulty in this case is that the distance is more than thousands of kilometers. Under this distance limit, it is difficult to read or write, and data replication itself has a delay. The logic of using this product is to clearly know which businesses are read, which businesses are imported into the reading machine room, and what the status of replication is from the control and flow level. The minute-level RTO is a big improvement over the original one, which can make the flexible switch of business dynamically online.

Case 3: Double work in different places

This enterprise customer who uses the remote double live has two machine rooms at present, and may expand further in the future. When doing this scheme, I did a lot of development on the compatibility of products, because to achieve the basic ability of the original product, there is a lot of work in the middle of this layer, the whole process is from more live product research and development to adapt to the application scene, and then to the comprehensive transformation with the business. The core is business continuity, so it doesn’t mean that all businesses will be multi-active in the machine room in the future, but only for critical businesses. For example, on the annual Singles’ Day, our core business is to ensure that the order will not have an impact. Logistics, through decoupling or other means, will not be as high a priority as the order trading category from the perspective of business continuity. What is the key point is how the services and products involved in the core transaction link can ensure that there is no problem when switching in the multi-active dimension.

This multi-active control platform recommends you to experience it for yourself. After two or more units are defined in the console, when one of the rooms fails, we want to quickly switch its application to another room by activating it. The premise of switching is to define the points in the control platform, whether it is the logical point of a single machine room or the points of multiple physical machine rooms, which should be mapped to the multi-active control platform. In the control desk, we will set some rules, such as the access of the single service, the dimension to divide the access traffic, or the way to mark with ID. When tangential flow dynamic display which dimensions of the flow to another room, when the failure can be quickly matched, which is relatively simple.

Now our ability to help customers to deploy, often also in the system through the console to do some cutting flow and exercise, take a look at whether the room is affected by some, because the system is also supporting other schemes, such as fault drills, cooperate with the fault applied to switch to another room and so on.

conclusion

The capability of disaster resilience has been practiced in Alibaba’s internal business for many years, and it took a long time to develop it into a product, with the hope that today’s product and solution can help enterprises build their capability of disaster resilience in less than 30 days. In particular, many product deployments on the public cloud are already off-the-shelf enterprises, which actually take less time to build. We hope that this product and solution multi-activity capability can help enterprises to quickly implement the failover and multi-activity capability building at the minute level. The original link

This article is the original content of Aliyun, shall not be reproduced without permission.