Best practices for hypermetro DISASTER Recovery in hybrid cloud applications

Author: Far Zhi

preface

More and more enterprises choose the hybrid cloud mode (cloud + self-built IDC or cloud + other vendors’ cloud) for DISASTER recovery construction in the process of digital transformation and cloud upscaling. On the one hand, they do not rely too much on a single cloud vendor, and on the other hand, they can make full use of existing offline IDC resources.

MSHA Cloud Native Multi-active Dr Solution [1] also released hybrid cloud multi-active Dr Product capabilities. This document uses a service Demo case to explain the difficulties in hybrid cloud Dr Construction and how to quickly build an application hypermetro architecture based on MSHA and achieve minute-level service recovery capability.

Service hybrid cloud Dr Practice

Service Background

Enterprise A is an e-commerce trading platform in the retail industry. Its business system is deployed in the self-built IDC room, which has the following pain points:

Services are deployed only in IDC servers, which lacks the Dr Capability.
The IDC capacity is insufficient and the upgrade and replacement cycle of physical machines is long, which is insufficient to support rapid service development.

In the process of rapid business development, the company’s senior management paid attention to the insufficient capacity and faults repeatedly, and determined to build disaster recovery capacity. The self-built IDC is an existing asset of the company and has been used steadily for many years. In addition, the company does not want to rely too much on the cloud. Therefore, the hybrid cloud Dr Architecture of IDC and cloud is expected to be established.

Current application deployment architecture

E-commerce trading platform includes the following applications:

Frontend: A Web application that interacts with users.
Cartservice: Cart application, which provides cart addition, storage and query services.
Productservice: Product application, providing product and inventory services.

Technology stack:

SpringBoot.
RPC framework: SpringCloud, Dubbo, registries using self-built Nacos, Zookeeper.
Databases Redis and MySQL.

Hybrid cloud Dr Target

Service Dr Requirements are summarized as follows:

Change the RTO to minute for cloud – on – cloud – off Dr. Expect cloud on cloud off cloud Dr, continue to play IDC value, and not 100% dependent on the cloud. In an IDC or cloud fault scenario, you must be able to perform a switchover at a critical moment, and the switchover RTO must be less than 10 minutes.
No data consistency risk. Data consistency between the two data centers in the upper cloud and the lower cloud is strong. Therefore, avoid data consistency risks such as dirty write during the daily normal and Dr Switchover.
One-stop control. The technical stack framework and cloud products involved in service Dr Require unified management and control, unified O&M, and unified switchover. Operations are converged on a one-stop management and control platform, facilitating rapid white-screen operations and automatic execution in fault scenarios.
Short implementation cycle and low transformation cost. Services have multiple product lines, complex dependency relationships, long invocation links, and are in a period of rapid development and frequent iteration. It is expected that Dr Construction will not bring transformation burden to service R&D teams.

The construction of the difficulty

Traffic management is difficult
If DNS is used to resolve traffic to the upper and lower clouds by weight, it takes a long time for DNS resolution modification to take effect (usually 10 minutes or hours. For details, see DNS Resolution Validity FAQ [2]), which cannot meet the requirement of less than 10 minutes for Dr Switchover.
Business applications rely on Redis and MySQL. IDC uses self-built open source while cloud products are directly used on cloud products. It is difficult to achieve the Dr Switching capability of self-built open source and cloud products.
Data quality of Dr Switchover is difficult to ensure
During a Dr Switchover, stale data may be read due to data synchronization delay or the time when switchover rules are pushed to distributed application nodes is inconsistent. Dirty data may be written to and read from databases on and off the cloud at the same time. Data quality assurance is a key and difficult issue during the switchover.
No service code intrusion is difficult
In order to realize the Dr Switching capability of Redis and MySQL, business applications usually need to cooperate with the transformation, which will greatly invade the business code.

The solution

Based on the service Dr Requirements and the characteristics of hybrid cloud IDC+ cloud, the application Active-active architecture can meet the service Dr Requirements.

Apply the hypermetro architecture

Schematic diagram:

Architecture specification:

Select the Region on the cloud whose physical distance from the IDC is less than =200km, and the network latency is about 5 to 7ms.
Applications and middleware are deployed in symmetric redundancy mode on and off the cloud, and services (application hypermetro) are provided externally.
Database remote active/standby, asynchronous replication backup. Applications read and write to databases in the same data center, avoiding consistency concerns.

Detailed scheme

Application traffic hypermetro

Service applications Are deployed symmetrically in the upper and lower clouds of the cloud, and access layer clusters based on MSHA to receive HTTP/HTTPS traffic and distribute traffic between the upper and lower clouds based on the proportion or precise routing rules. The active console provides routine O&M capabilities such as white-screen deployment, capacity expansion, and monitoring for MSFE clusters, as well as minute-level flow interruption in fault scenarios.

Service interworking and same cell priority call

Service applications need to be uploaded to the cloud in batches based on service product lines. In this process, only IDC is deployed for downstream applications. Using the MSHA registry synchronization function, services on and off the cloud can communicate with each other, facilitating services on the cloud. At the same time, based on the aspect capability of MSHA-Agent, when Dubbo/SpringCloud service is invoked, the Consumer calls the Provider in the same cell first, so as to avoid the network delay caused by cross-room invocation and reduce the business request RT.

Data synchronization & database connection switchover

The databases are deployed in remote active/standby mode, and the applications on and off the cloud normally read and write Redis and RDS databases on the cloud every day, without considering the data consistency problem. The MSHA console supports on-cloud and off-cloud data synchronization (asynchronous replication) by integrating the DTS synchronization component. At the same time, based on the mSHA-Agent section capability, it has the ability to switch the application database access connection. If the Redis or RDS failure on the cloud can switch the read and write access connection to Redis or MySQL in IDC, and vice versa. During the switchover, it also provides write protection to prevent data quality problems such as old data read and dirty write.

One-stop control & no business code intrusion

The MSHA console supports unified management, control, and switchover of HTTP and database access traffic. Operations are converged on a one-stop management and control platform, facilitating rapid blank screen operations in fault scenarios. In addition, the Agent access mode is provided for service application MSHA, enabling the related Dr Switchover capability without service code modification.

Modified content

Application on the cloud
Select the Aliyun region that is close to the self-built IDC, and deploy a set of applications, middleware, and databases in full redundancy on the cloud to build an on-cloud and off-cloud Active-active Dr Architecture. In this Demo case, the Hangzhou Region is selected as the Dr Unit.
Network access:
Access to CEN cloud enterprise network to realize network connectivity between the cloud and the cloud (see Building enterprise-level hybrid Cloud Documents by Multiple Access Methods [3]).
Access cluster deployment and configuration:
The MSHA access layer cluster (MSFE) is deployed on the cloud and off the cloud, and SLB is mounted on the cloud for public network access and load balancing of MSFE clusters (see Usage document [4]).
Enter domain names, URIs, and back-end application addresses for on-cloud and off-cloud traffic and minute-level traffic (see usage documentation [5]).
Application:
Deploy service applications in batches on the cloud.
JAVA applications install MSHA-Agent and use Nacos as the channel to control command delivery, so as to have the ability to preferentially invoke microservices in the same unit and switch database access connection (see usage document [6]).
Middleware and database:
Deploy the MSE to host the ZK/Nacos registry, cloud database Redis, and RDS on the cloud. You are advised to deploy the high availability version across availability zones to provide the same-city active-active Dr Capability.
If an application is deployed only in IDC, configure service synchronization for the registry (see Usage document [7]).
Configure data synchronization between cloud database Redis/RDS and self-built Redis/MySQL (see Using documentation [8]).

Modified application deployment architecture

Daily scenario: Service traffic on THE IDC and cloud – application hypermetro

Visit the home page of e-commerce Demo to check the actual traffic call chain: visit Beijing or Hangzhou unit in probability, and read and write the database in Beijing unit.

Disaster ability

RPO: <=1min (depends on DTS synchronization performance)
RTO: <=1min (Depending on the DTS synchronization delay, the MSHA component implements second-level switching. Overall RTO < = 1 min)

Verify the Dr Capability

After the application hypermetro architecture is constructed based on MSHA, verify that the service Dr Capability meets expectations. The next step is to create a real fault to verify the Dr Capability.

7.1 Preparation for Drill

Go to the MSHA console and select Monitor On the left menu bar. At the top of the page, the drop-down selection switches to the actual namespace in use.
View monitoring indicators on the page.

Note: before a test, determine a monitoring indicator (RT<=200ms, error rate <1%) based on MSHA traffic monitoring or other monitoring products to determine the impact level when a fault occurs and the actual service recovery after a fault is rectified.

7.2 Application Fault Injection

Here, we use Ali Cloud fault drill product to inject faults into ali Cloud-Beijing commodity application.

Enter the Chaos Failure drill product console [9], switch the top selection to the corresponding region, and select MySpace in the left navigation bar.
Select the configured walkthrough in My space (50% chance of network loss) and click Execute Walkthrough.

After fault injection is successful, an access exception may occur on the e-commerce home page or when you place an order, which is as expected.

7.3 Recovery of current interruption

In the case of failure of commodity application in Beijing unit, MSHA can be used to cut off the flow on the cloud to 0 to quickly restore services.

expected

After 100% traffic is switched to Hangzhou unit, the service is fully recovered and not affected by the failure of Beijing unit.

Flow operation

Log in to the MSHA console, and choose Switch Flow > Remote Hypermetro Switch Flow.
On the cut stream page, click one – key to cut zero for Beijing unit.

Click Perform pre-check. In the flow check area, click OK to start flow cutting.
If the current status on the cutting task page shows that the cutting is complete, the cutting is successful.

Refresh the home page of e-commerce Demo, which can be displayed normally after repeated visits, meeting expectations.

Check the actual traffic call chain: traffic always accesses the database in Hangzhou cell and reads and writes the database in Beijing cell.

7.4 Database Fault Injection

As can be seen from the above call chain, the applications in Hangzhou unit still access the Redis and MySQL databases in Beijing Unit. We continued to use Chaos failure drill [10] to inject faults into Redis and MySQL databases of Beijing unit and create database failure scenarios.

After the fault injection is successful, an exception occurs when you open the e-commerce home page or place an order, as expected.

7.5 Switching the Database for Recovery

If the database of Beijing unit is faulty, the connection of Redis/MySQL accessed by the application can be switched to the database of Hangzhou Unit through the MSHA database switching function (during the switching process, data synchronization will be synchronized, and write will be temporarily banned).

expected

After the database connected to the application is switched to Hangzhou, the service is fully recovered and not affected by the failure of the Beijing unit.

Flow operation

On the MSHA console, choose Remote Application hypermetro > Data Layer Configuration in the navigation tree.

2. In the data protection rule list, locate the product database, order database, and shopping cart database one by one, and click primary/secondary switchover.

After you click active/standby switchover, the pre-check page will be displayed. After you confirm that the status of each check item is normal, click “Confirm” to enter the details page of switchover and automatically execute the switchover process.

On the active/standby switchover details page, you can view the switchover progress and result. The switchover is complete when the task progress reaches 100%.

After commodity, order and shopping cart databases are switched. After repeatedly accessing the Demo home page or placing an order, the Demo is normal. After the active/standby switchover, the service functions are restored as expected.

conclusion

In this article, we introduce a practical case of MSHA disaster recovery (MSHA) to help enterprises construct hybrid cloud application Active-active DISASTER recovery (ACTIVE-active DISASTER recovery), and provide practical methods for disaster recovery architecture construction. At the same time, we use Chaos fault drill products to inject real faults to verify whether the service DISASTER recovery capability of fault scenarios meets expectations.

Finally, you are welcome to scan the qr code below or search group number (31623894) to enter the MSHA Communication group for consultation and communication.

reading

[1] MSHA cloud native Multi-active Dr Solution

www.aliyun.com/product/ali…

[2] DNS resolution Validity Time FAQ

Help.aliyun.com/document_de…

[3] Building enterprise-level hybrid cloud documents in multi-access mode

Help.aliyun.com/document_de…

[4] Use documentation

Help.aliyun.com/document_de…

[5] Use documentation

Help.aliyun.com/document_de…

[6] Use documentation

Help.aliyun.com/document_de…

[7] Use documentation

Help.aliyun.com/document_de…

[8] Use documentation

Help.aliyun.com/document_de…

[9] Chaos Failure Drill product console

Common-buy.aliyun.com/?commodityC…

[10] Chaos Failure drill

Common-buy.aliyun.com/?commodityC…

read

• MSHA Multi-Active Dr Solution Home page:

https://www.aliyun.com/product/aliware/ahas/msha

• Describes the four Dr Architectures supported by the MSHA:

https://help.aliyun.com/document_detail/338374.html

• Chaos Failure Drill

https://www.aliyun.com/product/aliware/ahas/chaos

Click here to go to the MSHA website for more details!

Best practices for hypermetro DISASTER Recovery in hybrid cloud applications

preface

Service hybrid cloud Dr Practice

Service Background

Current application deployment architecture

Hybrid cloud Dr Target

The construction of the difficulty

The solution

Apply the hypermetro architecture

Detailed scheme

Modified content

Modified application deployment architecture

Disaster ability

Verify the Dr Capability

7.1 Preparation for Drill

7.2 Application Fault Injection

7.3 Recovery of current interruption

expected

Flow operation

7.4 Database Fault Injection

7.5 Switching the Database for Recovery

expected

Flow operation

conclusion

reading

Related Posts

Wedding photography five benchmark brands tell you, customer capital transformation can actually play so?

K8s container deployment practice of Xiaomi Redis

EP39 problem with global static ApplicationContext