Abstract:At the 7th Global Software Conference, Huawei software engineer Du Zhigang shared the high availability guarantee scheme of Huawei cloud official website for the majority of developers, and deeply analyzed the scheme and engineering practice of how to quickly recover the website under various extreme major disaster scenarios.
This article is shared from Huawei Cloud Community “What Happened Behind the Website Access Failure? Huawei Engineers Teach You to Quick Response [Global Software Technology Conference Technology Sharing]”, the original author: technology torchbearer.
Recently, a CDN service malfunction caused a large number of well-known overseas news websites to be unable to access or load properly, causing a stir. Indeed, with more and more businesses on the cloud, whether a website or a business can guarantee continuous online is a great test of the high availability and high reliability scheme design behind it.
At the 7th Global Software Conference, Huawei software engineer Du Zhigang shared the high availability guarantee scheme of Huawei cloud official website for the majority of developers, and deeply analyzed the scheme and engineering practice of how to quickly recover the website under various extreme major disaster scenarios.
The website is unreliable, the loss is immeasurable
From the point of view of the website owner: the direct economic impact of the website being unavailable, especially for e-commerce sites, transactions are generated every second, once the access is interrupted, the impact of economic loss is obvious. In addition, from the customer’s point of view, the most intuitive feeling is that the website is not accessible, which will have irreparable negative impact on the reputation and trust of the website and the enterprise brand behind the website.
From the recent ten years of major Internet failure events, we can clearly remember the wide range of impacts caused by DNS and CDN, and the regional and global failures caused by other IT infrastructures also have a great impact.
Usability metrics widely used in the industry include the amount of time a website has been unavailable and the annual availability of a website. Different types of websites and applications have different usability requirements.
Where website unavailable time (down time) = point of recovery time – point of failure. Yearly website availability = (1- number of days the website is not available/total time in the year) *100%.
Huawei cloud website as cloud infrastructure providers of Internet access, the availability of extremely high request, the end user oriented core page to do 7 * 24 hours online, if there is a big failure, as a cloud service level, or infrastructure as a result of ChanYun global fault, 5 minutes warning notice to the relevant responsible persons, 15 minutes to complete the switch failure.
What’s going on behind the scenes when website access is down?
The following combined with the legend to analyze the overall process of website page access and the key failure points:
In ①, the DNS fault will generally lead to the site as a whole is not accessible, to ② is the CDN fault will make part of the geographical area users inaccessible, ③ is the single cloud global fault will lead to the site as a whole is not accessible, ④ is the cloud service area level fault will lead to the user shunt to the area is not accessible, ⑤ Cloud service availability zone level fault will lead to the user routing to the failed AZ inaccessible, ⑥ container cluster failure will lead to the user routing to the corresponding container service inaccessible, ⑦ service node failure will lead to the user routing to the failed service node inaccessible.
To sum up, under the cloud scenario, page access faces many key technical challenges, including
- How to deal with the overall failure of a single DNS service provider?
- How does a single CDN vendor deal with overall or multiple regional failures?
- Infrastructure failure caused by the overall failure of the single cloud how to ensure that the page can also be normal access?
- How can a single cloud service area level failure minimize the impact time on user access?
- Since page access relies on numerous back-end services, how to minimize failure points, reduce the overall complexity and cost of the scheme, and ensure that the scheme is universal and feasible?
Four solutions, easy to deal with a variety of website failures
In view of the above key challenges, based on the practice of Huawei’s official website in recent years, I have summarized four solutions to share with you. We will disassemble them one by one and show you the actual effects of these solutions.
1. Overall failure of a single DNS service provider: dual DNS service provider resolution
DNS is a relatively important weak link that has not received due attention. For commercial portals with high usability requirements, DNS depends on a service provider, so that the situation will be smooth without any problems. Once the overall failure occurs, the impact may be disastrous.
Our current strategy is to adopt dual DNS vendor domain name resolution scheme. In case of partial or overall failure of one service provider, it can automatically failover in a short time and hand over the domain name resolution work to other service providers. In addition, we also built a unified operation and maintenance platform to achieve the unified configuration of multi-vendor domain name resolution, as well as the ability to monitor DNS availability and quickly eliminate failed services.
Dual vendor DNS configuration is shown in the figure:
This configuration is based on the premise that the domain Name registrar and domain Name resolver support the multi-vendor Name Server configuration. In terms of specific configuration, the domain Name registration hosting is first migrated to the registrar that supports the multi-vendor NS configuration, and then the resolution records of the DNS vendor configuration are synchronized to the new vendor. Finally, both the domain Name registration service and the resolution service configure the NS records to point to the dual-vendor Name Server at the same time (effective from 0 to 72 hours).
This configuration allows ISP Local DNS to automatically lower the selection priority of the fault Name Server (BIND SRTT algorithm, penalty for failure) and use the preferred Name Server for A recording or CName domain resolution in the event of A failure of the single producer Name Server.
The drill steps can be broken down into:
Step 1: Dual vendor NS record configuration.
Step 2: Check with the browser that the service is accessible.
Step 3: Test the availability of Name Server to verify whether ISPs in different regions use Name servers from different vendors for domain Name resolution.
Step 4: Close BIND to simulate a single vendor DNS failure.
Finally, the service is dialed over HTTP from multiple geographies to see if it can be accessed properly.
2. Regional failure of single CDN vendor: multi-CDN service provider scheme
The following is an introduction to the configuration and switching of multiple CDN vendors, as shown in the figure:
There are three limitations to using this scheme: the DNS protocol does not support the CNAME resolution configuration of multi-vendor CDNs; DNS intelligent resolution supports different regions or network configuration of different CNAME resolution records; The overall failure probability of CDN is low, and more regional failures occur.
For the configuration of multiple CDN vendors, the main and standby CDN acceleration should be performed for domestic and overseas access respectively, and then the TTL of CDN CNName resolution should be set to 60s to shorten the effective time of failover when the service of single CDN vendor is unavailable. Finally, build a CDN management platform, connect with multi-vendor DNS management API, pre-configure switch and backcut strategy, and switch with one key in case of failure.
The final configuration effect is also obvious. After CDN alerts vendor A to large-area failure, CName resolution Failover in the corresponding area can be sent to vendor B to provide service through CDN operation and maintenance management platform, and the effective time is 1 minute.
The following figure is an example of the switching interface of our operation and maintenance platform, which can be switched in different access scenarios for domestic and overseas users according to different second-level domain names.
In 2020 and 2021, we all encountered actual current network failures, and the failover function of CDN was effectively applied, allowing page access to achieve rapid failure recovery.
3. Regional geographic disaster scenario: multi-active solution for page access
Here is the introduction of our China and international dual station remote multi-active networking strategy, as shown in the figure:
In case of a regional geographic disaster scenario, we use site multi-region multi-active deployment, which is a solution to ensure that the page content published by the content management service stays synchronized across the multi-cloud service area. At the same time, the LB and gateway routing configuration is consistent with the live cloud service area.
For specific configuration, the CDN return traffic of domestic and overseas users shall be distributed to different cloud service areas in proportion. The health check policy is then configured to alert when a cloud service area level failure occurs, so as to facilitate automatic or manual switching back to the source traffic to the healthy cloud service area; If there is a difference between overseas and domestic services, cross-cloud service area routing is carried out in LB or gateway via the internal dedicated line of the cloud vendor.
In this way, under the non-disaster recovery scenario, the multi-cloud service area provides page access service at the same time, reducing the return pressure of the single-cloud service area. Even when the cloud service area level fault occurs, the CDN Admin API can also be used to realize one-key fault switching, and the CDN can quickly return to the available state.
As shown in the figure, through our operation and maintenance platform, under the failure scenario of A single cloud service area, the rapid elimination of the fault cloud service area can be realized. This process is mainly realized by batching the backflow DNS A records at the Region level of secondary domain names.
4. Single cloud global failure scenario: website backup and switching scheme
Finally, let’s introduce the lowest level of the whole high availability scheme: website backup and failover. First, let’s take a look at the website backup process, as shown in the figure:
The operation and maintenance staff first configures the site metadata and the backup policy. The site management issues the backup task to the scheduling service according to the backup policy, and then the scheduling service calls the backup service to perform the backup task on a regular basis.
Collection is started by the backup service Headless Browser to load the entry page, and then load the static page resources, execute the page script to load the dynamic page resources, and then execute the preset script to load the dynamic page resources, and finally identify the page jump URL, including HTML tags and dynamic jump points triggered by the script, Start a new instance of Headless Browser to implement the cascade crawl.
After collection, the main document and related page resources are stored. After loading, the main document of the page is dumped to the object storage service through OBS interface. Then, the remote disaster recovery of the page content is realized through the synchronization capability of object storage across Region provided by the cloud vendor. Cross-cloud replication uses the cross-cloud synchronization tool to synchronize the page content of the backup site to other cloud vendor object storage services to realize cross-cloud disaster recovery.
Once the backup is complete, review the failover process again. When the Web service is not available as A whole due to single-cloud and multi-region fault caused by infrastructure problems and other reasons, fault detection is started. The page availability dial-and-test service detects that cloud service areas A and B are unavailable and sends an alarm within 5 minutes.
The next step is failover, setting up a combat team for emergency treatment of major problems, and opening the operation, maintenance and disaster recovery management platform to check whether the unavailable areas and backup sites are properly dialled. If the same cloud backup site is available, switch to the same cloud backup site first; If it is not available and the third-party cloud vendor backup site is available, switch to the backup site. The whole switch is achieved by updating the resolution address back to the source domain A record to point to the OBS public network access address.
The last stage is the fault repair stage. Firstly, the problem is located and solved, the Web Server is available by dialling and testing, and then the fault backcut is performed manually, and then the user returns to normal access.
conclusion
The above is a summary of some practical experience on how to keep the website online in a variety of extreme scenarios. The relevant schemes have been proven to be effective in the actual scenarios, and the continuous routine drill is achieved.
In addition, for different types or sizes of websites, there is no specific quantitative standard for high availability, which can be given to several relatively rough levels for reference: the most basic guarantee that the function is available, without considering the single point of network elements. To be a little higher, consider application service cluster deployment, DB, cache and other middleware for the corresponding high availability deployment to ensure that there is no basic single point of problem. Next, consider multiple data center deployments to solve the problem of single data center unavailability. Finally, consider the remote more live or disaster tolerance, to deal with a geographical area of the disaster scenario.
In addition to the above traditional routine, as more and more enterprises are on the cloud, it is necessary to consider how to quickly replace and escape when the overall failure of the infrastructure of a single cloud vendor occurs, such as CDN, DNS, etc., which are the key failure points to be considered in the basic scenario of website access.
welfare
This, there are two huawei experts bring the huawei cloud website the five key measures of intelligent practice “and” huawei cloud website front-end technology evolution and low code practice of sharing, they also answered the developers’ concerns, such as web intelligent recommended practice experience, low code platform selection and so on. welcomeScan the code to watch the video.
Click on the attention, the first time to understand Huawei cloud fresh technology ~