CDN has become one of the important infrastructure of the Internet. More and more network services cannot do without CDN, and its stability also directly affects the availability of business. The SRE team of Meituan has been in charge of CDN disaster recovery, and there are few plans and practices on the end. Combined with the specific practice of Meituan takeout business, this paper introduces a scheme to sense the availability of CDN at the end and carry out automatic Dr Switchover. Through this scheme, the sensitivity of services to CDN anomalies can be effectively reduced, the availability of services can be improved, and the operation and maintenance pressure of CDN can be reduced. I hope this program can help or inspire students who are troubled by CDN problems.

1. Introduction

As a business research and development, have you ever encountered failure to load business pictures, slow page opening, abnormal page layout or blank page due to CDN problems? Have you ever come across CDN domain name abnormality in some areas that leads to business suspension and continuous customer complaints? At this time, you are at a loss? As CDN operation and maintenance, are you often overwhelmed by various CDN problems feedback from the business side? While seeking solutions under various urging and pressure, you complain about the reliability of the service provider? Today, we mainly introduce the end-to-end CDN disaster recovery scheme of Meituan Takeout technical team. Through practice, we find that this product can effectively reduce the anxiety of operation and maintenance and business development students. We hope our experience can also help more technical teams.

Background 2.

CDN can effectively solve the network access delay caused by distribution, bandwidth and server performance, and has become an indispensable part of the Internet, as well as one of the services on which front-end business relies heavily. In actual business production, we usually host a large number of static resources, such as JS scripts, CSS resources, pictures, videos and audio, to CDN service to enjoy the acceleration of static resources by edge node cache. However, while enjoying CDN services brings better experience, it is often affected by CDN failures. For example, due to abnormal CDN edge node and CDN domain name blocking, the page will be blank, typesetting disorder, and image loading failure.

Every CDN failure, the business side is often helpless, can only hope the CDN team. The monitoring and troubleshooting of CDN is also a huge problem and challenge to SRE. On the one hand, because of the wide distribution of CDN nodes, it is extremely difficult to monitor edge nodes. On the other hand, the CDN monitored by various businesses hides details to a great extent. CDN exceptions in low-traffic services and fixed-point areas are often drowned. The SRE team has also made a lot of efforts to design a variety of solutions to reduce the impact of CDN anomalies on business, and has achieved certain results. However, there are still several problems that cannot be solved satisfactorily:

  • Timeliness: When there is a problem with CDN, SRE will manually switch the CDN, because it is difficult to guarantee the response time due to manual operation. In addition, the failover recovery time cannot be guaranteed accurately.
  • Validity: After switching to the backup CDN, the availability of the backup CDN cannot be verified. In addition, domain name hijacking and cross-network access cannot be solved because of the Local DNS cache.
  • Accuracy: CDN switching is a large-scale change, which cannot be carried out separately for a certain region or a certain project.
  • Risk: Switching to the backup CDN may lead to the back source, and the surge of traffic may bring down the source station, thus causing greater risks.

At present, Meituan takeout service serves over 100 million people every day. Even a small problem will be magnified into a big one in the face of huge traffic. In the dynamic architecture of takeout, 70% of business resources depend on CDN, so the availability of CDN seriously affects takeout business. How to carry out CDN DISASTER recovery more effectively and reduce the impact of CDN anomalies on services is a problem we are constantly thinking about.

Since the above problems cannot be solved perfectly by THE SRE side, can the end side make some attempts? For example, the CDN Dr Is front-loaded to the terminal. Phoenix is an end-to-end CDN disaster recovery scheme that is constantly practiced and improved through front-end capacity building under such an assumption. This scheme can not only effectively reduce the impact of CDN anomalies on services, but also improve the success rate of CDN resource loading. Now, multiple businesses and APPS in meituan are served.

3. Objectives and scenarios

3.1 Core Objectives

In order to reduce the impact of CDN anomalies on business, improve business availability, and reduce the pressure of SRE students in CDN operation and maintenance, we determined the following objectives at the beginning of the scheme design:

  • Automatic domain name switchover at the end: When a CDN is abnormal, the end senses and automatically switches the CDN domain name for loading and retry, reducing the dependence on manual operations.
  • CDN domain name isolation: Realize service isolation and service equivalence between CDN domain name and service provider in regional dimension to ensure the validity of CDN switching retry.
  • More accurate and effective CDN monitoring: the construction of more fine-grained CDN monitoring can monitor the AVAILABILITY of CDN in real time according to the project dimension, and solve problems such as insufficient monitoring granularity of SRE CDN and alarm lag. The CDN Dr Policies are dynamically adjusted based on Dr Monitoring to reduce the frequency of CDN switchover.
  • Domain name continuous hot backup: Ensures the continuous warm up of each CDN domain name to avoid source back-up during traffic switchover.

3.2 Application Scenarios

This applies to all end-to-end scenarios that rely on CDN and want to reduce the impact of CDN anomalies on services, including Web, SSR Web, and Native technologies.

4. The Phoenix program

The stability of CDN has always been guaranteed by SRE, and disaster recovery measures have always been carried out on the SRE side. However, it is difficult to deal with local problems and achieve quick stop loss only by relying on the guarantee at the link level. As the final carrier of business, the user terminal has natural independence and sensitivity to resource loading. If CDN Dr Is front-loaded to the terminal side, SRE side is incomparable in timeliness and accuracy. In end-to-end Dr, the availability of CDN must be sensed and the end-to-end switchover capability must be realized. We investigated the whole front-end field, but did not find any practice or output in end-to-end CDN disaster recovery in the industry, so the realization of the whole scheme is a process from scratch.

4.1 Overall Design

The Phoenix end-to-end CDN Dr Scheme consists of five parts:

  • E2e DISASTER recovery SDK: responsible for e2E resource loading awareness, CDN switchover retry, and monitoring reporting.
  • Dynamic computing service: according to the data reported by the SDK on both sides, the domain name availability of multiple groups of equivalent domain names is calculated by regular polling according to the dimensions of city, project, time period and so on, and the traffic is dynamically adjusted to the optimal CDN. It is also a daily inspection of CDN availability.
  • Disaster recovery monitoring platform: Provides CDN availability monitoring and alarms from the project and market dimensions, providing detailed information for troubleshooting.
  • CDN service: Provides complete CDN link services, implements domain name isolation in the architecture, and provides equivalent domain name services for business parties to ensure the effectiveness of end-to-end Dr. An equivalent domain name is a domain name that can access the same resource through the same path, for example: and can return to the same content, the equivalent of the domain name and
  • Disaster recovery (Dr) configuration platform: manages the Dr Domain name of the project, monitors and reports policies, and provides manual intervention for CDN traffic.

4.2 Dr Process Design

To ensure the consistency of Dr Effects and monitoring indicators at each end and side, a unified Dr Process is designed as follows:

4.3 Implementation Principle

4.3.1 End-to-end Dr SDK

The Web client implementation

CDN resources on the Web side are MAINLY JS, CSS, and images, so our disaster recovery target is also focused on these. In Web Dr, we mainly implement static resources, asynchronous resources and image resources Dr.

Implementation approach

To implement resource DISASTER recovery, the most important problem is to perceive the load result of resources. Usually we catch them by adding error callbacks to resource tags. Image disaster can be done this way, but this is not suitable for JS because it has a strict execution order. To solve this problem, we have replaced the traditional way of loading resources with XHR. Webpack was used to extract the synchronized resources during the construction phase of the project, and the PhoenixLoader was used to load the resources. In this way, the result of resource loading can be sensed through the status code returned by network request.

In the implementation of the scheme, we designed the SDK as Webpack Plugin, mainly based on the following four considerations:

  1. Versatility: Meituan has a relatively large front-end technology stack, so it is necessary to ensure that the DISASTER recovery SDK can cover most of the technical frameworks.
  2. Ease of use: too high access cost will increase the workload of developers, unable to achieve effective coverage of the business, and the value of the solution will be out of the question.
  3. Stability: The scheme should be stable and reliable without interference from CDN availability.
  4. Intrusion: Do not intrude into normal services. Plug and play is required to ensure the stability of services.

According to the survey, 70% of the front-end engineering construction can not do without Webpack, and Webpack Plugin is the best choice for implementation with its plug-and-play feature and independent configuration. The overall design is as follows:

Of course, many teams take a code split, on-demand approach to performance optimization. These resources cannot be sensed during the generation of synchronous resources. However, the loading results of these resources also affect service availability. In the aspect of asynchronous resource disaster recovery, we mainly rewrite the processing mode of Webpack for asynchronous resources and use Phoenix Loader to take over resource loading, so as to realize asynchronous resource disaster recovery. The overall analysis process is shown in the figure below:

CSS resources are handled differently from JS, but the principle is similar. You only need to rewrite the asynchronous loading implementation of the Mini-CSS-extract-Plugin.

Schematic diagram for loading Web solution resources:

Effect of disaster

Native disaster end

The CDN resources of the client are mainly pictures, audio and video, and bundle resources of various dynamic schemes. The DISASTER recovery construction of the Native terminal also focuses on the above resources.

Implementation approach

Rerequest is the basic principle of the Native CDN Dr Solution. Based on the mutual-standby CDN domain name, resources are rerequested using the Dr Domain name of the Native Dr Infrastructure. The whole process occurs after the original request fails. Native Dr Infrastructure does not perform any operation during the original request process, preventing impact on the original request. After the original request fails, the Native Dr Infrastructure agent fails to process the request and returns the result. The service party is still waiting for the result. After the re-request is complete, the service party returns the final result. In the whole process, from the perspective of the business side, only one request is issued and one result is received, so as to achieve the purpose that the business side is not aware of. In order to improve the efficiency of re-request to the best, it is necessary to ensure that the number of re-requests tends to be minimal.

We investigated the business concerns and the network framework used at the technical level, combined with the basic process of Phoenix DISASTER recovery scheme, and mainly considered the following points in scheme design:

  • Convenience: The convenience of access is the first consideration when designing the SDK. That is, the service provider can access the SDK in the simplest way to achieve resource DISASTER recovery (Dr) and remove the SDK without residue.
  • Compatibility: The uniqueness of the Android side is the diversity of networking frameworks, including Retrofit, okHttp, okHttp3 and the already little-used URLConnection framework. The SDKS provided should be compatible with various network frameworks, and the business can implement the DISASTER recovery function with minimal cost even if the network framework is changed. On the iOS side, it is considered to reuse an NSURLProtocol to intercept the request, reduce the redundancy of the code, and realize the unified adaptation of the initializer.
  • Scalability: In addition to basic functions, advanced configurations must be provided to meet special requirements, including monitoring and reporting monitoring data.

Based on the above design points, Phoenix is divided into the following structure diagram, in which the overall Dr SDK is divided into two parts: Phoenix-Adaptor part and Phoenix-Base part.


Phoenix-base is the core part of Phoenix Dr. It includes the Dr Data cache, domain name change component, Dr Request executor (different from original request executor), and monitor. It also contains an external access module to provide external access.

  1. Dr Data cache: Periodically obtains and updates Dr Data. The generated data is used only by the domain name change component.
  2. Domain name change component: connects to the Dr Data cache, the Dr Request executor, and the central node of the monitor, matches original failed hosts, filters error codes, provides the Dr Domain name to the Dr Request executor, and provides the detailed data copy of the entire Dr Process to the monitor.
  3. Dr Executor: the actual requestor of a Dr Request. Currently, the internal OkHttp3Client is used. The service side can also switch to its own executor.
  4. Monitor: Distributes detailed data about the Dr Process and reports the built-in data market. If a customized external monitor exists, it also distributes data to the customized monitor.


Phoenix-adaptor is a Phoenix Dr Extension that is compatible with various network frameworks.

  • Binder: Generates interceptors suitable for each network framework and binds them to the original request executor.
  • Parser: Converts the Request of the network framework into the Request of the Phoenix internal executor, and parses the Response of the Phoenix internal executor into the Response of the external network framework to achieve adaptation.

Effect of disaster

① Service success rate

For example, compare the success rates of Android image services (Phoenix Dr Is not enabled in version 7512, 2021.01.17, and Phoenix Dr Is enabled in the evening of 2021.01.19).

Comparison of iOS service success rates (Phoenix Dr Is not enabled in version 7511, 2021.01.17, and Phoenix Dr Is enabled in 2021.01.19).

② Risk response

As a comparison between take-out and Meituan pictures, when CDN service is abnormal, the comparison of the success rate of pictures between take-out App with Phoenix access and Meituan App without Access is made.

4.3.2 Dynamic Computing Services

If a domain name fails to load resources, the system tries again based on the Dr List until the resource load succeeds or fails. As shown below:

If domain name A is widely abnormal, the end will retry domain name A first, resulting in unnecessary retry costs. How to make the first loading of resources more stable and effective, how to dynamically provide the optimal CDN domain name list for different businesses and regions, these are the problems to be solved by dynamic computing services.

Calculation principle

Dynamic computing services are associated with project appkeys through the domain name pool, and manage policies based on different provinces, prefecture-level cities, projects, and resources. The system obtains the resource loading results reported within 5 minutes and performs periodic polling calculation to monitor the availability of domain names in the domain name pool by region (city & province). The computing service dynamically adjusts the domain name order based on domain name availability and outputs the result. Here is a complete calculation:

Suppose there are three domain names A, B and C, the success rate is 99%, 98% and 97.8% respectively, and the traffic proportion is 90%, 6% and 4%. For example, if the success rate difference between A and B is 1, B needs to transfer half of its traffic to A, and the success rate difference between A and C is greater than 1, C also needs to transfer half of its traffic to A, and the difference between B and C is 0.2. So C also needs to transfer a quarter of its traffic to B. Finally, it is calculated that the traffic proportion of A is 95%, B is 3.5%, and C is 1.5%. Finally, the final result is output after sorting and random calculation.

Since A accounts for the largest proportion, A is preferred. By randomness, B and C will also have a certain amount of flow; Based on the transfer benchmark, the flow can be smoothly switched.

Anomaly aroused

When a CDN cannot be accessed, the CDN access traffic is switched to the equivalent CDN B. If the SRE finds that the switchover is slow, it can manually allocate traffic. When the success rate of A small number of DOMAIN names increases, the traffic of A will be increased by repeated calculation. Until the initial state is restored.

Effect of service

Dynamic computing services increase the first load success rate of resources from 99.7% to 99.9%. The following figure compares the success rate of resource loading after dynamic computing and that without dynamic computing.

4.3.3 Dr Monitoring

At the monitoring level, THE SRE team usually only focuses on the monitoring indicators of domain name, large area, operator and other composite dimensions. The monitored traffic is huge, and the FLUCTUATION of CDN in small traffic or small area may not be identified by monitoring analysis, so as to detect the abnormal CDN edge node. The construction of DISASTER recovery monitoring is mainly to solve the CDN monitoring alarm lag and monitoring granularity problems of the SRE team. The overall monitoring design is as follows:

Process design

End – side Dr Data is reported, monitoring indicators are established according to the project, App, resource, domain name and other dimensions, and CDN availability is made part of project availability. Through the analysis and aggregation of data on the computing platform, the CDN availability market is formed, which is output by domain name, region, project, time and other dimensions. The communication with sky Network monitoring and the establishment of minute-level monitoring and alarm mechanism greatly improve the sensitivity of CDN abnormal perception. At the same time, the skynet monitoring on THE SRE side will also interfere with the results of dynamic computing services. The overall monitoring process is as follows:

Monitoring results

CDN monitoring not only monitors THE AVAILABILITY of CDN in a more fine-grained way from the project dimension, but also provides richer information such as region, operator, network status and return code for CDN anomaly detection. In terms of alarm monitoring, minute-level abnormal alarms are realized, and the sensitivity is higher than that of meituan’s internal monitoring system.

4.3.4 the CDN service

The effectiveness of end-to-end domain name switching is dependent on the support of CDN service. In terms of CDN service, on the basis of the original SRE disaster recovery, the whole CDN service is upgraded to achieve domain name isolation, and the disadvantages of single domain name corresponding to multiple CDN and multiple domain name corresponding to single CDN are solved.

5. Summary and outlook

After one year of construction and development, Phoenix CDN disaster recovery scheme has become increasingly mature, and now it has become the only public service of Meituan in CDN disaster recovery, and has played a huge role in many CDN anomalies. At the end, the solution has 30 million + daily disaster recovery resources and 350,000 + users, covering takeout, wine travel, catering, selection, shopping and other business departments, serving more than 200 projects. Takeout App, Meituan App and Dianping App have all been connected.

On the SRE side, minute-level accurate alarms of the project dimension are realized, and the abnormal information is enriched, which greatly improves the efficiency of TROUBLESHOOTING SRE problems. Since the large-scale implementation of the scheme, manual switching operation rarely occurs when CDN is abnormal, which greatly reduces the operation and maintenance pressure of SRE students.

Due to the diversity and complexity of the front end technology, we can’t cover all SDK technical solutions, so in the construction of the following, we will actively promote our disaster principle, dynamic computing services, public wants more disaster and the framework of service in our thoughts, joint business implementation itself disaster end side of CDN. In addition, for the scheme itself, we will continue to optimize the resource loading performance, improve the resource check, intelligent switching and other capabilities. Students who are interested in Phoenix CDN disaster recovery scheme are also welcome to discuss and communicate with us. At the same time, welcome to join us. Recruitment information is attached at the end of this article. Looking forward to your email.

6. Author introduction

Wei Lei, Chen Tong, Zhang Qun and Yue Jun are all from meituan’s food delivery platform – big front end team, ding Lei and Xin Peng are from Meituan’s catering SaaS team.

7. Job postings

Meituan Takeout platform – Big Front End team is an open, innovative and boundary-free team that encourages every student to pursue their own technology dreams. The team is recruiting senior/senior engineers and technical experts for Android, iOS and FE. If you are interested, please send your resume to [email protected] (please note: Meituan Takeout front).

Read more technical articles from meituan’s technical team

Front end | | algorithm back-end | | | data security operations | iOS | Android | test

| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.