Stability monitoring scheme under ten billion flow

Author: Mo Hui

NO. 1 introduction

Monitoring – the first front of production safety, the effective coverage of the alarm, the ability to find online problems and how to quickly locate the problem is the core ability of monitoring.

Overall goal of safety to 1-5-10, 1 minute to detect problems, 5 minutes to locate problems and 10 minutes to repair problems.

JSTracker platform — Build end-to-end front-end monitoring and data analysis platform, provide real-time monitoring, multi-terminal coverage, data analysis, intelligent core capabilities, and complete stability guarantee of Amoy Double 11 based on JSTracker platform.

This article will introduce how we build the overall solution of 1-5-10 based on the JSTracker platform and how to guarantee the stability of multiple core businesses of Taobao live, venue, shop, interaction, transaction and so on on Double 11.

NO. 2

Front-end cross-end solutions are evolving. Front-end frameworks cover cross-end solutions such as web, WEEX, and small programs. The engineering infrastructure is mature, but the fault detection rate is still low. A statistical analysis is made on the front-end faults of Amoy system in FY18-FY20. The average detection rate of monitoring is lower than 30%, and the overall average repair time is longer than 1 hour. Now from the problem discovery, rapid recovery to analyze the specific causes:

The problem found

Most of the faults are not found in time, and the core problems are not the effectiveness of the alarm. Most of them come from passive notification, online feedback, public opinion or customer complaints, etc. From the analysis of the problems, there are the following problems:

Services are not monitored, security awareness is lacking, and infrastructure is not complete. Monitoring SDK needs to be manually introduced to the page, and many services are completely running naked online
Core indicators are not subscribed, and most pages are monitored but alarm is not subscribed or index subscription is incomplete, leading to online problems not found in the first time
Monitoring indicators are incomplete. From the traditional front-end perspective, it may only focus on anomalies during page running, such as Jserror and interface error running, etc. However, from the perspective of the whole page running process, many monitoring indicators are missing, such as CDN node anomalies, blank screen of page loading, and page crash

Fast recovery

The statistical average fault recovery time is far from the target 10 minutes. A complete development process includes: development -> release -> online validation regression process. The process is shown in the figure:

If the problem has been released online, it is difficult to recover in 10 minutes according to the release process. The core solution to the problem needs to focus on the development stage and release stage before the release of the page, mainly involving two points:

Pre-release: Complete automated testing process, such as resource exception, JSError, etc detection intercepts before release
Publishing process: The page publishing process has a complete process that can be grayscale, monitored and rolled back

No.3 Overall plan

According to the overall goal of 1-5-10 security production, the solutions for “problem detection and quick recovery” are as follows:

Monitoring coverage

From the perspective of fault discovery, the core problem is to solve the monitoring coverage, access coverage, subscription coverage, and indicator coverage to ensure 100% service access and subscription and page monitoring indicator integrity. The solution is as follows:

Access cover

We need to address access coverage from two dimensions, improving infrastructure and business governance

Improve the infrastructure, the front page is mainly divided into two parts, source code and build (Zebra, ark). In the construction layer, we have made unified default access in solution. In the source layer, the group monitoring and collection standards and data standards have not been unified at present, and they are on the way of standardization and unification
Business governance promotion, through the team dimension statistics coverage distribution and indicators, measure the page security score, guide and promote the rapid completion of business access

At the business governance layer, indicator statistics and business measurement methods need to be established, and the scheme is as follows

Index statistics

During the measurement development process, various dimensions are needed to count metrics data, such as team, time, and so on. How to efficiently summarize data according to the reporting relationship is the core problem of statistical analysis. The overall idea is as follows

Construct the full path field of the employee id (the employee id path), where the hierarchy can be found through the PATH node
To obtain a list under a supervisor, you can use like to quickly query, for example, to obtain all employees of A3, such as: id path like ‘% A3 %’

Measurement & Red black list

Measurement objectives help businesses and teams to make auxiliary decisions. Based on the original data of the page, a set of indicator measurement model can be established from the original data -> data analysis -> indicator measurement -> business decisions to help businesses quickly find problems

Subscribe to cover

As can be seen from the statistics, many pages are mainly due to incomplete index subscriptions, for example, they only subscribe to Jserror and ignore the subscriptions of indexes such as blank screen and crash. Under the current business situation, it is necessary to supplement subscription indexes for unsubscribed pages and guarantee subsequent core index subscriptions. The overall process is as follows:

Subscription index completion: Through the governance process, the distribution of unsubscribed relationships on the page is counted, and the subscription index is completed by clicking one button
Improve the release process: after the page is published, subscribe to the release message and make incremental subscription to the core indicators

Indexes cover

Cross-end page, from the overall page cycle is divided into three processes, from container start -> air render -> page load execution

Container layer, such as WEEx and WebView container, can check whether the page is blank or crash during loading
At the source level, if there is an exception in the CDN, it will not be sensed from the perspective of the front end
Page layer, depending on the ability of SDK itself, global capture process anomalies as monitoring index points

From the perspective of security, the stability of the entire link must be monitored. The whole link guarantee process provides unified access to data indicator processes at the data link layer and aligns indicators with page addresses

Gray level monitoring

At the heart of fast recovery is the need to find problems faster and roll back changes faster. Historical failure data, about 80% of online problems are caused by changes. However, many failures are not caused by lack of monitoring, but problems caused by new version changes. The overall error volume is not obvious, and there is no differentiation in the overall log layer, resulting in the appearance of normal, while ignoring the problem.

For the monitoring of the process of change [gray monitoring], abnormal logs need to be identified as those brought by the new version, which can be compared with the monitoring of the online version to distinguish the increase of the error proportion and new problems caused by the new version. The solutions are as follows:

Indicator collection: The collection script obtains the gray mark by reading the global variable of the template, and the container layer obtains the gray mark by response Header
Monitoring indicator: The collection script and container layer must unify and standardize the gray field specifications, which are carried in the log reporting process. Small programs are differentiated by their version numbers
Grayscale application: mainly in the indicator grayscale real-time log presentation and alarm

Collection standard

It is mainly divided into two parts: field specification and integration specification. Field specifications are based on different log sources and are unified in data links. The integration specification is used to identify grayscale state for different cross-end scenarios

Field specification

Integration specification

< meta name = "page - the tag" content = "env = spe, grey = true, version = 0.0.1" / >Copy the code

Acquisition methods

To solve the problem of page monitoring SDK and cross-end container (Weex, Windvane, etc.) obtaining release version information through collection, the following two problems exist

Monitor SDK: Restricted by the browser, it can only be read globally, such as through meta tags or global variables
Cross-end container: does not get the contents of the template, only the version information from the response header

Based on the above questions, and the front-end publishing convention, there are two standard ways to notify the end of the current publishing status

Integrate into the Response Headers content. Write the version number and gray state in headers
The page template content is injected directly in the render layer and can be written to global parameters when the template is being rendered

The specific way

Web SDK, considering that global variables have some pollution, the current standard integration is through meta tags
On the container side, response content can be read directly, but it also has disadvantages, such as relying on the client to release and iterate

Gray scale application

From a monitoring perspective, the two main concerns of a new iteration are whether new problems will be introduced or whether the error rate will increase. For the application side, users need to clearly perceive the changes of indicators, as follows:

Grayscale monitoring alarm

Grayscale alarm link is shown as follows:

1. Subscribe to the published message of the page, store or delete the published information of the page, such as the configuration information of the page address, gray scale, publisher and so on

2. The gray level alarm is currently polling through 5 minutes. The gray level log can be pulled for the latest 30 minutes, and the online log can be pulled for 12 hours, so as to avoid the error caused by the large error of the newly added log.

Grayscale real-time monitoring

After the indicator collection is complete, the grey field is added to the indicator log to distinguish the gray version from the online version. The gray version can be judged by comparing the following points

Ratio of error rate of gray version to error rate of online version
Trend ratio and status of error logs

The results of

At present, Amoy C terminal page monitoring coverage is 98%, including source pages, build pages and small programs. Taking the monitoring of the main venue as an example, in the pre-sale, pre-heating and formal stages, the module development problems are directly located at 10+ places.

Monitor screen

Based on the complete premise of page monitoring and index coverage, the whole monitoring market of Amoy system is built through DATAV to observe the abnormal situation of the core page globally.

Case: On the night of Double 11, the weeX error log increases. The log check indicates that a js page is executing incorrect logic

Indexes cover

The overall increase of Crash logs, xx client push configuration, a large increase of Crash logs, business received the alarm in the first time, timely hemostasis

Gray level monitoring

Interactive business: After the release of new function iteration, the proportion of grayscale error increases, and timely rollback avoids the expansion of online problems

NO. 4

In monitoring coverage, gray monitoring and other capacity construction, we have better ability to avoid problems and perceive problems, so that the business can go further, go faster. At present, there are still many problems in alarm subscription, accuracy and index analysis. We will continue to improve our monitoring ability in the future. Welcome to discuss and communicate with us.

Stability monitoring scheme under ten billion flow

NO. 1 introduction

NO. 2

The problem found

Fast recovery

No.3 Overall plan

Monitoring coverage

Access cover

Subscribe to cover

Indexes cover

Gray level monitoring

Collection standard

Acquisition methods

Gray scale application

Grayscale monitoring alarm

Grayscale real-time monitoring

The results of

Monitor screen

Indexes cover

Gray level monitoring

NO. 4

Related Posts

Introduction to the React Ref

Vue CLI mode and environment variables in detail

What are placeholders for ramda.js used for?