Introduction: With the rapid iterative development of services, the system’s monitoring and optimization of services is no longer limited to behavior and performance monitoring. Front-end exception monitoring can better reflect the real experience of the client. Fine-grained monitoring can proactively discover problems in time to reduce losses, and targeted analysis and governance can even bring business gains. Combined with the experience of abnormal monitoring and management of advertising hosting team, this paper introduces the actual summarization of abnormal collection, alarm monitoring, investigation analysis and governance optimization.

The full text is 8455 words, and the expected reading time is 19 minutes.

One, foreword

Behavior, performance, and exception points are a cliche in the front-end world, and in practice many teams use them in the same way: Behavior > performance > exception. This is not difficult to understand. Behavioral statistics are more intuitive in the short term from the perspective of team revenue, and even some points themselves are business requirements, such as PV and UV statistics after the function is launched.

Generally, for online services, back-end exception monitoring is a must, and active discovery of service exceptions is mostly from the back-end. What role can front-end exception monitoring play? Is it cost-effective from a manager’s point of view to add such input? How can anomalies be monitored to detect and guide stops faster? In the face of these problems, front-end anomaly monitoring of many services ends before it even starts.

Our team summed up some thinking and experience in practice, hoping to be of some help to readers.

1.1 Service Background

We are Baidu advertising hosting business, to undertake the site construction work of many industries. This includes mobile/desktop Web sites, small programs, HN (ReactNative solution of Baidu App) and other carriers, with a large amount of access traffic every day.

  • For netizens, we need to ensure smooth reading and interactive experience.

  • For advertisers, we want to provide high quality assurance.

Through front-end abnormal monitoring and governance, the business team has gained many benefits such as identifying problems in advance, timely stopping losses, and optimizing advertising effects.

1.1.1 What problems are to be solved

As mentioned at the beginning of this article, for quite some time in the development of the business, the focus of the team has been on the improvement of back-end monitoring and alarm. However, when the stability management of the service reaches a certain standard, we find that some online problems are still difficult to recall, such as

  • The whole or some parts of the page render abnormally, affecting the experience and even advertising conversion and cost. Examples of possible causes:

  • Static resource loading is abnormal, including script resources and image materials

  • API Access exception

  • JS execution exception

Compared with back-end exception monitoring, resource loading and JS execution exceptions are incremental scenarios brought by front-end exception monitoring. The end-to-end interface stability is closer to the actual perception of users and better indicates the impact of the network on the stability.

  • Detect problems in low-traffic scenarios

Product launches often require validation of low traffic and AB testing. Or some problems are triggered only in certain scenarios because of traffic limitations, which are difficult to detect through fluctuations in service data, and are discovered only after the expansion creates a greater negative impact or customer complaints.

Front-end exception monitoring can help us solve these scenarios very well. The following will introduce our main work and experience in the following stages: anomaly collection, improvement of monitoring and alarm, anomaly investigation and anomaly management.

Note: This article mainly discusses the problem from the perspective of business application, and does not discuss too much about the general buried point receiving service, data processing and display platform. Fortunately, the team has such professionals and platforms. Combined with our business scenario requirements, we designed and supported business exception monitoring in addition to general monitoring with the platform, which will be introduced later.

= = =

2. Exception collection

The first step is to send the exception to the collection service in the form of a dot. This includes many of the generic scenarios mentioned in this article, such as errors caught through Window listeners, as well as more insidious business exceptions that have a significant impact on the business.

2.1 General exception collection

General exception collection is a non-intrusive exception collection method that does not require active expression by business developers. When an exception occurs in the system, it can collect errors through event bubbling, event capturing or hook functions provided by some frameworks.

When collecting exceptions on a page, two scenarios are involved:

  • Resource loading exceptions caused by network requests, such as image loading failure, script link loading failure

  • Exceptions due to runtime exceptions, most of which are due to some code compatibility or unconsidered boundary cases

For resource loading exceptions, services are monitored in the following two ways:

  1. Use the resource’s own onError event to report the error when the resource fails to load. This scenario generally requires the use of a packaging tool to add onError logic for related resources during code packaging, such as script-ext-html-webpack-plugin to add onError attributes for all script tags.

  2. To use:

window.addEventListener('error', fn, true)
Copy the code

Exceptions generated at runtime are usually monitored in the following ways:

Add the following event to the top of the page:

Window. The onerror or Windows. The addEventListener (" error ", fn)Copy the code

However, this processing method also has its limitations. Exceptions generated for uncaught promises cannot be caught, so in business use, an additional event listener method is generally added to catch unhandled promise exceptions.

window.addEventListener('unhandledrejection', fn)
Copy the code

Some front-end frameworks also provide configuration methods for run-time exceptions to simplify our daily development.

The React framework:

After React 16, the framework supported componentDidCatch for catching render exceptions. Caution should be taken when using it, see Error boundary

(reactjs.org/docs/error-…).

Error boundaries do not catch errors for:

  • Event handlers

  • Asynchronous code (e.g. e.g. setTimeout or r``equestAnimationFrame callbacks)

  • Server side rendering

  • Errors thrown in the error boundary itself (rather than its children)

Vue framework:

The Vue framework also provides similar global misconfiguration. The following methods specify handlers that do not catch errors during component rendering and observation.

Vue.config.errorHandler = (err, vm, info) => {}
Copy the code

As of 2.2.0, this hook also catches errors in component lifecycle hooks. Similarly, when the hook is undefined, the caught error will be output console.error to avoid application crashes.

As of 2.4.0, this hook will also catch errors within Vue custom event handlers.

As of 2.6.0, this hook also catches errors thrown inside the V-ON DOM listener. In addition, if any overridden hook or handler returns a Promise chain (such as async function), errors from its Promise chain will also be handled.

Note:

It is common to see exceptions with the error message “Script Error “among the caught exceptions. This type of exception occurs when a web site requests and executes a cross-domain Script. If the Script fails, an exception with the error message “Script Error “will be caught in the global exception listening method. Due to browser security restrictions, there is no specific error message displayed here, which is very unfriendly for troubleshooting. At present, most of the resource files packaged by the project are deployed to the CDN service independently. The reference domain name of the resource is the CDN domain name, which is inconsistent with the domain name of the page running.

A common solution is to use a packaging tool and add:

Crossorigin (developer.mozilla.org/zh-CN/docs/… Access – control – allow – origin: yourorigin.com. In this way, when the JS introduced by CDN address runs and reports errors, the global error monitoring method can obtain complete error information.


2.2 Collecting Service Exceptions

2.2.1 How to Define Service Exceptions

On the basis of general exception collection, the system adds the point of business custom exception. It’s a “buried” approach to development (as opposed to a “buried” approach that upper level developers don’t perceive). The developer sends out data points explicitly in the program, often with some runtime data at the time.

Why add this approach? The amount of anomaly data collected by standard method is limited, and most of them are anomaly stacks. But there are still some scenarios that don’t go down well:

  • Cannot be captured directly from the bottom

Although you can’t see the red flash error in the console, there are still some issues to be concerned about from a business perspective.

For example, in APP download business, customers need to bind channel download packages on the page, and then use the page for advertising. Sometimes customers make mistakes and put android download packages into iOS. This situation will not be abnormal in the page rendering stage, but it is obviously bad for advertising conversion, which needs to be found and solved from the business perspective.

  • Some exception stacks have insufficient information

We need to capture some runtime data to assist in locating and analyzing subsequent problems. For example, the id of the currently accessed account, some key business data in the current application status, etc.

For example, if a large number of anomalies “onAndroidBack is not defined” are found in a business, the product line where the anomaly occurs is quickly located based on the product line ID carried in the anomaly information. After communication with the developer, the problem code is located for business compatibility.

  • Analysis cost is high and alarm time is low

If you have practice, you will find that the analysis and statistics of abnormal data are not easy, especially in the very common abnormal stack to find more accurate problems, timely alarm, investigation, stop loss. Business exceptions allow us to more intuitively locate the root cause when an exception occurs. We do not want to be lazy in the analysis stage, but to make the calculation logic of data service more simple and direct, so that the time of big data processing is higher, and the manual analysis and repair of problems after alarm is more efficient.

2.3 Exception Collection Protocol

In order to support common exceptions and customized exceptions, a unified data transmission and storage protocol must be designed.

Design of transport and storage protocols:

The design of transport protocol should follow the following principles: the top-level schema is stable and the service information is extensible. In addition, the rapid docking of downstream data processing modules is also a key factor to be considered in its design, so we define the overall format of data transmission protocol as follows:

Level 1 key is a continuation of the data structure supported by the general data processing module. The following fields are more important. Meta fields are some common data fields in advertising hosting business. In addition, the extra field in meta is opened twice. Through the API exposed by the buried SDK, developers can upload some other business data together to assist the troubleshooting of abnormalities in the later stage.

Request: // Save information related to the current page meta: {XXX: // extra: {// developers can extend the field}}}Copy the code

The above design supports the flexible expansion of business on the premise of stability and completeness.

Why can’t the field “extra” be placed in meta?

This design is closely related to the establishment of the underlying table structure index and the complexity of downstream data processing. Business-related fields in meta are common fields of managed pages. To facilitate subsequent queries, they exist in the form of columns in the database and need to be enumerated in advance. In order to define these first-level keys, the overall business is sorted out. Finally, the characteristics of the fields in the first-level key are as follows:

  • Id field used for attribution analysis in a business

  • The field represented by “extra” is used to assist information screening

Data stored in the level-1 key of the META is expensive to upgrade. Therefore, the upstream and downstream users must understand the service meaning and upgrade the data in a unified manner. In order to facilitate the business side to flexibly expand the information, the extra field is added to meta and stored in the database as a string. The database storage uses BaikalDB developed by Baidu (github.com/baidu/Baika…). , has good support for real-time storage and reading of such structured information.

In order to achieve the reporting of common exceptions and custom exceptions, we classify the exception capture.

Common exceptions are reported in the following ways:

Window.addeventlistener ('error', error => {logsdk.addWindowErrorlog (error)},true) ErrorHandler = (err, VM, info) => {logsdK. addCustomErrorLog({errorKey: Xx, // Exceptions collected by the framework will have a determined errorKey by default. error: err, userExtra: { message: info } }); };Copy the code

Custom service exceptions are reported in the following ways:

Try {XXX business logic} catch(e) {logsdk. addCustomErrorLog({errorKey: 'XXX ', // Specific business type error: e, userExtra: {// Customized business extension field. Corresponding to extra in transport protocol. }})}Copy the code

When the page opens, the embedded SDK is instantiated by passing in business meta information. When an exception occurs, if the service provider captures the exception, the service provider constructs the error parameter and invokes the API to report the error. Exceptions that are not caught by services are reported using the unified exception processing logic of the framework. At the same time, register the global error event handling method, for some resource loading exceptions, as well as some other exceptions to catch.

In this nested mode, the service provider can customize exceptions and collect and report exceptions that are not captured by the service provider.

The data need to be preprocessed before the abnormal log is stored in the database. Here, with the help of the ability of the company’s streaming computing platform, real-time ETL processing is carried out for each log data. Finally, meta data and some data obtained in the NGINx layer are stored in the database in real time.

After comprehensive consideration from the generality of the transfer protocol to the high efficiency of the storage query, we finally get a table to store the abnormal log on the line. The table structure column of the table to store abnormal data is very many, which is a large “wide table”, which provides data support for the subsequent data aggregation alarm.

3. Improve monitoring and alarm

A large amount of data to generate accurate and efficient alarm, need to go through the following process: based on the dot metadata, create monitoring items; Set alarm policy based on monitoring item statistics.


3.1 Delineate monitoring items

As mentioned above, the monitoring platform will collect anomalies and form a large and wide table. Based on the wide table, the condition analysis of multiple aggregated items can meet the demands of most monitoring items.

Monitoring item: Filters a column of data. For example, the URL contains a query and the service type belongs to a certain range. The platform supports a variety of filtering conditions as shown below (support for regex).

Monitoring aggregation: The intersection of multiple monitoring items. For example, (Line of service === XXX) && (Request status code === 500).

3.2 Develop alarm strategies

Alarm policy has three key factors: aggregation period, alarm receiving group and trigger mechanism.

  • Polymerization cycle

A monitoring item is a statistical rule, and the aggregation period is a window for statistical rules. Set reasonable aggregation cycle according to data volume, importance, etc. For example, the quasi-real-time alarm of 30 seconds is set for the abnormality related to the highest and fluctuation sensitive advertising transformation; On the contrary, you can increase the window appropriately. Avoid frequent false positives for monitoring items that fluctuate greatly.

  • triggering

Triggering can be done in two ways: threshold and fluctuation. The threshold can be set for the relatively stable number of anomalies. For example, some business indicators are basically flat on a daily basis. You can compare fluctuating anomalies by comparing them from yesterday, last week, and two weeks ago. For example, if the user access volume fluctuates regularly within a day, the number of anomalies changes with the fluctuation.

The following figure has a large fluctuation and no obvious time rule, so it is suitable for threshold value:

As shown in the figure below, the abnormal quantity has a time rule, and the fluctuation alarm can be set.

In practice, we often work in both ways, and it is not easy to balance the accurate rate of alarm. We will also observe, control, and adjust the parameters. More challenges and solutions will be mentioned later.

  • Alarm receiving group

Through email, instant messaging, SMS and other ways to ensure that the alarm contact. The most important lesson is that the alarm receiver should never rely on a single point!

3.3 the challenge

As mentioned above, it is important and challenging to measure the abnormal monitoring and alarm. Generally, Server services will have gateways and stable operating environment, while the front-end code running environment will be more uncontrollable, which is a great challenge to improve monitoring.

Challenge 1: How to set up comprehensive monitoring?

After all exceptions are reported, each type of exception must be perceived. Normally, it can be classified according to the type of exception (resource loading exception, API exception, JS execution exception) and establish monitoring for each type of exception to meet the requirements of completeness.

In practice, however, this setting is not suitable for hosting pages. As mentioned at the beginning of the article, business scenarios for hosting pages cover different ends. The traffic of different ends varies greatly. The same exception monitoring item is reused between different ends. The errors generated by the end with small traffic are easily drowned in the overall exception. Therefore, from the perspective of managed business, exceptions are divided into two dimensions: the type of exception and the end where the exception resides.

The monitoring item established by combining the two dimensions can not only meet the requirements of completeness, but also timely discover the problems between different ends.

Challenge 2: How to improve alarm accuracy?

The alarm of the managed page is aggregated in real time based on various conditions, and then compared with the preset threshold to determine whether the alarm is triggered. Theoretically speaking, the accuracy of the alarm depends on the business side. As long as the conditions of aggregation are accurate enough, the alarm is accurate enough. But it’s a matter of cost and trial and error, and you don’t know what conditions will rule out an invalid alarm until it’s triggered. Therefore, in the continuous practice and exploration of business, some common abnormal aggregation conditions are deposited to improve the accuracy of alarm:

  • Exclude crawler traffic (via UA)

  • Only look at commercial traffic (judged by commercial placement metrics)

  • Progressively improved exception blacklist (known exceptions that cannot be resolved, such as “Script error” caused by external injection)

** For example: ** starts by setting a JS exception alarm from a line of business. The aggregation conditions are set as follows

Line of business = XXX && Error type = JS exceptionCopy the code

The optimized alarm polymerization conditions are as follows:

Line of business = XXX && Error type = JS exception && Business traffic flag! &&error_message not like 'Script error' &&error_message not like 'Script errorCopy the code

In order to avoid setting the same aggregation conditions repeatedly for each alarm item, some general data are filtered at the top level, which improves the accuracy of alarm and reduces the configuration work of each business side.

Challenge 3: How to set the year-over-year and quarter-over-year monitoring items with obvious periodic exceptions?

For obvious periodic exception monitoring, caution is generally used in the initial setting. Setting too small will generate many invalid alarms. Setting too large, there is an alarm can not be triggered in time.

This situation is found in practice:

  1. The same setting should not be set at the beginning, should be observed for a period of time before setting. For example, compared to yesterday’s data, the threshold should be set at least two days after the accumulation of data, based on the actual daily data fluctuations to set a reasonable percentage of the threshold.

  2. The year-over-year setting should not be fixed and should be updated once in a while. With the development of business, online abnormal requests are constantly changing. If there are more alarms in a period of time, and most of them are found to be invalid alarms after investigation, then you need to reconsider whether your alarm Settings are reasonable.

Challenge 4: How to monitor the follow-up of problems after alarm?

Exception governance of managed pages is more than a building of basic capabilities. We also hope to form a closed loop of engineering ability for abnormal problem discovery, abnormal follow-up and abnormal solution. Abnormal follow up through the company’s internal task management platform, for each alarm to create a task card, by the specific abnormal responsible person to follow up. When the problem is solved, specific information can be input on the task card, so as to achieve the goal of having a special person follow up and deal with every exception. In order to improve the follow-up rate of problems, we will also carry out routine statistics based on the information of task cards, analyze and calculate the duration of cards and the number of cards, and push the card statistics through internal instant messaging tools.

= = =

Four, abnormal check

After receiving an abnormal alarm, it is a very common business scenario to quickly locate the cause of the alarm.

From practice, we summarized several ways to quickly improve the efficiency of exception detection:

  • Make good use of the aggregation

Most of the time, the abnormal data on the line fluctuates within a certain range. When a thorn suddenly appears, you can quickly find the problem through aggregation.

Through these aggregation conditions, the similarity of these sudden anomalies can be found quickly. Common aggregation options include IP address, UA, device ID, and URL.

For example, in an online resource loading failure alarm, the page URL, resource failure URL, and input parameters in the abnormal log are different. Ruled out the possibility of individual advertising pages to increase the amount of traffic, also ruled out the possibility of machine scripts to brush the page. Finally, after IP aggregation, it was found that all the anomalies were in a certain region. After feedback from CDN students, it was found that there were network faults in this region, which was timely promoted to avoid a wider range of losses.

  • Improve basic capability support

Currently, JS on line is compressed. Once exception information is generated, the compressed information is stored in the exception stack, which is not convenient for troubleshooting. Therefore, we collaborate with the downstream error analysis platform to upload the sourcemap resources associated with the managed page. In this way, the error information in the JS execution exception can be directly located to the original error file through the sourcemap file, which is convenient for developers to quickly locate the code where the problem occurs and improve the efficiency of troubleshooting.

5. Exception management

The exception collection, alarm, and troubleshooting described above tend to be a passive scenario. We had to trigger an online alarm to get involved. But in fact, in addition to the follow-up of alarm problems caused by these occasional spikes, we also took the initiative to explore some general schemes for the existing online anomalies, so as to actively optimize the online abnormal scenes and improve the stability of the hosted page line.

First of all, in order to unify the governance objectives and cooperate with all parties to deal with online anomalies, we set the online anomaly governance objectives from the following steps:

1. Identify the types of exception errors

On the basis of JS execution exception, API exception, resource loading exception, subdivide again. Finally, there are four types of exceptions, which are:

  • JS execution exception

  • API abnormal

  • Image resource loading is abnormal

  • SCRIPT resource loading is abnormal

2. Clean data

  • In order to exclude the influence of some test data or web crawler data, only the errors from commercial traffic are considered in data screening due to different scenarios of hosting landing page operation.

  • Establish a blacklist of known online error messages that do not affect front-end stability due to on-end injection, and eliminate the interference of such errors through specific error messages.

3. Establish appropriate data standards

In order to smooth out the traffic differences between different product lines, we proposed the concept of outliers generated by a single AD click. The absolute value of the abnormal number is changed into a relative value based on the advertising traffic, so as to measure the abnormal amount of products offline with different traffic. After such normalization, the influence of advertising traffic on abnormal data volume is excluded.

For indicators related to network conditions, such as the number of failed image/SCRIPT loading and THE number of FAILED API requests per AD click

The process for establishing data standards is:

  1. Give the base time

  2. Calculate the number of single AD click image/SCRIPT loading failures and API request failures for different product lines within the baseline time range every day, and give the value of 80 quartiles

  3. The minimum 80-quartile value in the baseline time is taken as the target of optimization. (Can be adjusted based on business)

Core idea: Such exceptions are caused by network reasons, and values should be consistent across different product lines. Based on this, 80 quantile value is taken as the reference line for optimization. Product lines that do not reach this reference line must have other problems besides network factors, which can promote these product lines to align to this unified standard. (To avoid extremes on a single day, consider averaging or minimising after removing spikes to get the final target value)

For the indicator closely related to the runtime: the number of FAILED JS clicks per AD

The process for establishing data standards is:

  1. Give the base time

  2. According to errorKey, errorMessages are aggregated and sorted from largest to smallest. Find the exception in the result due to JS execution of the managed page itself. These are exceptions that are expected to be optimized to 0. After excluding these, the final value is divided by the traffic of the landing page to obtain the data of the abnormal JS execution of a single AD click in a day.

  3. Take the minimum value in the baseline time as the target of optimization. (Can be adjusted based on business)

After establishing optimization goals, targeted optimization can be carried out.

The following ideas are adopted for the abnormal core of resource loading caused by network reasons.

  1. Changing to CDN link or reducing resource size can reduce the first load failure rate

  2. Images are linked using CDN

  3. Compress images reasonably or use image formats with higher compression ratios (e.g. Webp, etc.)

  4. Retry reduces the final resource load failure rate

We set up the corresponding retry mechanism for API request exception, SCRIPT loading exception and picture loading exception respectively from the bottom. Among them, except the retry of SCRIPT loading exception, API request exception and image loading exception, the business side can express the service by passing related parameters to support different business scenarios.

For the exception of JS execution, we have established a complete processing flow:

  1. Discover business-optimizable exceptions from common exception monitoring;

  2. You can specify specific monitoring conditions and set up independent monitoring for service exceptions.

  3. On-line optimization scheme to deal with such exceptions;

  4. Observe whether the monitoring data drops as expected.

Optimize each type of specific business exception through the above four steps.

Through the above methods, we set appropriate goals and carried out targeted optimization. Finally, the data of each anomaly indicator declined to different degrees after targeted management. At the same time, an online experiment was introduced during anomaly management to measure the impact of reducing the number of online anomalies on advertising conversion. The experimental results showed that app download and the conversion of clues were improved.

Six, afterword.

Exception governance is a difficult but correct path. In the practice of business landing, we encountered a lot of problems and challenges. We completed the process from 0 to 1 and explored a sustainable way to monitor and manage front-end anomalies. However, many things still need to be further worked out, so as to continuously reduce the number of anomalies in the front-end of managed page and improve the stability of managed page line.

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome to your attention