Ant Financial - Quickly locate front-end problems based on data and stack mapping

Front-end early chat conference, a new starting point for front-end growth, jointly held with the Nuggets. Add wechat Codingdreamer into the conference exclusive push group, win at the new starting line.

The 14th | front-end growth promotion, 8-29 will be broadcast live, 9 lecturer (ant gold suit/tax friends, etc.), point I get on the bus 👉 (registration address) :

This article is the fifth session – front-end Monitoring System construction special lecturer, sharing from Ant Financial Yaser – brief collated version of the lecture notes (including the complete version of the functional demonstration, please see the video and PPT) :

preface

Hello everyone, welcome to participate in the sharing of monitoring field today. My name is Yaser from the Experience Technology Department of Ant Financial. Our team is currently committed to the construction of front-end infrastructure, and monitoring is naturally one of our main businesses as a part of front-end infrastructure.

Compared with the service side, the front-end monitoring started late, many enterprises in the front-end monitoring is also in the exploratory stage, so today to share some of our team in the front-end monitoring practice and experience, I hope to help you better improve their monitoring products.

Let’s take a look at the main content of this sharing. First of all, I will introduce the front-end monitoring system we developed last year — Swift front-end monitoring, so that we can have a look at the functions and products of the most widely used front-end monitoring system of ants.

Then comes the main topic of this time — how to quickly locate problems. In this part, I will introduce some good product design and good practices in our products to you.

Finally, we will share some other practices and see how the front-end monitoring system can be used.

So no more nonsense, let’s introduce our monitoring platform.

Product introduction

Business background

First of all, let’s talk about why we want to make such a monitoring platform, mainly in the following four aspects:

Ants lack a systematic front – end monitoring platform

The business side that wants to do front-end monitoring may need to build it from scratch or use some service-oriented monitoring platform to do monitoring.

Fast business growth, high access costs

Whether the system is self-built or replaced by other systems, it is inevitable to have a relatively complicated manual access process. In the face of the rapid growth of front-end services, a product that can access at zero cost and provide monitoring functions by default is urgently needed.

Rich custom monitoring requirements

The front-end has more business attributes than the server, and there are a lot of custom monitoring requirements, which are difficult for existing products to meet such flexible custom monitoring.

Does not integrate well with other internal platforms

Ant has a lot of internal front-end infrastructure and data assets, which can bring a lot of possibilities for front-end monitoring, but the existing products do not take advantage of these internal resources.

Swift front end monitoring

Because of the above business scenarios, we began to build a front-end monitoring platform for ants last year, namely swift front-end monitoring.

The characteristics of

Swift front-end monitoring mainly has the following characteristics:

Focus on the front end: combine the characteristics of the front end to design products, help developers to monitor
Open up the RESEARCH and development platform, zero cost access: let front-end applications have monitoring capability by default, many businesses are unknowingly found with monitoring data.
Exception monitoring & Custom monitoring: In addition to the default exception monitoring, we also provide the ability to customize the monitoring, so that businesses can design monitoring based on their own business scenarios. Some teams even build upper-level service medias for some common scenarios on this basis.
Data asset output, enabling business: The monitored data can be directly output through our data service. Business parties can directly consume this data and build reports or visualization products according to their own needs.
Integrate internal resources: including research and development, user behavior data, and use these resources to help users find problems.

Architecture diagram

If you look at our architecture diagram, the main part in the middle is our monitoring system, the development platform below enables us to inject monitoring scripts into the application, providing default monitoring capabilities, and the right side supports the business center through interfaces and data services.

The entire monitoring system, with the collection script at the bottom, provides functions such as exception capture, performance collection, custom reporting, and environment information collection.
Next up are data tasks, including offline, real-time, and detailed data.
The intermediate DataService not only provides the platform with the ability to query heterogeneous data sources, but also can output data for other services.
The monitoring server mainly contains the following functions: meta information management, mainly monitoring item and alarm item configuration, and then capability output, including meta information management interface, alarm capability output, and message push. There is also alarm message processing, used to generate system alarm, such as new abnormal alarm, as well as intelligent alarm and rule alarm.
The top layer is to display the front end of the monitoring platform, including the functions contained in the red part.

Product function renderings

Here is a simple demonstration of the actual effect of swift monitoring platform, convenient to give you an intuitive impression.

The first is to monitor the market, to give users a global information display. Each square on the market represents a monitoring item, each monitoring item can be very intuitive display of the day’s cumulative number and affect the number of users, as well as a trend of the number.

After each monitoring item is clicked on, the monitoring item details are displayed and the trend data of the monitoring item is displayed. The lighter color in the trend data is the comparison data of the previous day, so that users can use the data of the previous day as a reference.

The detailed reported information is a dimension under monitoring items. For example, various error messages about JS anomalies are displayed here. For each error message, you can also see the specific number of messages and the number of users affected. Exception information can also be added to the ignore and concern list, so that the ignored exceptions are not included in the cumulative data, facilitating users to filter out some unimportant exceptions.

In addition, there are some convenient features, such asBox the trend chartSome part of itFilter data, and querying details directly on a message.

Click the “Analysis” button for each reported message to enter the details page of exception information, where you can see the trend of message, stack mapping, multidimensional analysis and other tools.

In addition to the above monitoring data, there are alarm items to view

As shown in the architecture diagram above, we have system alarm, intelligent alarm and regular alarm:

The system alarm

System alarm is the default alarm item provided by the system, the most commonly used is the new exception, that is, the new abnormal alarm.

Usually there are some exceptions that keep happening in the front end, and usually these exceptions are so unimportant that they keep reporting errors online without fixing them. However, when a new exception occurs, it is necessary to be vigilant. Users can receive the alarm of the new exception to confirm the cause and severity of the exception.

Intelligent alarm

Intelligent alarm is a set of intelligent algorithms that can automatically detect sudden increases in abnormal conditions by analyzing trend data.

The rules of alarm

Regular alarm is an alarm set by users themselves. Users can select an object, such as a monitoring item or a page. Then the selected object can be filtered according to certain conditions.

For example, filter only exceptions where Message equals a specific value. We can see that our product is very flexible to the conditions. For example, you can compare the latest N minutes with the last N minutes, or compare it with the same period yesterday, and configure the growth rate decline rate.

After each alarm item is opened, you can see the specific alarm record, the specific error number when the alarm is generated, and the abnormal trend before the abnormal occurrence

Monitor the process

Through the above product introduction, we should have a preliminary understanding of our monitoring products. Through the functions of the monitoring platform, we can see that a complete monitoring process probably includes the following steps: acquisition, perception, positioning and problem treatment

Among them, problem positioning is often the most difficult part, which is also the most able to reflect the differences between different products. So what is the difficult location of front-end problems?

Problem orientation

The difficulties in

One: the lack of information

The first is the lack of information. Compared to the server, if a server application fails, the worst way is to log on to the machine. Various error messages, stacks will be printed out, and mature server frameworks will also print out some context information.

However, for the front-end, the logs of exceptions are recorded on the client, so you cannot view them directly. You can only rely on reports, but the reports cannot obtain all information, and it is difficult to understand the context of the occurrence of exceptions. In addition, the front-end code gets confused and the reported stack is not directly readable.

Targeting is nonsense when you don’t have good information.

Second: there are too many factors that may cause abnormalities

Environmental factors

There are many factors that cause front-end anomalies, the first is environmental factors. Front-end applications run in a much more complex environment than the server. Models, browsers, and operating systems may cause front-end anomalies.

Business factors

There are more business attributes in the front end and more business judgments in the code, because exceptions caused by business are common, such as requests or JS exceptions caused by non-member users seeing operations that can only be performed by members.

Interaction factors

In addition, there will be more interaction at the front end, and it is easy for some boundary conditions in the complex code to be missed in the test, resulting in problems on the line.

If there are many factors, the investigation of problems will often be clueless. If you look at the logs one by one, it will be difficult to find the cause of the problem in the first time unless you have the ability to detect crimes.

Requirements for problem location

The difficulty of the investigation has increased, but the corresponding time for investigation has not increased. On the contrary, because the front end is directly perceived by the user, the investigation speed of the problem has higher requirements. Therefore, the requirements for problem positioning are two words, “fast” and “accurate”, that is, the efficiency and accuracy of troubleshooting problems.

Good practice

Let me share with you some of our best design and practices for problem location.

Automatic stack mapping

Exceptions are often caused by problems in the code itself, and the best way to locate the problem is to look at the stack.

As mentioned earlier, for the front end, the stack itself is unreadable because the front end code needs to be obfuscated, so when you get the stack, you need to map the stack information to the source code via SourceMap for further troubleshooting.

The stack mapping technique itself is pretty well developed, just combining the stack with SourceMap to see where the problem lies in the source code. The SourceMap generation mapping tools are readily available, so what’s the point of this section?

As I said just now, the investigation of problems should not only be accurate, but also be fast. The stack mapping itself is “accurate”, so “fast” is achieved through automation of the whole process.

Manual mapping is inefficient

Here’s how inefficient manual mapping really is.

The generated results are not stable

It is possible that locally generated SourceMap will not match online code, most often due to changes in dependency versions

SourceMap generated slow

The first is that SourceMap generation itself is slow, it is not unusual for an application to generate SourceMap for more than 5 minutes, and if you generate SourceMap after discovering the problem, you will miss a lot of valuable time.

Retrieval and management of SourceMap

Even if you have SourceMap, the mapping process is annoying to do manually, and it won’t go much faster if you imagine finding specific SourceMap artifacts for each row on the stack.

Not easy to share

For collaboration, and even more teamwork project, found the problem and fix the problem is likely to be different people, if just using the tools manual mapping results is unable to share to others directly, only through screenshots, inefficient forms such as the text to make others get useful information, need others even manual mapping again.

Cannot be combined with other auxiliary information

Stack information, while useful, is not a panacea, and even when source code is available, data and auxiliary information are needed to better locate problems, especially those that require specific scenarios to reproduce. After manual mapping, go to the monitoring platform to view the information, and precious time has passed.

Therefore, the automation of the whole process is the key to quick troubleshooting.

Automation practices

Should SourceMap be turned on at build time and the stack be reported when an exception occurs? It’s not that simple.

SourceMap generated

Turn on the SourceMap problem

The easiest way to get SourceMap is to turn it on at build time, and mainstream build tools can do this with a configuration change. There are two problems with SourceMap enabled by default:

Long build times affect build efficiency: An application with SourceMap on can take several times longer to build, which is fine if the business is small, but can seriously affect the build efficiency of the entire company if the business is already large.
Prone to build exceptions: SourceMap is very memory hungry and often appears in OOM, if the business release is blocked because of a failed SourceMap build, it is not worth the loss.

SourceMap is generated asynchronously

To solve these problems, we used SourceMap asynchronous builds.

As shown below, on our build platform, when an application starts building, an asynchronous build task is set up on other machines to generate SourceMap, equivalent to two builds. This allows SourceMap to be generated without blocking the normal build process.

Also, because you are building on a different machine, you prevent SourceMap errors from impacting the normal build.

Splitting into two builds, however, can lead to inconsistent build artifacts, which make it impossible to map even with SourceMap. This problem is usually caused by inconsistent versions of dependencies, in which case it is usually resolved by locking the dependency tree, creating the lock file in the normal build process, and then the SourceMap build task retrieving the dependencies based on the locked version.

Or simply using the same dependency can be resolved. When the SourceMap is generated, it can be saved to the CDN that only the Intranet can access, so that the monitoring platform can obtain it easily.

Stack reporting optimization

With SourceMap, you also need a stack of exceptions to finally locate the problem, and the simplest and most crude way is to harvest the stack intact.

Full stack reported problem

A full stack report does not affect the troubleshooting of a problem, but may cause other problems.

First, the stack has the following characteristics:

Stack text is large: a stack typically has hundreds to thousands of characters, far more than any other data that needs to be reported.
Double reporting is meaningless: when you have the same exception happening 1000 times, you only need one stack, and multiple reporting is meaningless.

If the full amount is reported, the following problems may occur:

For the front end, heavy reporting affects front-end performance and wastes traffic (affecting user usage)
On the data side, data storage and computing resources are wasted (the company costs more). At present, there are about 20 million stack anomalies reported on the monitoring platform every day, and a stack is about 2KB after encode, so if reported in full, the daily stack storage is about 40TB of such an magnitude.

Optimization scheme

To solve the above problem, you can use the following two solutions to optimize stack reporting

Stack compression: Reduces the size of a single stack
A mechanism to prevent stack duplication

2.1 Stack Compression

The principle of stack compression is to extract the duplicate content as much as possible, replace it with a short identifier, and report the replaced content along with the identity mapping.

For most front-end scenarios, the highest repetition rate is the URL of the file. As you can see below, urls are replaced with # and number identifiers, which effectively reduces the amount of text on the stack. (The URL in the original stack in the image is optimized by Chrome for presentation, but the actual URL is much longer).

2.2 Preventing Repeated reporting

The most important thing to prevent duplicate reporting is: How do you tell if it’s a duplicate stack? Our solution is to compute an ID for each stack, and when the IDS are the same, the stacks are the same. The stack is generated as follows:

Stack normalization: the first step is to normalize the stack and erase the morphological differences of the stack under different platforms
Fingerprint extraction: To extract several lines of the stack according to a stable algorithm and splice them together as fingerprints. The purpose of this step is to improve the performance of ID generation.
ID generation: Use MD5 to calculate a fixed digit hash for the ID of the fingerprint in the previous step.

Once we have an ID, we also need to detect whether the stack for that ID has been reported. The simplest way to do this is to ask the server interface for a query, but it is difficult for most server interfaces to handle requests of this magnitude. So we took a clever approach to stack reporting probes.

When a server receives a stack, it sends an empty file with the stack ID to the CDN. When the collection script is abnormal, it will use the ID to access CDN resources when it obtains the ID. If the request succeeds, it means that the stack has been reported, and it will not be reported again. If it is 404, the stack is new and needs to be reported. CDN is designed to handle a lot of access, so it solves the problem of detection.

The process of preventing repeated reports is as follows:

Platform for mapping

With the SourceMap and stack, all that’s left is to map it on the front end, which the source-Map tool on NPM can do. In addition, we also go further, because of the development platform through, so can get the code, by the way, the location information in the way of source code display.

The end result is that when you open the page, you can see several lines of code above and below where the exception occurred, giving you a pretty straightforward idea of what went wrong.

Stack is not a panacea

While automated stack mapping is very efficient, there are problems it doesn’t solve:

Some exceptions, especially custom business exceptions, have no stack information of their own
Even if you have a stack, you can’t reproduce it

In this case, the power of data is needed.

The data analysis

The investigation of problems through data is also one of the more common means.

As mentioned earlier, there are many factors that lead to anomalies on the front end, including environmental factors, business factors, and complex interactions, and data analytics are useful in helping you narrow down those factors.

Therefore, the design of the data analysis function must be able to achieve the purpose of narrowing the scope, do not stack useless data!

Multidimensional analysis

There are a number of factors that can cause an exception. When an exception occurs, it is often difficult to determine the cause of the problem simply by looking at the log. If the scope of the problem can be focused on a certain dimension or two, it can greatly reduce the difficulty of troubleshooting the problem, and it is easier to reproduce the problem.

This is the interface for monitoring the “multidimensional analysis” function of the platform. On the left side are some large dimensions, including the common dimensions of the front end, such as page, browser, operating system, etc., as well as some customized dimensions of some businesses.

On the left side, some dimensions with more concentrated distribution will be labeled so that users can focus on them.

Points in each main dimensions we can see the distribution of concrete, there are some dimensions and dimensions, can further analysis, such as found one exception very concentrated under a brand mobile phone, and further the brand’s data is very concentrated in one version of the alipay, trying to identify the scope of the problem can be narrowed.

Would there be exceptions that are not concentrated in every dimension? Of course it does, but it’s either very, very rare, or it’s something innocuous, or a universal and serious exception is usually easy to spot during the testing phase and never shows up online. This kind of multidimensional analysis is often more useful than expected.

Custom dimensions: in order to deal with because business factors result in abnormal, our monitoring system to increase the support custom dimension, users can according to their own business needs, set up some dimensions, report the specific business value report, and other general dimensions to do the same polymerization, so that it can enhance the business factors lead to abnormal positioning capability.

If you use the example I gave earlier, you can use membership as a dimension, so that problems like “non-member users perform actions only for members” can be found more easily. How to set the specific business dimensions needs to be seen in the context of the respective business model, which is not detailed here.

Pay attention to the operating

As mentioned earlier, there is another feature of the front end, which is heavy interaction. A page may have very complex interactions, and some problems may only be triggered by certain interactions. For this kind of problem, the most needed step is to reproduce the problem.

Many monitoring tools have many attempts, such as video recording. This kind of black technology can indeed help the investigation of problems, but there are many technical difficulties, and the cost performance is not very high.

The simplest way is to record the user’s actions before the exception occurs. In this way, a user’s operation steps will be obtained. According to the user’s operation steps, we can try to reproduce them, and it will be easier to locate the problem.

If you already have buried data in your organization, it is the most economical way to directly associate buried data with outlier data.

Pay attention to the change

When is it most likely to generate a new exception? That is when changes are made, including changes in the code version, or changes in the environment on which they depend. For example, your page is embedded in your APP. Even if your own page does not change, APP version may also cause new exceptions.

Therefore, if you can relate the occurrence of an exception to a specific change point in time, you can narrow the investigation to that change.

A change is a point in time, so marking such points on a trend chart is the most intuitive way to relate anomalies to changes. When you see a significant increase in abnormal trends that coincide with one of your releases, it’s almost certainly the cause of that release.

The screenshot here is interesting, but it’s actually a counter example, and when you fix an exception and post it, you can see how the number of exceptions goes down.

Specific annotation information can be adjusted according to their own business needs, such as the page version, APP version. You can bring along some other change information.

For example, the version of your page, what changes have happened to the dependency, whether the dependency upgrade caused the occurrence of abnormalities, these can help you to troubleshoot the problem.

Other good Practices

Automatic Script Access

One of the selling points of Swift front-end monitoring is the ability to provide exception monitoring by default. To do this, the most important thing is to automatically plug monitoring scripts into business applications.

It’s not really that complicated, but it dramatically reduces the cost of monitoring business access. The whole access process is mainly about the development platform injecting the monitored project ID into the build container, and then the build tool fetching the parameters and injecting the scripts and parameters into the front-end build artifacts.

To achieve this, it is necessary to have a sound R&D infrastructure and a unified technology stack. For enterprises in the stage of business growth, they can make efforts in these two directions in a planned way to prepare for carrying a large number of businesses in the future.

Customize the monitoring system

As mentioned many times before, the front end has very rich custom monitoring demands, so we provide a very complete set of custom monitoring system, the default anomaly monitoring is also based on this system.

As can be seen from the figure above, our monitoring platform provides customized capabilities in terms of monitoring items and indicators, dimensions and fields under monitoring items:

Custom monitoring items: This is a relatively large dimension. Before our monitoring platform launched performance monitoring, some services would use this function to perform performance monitoring, so there would be monitoring items such as “page load time”
Custom indicators: These user-defined values are displayed in the trend data. If page load time is used as an example, you can set the maximum, minimum, and average values to three indicators to analyze performance from multiple perspectives.
Custom dimensions: The “member or not” used in the previous multidimensional analysis is an example of a custom dimension that will not be covered here.
Custom field: This is a field that will not be used for any processing, but will only be displayed in detail data. The business side can set some purely auxiliary information in the custom field.

This may be abstract, but it can be understood in combination with the specific setting interface:

Other possibilities

This sharing ends here, but the exploration of front-end monitoring has just begun, and there are many questions worth exploring together:

Automatic determination of the severity of the problem: there are many front-end exceptions that do not need to be handled, but there are also many errors that are very fatal, how to automatically determine the severity of the problem will help us to make decisions in the first place.
More intelligent alarm, reduce false alarm rate: now most of the monitoring products can do not miss alarm, but in the false alarm are not very good, the real threat of alarm may be lost in a large number of unnecessary alarm, and eventually become a “Wolf is crying” story.
Exception self-healing: If an exception occurs, whether a solution can be automatically generated based on the exception to reduce the cost of problem repair.

How to Quickly Locate Problems Based on Data and Stack Mapping mp4 (121.63MB)

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Ant Financial – Quickly locate front-end problems based on data and stack mapping

preface