preface
This article is from the record of Dingding Front-end – How to design front-end real-time Analysis and Alarm System, shared by SlashHuang, the head of monitoring of Dingding front-end team, as the guest of the fifth session of “Front-end Monitoring” special session on April 25, 2020.
The text start
Hello, everyone, I am from Dingding candle elephant, my topic today is “Dingding front-end team how to build a hundred million level of traffic monitoring system”.
Personal introduction
First, let me introduce myself. I graduated in 2013 and joined Dingding in 2017. When I joined Dingding, I was P6. Then, I successfully got some promotion opportunities by doing front-end monitoring, making some modular code packages and efficiency tools.
About the nailing front end
Dingding has been developing rapidly since its establishment at the end of 2014, and its front-end monitoring is also evolving accordingly. We have hundreds of millions of users and tens of millions of enterprise users. Front-end products include Android, iOS, desktop, small programs and H5, etc. The release of front-end applications also covers full release and gray release.
The challenge of hundreds of millions of streams
For such a hundremillion-level platform, in addition to front-end monitoring system, I believe that many partners also have a sense of body, to ensure the stability of the overall nail front-end, but also need to have some means of technical operation, including some of the situation of people. We now have more than 100 front-end development members, and our technical module has IM, address book, live broadcast, education, documentation, hardware and other very B-side attributes of the business.
results
Let me start with our results: 100% coverage of all our H5 and applets today, supporting the monitoring needs of more than 100 front-end people. The log volume of front-end monitoring reaches 10 billion yuan, and the number of monitored plates exceeds 100. It can sense online problems in one minute and locate fuzzy problems in one minute. In terms of human input, we always kept within two responsible personnel, and in most cases I was mainly responsible for the overall monitoring situation. Therefore, our cost in terms of human input was relatively low. The two trend charts in the figure above are the main product structure that we monitor. One is our monitoring trend chart, and the other is our business tray folder used to carry each business. Meanwhile, we have a unified applet for production environment, H5 monitoring tray.
The road of evolution
Next, I will talk about the front-end monitoring of Dingding, how we have evolved the system to get a good result.
Considering that there are many friends who are not engaged in front-end monitoring, so HERE I will first talk about some basic knowledge to expand how to design a front-end monitoring system.
If we look at the code above, const creates an object and foot.a.b = c. As you can see, this is a very classic NPE code, called a NULL Point exception, which is very common in front-end code. ** Uncaught TypeError: Cannot set property ‘b’ of undefined**.
For such an error, after the user side, how does our front-end monitoring system catch the error and find it in less than a minute? Let’s look at a traditional way of doing this:
- First, write a front-end monitoring SDK for data collection
- Select a notification scheme to notify the front-end log to the server
What I’m demonstrating here is using the image tag, creating an image tag and setting its SRC to point to the corresponding log server to send the corresponding log. We use window.onerror to catch global errors. The captured errors are then sent to the front end monitoring server on the right side of the image by creating an image tag.
The above code is just a pseudo code demo, I wrote relatively simple.
For a traditional monitoring system based on log analysis, the first thing you need to know whether the log is from where, so we each log on the front end to, have an application id, let me call spmId, through spmId log to identify the source, then the log storage to the corresponding server monitoring, This completes a very simple front-to-back link.
A closed loop from log generation, collection, and then storage is very simple. In fact, see such a simple implementation, and then the log type to enrich, collection and storage do a bit more powerful, basically can go to build a relatively simple front-end monitoring system.
Generally speaking, a simple front-end monitoring and analysis system needs to contain the following three dimensions:
- The first is a stabilities dependent JS error
- The second is performance-related performance
- The third is related to API success rate
In the monitoring platform, we need to do some log storage, provide the monitoring log to the visual platform server, through providing some API services can draw the above diagram. For example, number 1 is the interface success rate. I think in the technology selection above, for many slightly a little Node or server base front-end students, basically can make a simple Demo. However, such a seemingly complete function of the system, for front-end monitoring, there is no problem? Is it able to meet the monitoring needs of a hundred-million-level flow platform like Dingding?
The left side of the figure shows the process of our developers’ access to front-end monitoring, including the development stage, test stage and launch stage. During the implementation of front-end monitoring, we required all developers to actively observe and monitor the market for at least 30 minutes after the application iteration went online, and observe three indicators:
- js error
- performance
- The success rate of the API
For we are currently more than 100 front-end students team size, the human cost is 100 times for 30 minutes, for nailing the enterprise products, at the same time we demand is very high to the stability of the line, the line fault tolerance is extremely low, thus also requires daily online application for inspection, so the human cost is very high.
From a developer experience perspective, when a developer looks at monitoring: the first thing he does is go to the visual analysis platform to see if there are error logs. There is a very important point here, which is to say that the logs that we are monitoring the analytics platform to see, are they “front-end page” logs?
Not necessarily. Why is that? Because for the user, it not only opens the front page, the front page behind the container webView, application container, carrier, etc.
For example, one of our pages can be opened in the container of wechat, can be opened in the container of toutiao, can be opened in the container of Dingding. So the log source that you collect is not just a front-end page, but also the webView of the container, and we have a lot of operators. For example, we often see an advertisement inserted in the front page, and then we have some mobile phone manufacturers, such as Vivo, Huawei, etc., will also insert relevant scripts in our page. Therefore, the logs collected by the monitoring and analysis platform are not only front-end logs, but also user terminal logs corresponding to front-end pages.
Generally we encounter three types of interference logs:
- The first is third-party script injection
- The second is the injection of container scripts
- The third is injected by the phone manufacturer script
For example, the above is an online application of ours. The JS error rate is about 0.08%. For a volume like Dingding, this error rate affects a very large number of users.
So let’s see what the corresponding error is actually? WeixinJSBridge is not defined, toutiaoJSBridge is not defined, 20 vivoNewsDetailPage, These things basically have nothing to do with business errors from the error messages.
Therefore, we can draw the first conclusion, which is that some errors caused by front-end monitoring are actually unrelated to business, which may be contrary to many people’s cognition.
Let’s look at another issue. The left graph shows our desktop release curve. Dingding is one of the few platforms in China and even the world that is very desktop heavy. Basically, the desktop side is an iteration a week or two weeks. Because the front-end code of the desktop side is in the form of offline package, it is difficult to update and repair the code, which requires very high front-end stability.
For our desktop today, there have been more than 100 online release versions, so many versions reported logs using the same application ID, how to do hierarchical monitoring, online traffic imbalance how to do hierarchical monitoring, to avoid small traffic release monitoring drowned?
These problems are often encountered in the business scenarios of Nailing. The granularity of our monitoring needs to be adapted to the development of the front-end, and the logs monitored need to support more dimensions. For example, monitor by application and release variables.
Let’s take a look at another case. Dingpin has hundreds of front-end applications, and one alarm for each application is very exaggerated. Basically, there are more than 500 logs in the alarm group in a day. That is, it has an alarm, but it doesn’t need to be modified, etc. The reason for the long tail error is that even though I fixed the problem, the user may not have full access to the latest version.
Therefore, conclusion 3 is that the human cost of our monitoring operation is very high. The requirements for front-end monitoring are not only to report the alarm, but also to make the alarm intuitive and real-time, and to support some means of short-time shutdown and error filtering.
After looking at these three cases, let’s take a look at how to design a monitoring system that can serve 300 million volumes.
First of all, we define the monitoring design objectives, nail enterprise front-end monitoring needs to do things: one minute of perception, 5 minutes of positioning, 10 minutes of recovery. Let’s call this monitoring system 2.0.
For front-end monitoring 2.0, we defined the following capability levels based on 1.0.
The first is to close to the actual business, reduce the cost of human operation, business can low-cost intervention. At the same time for the alarm system, fast alarm, quasi alarm, and support custom alarm. We set a baseline internally, that is, the precision of front-end monitoring must reach more than 90%, the labor cost must be reduced by 20 minutes per person, and the alarm and the large plate need to be able to support custom configuration.
The diagram above shows the overall monitoring component layout. On the left is a legend. The blue part represents the 1.0 monitoring component and the dark green part represents the 2.0 new monitoring component.
Custom collection The first log collection device supports custom collection in addition to collecting routine service data and monitoring data.
Analysis intellectualization The analysis intellectualization section adds the ability to customize the analysis.
Real-time alarm In the part of real-time alarm, the requirements of 1-minute alarm and 5-minute positioning are added.
The most critical technology implementation
Also, the blue part is a system from the original 1.0, and the dark green part is our new system. We will notice that in the log collection and log consumption side, we have added a module called log double write.
One log is consumed by two systems, one for real-time alerts and one for analysis:
- After the server got the log, a storage analysis to do some monitoring report services;
- The second block introduced log minute computing system to do real-time alarm.
Many students will feel that log double writing is actually a very large system waste, one log is consumed by two systems. In fact, Nail nail front-end monitoring with the help of Ali’s very mature log consumption system and infrastructure. Through the log distribution of two ways to be quickly consumed, so that the minute computing system in the whole monitoring system is ahead of the arrangement, to meet the requirements of 1 minute alarm, which is the core of our technical ideas in this piece.
Below the dotted purple line in the image above is our user perspective. On the user side, there are two parts. The first one is the front-end monitoring SDK, which includes THE SDK of H5 and small programs. The second one is the platform, including the analysis platform and alarm platform.
The real case
Let’s look at a real case. The user encountered two JS errors. Both of these JS errors are classic front-end NPE errors.
The first one happened in iPad + Baidu browser. The second error occurred in Android + Toutiao WebView. As a result, we can find that there are two kinds of errors reported by our client:
- TypeError: Cannot set property ‘b’ of undefined. TypeError: Cannot set property ‘b’ of undefined.
- For example, Baidu browser will inject MyAppHrefLink is not defined.
Maybe a lot of you haven’t observed it. We did a thorough canvass. Baidu browser will inject MyAppHrefLink is not defined. Headliner will also inject some headliner jsBridge.
After the log arrives at the server, we clean the log first, filter out all the host interference logs, and ensure that our alarm system is the log error of real business consumption. This is the first module in the yellow area: log cleaning
Next, we group the logs of application A spmId=A and application B, and group the logs of application A and B by application id. The filtered logs are calculated in real time.
After this step, the log is transferred to the alarm indicator for real-time calculation, and the alarm rule engine issues relevant instructions to the corresponding Map Reduce machine for some processing.
For example, the JS Error failure rate is equal to the number of JS Error logs divided by the number of PV logs. When the calculation result of the log is greater than 6%, the nailing group alarm is carried out. When the failure rate is greater than 15%, the SMS alarm is carried out.
Dingding front-end monitoring 2.0
Monitor log
By applying the same process to different indicators, such as API success rate, JS error failure rate and PV data, we can build a monitoring system satisfying 1-minute perception in the minute calculation system.
Alarm System architecture
As for the alarm system, the above picture shows a very classic monitoring system in our R&D department of Ali. If you are interested, you can search Sunfire on infoQ for a more detailed architecture introduction, which will not be expanded here.
Summary of the overall logging architecture
And that’s basically what I want to share today, how we think and how we landed on the ground as we moved from 1.0 to 2.0. Here I’ll give you a brief summary:
- The key technical idea is to prearrange the log alarm component. Our realization is to use the log double write analysis system and alarm system.
- In the alarm platform to support the alarm rule engine, truly achieve self-defined alarm, alarm can be graded.
- For the front end, we are not only the front page, we are more in the face of the user terminal.
conclusion
Ok, this is what I want to share today. How Dingding front-end monitoring can give online stability to dingding with a volume of 100 million, more than 100 front-end and more than 600 front-end pages.
The following is the technical column of Dingding front, zhihu column, Nuggets column, we have been recruiting talents, welcome to contact me.