preface

Developers in front of this video, good afternoon!

The topic I share today is “How to implement a multi-terminal error monitoring platform”. I would like to introduce myself briefly. My name is Allan from bebey-Big Front-end Architecture group. I am currently responsible for the group’s infrastructure work such as maintenance of error monitoring system and engineering standardization. I’m also the author of React+Redux Front-end Development. Without further ado, today I will share with you how Beibei developed this error monitoring platform. I hope this sharing can make everyone “understand, use and take away”.

I’m going to talk about this for about an hour, so I hope you’re not in a hurry, so be prepared.

directory

The value of technology is to solve business problems. Therefore, in the first two chapters, I will combine beibei’s actual situation to explain why we do this thing and the value brought by this error-monitoring platform. The third chapter covers technical implementation and the final chapter concludes.

Chapter one: Current situation of the Group

Next, let’s get started. First of all, I would like to briefly introduce the business status of Beibei Group. In 2014, Beibei established Beibei, a shopping platform for mothers and babies. In 2017, bei Store was established — membership discount mall; Founded Beidai — a fintech platform in 2018; Bei Cang was established in the first half of last year, and Bei Province was established in the second half of last year. At the same time, Beibei also has a number of entrepreneurial businesses: Live broadcast of celebrities, baycang new retail and so on. From the time line, we can see that Beibei Group started from one project every three years at the beginning, then one project every year, and then several projects in a year. Beibei’s business has been diversified and developed rapidly.

Well, this brings us to the first status quo of the group: lots of business! Having talked about the business situation within the group, let’s take a look at the current situation of the industry as a whole.

I have been engaged in front-end development since 2014 and 2015. The full development of mobile Internet happened in 2014, so what changes have taken place in the past seven years or so?

  • Before the comprehensive development of mobile Internet, most applications stayed on PC;
  • When the mobile Internet became available, applications began to migrate from the PC to the mobile end, which is our iOS platform and Android platform;
  • Then, in order to avoid the development of both sides of the project, H5 appeared on the Webview;
  • But the performance and experience of H5 was never as good as that of native apps, leading to Hybrid, Weex, RN and, more recently, Flutter.
  • And with the emergence of all kinds of small programs (wechat small program, Alipay small program, Dingpin small program, Douyin small program);
  • At the same time, the front-end technology stack has evolved from jQuery to Angular, Vue, React and so on.

So as front-end development, we have to learn new technologies every year. This is the evolution of our industry as a whole, as our end/platform/technology stack continues to split and become more and more chaotic. Believe this we all sympathize with it! This leads to the second current situation of our group: Duando!

The diversified development of the group’s business leads to the first situation: many businesses; Fragmentation at the end of the industry leads to status quo two: too many ends. And the business is much and end quantity is much eventually lead to our group project quantity is more and more! According to my latest statistics, Beibei currently has more than 80 online businesses! Ok, so what is the cost from an enterprise perspective of using third-party solutions for these 80 + businesses?

Let’s start with a few common third-party products:

  • Let’s start with Fundebug: Paid version 159 every month, the data of this version is not available to us, so from the point of view of data security, we definitely won’t adopt it, and the local version, which can store data on its own server, requires 300,000. That’s a lot of money to pay every year
  • So FrontJS, FrontJS Premium 899 per month, pro 2999 per month
  • And finally Sentry, Sentry is $80 a month

Then we assume that FrontJS is used as the reference for charging, and a simple account is calculated for these 80 projects, as shown in the figure below:

The value of

Chapter 2: Why choose to develop your own

Of course, we can’t just use price as a yardstick for us to develop our own error-monitoring platform. After all, beibei last year is also to have the financing of 860 million 😁. So let’s take a quick look at competing products, starting with Sentry. Sentry does not support Weex and applets. And the stability is not very good, 100% hang when a large number of errors occur.

  • [Small Story 1] Beibei actually used Sentry before using this error monitoring platform developed by himself. Because Sentry is dealing with a large number of errors, the Sentry web page cannot be opened, and almost 100% of the errors reach a certain number, resulting in 504. The error does not return to normal until the Sentry process is complete, which lengthens the error response time. As for Fundebug, Fundebug does not support Weex and the statistics are not detailed enough for us. Now let’s look at Bugly. Students on the client may know more about Bugly, while students on the front may not know much about Bugly. Bugly only supports Android and iOS platforms, as well as some game scenarios (Cocos2d, Unity3D), and does not support expansion.
  • Bugly provides basic information about the number of errors and the incorrect model. However, the error model, version, and system version are scattered. The error information is as follows: The weeX instance creation error and the WEEX file content format error. But information alone cannot locate and solve the problem! Finally, FrontJS supports only Web and applets, has a limited number of monitoring events per minute (advanced version), limited alarm notification, no extension support, and limited selection of historical events for filtering problems.

So, we compare them with our own error monitoring platform (industry plan data statistics may be wrong, for reference only). It can be found that the industry plan each has its own good, but can not meet the beibei side of the scene. We need: something with a longer history, something more stable, something that works on all sides (front-end, client, Node), etc.

So after integrating the business solution, we finally decided to develop it ourselves, considering the dimensions of stability, consistency, scalability, security and cost.

Chapter three: Skynet technology analysis

To give you an idea of what I’m about to say, let’s take a look at skynet’s overall architecture:

Errors are captured and reported from the application access layer through the SDK. The reported data flows through Kafka to ES temporary storage. After cleaning, the data is stored in MySQL persistently. Finally, Node.js provides an interface for visualization, and finally, error data is displayed. This whole process seems to have a lot of strange words, so don’t be afraid, don’t panic, let’s abstract this picture.

The overall architecture diagram above is composed of six parts: from error collection to error reporting to data cleaning to data persistence, and finally to data visualization and monitoring. It still looks a little difficult, doesn’t it? It doesn’t matter. Let’s start by implementing a minimal closed loop. Because we are front-end development, so we first implement a front-end error monitoring! Let’s first simulate an online error:

SDK — Error collection/reporting

1. How to design SDK?

There is no doubt that the source of the visualization data was collected at the source of the error using the SDK, so let’s start with the SDK.

Error capture mechanism

Let’s take a look at several common error trapping mechanisms (highlighted in the figure below).

[1] Listen for window.onerror

When JavaScript runtime errors (including syntax errors and exceptions thrown in handlers) occur, the error event using the interface ErrorEvent is raised at the window and called window.onerror().

[2] Listen for the unhandledrejection event

When a Promise is rejected and not processed, the unhandledrejection event is emitted. Therefore, you can monitor the event to capture and report the error information.

[3] Cross-domain Script error: Script error.

Since static resources are generally stored in third-party domain names such as CDN, window.onerror in the current business domain name will display such errors as Script error. So there are two general solutions to this problem:

  • Access-control-allow-origin is configured in the back end, and Crossorigin is configured in the script tag in the front end
  • Hijack the native method, bypass it with a try/catch, and throw the error

The second option is to hijack the native method and wrap it around a try/catch function, since browsers do not cross-block the _try-catch exception. Example code is as follows:

This is an AOP (aspect oriented programming) design pattern. When an error occurs, we re-throw an error in a catch, and finally execute it by throwing the native addEventListener. To make it easier to understand, let’s show the solution flow:

As shown in the figure above, if the business is on a.com and static resources are on b.com, when no processing is done, js errors on b domain name will be displayed as Script Error. But we want to get the entire error stack, so we hijack the native event, make it a try/catch, and then throw error. When we re-throw an exception, we execute codomain code, so we get the entire stack of error information. To achieve error capture reporting.

[4] Other technology stack — vue.js

ErrorHandler (errorHandler) Has its own error capture mechanism, vue.config. errorHandler (errorHandler). Vue has hijacked vue.config. errorHandler to report errors that have occurred in Vue. Example code is as follows:

The AOP pattern is still used here (if you don’t know, you can search for it). Assign the native vue.config. errorHandler method to a temporary variable, report it in case of an error, and continue with the original errorHandler.

[5] Other technology stack — React. Js

React. Js also has its own error catching mechanism. We declare an error bound component in the SDK, and then reference that component in the SDK in the business to wrap your React tag. When an error occurs in a child component, the error goes to the component within the SDK’s error boundary, where the error can be caught and reported directly in the componentDidCatch declaration cycle. The error bound component is a higher-order component (what is a higher-order component? A React component that wraps another React component.

3. Environmental collection

3.1 Principles of Collecting Environment Information

If we want to obtain environmental information, we should first see whether he hit the initiative to report, if there is, then use the initiative to report environmental information; If not, check whether the environment information reported by the hybrid interface (which uses the client’s ability to capture environment information) is matched. If yes, use the environment information collected by the client. If not, we will check whether the environment information collected by the UA (UserAgent) is matched. So these are the three ways we collect environmental information.

Business information is reported through active reporting, client capability reporting, and UA reporting. Device information: Reported by client capability and UA. Network information: Reported by client capability and UA. Finally, the version information of SDK can be obtained directly in the SDK by requiring (‘./package.json’).version.

4. Why behavior collection?

We can already locate errors from the previous error information and environment information, so why do we need behavior collection? Take a look at the screenshot below (from our error monitoring platform) :

4.1 Behavior Collection and classification

We simply classify the behaviors as user behavior, browser behavior, and console print behavior.

4.2 Explanation of behavior collection mechanism

Next, we explain the cases of these behaviors respectively.

Case1, Click behavior (user behavior)

Use addEventListener to listen for click events globally, collecting user actions (click, input) and DOM element names. Report errors and actions when they occur.

Case2: Send request behavior (browser behavior)

Listen to the onReadyStatechange callback of the XMLHttpRequest object and collect data as the callback executes. Whether you’re using axios or FETCH, the bottom layer is XMLHttpRequest, so don’t worry about capturing the request behavior you’re using!

Case3: Page jump (browser behavior)

Listen for window. onpopState. This method is triggered when a page jumps to collect information.

Again, the AOP pattern is used here, and you will find that AOP, which we don’t use much in our daily business, is used many times in the SDK! Interested in quickly to understand it ~🙂

Case4: Console printing (Console behavior)

Here, we overwrite the ** INFO, WARN, and error ** methods of the console object to collect the information when the console executes.

5. Data reporting

At this point, we have got the error, environment, and behavior information, and we need to report them. Here we use GET a GIF to report them.

As shown in the screenshot above, we report the error message by requesting an image named n4.gif.

5.1. Why use 1 x 1 giFs?

The reason is:

1. No cross-domain problems

2. There is no need to fetch and process data after the GET request, and the server does not need to send data

3, does not carry the current domain name cookie!

4. It will not block page loading and affect user experience. Only new Image object is required

5. Compared with the smallest size of BMP/PNG, it can save 41% / 35% of network resources

6. SDK Summary

Let’s summarize the SDK in one sentence:

Listen/hijack the original method, get the data that needs to be reported, and trigger the function to use GIF reporting when an error occurs.

In order to facilitate memory, extract 3 keywords: hijacking, original method, GIF! (If you can’t remember, don’t hit me.)

Two, data cleaning

Let’s first think about why we can’t use the data reported by the SDK directly, but need to clean the data.

The raw data reported by the SDK has the following characteristics:

  • Large amount of data, large volume: often a few megabytes, ten megabytes, but also encountered dozens of megabytes
  • No categorization or aggregation: Errors of the same type are only of different time dimensions, so there is no need to store them all
  • No filtering of illegal data: too much useless information, not conducive to aggregation ah, but also increase the burden of the server

1. Comparison of storage media

So how do we clean the data reported by the SDK? First we have to find a place to store this data, so how do we store it? Let’s take a look at some common data storage solutions on the market:

The first one is MySQL, MySQL you should be familiar with, daily business is useful to. MySQL supports secondary indexes and transactions. Although it is not very good for full-text search, its usage scenarios are not very good for full-text search, so it is more suitable for online core business.

The second is HBase, which has been born for 10 years and is also a relatively mature project. Hbase is naturally distributed. Although it does not support full-text search, secondary indexing, or transactions, it supports online expansion, making it suitable for applications with unpredictable growth and large write volumes.

The third one is ES, which is a popular open source distributed search and analysis engine in recent years. Through simple deployment, it can analyze logs, search full-text and analyze structured data. ES also has statistical capabilities, secondary indexing support, and ES is naturally distributed. Compared with other high profile database products, ES’s background is relatively diaosi. The founder of ES was an unemployed programmer who created ES for his wife to search for recipes when he had nothing else to do. Elastic has raised hundreds of millions of dollars in funding, and a cool programmer has risen to the top of his career as CEO.

Ok, so the comparison of these storage schemes, we can finally take Mysql as a data persistence storage scheme, to provide data for visualization; ES is a temporary storage scheme for data. In this link, we can clean this part of data.

2. Cleaning process

2.1 Cleaning Process

The cleaning process is divided into the following three steps: data acquisition, data preprocessing and data aggregation.

2.2 Obtaining Data

Getting data from ES is very simple. The bottom of ES is based on Lucene’s search server, which provides a distributed multi-user capability full-text search engine based on RESTful Web interface. So we only need to develop front-end business call interface as usual to call it.

GET nearly a minute of error messages from ES with a GET request (below)

To set the threshold (peak peaking), we use the following two methods for peak peaking:

The upper limit of data acquisition is 10000 pieces per minute. If the number of errors of the same type exceeds 200 pieces, only the number will be counted

2.3 Data Preprocessing

As shown on the right in the PPT screenshot above, this is a wrong content field captured from ES. Since data in ES is in string format and the code has been translated, we need to parse it json.parse (). And sometimes there are objects that are not completely wrapped in strings, so we need to extract the fields that we need from them. In addition, we need to remove useless information from the original data and reduce the storage volume.

The screenshot above shows the data we eventually stored in mysql. You can see that the data here is relatively clear, there is no redundant incomprehensible content.

2.4 Data Aggregation

Let’s think about why we need to do data aggregation for two purposes: 1. Storage performance: small storage; 2, query performance: query fast. So how do we aggregate a certain kind of error?

We do it in three dimensions: 1. Business name; 2. Error type; Error message

SyntaxError: The string did not match The expected pattern. The string did not match The expected pattern. Error message. We will they spell together, and then use md5 get so a bunch of stuff: ecf9f6d430bea229473782dc63407673.

All subsequent errors will be aggregated in the same way to see if their string is the same. If so, we will identify them as the same type of error and store them in MySQL as the same:

As shown above, message_id is the aggregated list of things, and event_count is the number of errors of the same type. Finally, we visualized the data as follows:

2.5. Cleaning process monitoring

Online errors are monitored by SDK, so first check whether the amount of data and time consumption per unit time are normal, whether the amount of ignored data is stable on the timeline, and whether the amount of data pulled per minute is 10K (peak price cutting mechanism was discussed before).

First, check whether the amount of data and time consumed per unit time are normal, whether the amount of ignored data is stable on the timeline, and whether the amount of data pulled per minute is 10K (the peak clipping mechanism is described before, and 10K data is executed per minute).

Three, monitoring,

In the final section of Web-side error monitoring, we look at monitoring. We need to be the first to inform developers of errors when they occur in large numbers in a short period of time, and let the kids deal with online errors first. Thus minimizing the impact of online errors. The error alarm model is simple:

When a bug meets certain criteria, the bug is told to the developer who subscribes to the project by way of a pin, email, phone, SMS, or Webhook. An alarm is triggered if the number of errors reported per minute is greater than or equal to 100 within two consecutive minutes. Then reach the developer with the error message!

As shown in the preceding figure, alarms are classified into ordinary alarms and upgrade alarms. There will be additional text messages and work groups in the upgraded alerts. I believe developers before the video will carry their phones everywhere, even when they go to the bathroom. The alarm information will then have a title and content. The title contains the source of the error, the level of the error, and the business name; The error content will have a description of the error and how many users it affected.

Ok, so far we have completed the error monitoring on the Web side and achieved the minimum closed loop of multi-terminal error monitoring! Now what about error monitoring on the other side? Let’s first review how error monitoring on the Web side is implemented based on skynet’s overall architecture diagram:

Let’s think about what end-to-end flows we can share in the diagram above. And what processes are different from one end to the other?

Yes, you guessed it. In addition to error collection reporting is different from end to end, other processes can be shared! Using this as a starting point, we then analyze how errors are collected and reported on other terminals or platforms.

Note that the core idea here is: differentiated collection, formatting report!

Remember this sentence for later in the presentation!

SDK implementation node chapter

1. Initialization

During startup, the Node application obtains the corresponding service ID from ZooKeeper to initialize skynet.

Q: Why get id from ZK?
A: Skynet is initialized in an underlying package of Node, not called from the service, so it dynamically obtains the service ID through ZK for service positioning.

Error capture mechanism

The Node side uses the Process object to listen for uncaughtExceptions and unhandledRejection events to catch unhandled JS exceptions and Promise exceptions.

3. Collect error information

Once the error object is caught, it is parsed using a third-party library stack-Trace, which parses the error stack into an array and gets the source code for the in-app files in the stack call link.

SDK implementation Weex

1. Multi-terminal compatibility

1) Weex engineering package has two sets of Webpack configuration, but the unified reference in the business is @WEEx/Skynet

2) Use replace-loader to replace the package with @FE-base/Skynet

Weex: @weex/ Skynet; For Web: @fe-base/ Skynet.

Core idea: Achieve multi-terminal compatibility with the help of Webpack engineering capability

Error capture mechanism

1) The front end provides the business module name for the client

2) Pass the module name to the client through the hybrid interface

3) The client reports weeX errors when they are captured

Core idea: front-end marking source, client ability to report errors

The prelude of SDK implementation

1. Error capture mechanism

There is a global function onError in wechat applet, which is used to catch errors.

2. Collection of environmental information

Q: Does wechat applets not provide a general API to obtain environmental information?

A: Environment information can be mounted to globalData during the life cycle phase of the applet startup. The SDK obtains globally unique App instances through the getApp method to get globalData objects

Vii. Client error reporting (without SDK)

1. Android error reporting mechanism

Use the system to provide the mechanism to realize Thread. UncaughtExceptionHandler interface, obtained by uncaughtException collapse error messages, replace the default during application initialization callback of collapse

To facilitate front-end development understanding:

UncaughtException: can be analogous to the front end window.onerror; Thread. UncaughtExceptionHandler: callback function.

2. IOS error reporting mechanism

Using the error capture mechanism provided by the system, objective-C exceptions and POSIX signal processing hooks are registered. When a crash occurs, the crash information can be recorded through the hook function.

At this point, skynet error monitoring system covers: Web, Node, Weex, small program, client!

Core idea: differentiation capture, formatting (unified) report.

conclusion

We start with a visual presentation of error monitoring on the Web and extend the source of the data. Then build an SDK to report the data. Then, two scripts are used to process the reported data: cleaning script (pulling data from ES for cleaning) and alarm task. Finally, data is persisted in Mysql storage, and node.js provides interface services for visualization.

Finally, the SDK is continuously expanded to achieve cross-platform/cross-terminal monitoring.

The above.