The front end early chat conference, the new starting point of the front end growth, held jointly with the Nuggets. Add wechat CodingDreamer into the exclusive inner push group, win in the new starting line.


The 14th | front-end growth promotion, 8-29 will be broadcast live, 9 lecturer (ant gold suit/tax friends, etc.), point I get on the bus 👉 (registration address) :


The text is as follows

This article is the fifth session – Front-end monitoring system construction lecturer – Allan sharing – brief lecture version (see the full version of the video) :

preface

Good afternoon, developers in front of this video!

My topic today is “How to implement a multi-end error monitoring platform”. Let me introduce myself briefly. I am Allan from Beibei big Front-end Architecture Group. I am responsible for the maintenance of the group’s error monitoring system and engineering standardization. I’m also the author of React+Redux Front-end Development In Action.

Without further nonsense, today I will share with you how Beibei is self-developed to achieve this error monitoring platform. I hope this sharing will make you understand, use and take away. My share today will last about an hour. I hope you are not in a hurry and get ready for it.

directory

The value of technology lies in solving business problems. So I will use the first two chapters to explain why we do this and the value of doing this error monitoring platform in combination with the actual situation on Beibei’s side. The third chapter will cover the technical implementation, and the last chapter will conclude.

Chapter one: the status of the group

Now, let’s get started. First of all, LET me briefly introduce the current business situation of Beibei Group. Beibei founded beibei in 2014, which is a shopping platform for mothers and children. In 17 years, founded Beidian — membership discount mall; In 18 years, beidai was established — a fintech platform; Bay bay was established in the first half of last year, bay Province was established in the second half of last year. At the same time, Beibei also has a number of entrepreneurial businesses: Red live, bei Warehouse new retail, etc. From the timeline, we can see that Beibei Group has diversified and developed rapidly from one project every three years at the beginning, one project a year later, and several projects a year later. Well, this leads to the first situation of the group: a lot of business! Having talked about the internal business situation of the group, let’s look at the current situation of the industry as a whole. I am from 14 years contact front-end development, 15 years engaged in front-end development. The full development of mobile Internet happened in 2014, so what changes have taken place in our front end in the past seven years or so?

  • Before the full development of mobile Internet, most applications stay in PC terminal;
  • When the mobile Internet came up, apps started moving from PCS to mobile, which is our iOS and Android platforms;
  • Then, in order not to have two sides of a project, H5 appeared on the WebView;
  • However, the performance and experience of H5 was not as good as the original App, so Hybrid, Weex, RN and the Flutter emerged in recent years.
  • And with the emergence of all kinds of small programs (wechat small program, Alipay small program, nail small program, tiktok small program);
  • At the same time, the front-end technology stack also from the beginning of jQuery to Angular, Vue, React, and so on.

So as a front-end development, we have to learn new technologies every year. This is the evolution of our entire industry, which is constantly splitting our end/platform/technology stack and becoming more and more chaotic. Believe this we all empathize with it! So this leads to the second status quo of our group: Duando!

The diversified development of the group business leads to the status quo 1: many businesses; Fragmentation at the industry end leads to status quo two: multiple ends. The large number of businesses and the number of terminals eventually led to an increasing number of projects in our group! According to my recent statistics, Beibei now has more than 80 online businesses! Ok, so how much is it going to cost us from the enterprise’s point of view to adopt a third party solution for these 80 + businesses?

Let’s start with a few common third-party products:

  • Let’s start with Fundebug: paid version 159 every month, we don’t have the data in this version, so we definitely won’t use it from the point of data security, while the local version that can store the data on your own server needs 300,000. If you have to pay such a fee every year, the cost is still quite large
  • Look at FrontJS, FrontJS advanced version 899 every month, professional version is 2999 every month
  • And finally, Sentry. Sentry is $80 per month. So we take FrontJS as the billing reference, calculate a simple account for the 80 projects. 80 projects, 12 months, 299 per month each project, a year over the cost of 287,000. And we calculated that it would take 180 person days to build such a system, which is 150,000 RMB. And once developed, there is no need to pay such a fee every year.

Chapter 2: Why choose your own research and development

Of course, we can’t just look at price as a measure of how we’re going to develop our own error monitoring platform. After all, Bei Bei also had last year to finance 860 million 😁.

So let’s take a brief look at the competing products, starting with Sentry. Sentry does not support Weex and small programs. And the stability is not very good, 100% hang when the error burst.

  • 【小 story 1 】 Beibei has actually used Sentry before using the set of error monitoring platform developed by himself. Because Sentry is dealing with a large number of errors, it will lead to the inability to open the Sentry web page. When almost a certain error amount reaches 100%, 504 will appear. It will not recover until the Sentry has processed the error, which lengthens the response time of the error. Fundebug does not support Weex, and the statistics are not detailed enough for us. Let’s look at Bugly, maybe you know more about Bugly on the client side, but you don’t know about Bugly on the front end. Bugly only supports Android and iOS platforms, and some game scenarios (Cocos2d, Unity3D), and does not support extensions.
  • Bugly allows us to know the basic information such as the number of errors and the type of errors. However, the type/version/system version of errors are scattered. The error information only includes: weeX instance creation error and weeX file content format error. But you can’t locate and fix a problem with only information! Finally, take a look at FrontJS, FrontJS only supports the Web side and small programs, the number of monitoring events per minute is limited (advanced edition), alarm notification is limited, does not support expansion, and the selection of historical events for filtering problems is limited.

So, let’s compare them to our own error monitoring platform (industry solution statistics may be wrong, for reference only). It can be found that the industry solutions have their own good, but can not meet the beibei side of the scene. We need: more time-back, more stable, products that work on all sides (front-end, client, Node), and so on.

So when we put together a business solution, finally considering the stability, consistency, scalability, security, and cost dimensions, we decided to develop it ourselves.

Chapter three: Analysis of Skynet technology

To give you an idea of what I’m going to say, let’s take a look at the overall architecture of Skynet:

Errors are captured and reported from the application access layer via the SDK. The reported data flows through Kafka to ES staging. Then the cleaning script is executed in the scheduling center. After the data is cleaned, it is persisted and stored in MySQL. Finally, Node.js provides an interface for visualization, and the error data is displayed at last. This whole process seems to have a lot of unfamiliar words, so don’t be afraid, don’t panic, let’s abstract this picture.

The previous overall architecture diagram consists of six parts: from error collection to error reporting to data cleaning to data persistence, and finally to data visualization and monitoring.

It still looks a little hard, doesn’t it? Okay, so let’s start by implementing a minimal closed loop. Because everyone is front-end development, so we first implement a front-end error monitoring to talk about!

Let’s simulate an online error:

As shown in the screenshot above, this is a vue project where we execute an undeclared method called this.foo() in created. There’s going to be an error, right? Ok, so after a minute or two, we go to the visualization platform, and you can see that this error appears on the front page of the error monitoring platform, and on the top is the error trend, on the horizontal axis is the error time, on the vertical axis is the number of errors, and below in the red box is the list of errors. After clicking on the list, we enter the error details page. In the middle screenshot, we can see the error stack information. Here, we can actually locate the error at a glance. But to make it easier for developers to locate the problem, we also show the context in which the current error occurred on the right. It has equipment information and environment information!

Ok, so what happens between the occurrence of the error on the line, and the final visualization?

With this in mind, let’s go into more detail from the architectural flowchart we abstracted earlier. We first from our front-end development students are most familiar with the Web end to start!

SDK — Error collection/reporting

1. How to design the SDK?

There is no doubt that the source of the visualized data was collected using the SDK at the error source, so let’s start with the SDK.



We divided the SDKS into automatic and manual. Manual error reporting is usually used for try/catch services. Manual error reporting is divided into three levels: ERROR, WARN, and INFO. In the case of manual report no hit, we use automatic report pocket!

2. Error catching mechanism

Let’s take a look at some of the common error-catching mechanisms (highlighted below) as examples.

[1] Listen to window.onerror

When JavaScript runtime errors (including syntax errors and exceptions raised in handlers) occur, an error event using the interface ErrorEvent is raised on window and is called window.onError ().

[2] Listen for unhandledrejection event

When a Promise is rejected and not processed, the unhandledrejection event is emitted. Therefore, this event can be monitored to capture error information and report.

[3] Cross-domain Script error: Script error.

Because we generally store static resources in third-party domain names such as CDN, window. onError in the current service domain name will uniformly display such errors as Script error. So there are two solutions to this problem:

  • Configure access-control-allow-Origin on the back end and crossorigin in the script tag on the front end
  • Hijack the native method, use try/catch to bypass it, and throw the error

Here’s the second. Since browsers don’t do cross-domain interception of _try-catch _ exceptions, we hijack the native method, wrapping it in a try/catch function. The example code is as follows:





This is AOP (Aspect Oriented Programming) design pattern, when an error occurs, we will re-throw the error in a catch and finally throw the native addEventListener into execution. For ease of understanding, let’s show the solution process:





As shown in the figure above, if the service is on a.com and the static resource is on b.com, the js error on b domain will be uniformly displayed as: Script error. But we want to get the full stack of errors, so we hijack the native event, try/catch it, and then throw an error. When we rethrow the exception, we execute codomain code, so we get the full stack of errors. To achieve error capture and report.

[4] Other technology stack — Vue.js

Vue project has its own error capturing mechanism, vue.config.errorHandler (errorCaptured). Here, we hijack vue.config. errorHandler to capture and report an error when it occurs in Vue project. The example code is as follows:

Here again, the AOP pattern is used (those who are not sure can search for it). The native vue.config. errorHandler method is assigned to a temporary variable and then reported when an error occurs. The original errorHandler is then continued.

[5] Other technology stack — React. Js

React.js also has its own mechanism for error trapping. We declare an error boundary component in the SDK, and then refer to the component in the SDK in business, and let it wrap your React tag. When an error occurs in a child component, the error will go to the COMPONENT of the SDK’s error boundary, and it can be captured and reported in the componentDidCatch declaration cycle. The error boundary component is actually a higher-order component (what is a higher-order component? A React component that wraps another React component.

3. Environment collection

The above five screenshots are from the visual error details page, where we show the summary, proportion, feature, location, and SDK version information for an error. This helps us better understand the circumstances in which errors occur. We categorize these errors as: business information, device information, network information, and SDK information.

3.1 Principles of Environment Information Collection

To obtain environmental information, we first need to see whether he hit the active report, if there is, then use the active report of environmental information; If not, it depends on whether the environment information reported by the hybrid interface (using the client capability to capture the environment information) is matched. If the environment information is matched, the environment information collected by the client is used. If not, we check whether the environment information collected by UA (UserAgent) is matched. So these are the three ways that we collect environmental information.

Again, look at the previous classification of the environment, business information: active reporting, client capability reporting and UA reporting; Device information: Reported by client capability and UA. Network information: Reported by client capability and UA. In the end, the SDK version information is directly obtained by requiring (‘./package.json’).version in the SDK.

4. Why behavior collection?

We can already locate errors with the previous error messages and environmental information, so why do we do behavioral collection? Take a look at the screenshot below (from our error monitoring platform) :





From the screenshot above, we can clearly see the link where the error ‘Nick’ of undefined occurs. From the browser sending the request, to the user clicking, to the console printing, to this error display. We can fully reproduce the origin of an error. That’s why we’re doing it!

4.1 Behavior collection and classification

We simply categorize behavior as user behavior, browser behavior, and console print behavior.

Among them, user behavior includes our common click, scroll, focus/out of focus, long press, etc. Browser behaviors include making requests, jumping, forward/back, closing, opening new Windows, etc. Console behaviors include console.log/error/warn.

4.2 Explanation of behavior collection mechanism

Next, we will explain the cases of these behaviors respectively.

Case1, click behavior (user behavior)

useaddEventListenerGlobally listens for click events, collecting user actions (click, input) and DOM element names.

When an error occurs willerrorandbehaviorTogether.

Case2: Send request behavior (browser behavior)

Listen to the onReadyStatechange callback function of the XMLHttpRequest object and collect data while the callback function executes.

Whether you’re using Axios or FETCH, the underlying XMLHttpRequest is actually going on, so don’t worry about capturing the request behavior you’re using!

Case3: Page jump (browser behavior)

Listen for window.onpopState, which is triggered when the page jumps to collect information.

Again, the AOP pattern is used, and you’ll see that AOP, which we don’t often use in our daily business, is used multiple times in the SDK! Interested hurriedly go to understand below ~🙂

Case4: Console print (console behavior)

Here we go throughrewriteThe ** INFO, WARN, and error ** methods of the console object collect information when the console is executed.



5. Data reporting

At this point, we have the error, environment, and behavior information, and we need to report it, and we report it using a POST GIF.

As shown in the screenshot above, we reported the error message by requesting an image named n4.gif.

Why use 1 x 1 GIF?

The reason is: 1, there is no cross-domain problem; 2, after sending GET request, there is no need to obtain and process data; the server does not need to send data; 3, there is no current domain cookie! 5. Compared with BMP/PNG, it has the smallest volume and can save 41% / 35% less network resources

6. SDK summary

Let’s summarize the SDK in one sentence:

Listen/hijack the original method, get the data that needs to be reported, and when an error occurs ** trigger ** function to report using GIF.

In order to facilitate memory, refine 3 keywords: hijack, original method, GIF! (If you can’t remember, don’t hit me.)

Second, data cleaning

Let’s start by thinking about why we can’t just use the data reported by the SDK instead of cleaning it.



The original data reported by the SDK has the following characteristics:

  • Large data volume, large volume: at every turn a few, a dozen, but also encountered dozens of trillion
  • No sorting, no aggregation: The same type of error is just a different time dimension, there is no need to store all of them
  • No filtering for invalid data: too much useless information, not good for aggregation ah, and also the burden of the server

1. Comparison of storage media

So how do we clean the data reported by the SDK? First we have to find a place to store this data, so how do we store it? Next, let’s take a look at several common data storage solutions in the market. The first one is MySQL, which you should be familiar with and useful in daily business. MySQL supports secondary indexes, supports transactions, and is not very good for full-text search, but its usage scenario is not very good for full-text search, so it is more suitable for online core business. The second is HBase, which is now 10 years old and is also a mature project. Hbase is naturally distributed. Although it does not support full-text search, secondary indexes, and transactions, it supports online scaling, making it ideal for applications with unpredictable growth and large write volumes. The third is ES, ES is a very popular open source distributed search analysis engine in recent years, through simple deployment, you can analyze logs, full-text search, structured data analysis, etc. ES also has statistical capabilities, supports secondary indexing, and is naturally distributed. Compared to other high profile database products, the ES is a loser. The creator of ES was a former unemployed programmer who created it so his wife could search for recipes when there was nothing else to do. Elastic has raised hundreds of millions of dollars in funding, and the former loser programmer is on his way to becoming CEO.

Ok, so comparing these storage solutions, we can finally use Mysql as a persistent data storage solution to provide data for visualization; As a temporary data storage scheme, ES can clean this part of data in this link.

2. Cleaning process

2.1 Cleaning Process

We divided the cleaning process into the following three steps: data acquisition, data preprocessing, and data aggregation.



2.2 Obtaining Data

Fetching data from ES is very simple. The underlying layer of ES is based on Lucene’s search server, which provides a distributed, multi-user capable full-text search engine based on a RESTful Web interface. So our front-end development only needs to call the business call interface as usual.

  1. GET error message of nearly one minute from ES via GET request (below)
  1. Set the threshold (peak shaving mechanism). In order to “not let the server bear the pressure he should not bear” when a large number of errors occur, we use the following two methods to do the peak shaving mechanism:
  • The upper limit of data acquisition is 10000 pieces per minute. When the number exceeds 10000, the data will be sampled into the database
  • If the number of errors of the same type is greater than 200, only the number is counted

2.3 Data Preprocessing

As shown on the right in the PPT screenshot above, this is an incorrect content field taken from ES. Since the data in ES is in string format and the code is being translated, we need to parse it json-.parse (). And sometimes there will be objects that are not completely string wrapped, so we need to extract the fields that we need. In addition, we also need to take out the useless information in the original data, reduce the storage volume.

The screenshot above shows the data we finally stored in mysql. You can see that the data here is relatively clear, there is no redundant not to understand the content.

2.4 Data Aggregation

Let’s think about why we need to do data aggregation for two reasons: 1. Storage performance: small storage; 2. 2, query performance: query fast.

So how do we aggregate certain types of errors?

We do it from three dimensions: 1. Business name; 2. 2. Error type; 3. Error message

SyntaxError: The string did not match The expected pattern… Is an error message. We will they spell together, and then use md5 get so a bunch of stuff: ecf9f6d430bea229473782dc63407673.

If so, we will identify them as the same type of error and save them as the same item in MySQL:

As shown in the figure above, message_id is the aggregated list of things, and event_count is the number of times the same type of error occurred. Finally, we show these data in the visualization like this:

2.5. Monitoring the cleaning process

Online errors are monitored by SDK, then G’t first looks at whether the data volume and time consumption in unit time are normal, whether the quantity of ignored data is stable on the timeline, and whether the data volume is 10K per minute (the peaking mechanism has been mentioned before).

First, check whether the data volume and time consumption in unit time are normal, whether the quantity of the ignored data is stable on the time axis, and whether the data volume is 10K per minute (as described in the peak shaving mechanism before, 10K data is executed per minute).

Three, monitoring,

In the final section of web side error monitoring, let’s look at monitoring. We need to be the first to inform the developers when errors occur in a short amount of time, so that the kids can handle online errors first. This minimizes the impact of online errors. The alarm model for errors is simple:

When an error meets certain conditions, the developer subscribing to the project will be informed of the error by means of nail, email, phone, SMS or Webhook. Conditions can be understood as follows: An alarm is generated when the number of errors is greater than or equal to 100 per minute for two consecutive minutes. Then the error message reaches the developer!





As shown in the preceding figure, alarms are classified into minor alarms and upgrade alarms. In the upgrade alarm, SMS and work group will be added to the common alarm. I believe that the developers before the video always take a mobile phone with them wherever they go, even when they go to the toilet. Do you think you are [laughing]? Then the alarm information will have the title and content. The title contains the error source, error level, and business name; The error content will describe the error and how many users it affected, etc.



Ok, so far we have completed the error monitoring on the Web side, realizing the minimum closed loop of multi-end error monitoring!



Now let’s think about how to do error monitoring on the other side. Let’s first review how error monitoring on the Web is implemented in terms of skynet’s overall architecture diagram:

So let’s think about this diagram above, what processes can we share between ends? And what processes are different from end to end?

Yes, you guessed it. In addition to the error collection and reporting is not the same between the end and end, other processes can be shared! Using this as a starting point, we then analyze how errors on other ends or platforms are collected and reported. Note that the core idea here is: differentiation collection, formatting report! Please remember this sentence in the following explanation!

Iv. Implementation of SDK on node

The rest will be covered in the next article. Tobe continue…

This article was typeset using MDNICE