Li Yijun is a front-end development engineer at JD.com.

preface

Badjs, namely front-end exception of a popular general term. It refers to an exception thrown by the front end such as “object not found”, “undefined”, “syntax problem”, etc.

The author is responsible for a front-end business of Jingxi, which has been plagued by a large number of anomalies for a long time, and often can not find the reason. Sometimes, when there is an abnormal inflation, it falls back down again, and we cannot locate the problem and are deeply troubled by it. After a long time of precipitation, a set of conclusions and methods are summarized.

The first half of the article is about why to do this and how to collect and analyze BADJS. It is suitable for students who have no systematic exposure to BADJS.

The second half of the Script error, Hybrid, big data system for more in-depth analysis, for your reference and discussion. Those of you who have some experience can skip to this section and start reading.

Hopefully that will inspire you on the screen.

Ps: This series of methods does not apply to Node.js

Beijing xi badjs

Take a look at this picture:

This is the badJS trend of an online business I am responsible for. I don’t know how you feel when you see this, but I feel very nervous when I see this, and I have to go to the computer to check the problem, analyze the log, report the cause to my boss…

Why such a system

As the saying goes, technology serves business. Our BadJS logging system was born of necessity.

Take wechat small program commodity details business as an example, pv has ten million.

Let’s say there’s something wrong with the front end, something doesn’t move, and fewer users are accessing it. The final result is that the single quantity is less, the user lost, but also affected the entire department of students’ rice bowl. This pan, it won’t carry.

Faced with these questions, ask: If it was a page you maintained, would you be afraid? If it’s a page you’re about to publish, do your hands shake?

For programmers to live a happy life, such a system is necessary to detect problems ahead of time, and to nip them in the bud.

Badjs principle and collection

There is no way to predict which pieces of code will go wrong, and the least costly solution is to process them in a centralized place and then collect them.

The origin of the badjs

The essence of badJS is that the JS engine executes unrecognized logic and an exception occurs.

Such as operating on undefined, using the wrong data type, parsing exceptions, syntax errors, and so on:

Badjs collection

When an exception occurs during JavaScript execution and is not caught, an ErrorEvent interface error event is raised and window.onError () is executed.

We can handle exceptions globally in two ways:

window.addEventListener('error'.function(errorEvent) {
    const { message, filename, lineno, colno, error } = errorEvent
    ...
})
Copy the code
window.onerror = function(message, source, lineno, colno, error) {... }Copy the code

Window.onerror can be subscribed to only one, while multiple error events can be listened for.

Unfortunately, IE8 previously did not support ErrorEvent, and only window.onError was used. So the business needs to be used properly.

Try… The catch approach can also collect global error information for a particular scenario, but this approach has a number of drawbacks, which we’ll discuss later.

Through the error event under window, we can normally collect five types of information:

attribute meaning instructions
message The error message Error description
filename(source) The URL of the script where the error occurred inErrorEventIs in thefilenameIn theonErrorIs in thesource
lineno The wrong line
colno The URL of the script where the error occurred
error The Error object error.stackIt’s important information

From the console it looks like this:

So that gives us a pretty good amount of information to locate the problem.

The error message here is interesting. Message will have ‘Uncaught’ in it. Error. Stack does not have this prefix:

Cross-domain scenarios:

The above is an ideal scenario, but the real environment may involve cross-domain, and in this context only some information that is not very meaningful can be collected.

This question will be discussed in depth in the second half of the article.

Tips:

Window error events are not all exceptions can be caught.

Exceptions that we raise directly from the console, data that the browser intercepts, resource 404, etc., are not triggered.

Small program badjs:

Above is the front end error collection with HTML, some differences in small programs. Take wechat applet as an example,

The global exception capture of applets can be subscribed under the method of registering applets in App. For related documentation, see here

App({
  onLaunch (options) {
    // Do something initial when launch.
  },
  ...,
  onError (msg) {
    console.log(msg)
  }
})
Copy the code

What’s different here is that the onError subscription has only one MSG parameter. The content is similar to error-stack

Small program scenario is more concentrated, less complex environment, no cross-domain scripting problems, much simpler than H5. The following content is basic general, for the small program part no longer repeated.

Badjs report

If something goes wrong, we have to rely on the information we report to locate the problem. Then determine in advance what information is needed. In addition, considering the magnitude of the data, too many servers are liable to choke, so only the necessary information is reported.

Necessary data to be reported on the front end:

category source The sample
content message + error.stack “ReferenceError: test is not defined at HTMLLIElement. (Wq.360buyimg.com/wecteam/bad…)
Helpful business information / balabalabala

In addition to the stack information, there are also some business information (such as the documentary, user ID, etc.) that may be helpful for locating. You can choose your own based on the business background.

In addition, some data is contained in packets and can be collected and displayed on the server. The front-end does not need to report the data separately.

Data contained in a packet:

category source The sample
ip / 58.20.191.9
time / May 20 th 2020, 14:56:00. 062
referer Headers.Referer Wqsou.jd.com/wecTeam?key…
UA Headers.User-Agent Mozilla / 5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X Like Gecko) Mobile/15E148 MicroMessenger/7.0.12(0x17000C2D) NetType/4G Language/zh_CN
Cookie Headers.Cookie pin=wecTeam;

Data splicing:

The front end pieced together the necessary data and prepared to send it to the server. Something like this:

Wq.jd.com/wecteam/bad…

Data can be transmitted in GET mode. A simpler one is passed with IMG.

var _img = new Image();
_img.src = url;
Copy the code

Note that get has a maximum length limit (2048 characters), pay attention to clipping. If you need to break the limit, use POST to report.

Example: bj – report. Js

You may ask: I understand the truth, but I do not want to work hard, there is no ready-made?

There is.

Click here to go to git home

It can help you with front-end log reporting and JS exception monitoring. You can skip all the steps above by simply initializing and passing in some parameters

BJ_REPORT.init({
  id: 1.// Report the ID. If the ID is not specified, the report will not be reported
  uin: 123.// Specify user ID, (default read qq uin)
  delay: 1000.// How many milliseconds to delay the report in the merge buffer (default)
  url: "//badjs2.qq.com/badjs".// Specify the report address
  ignore: [/Script error/i].// Ignore an error
  random: 1.The value ranges from 1 to 0. 1 indicates 100%. (Default: 1)
  repeat: 5.// The number of times the same error is reported (the number of times the same error is not reported)
                                        // Avoid excessive reporting of the same error by a single user
  onReport: function(id, errObj){},     // Call back when it is reported. Id: indicates the id of the report. ErrObj: indicates the object with an error
  submit: null.// The original reporting mode can be overwritten. You can modify the reporting mode to POST
  ext: {},                              // Extended attributes. The backend does the extended processing attributes. For example, if an MSID exists, it will be distributed to monitor,
  offlineLog : false.// Enable offline logging [default false]
  offlineLogExp : 5.// Offline validity period. The default value is the latest 5 days
});
Copy the code

The data analysis

After saying the data report, say again how to get the data analysis.

Data extraction

The current log system used by Jingxi is customized based on Kibana, a third-party professional log analysis system:

Some of its functions are simply marked on the graph. The important thing is that we can find the information we want quickly and accurately through it.

Need to save some trouble, you can also build a simple report system:

If you want to make things easier, you can simply connect to a data server and export a document.

Getting reported data is not part of this topic and will not be covered here. In short, we extract the BadJS log here.

Abnormal analysis

With the data, you can start the analysis of KU music.

Badjs can be divided into two types. One is caused by defects in the code written by developers, commonly known as bugs. The other is intentional, such as security scans, brushes, crawlers, browser plug-ins, embedded third-party scripts, triggering exceptions that do not exist.

We need to pay special attention to the former, distinguish the latter, and minimize the impact of the latter on the data.

The content of information:

The content information is message and error.stack, which describes the error and the stack information.

Message is a complement to error.stack, and a complete error.stack contains the error code stack, file, and column and column numbers. From this information we can basically determine the location of the error and the cause of the trigger.

For example, here’s the stack information (in Chrome) :

Let’s look at it line by line:

  1. Uncaught TypeError: Cannot read property 'style' of undefined

You cannot access the style property under undefined. That is, under some object, a property is empty, and on top of that, the property is accessed, so an error is reported.

  1. at removeAD (badjs.js:4)

This exception code is located on line 4 of the badjs.js file and belongs to a method named removeAD.

  1. at window.fireBadjs (badjs.js:8)

The code that calls removeAD is on line 8 of badjs.js and belongs to a method named window.firebadjs.

  1. at badjs.html:16

The code to call window.firebadjs is on line 16 of the badjs.html file.

There is such detailed information, check the source code, for the badJS reason on the bottom of the mind.

There is a shorter type of stack information:

The exception occurs in a strange place, first row, first column, fired in Anonymous. I went back to the code and found that the first line and first column didn’t have this method at all.

This is code that is executed in an anonymous function of the browser, similar to code typed directly on the console or run through functions such as eval.

In fact, the background of the badJS in the image above is that the App’s native javascript code directly executes a window.scope callback (i.e. GetNetWrokCallback) in Webview. But the function doesn’t exist, so an exception occurs.

With that in mind, let’s look at the error.

Looking at the error content, this is the same as the above example, but without anonymous’s message. But there is no SOHUZ attribute in our code, so we first guess that the exception may be caused by the JS code actively executed by an App.

Further analysis needs to be combined with UA information. The UA has the kuaizhanAndroidWrapper field. Through the universal Internet, it is found that the “Express” App id contains this field.

Therefore, the conclusion comes out: “Express station” App visited this page, but it did not conduct non-empty check, directly visited SOHUZ, resulting in the badJS report in our business here. No mistake. Case closed.

UA:

At the end of the above paragraph, the function of UA information is introduced in advance.

We can infer the environment from UA: for example, running on certain wechat, certain App, certain browser, etc. Can also infer who is visiting our page: such as Baidu crawler, XX crawler, Ali Baichuan, etc.

The UA can simply be used as a user’s fingerprint. Jing Xi’s page will often be unknown users brush, every day also especially on time. Because of the uncertainty of the source, it can only be jokingly called: brush. Sometimes we get a lot of weird errors, but the UA behind it is the same, we can basically assume that someone is scrolling through our page.

One thing to emphasize is that UA can be faked. Some experienced brush, each UA is different, very confusing. Therefore, whether the UA information is true or not needs to be analyzed on a case-by-case basis.

Script error:

In fact, most of the details of the BADJS log on the JINGxi H5 business line only show ‘Script error’.

A Script error occurs because cross-domain scripts are introduced. For example, if badJS occurs in another js file that is not in the same domain, the message in the onError event will only be a script error, and the error message will be null

For those Script errors that have been reported, the data available is very limited, and most of the time you have to give up.

However, there is no solution to this problem, we can “open” the Script error message and report it. For example, cross-domain scripts set trust policies.

The detailed strategy for dealing with Script errors will be discussed further in the following section.

Script error

This section goes a bit further and discusses how to dig up valid information in a Script error.

source

Script error is essentially a cross-domain security policy of the browser to protect the content of code that is not in the same domain.

For all cross-domain scripts introduced through the script tag, if an exception occurs, the error event in window will only get ‘script error’.

The solution

There are solutions to the problem of ‘script errors’ caused by cross-domain scripts introduced by the script tag.

The idea is: it’s a security issue, so long as both sides feel it’s credible, my browser will let you off the hook.

The specific implementation is as follows:

1. Response header adds access-control-allow-origin to indicate trusted domain names

2. The requested script tag adds the Crossorigin attribute

Crossorigin has two values

  • anonymous
  • use-credentials

When crossorigin= “, or any other character, the effect is the same as setting Anonymous. Anonymous relies on the access-control-allow-Origin of the response header. Note that with Anonymous, the request header does not contain user information, such as cookies. In contrast, use-credentials can be used to carry user information.

The use-credentials must be used together with the access-control-allow-credentials of the response header. When the access-control-allow-credentials returns true, the access-control-allow-credentials command is used to authenticate the credentials. The browser will allow scripts and trusted sources to run. In this case, access-control-allow-origin cannot be set to * and must be a specific domain name.

By using the above two steps, the error message in the Window Error event in the Web browser will not be intercepted by the browser as ‘Script error’.

A variant scenario: JSONP

Here’s a variant of the script tag: jSONp

Jsonp itself addresses the problem of cross-domain interface requests, so most usage scenarios come with a cross-domain halo. In addition, it runs through the script tag, just like a JS script.

Therefore, when an exception occurs, there will be a Script error.

A common scenario in which jSONP exceptions occur is when the callback is undefined

Although you can see the specific error message on the console, of the errors caught, only Script errors are caught because the URL is cross-domain and no additional trust is set.

The solution is to “deconstruct” a Script error by following the two steps listed in the “normal solution”.

The problem here is that most interfaces rely on user information, so the front end needs to use crossorigin=’use-credentials’ to put cookies on requests. Therefore, the background returns the specific domain name in access-control-allow-Origin, and sets access-control-allow-Credentials to true for the browser to pass authentication.

As an added bonus, the vast majority of scripts generated by JSONP are non-asynchronous code. The cross-domain script asynchronous code has a few pits, which will be covered later.

Special solution

Using crossOrigin is the normal solution. A special way is to use try… The catch.

A simple example is chestnuts:

// https://xxx.badjs.js
window.fireBadjs = function () {
    ops
}
Copy the code
<script src="xxx.badjs.js" ></script>
<script>
    window.addEventListener('error'.function (errorEvent) {
        const { message, filename, lineno, colno, error } = errorEvent
        debugger
    })
    fireBadjs()
    try {
        fireBadjs()
    } catch (error) {
        debugger
    }
</script>
Copy the code

Badjs.js inserts a function called fireBadjs into the window, which runs a line of code that will cause an exception.

After the first fireBadjs() is run, an error event is emitted with the following content:

It makes sense that error events can only catch Script errors.

Comment out the first fireBadjs() so that the try… FireBadjs () is executed in the catch, badjs is caught by the catch. On the console, it reads:

Miraculously, the content that was originally a Script error was dug up.

But this approach is ultimately inelegant and intrusive to the code. There is also a small flaw that it does not raise an error event under window.

Now let’s talk about how to improve.

Try catch enhancement

try… Catch is a tool that people both love and hate. A lot of times we wrap our code around it to make it robust.

But if an error occurs in a try, the browser does not type the error in the console and does not raise an error event.

Let’s look at a simple code:

try {
    JSON.stringify(apiData)
} catch (e) {}
Copy the code

If json.stringify (apiData) has an error, the code may work fine, but it doesn’t perform as expected and we don’t know there’s a problem. Probably took a big detour to get back here.

The problem here is that the exception was caught without prompting us. The solution is very simple, just add the tips manually.

In this case, we can type the error in the console. Error in the catch, and manually wrap the ErrorEvent to throw the error event in the window.

try {
    JSON.stringify(apiData)
} catch (error) {
    console.error(error)
    if (ErrorEvent) {
        window.dispatchEvent(new ErrorEvent('error', { error, message: error.message })) // This will also raise window.onError
    } else {
        window.onerror && window.onerror(null.null.null.null, error)
    }
}
Copy the code

It is important to note that try… A disadvantage of catch is that it cannot catch exceptions in asynchronous code. Such as setTimeout, promise, event, etc. Because of this problem, we can’t simply wrap a try in the code from start to finish… Catch solves all problems.

Hybrid

Developers often make changes and upgrades in their ivory towers, but real production environments, such as Hybrid, are often more complex than expected.

This is a generalized Hybrid, so in addition to native apps, the browser counts. In fact, the real scene of H5 is in Hybrid. Here, we will study the abnormal performance of H5 in mobile App.

Webview

Many apps have built-in WebViews that run H5 pages. If H5 has badJS in a Webview, guess what the Webview will look like?

There are two environments, iOS and Android.

1. IOS (System test version: 9.0.2/11.0.3/13.4)

In iOS Webview, if a badJS occurs in a cross-domain Script asynchronous code, the error event in the window will only catch ‘Script error’, regardless of whether the cross-domain header and crossOrigin are set in the usual way.

You read that right, as long as it is a Webview in iOS, no matter your business App, or wechat or even browser.

For example: in the iPhone wechat opened a H5 page, H5 page referenced a (cross-domain) CDN JS file. When an event inside the js file generates badJS and is reported, we only see a ‘Script error’ in the log system.

2. The Android system

The Webview in Android behaves the same as the Web side. With the cross-domain header and crossOrigin set, if the cross-domain script badJS occurs, both asynchronous and synchronous code can catch the error details in the error event in the window.

Take an online data of Jingxi H5 and simply verify this phenomenon:

A vast majority of traffic in wechat and mobile QQ business, in the Android environment, there are 8043 Script error

25,685 Script errors on iOS

By the way, there are only 393 non-Script errors on iOS.

The solution, mainly for the iOS environment. Because Android looks the same as the Web side.

1. Change the cross-domain script to the same-domain script.

Script error is not displayed when a Script in the same domain fails.

2.try… Catch the parcel

Use asynchronous methods in cross-domain scripts to try… Catch wraps, manually triggers the event in the catch. Manual package is more troublesome, you can consider using tools to package when automatic package.

3. Use Android badJS data to refer to iOS

Because the probability of exceptions in the two environments is approximately equal, after cross-domain work, focus on solving the BadJS in Android, you can cover most of the exceptions in iOS.

Native code injection

The difference between this and the last video, where H5 was in free fall in the App, this is where the App is actively intervening.

App Native can inject code into a Webview in two ways. The first is to convert native code into JS code and send it to Webview for execution through JS engine. The second is c++ code, which is written directly at the bottom of the engine and becomes native code.

The first method is actually executed in the Webview environment, just like JS code. Some of the callback parameters commonly found in the JSSDK.

If the callback does not exist and an exception is thrown, Android and Web will behave the same.

But iOS only has Script error.

If you see an error in the log in the first line, first column of Anonymous, and the UA is the App environment. That’s most likely a badJS generated by an active call to the App:

Android doesn’t have much to say, iOS will show Script error in content. Since the code is generated by the App and sent to the Webview, there is nothing the front end can do to deconstruct the Script error. The front end can only discuss with the App students to create a layer of protection, such as checking whether the function is defined before calling.

The second way to use c++ code, written directly at the bottom of the engine, is to generate an interface for H5 to use native App capabilities, such as the JSSDK. Code generated this way becomes native code.

Example of native code: console.log

Since it is running in the engine, if there is an error, the whole Webview may die. Front-end BadJS events are not valid for this case and can be ignored.

IOS stack differences

In fact, the error stack in iOS is different in detail, which is described here.

Test model:

  • The iphone xs Max 13.4
  • Honor V20 magicUI_3.0.0 andorid_10.0.0.194

Test code:

<script>
    window.addEventListener('error'.function (errorEvent) {
        confirm(errorEvent.error && errorEvent.error.stack)
    })
    nextLineBadjs // Trigger the co-domain badjs
</script>
<! -- scripts with cross-domain headers set -->
<script src="//wq.360buyimg.com/badjs.js" crossorigin="anonymous"></script>
Copy the code
// badjs.js
setTimeout((a)= > {
    asyncBadjs // Asynchronous code
}, 1000);

syncBadjs // Synchronize the code
Copy the code

WeChat Webview:

IOS:

Under the Android:

IOS does not show the specific error code, using ‘global code’ referred to. A cross-domain asynchronous Script displays a Script error.

Everything is fine on Android.

IOS ‘Global Code’ makes it a bit more difficult to locate, but because it specifies the location, it’s not a problem. By the way, this is the information in error. Stack. In fact, message contains the details in ‘global code’.

We type it out through confirm(errorevent.message)

Therefore, you can report messages with errorEvent. Message.

Other Webview:

I also tested other mobile apps and browsers, including JD.com App, Baidu browser, Chrome, QQ browser and Firefox

Because the results are similar, there is no essential difference, so I won’t show you much here.

Talk about backstage and big data

The biggest obstacle for most peers to use badJS logging system comes from the background construction. According to the logStash industry standard log collection process, the front-end can only perform the first step: collection. There is nothing you can do about storage and display, and you need the intervention of students in the background.

Here is some background on Jing Xi for your reference.

Data level

In addition to badJS, our log data also includes some business log information proactively reported by the front end. Because there are a lot of businesses in the department and the reporting process is relatively standard, the reported data increases with the increasing number of businesses. There are now about 2.4 billion pieces (2.5 terabytes) of data per day. When business reports are screened out, pure BADJS account for about 0.67%.

Since we have the resources of a big data machine, this information runs on our own data machine at no extra cost.

Performance optimization

Although the machine resources are quite sufficient, they cannot support the uncontrolled use. We have some performance optimizations to keep the logging service stable.

  • Data is cleared every five days
    • Data is cleaned regularly to ensure a virtuous cycle of storage space
    • Expired logs cannot be queried
  • Report the Degradation Plan
    • Front-end access configuration, when the data blowout, automatically adjust the sampling rate, give up a certain proportion of reports
    • Similar configurations exist in the background. When the amount of data is too high, some data is automatically abandoned

First release of data collection system: WebMonitor

Our initial logging system interface was self-built, using URLS as keywords to query badJS.

Log data is stored in the HDFS. Impala is used to violently scan HDFS when a query is initiated. Efficiency is relatively low, check a time about half a minute.

There are no more filters to choose from, and the presentation is crude.

Data collection system: ElasticSearch+Kibana

In the second edition of ElasticSearch, the big data students connected to ElasticSearch, indexed the log files, and then seamlessly connected to it through Kibana. Because of the index, query speed is fast; With Kibana, the query threshold is lower and more dimensional data can be analyzed.

The improvement of the query speed, the efficiency of the rapid positioning and response to the superior has a qualitative leap.

We can search by keyword, screen out the useless, or select the information of interest.

It can also conveniently select an interval in a data set to quickly analyze problems and propose solutions.

I’ve never felt so happy as a developer.

summary

How do I collect BadJs

  • througherrorEventorwindow.onerror
  • It can be used in special circumstancestry... catch

Badjs report

  • Concatenation error. Stack information to the URL
  • Use img to send data, pay attention to the length

Badjs analysis

  • Manually trace back the code through the stack information
  • Use UA to simply determine the environment and user identity
  • Script error messages are not of much value and can be skipped during analysis. But find a way to deconstruct it when you report it.

Common scenarios that result in a Script error

  • Cross domain script, and no set crossorigin exception occurred
  • Jsonp, crossorigin is not set, an exception occurs in the js script generated for the script
  • An exception occurs in the asynchronous code of the cross-domain script in iOS. Procedure
  • In iOS, an exception occurs when native actively executes JS code

Note about asynchronous code

  • Cross-domain asynchronous code on ios cannot resolve Script errors
  • Try catch Failed to catch an error in asynchronous code
  • Jsonp is mostly synchronous code

The need for

Finally, let me return to the necessity of this system.

For Jingxi’s business, such a system is a must. Because safety is so important, we simply can’t afford to be responsible for long periods of time when something goes wrong online.

Here’s a look at the pros and cons.

advantage

Not knowing if your code is working is a very unsettling thing. If there are customers and users to complain, that is really a first two big.

Most of the business code on the front end is tested. The business logic of the online code is less likely to fail if it passes sufficient tests, so the focus is on whether the code is reporting errors.

The emergence of such a system greatly increases the confidence of students in the front end. After release, if there is a problem, you can immediately roll back the code and locate the problem. Most importantly, it gives bosses a sense of security.

defects

However, such a system is not a panacea, and there are many other dimensions that cannot be measured by it. If you’re too dependent, haven’t tested your business enough, and just take a look at the BadJS log when you go live, there may be a lot of big problems waiting for you to find out.

The cost of this resource is also an issue that has to be raised. If the company does not have enough resources, it will need to spend extra money on leased servers and storage devices.

How to choose

Some students may ask: “our business does not have this thing, and also rely on so many resources, I can not make a decision.”

Many growing businesses may not have the energy to do this infrastructure, but I believe at some point in the future, you will need this data.

Ask yourself how much it can reduce online accidents.

Come back to this article when you need it.

If the cost of a problem is greater than the cost of construction and maintenance, I’m sure you can convince your team and your boss.

The resources

  • The wider the front end, Tencent Now live how to achieve the monitoring system to the extreme?
  • Script error specification