About the author: Yang Ting, member of meituan Ordering terminal team.

Monitoring platform

The importance and necessity of monitoring are needless to say. It is necessary to improve the troubleshooting capability and ensure service quality. So what exactly does surveillance do? In short: Report errors promptly, collect valid information, and provide a basis for troubleshooting.

  • Timely error reporting: It is believed that every programmer has such experience. When an online problem occurs, it is reported to the developer through operation or product. The flow process may be several minutes or even dozens of minutes, which may directly lead to the economic loss of the company. If there is a monitoring system, when there is a problem online, the monitoring system can be the first time to alarm, and notify the developer, the developer can be the first time to repair online, directly can minimize the loss of the company.
  • Collect effective information: Especially in the mobile era, a lot of user information (such as the user’s mobile phone version, network status, operation process, etc.) is needed to locate a problem. If there is no monitoring data, we can only guess, or go back and forth to find users with product operation or even problems to communicate and locate, which will inevitably spend a lot of time. However, if the monitoring system records device information, information about the scenario when an error occurs, and user operation flows, you can directly locate the fault based on the information, rectify the fault in the shortest time, and reduce the impact of the problem.
  • Provides troubleshooting basis: Monitors the error information and other recorded information reported by the front-end SDK to provide a solid basis for troubleshooting and guarantee services.

Monitor the classification

Therefore, our monitoring platform needs to include recording monitoring and capturing monitoring:

  • Recording monitoring
    • Page access records: Which pages the user visited
    • Resource loading record: What resources are loaded on the page
    • User behavior record: What the user does on the page, currently we only record the user click behavior
    • Interface invocation records: Which interfaces are invoked on the page
  • Capture monitoring
    • DNS hijacking: Whether the page is hijacked
    • Resource loading errors: In order to catch cross-domain JS errors, add the Crossorigin attribute to the corresponding resource tag.
    • Page error: Errors that occur during page rendering
    • Internal logic errors: Errors that occur during user specific operations and are located by user behavior
    • Interface error: Failed to invoke the interface

Scene reduction method

A very useful feature provided by the monitor panel is scene restore, which displays all the records and errors processed by the supporting system in chronological order. Through the scene restore list, we can restore all the things that happened during the browsing of the page for a given user and the order in which they happened to determine the timing and context of the problem.

Assume the following scenario:

PM: BD feedback users in the shopping cart can not brush out! RD: What? I have a try! PM: Some users in the store can and some users can’t. RD: Don’t worry, tell me shopId and accounts of users who can’t be opened. I’ll go to the monitoring platform to have a look. It was found that the user entered the shopping cart from the menu details page, while normal users did not enter the cart from this entrance. It was found that there was a problem in the part of the menu details page jumping into the shopping cart, and it was immediately repaired

In the above scenario where the user may have multiple actions, the scenario restoration method can restore the entire action path and everything that happened on the page for a specific user to help reproduce the problem. In addition, some unnecessary problems are often caused by different models or environments. The environment in which the problem occurs can also be judged by reproducing the scene.


  • Monitor front-end SDK: Collect client errors and related information and report it
  • Monitoring Web layer support system: processes reported monitoring information
  • Monitoring panel: Allows you to view reported information in real time, facilitating monitoring data use

This article mainly introduces the practical experience of monitoring front-end SDK, there are still many places to improve, welcome everyone to clap brick, help us improve.

The overall design


In terms of the front-end SDK, it can be divided into three parts: data module, data processing module and reporting module. The data module includes specific monitoring data module and environmental data module:

  • Data module
    • Monitoring modules: Obtain specific content information to be reported (EventData or ErrorData)
      • DNS hijacking detection
      • Resource integrity Check
      • Resource loading error
      • API monitoring
      • Global error
      • The user interaction
      • Custom report
    • Environment module: Obtain environment data
  • Data processing module: process the environment data and content data into the corresponding format of the interface, and return the standard format data
  • Report module: Collects environment data from the environment module and distributes it together with content data to the corresponding data processing module according to different monitoring types. Get the standard data and send it to the Node layer. The report module first checks the local cache data and reports the local data and the new data together. If the report fails, the local data is saved to the localStorage

The detailed design

The SDK uses singleton mode, including each monitoring module, environment module and report module. Each monitoring module obtains an instance of the reporting module and reports it. The reporting module ensures that only one reporting request is generated at the same time. Events are listened for in the capture phase to prevent information from being lost because of event bubbling.

The environment module

The environment module collects the following environment information: project configuration information, Web environment data, and JSBridge environment data Other information that can be obtained by the Web layer, such as UA and ISP, is obtained by the Web layer. This module exposes the init and getEnv methods

  • Init Receives environment parameters configured by the user
  • GetEnv updates the page URL and returns a copy of the current env object Freeze

Report module

In single-request reporting mode, each user has only one request to report the current recorded monitoring information. After the request succeeds, the user can continue to report the monitoring information.

All new records that are reported before the report is complete are stored in the LocalStorage. After receiving a success message, delete the reported data and report again. Unsuccessful records are stored in the LocalStorage. In this step, control the upper limit of the localstorage.

If no data is being reported, report all the data and new data of the current Localstorage. If too many records are reported, send them in stripes. If all the packets are sent or the report fails, the report ends.

Each monitoring module

DNS hijacking

After HTTPS pages are hijacked, the page resources cannot be obtained. If the hijacker has no profit, the motivation of hijacking will be reduced. If the device is still hijacked, the front-end resources cannot be reported to the local device and can only be monitored at the network layer. Since our company has cut HTTPS in full, this module is not in this monitoring system. However, our team has done the hijacking detection in the HTTP domain before. The detection idea is to request the sample HTML or JS resources under the specified domain name of the Node layer, and compare the returned results with expectations.

Resource integrity Check

The task of the resource integrity check module is to record which resources are loaded on the page and report. When troubleshooting problems, we can check which resources have been successfully loaded on the current page and the loading sequence to exclude errors caused by some resources not loaded or the loading sequence is inappropriate.


  1. Onload: triggered when window.onload
  2. Onload_timeout: triggered when the onload times out (5 seconds)
  3. Async: window.onload is triggered after a certain delay (5 seconds), and stops listening after reporting
  4. Hash_change: onhashchange starts listening, and the report is triggered after a certain delay (5 seconds). After the report, the listening stops

Maintain an array of loaded resources in the memory, and delete reported resource records after each report.

Resource loading error monitoring

In order to catch cross-domain JS errors, crossorigin attributes need to be added to the corresponding resource tags.

API Error monitoring

Also use XMLHttpRequest hook way to achieve. The interface URL is recorded when the interface is open. The interface URL is reported when the interface fails to be invoked according to the status.

XMLHttpRequest.prototype.open = function open(method, url, bool) {
    monitor.originXHR.open.apply(this, [method, url, bool]);
    // get something...
    // this.ajaxUrl = url;
}

XMLHttpRequest.prototype.send = function send(_data) {
    const self = this;

    this.addEventListener('readystatechange', () = > {if (self.readyState === 4) {
            if(self.status ! = =200&& self.status ! = =304 && this.ajaxUrl ! == REPORT_URL) {// filter urls
                // report error info
                // ...
                // monitor.reporter.report(dataTypes.API_ERROR, error);}}},false);

    monitor.originXHR.send.apply(this, [_data]);
};Copy the code

Filter the SDK address and other interface addresses that need to be ignored.

Type the blackboard, interface access url may be a relative path, suggested to complete protocol and domain

Global error monitoring

Listen for error events on the window and filter error events on the event agent.

User interaction monitoring

Listen for click events in the capture phase on the window and record click-related data.

Business code can add a data attribute to the element of interest, and each click will report the specified attribute of the clicked element, additional information, and domPath help to locate the element.

Record user interaction information to identify the user’s operation path when a problem occurs. Combined with environmental data, resource loading records, and error data, the entire problem scenario is clear at a glance.

access

The SDK can be accessed in the following two ways:

  1. To load the SDK
    • Advantages: It can record the situation before the page is loaded, the resources loaded, and the errors occurred
    • Disadvantages: Affects page loading speed and is directly copied in the head, which is unfriendly to service access
  2. After loading the SDK
    • Advantages: Does not affect page performance
    • Disadvantages: You can only monitor pages that load successfully, but you need to be concerned about pages that fail to load

In order to meet the functional requirements, the current introduction method of monitoring platform V1 is to directly introduce the compressed SDK code into the head of the monitored page, and initialize the project name by the business code. This step can be accomplished with the help of webPack plug-ins to reduce the complexity of service group access.

The following improvement direction should be considered as follows: core foundation library +loaders/plugins, introducing the SDK code that must be loaded first into the head, and adding the rest code asynchronously after the page is loaded.

The above is the practice sharing of the front-end SDK part of our terminal team monitoring platform. We welcome your criticism and comments. We also hope you can put forward good suggestions to help us improve. We will continue to optimize, and will continue to discuss with you. Thank you very much for your patience