This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.
Sorted out some of the project experience of front-end monitoring, combined with their own ideas output this article, to share with you.
Build front-end monitoring system
Usually, the front end establishes the monitoring system, mainly to solve two problems: how to find problems in time, how to quickly locate and solve problems.
Generally speaking, from the perspective of development and product, front-end monitoring system needs to do the following things:
- Overall page access, including common PV, UV, user behavior report.
- Page performance statistics, including the loading time and interface time.
- Grayscale release and effective monitoring capabilities, convenient timely discovery of problems.
- Sufficient logs are required to locate problems reported by users.
These problems can be solved from two perspectives: data collection and data reporting.
The data collection
To monitor effectively, we first need to report the monitoring data. In the traditional page development process, the quality of the system is usually evaluated from three aspects, and the monitoring and data collection for the page are also carried out from these aspects:
- Page access speed
- Page stability/exception
- External service invocation
Abnormal collection
First, we need to collect some errors during the project’s execution, because in general, script execution exceptions are likely to directly cause functionality to become unavailable. When abnormal HTML document execution, we can through the window. The onerror, document. The addEventlistener (error), XMLHttpRequest status intercept method such as error exception. For example, by listening for the window.onerror event, we can retrieve the error and analysis stack in the project and automatically report the error information to the background service.
Common front-end exceptions include:
- Logic error: When developing the implementation function, the logic comb is not as expected
- Business logic judgment condition error
- The event binding sequence is incorrect
- Call stack timing error
- Error manipulating js object
- Code robustness: Code boundary cases are poorly considered and exception logic is executed incorrectly
- Read property from NULL as an object
- Traverse undefined as an array
- String numbers are used directly in addition operations
- Function parameters not passed
- Network error: The user network is abnormal or the background service is abnormal
- The server does not return data but still 200. The front end performs data traversal as normal
- Network interruption during data submission
- No front-end error handling is performed when server 500 error occurs
- System error: An error occurs due to compatibility with the code running environment or insufficient memory
- Page content exception: missing content, binding event exception, style exception
Life cycle data
The life cycle includes critical points in time when a page is loaded, and often includes data on how long it takes to open, update, and close a page.
In general, we can get some life-cycle related data through the PerformanceTiming property, including:
PerformanceTiming.navigationStart
: Timestamp when the previous page of the current browser window closes and an Unload event occursPerformanceTiming.domLoading
: returns when the DOM structure of the current page begins parsing (i.eDocument.readyState
The property changes to Loading, correspondingreadystatechange
When the event is triggered)PerformanceTiming.domInteractive
: returns when the current page DOM structure finishes parsing and starts loading embedded resources (i.eDocument.readyState
Property changes to interactive and correspondingreadystatechange
When the event is triggered)PerformanceTiming.domComplete
: Returns that the current document has been parsed (i.eDocument.readyState
Becomes “complete” and correspondingreadystatechange
) when triggeredPerformanceTiming.loadEventStart
: Returns the timestamp when the load event was sent under the documentPerformanceTiming.loadEventEnd
: returns the timestamp when the load event ends, that is, when the load event is complete
In addition, the DOMContentLoaded event is fired after the initial HTML document has been fully loaded and parsed, without waiting for the stylesheet, image, and subframe to be fully loaded. Due to the appearance of the front-end framework, most of the page rendering is handed over to the framework to control, so the DOMContentLoaded event has lost its original role, most of the time we will collect data in the lifecycle function provided by the framework itself.
We can also use the MutationObserver interface, which provides the ability to listen for changes in the DOM tree of the page and obtain specific times in conjunction with Performance:
// Register the listener function
const observer = new MutationObserver((mutations) = > {
console.log(` time:${performance.now()}The DOM tree has changed! There are the following types of changes: ');
for (let i = 0; i < mutations.length; i++) {
console.log(mutations[0].type); }});// Start listening for document node changes
observer.observe(document, {
childList: true.subtree: true});Copy the code
HTTP speed data
We can also obtain PerformanceTiming data for the request:
PerformanceTiming.redirectStart
: returns the timestamp when the first HTTP jump startedPerformanceTiming.redirectEnd
: returns the timestamp at the end of the last HTTP jump (when the last byte of the jump response was accepted)PerformanceTiming.fetchStart
: Returns the timestamp when the browser is ready to read the document using an HTTP request, before the web page queries the local cachePerformanceTiming.domainLookupStart
/PerformanceTiming.domainLookupEnd
: Returns the timestamp at the start/end of the domain name queryPerformanceTiming.connectStart
: Returns the timestamp when the HTTP request began to be sent to the serverPerformanceTiming.connectEnd
: Returns the timestamp when the connection between the browser and the server was established, when all the handshake and authentication processes were completedPerformanceTiming.secureConnectionStart
: Returns the timestamp when the handshake between the browser and the server began the secure linkPerformanceTiming.requestStart
: Returns the timestamp when the browser made an HTTP request to the server (or started reading the local cache)PerformanceTiming.responseStart
: Returns the timestamp when the browser received the first byte from the server (or read it from the local cache)PerformanceTiming.responseEnd
: Returns the timestamp when the browser received (or read from the local cache) the last byte from the server (or closed if the HTTP connection has been closed before)
Using this data, we can see if the back-end service is stable and if there is room for optimization.
User behavior data
In addition to the common front-end page loading and request time data, we can also pay attention to some behavior data of users, including page views or clicks, the duration of the user on each page, the entry through which the user accesses the page, and the behaviors triggered by the user in the corresponding page. User behavior data can be obtained through the operation events of some DOM elements.
This data is often used for statistical analysis of user behavior in order to tailor the functionality of the page and make it work better. At the same time, we can also observe whether the system functions normally through some user interaction data.
The user log
Logs are used to locate system anomalies. There are two ways to store logs:
- Report to the server. If all logs are reported to the server, the storage cost is too high. In addition, frequent log reporting increases the maintenance cost of the interface. In addition, some or all logs may be lost due to network problems.
- Local storage. In this solution, users need to manually submit local logs to locate specific exceptions. If the user cannot be reached, it may not be fixed because the exception cannot be reproduced.
Logs are usually used when users are trying to locate user problems, but we often need to print the logs in the code ahead of time. Otherwise, when we need to locate the problem, we find that we do not output the relevant log, and some problems are difficult to reproduce, and it may not be able to reproduce after the log is published, so we will be passive.
Automatic log printing can be done by global hijacking of key modules and functions, for example:
When each function module is running, print input parameters, execution information, and output parameters in a specified format. Then, you can parse logs to summarize the complete call relationship and function module execution information of the operation:
Buried point solutions
Common buried point schemes in front end include three kinds:
– | Code buried point | Visual burial point | Non-trace buried point |
---|---|---|---|
use | Manual coding | Visual circle selection | Embedded in the SDK |
Custom data | Can be customized | Difficult to customize | Difficult to customize |
Mature products in the industry | Youmeng, Baidu statistics and other third-party data statistics service providers | Mixpanel | GrowingIO |
Update the price | Version update required | Configuration Needs to be delivered | Don’t need |
Use cost | high | In the | low |
Traceless burial point is generally used for data collection through some APIS used in the above data collection, but since the ability of customization of traceless burial point is very weak, we can usually cooperate with the way of code burial point.
Standardize buried point data
Either way, we need to standardize them. Generally speaking, specific parameters are agreed with the background, and then the front end automatically converts them into some data formats required by the interface for local storage during buried point collection.
Information through these ACTS, each user can be real-time calculated in the sequence of operation on the timeline, as well as the contents of each step of the operation time, operation, etc., through the visual system intuitively show the user’s link, including the system of the entrance, open or close the page, click on each function point and operation time, dysfunction, etc.
Obtain the user click flow and page usage in a standardized way, report the operation behavior of the page and each function to the server, analyze the real-time operation time, operation name and other information to get the user’s operation link, the time consuming and conversion rate between each page and function operation steps, and effectively monitor them. In this way, the use of the product can be observed efficiently and intuitively, the behavior of users can be analyzed, and then the direction of the product can be determined, and the product function can be improved.
The data reported
After data collection is completed, we need to report these data to the background service:
As shown in the figure, when page opening, updating, closing and other life cycles, user operation behaviors in the page, and system anomalies are triggered, the bottom layer of the system monitors these events through buried points, obtains relevant data and standardized processing, and then collects them locally and reports them to the real-time data analysis system.
The relevant data information includes time, name, session marker, version number and other information. Through these data, the number of use of each buried point, execution time between buried points and conversion rate between buried points can be calculated in real time, and the complete page usage can be displayed intuitively through the visual system. It includes the opening, updating and closing of each page, the clicking and loading of each function point, and the abnormal function, etc.
Report the way
Generally speaking, the data and running logs of our buried sites need to be reported and sent to the background service for conversion, storage and monitoring.
The batch report
For the front end, too frequent requests may interfere with the user’s experience of other normal requests, so it is often necessary to store the collected data locally. When a certain amount of data is collected, it is packaged and reported at a time, or packaged and uploaded at a certain frequency (time interval). Packaged and uploaded data is consolidated into one time, reducing the pressure on the server.
Critical life cycle reporting
Because users may encounter exceptions or exit during use, we also need to upload when exceptions are triggered and before users exit the program to avoid timely discovery and location of problems.
User initiative submission
We will provide users with the option of active uploading for some anomalies and user experience problems. When the user is booted to upload, we can submit the local data and logs together.
Data monitoring
After the data is reported, we need to set up a management terminal to effectively monitor the data, which mainly includes three parts:
- Performance monitoring
- Page loading performance
- Network Request performance
- Abnormal monitoring
- JS Error
- Data monitoring
- Page PV/UV
- The page source
In daily monitoring, you can configure alarm thresholds for the monitoring data and send them to relevant personnel by email or robot to discover and solve problems in time.
Release Process Monitoring
In a project with multi-team cooperation, the functions developed by several partners will be combined and released every time the release is made, so it is inefficient to ensure the correctness of the functions manually. Manual tests may not cover the complete functions, and automated tests are often unable to be completed due to problems such as cost performance. Therefore, in addition to automatic testing and functional self-testing related to changes, we will bring the version number of each time in the reporting process, and we can observe the curve of the new version according to the version. In the gray process, we also need to pay attention to observation:
- Check whether any new error occurs in the small program error alarm. You can locate and rectify the error according to the error content
- Full version monitoring observation: whether the overall function point coverage curve is normal, whether there are abnormal ups and downs
- Monitoring by version: whether the function coverage is complete, whether the gray ratio is normal, whether the conversion rate of the new version is consistent
In the grayscale release process, we can confirm whether there are abnormalities and where there may be abnormalities by reporting whether the data function curve is normal, whether the abnormality is in the expected range, whether the curve mutation is consistent with the grayscale time point, etc. When data is abnormal, it can coordinate with the corresponding alarm channel to notify the corresponding person in charge and repair the abnormal function in time.
conclusion
In many cases, the front-end project will carry out some anomalies, time measurement and other monitoring, as well as some user behavior data reporting. In fact, we can also think of more automatic realization of these processes, while the data after reporting can also be screened, statistics, conversion, calculation of the use of various dimensions of the product, and even can do the whole link monitoring, or give some practical product direction guidance.