Build your own front-end performance monitoring system

background

Why monitor page performance?

Poor performance on one page can affect many things. At the level of the company, page performance will affect the revenue of the company. For example, if users wait too long to open the page, they may directly close the page or not open the page again, especially on the mobile terminal, users have a low tolerance for page response delay.

In addition, the page loading speed will directly affect the SEO of the page, the page loading speed is too slow, the user will directly shut down, which directly increases the page jump rate, when the search engine found that the page jump rate is high, the search engine will think that the site is not high value to the user, thus reducing the ranking. In July 2018, Google introduced a new rule to reduce the search ranking of pages that take longer to visit.

Although performance is important, it is often overlooked in development iterations, and performance degrades with each release, so we need a performance monitoring system that continuously monitors, evaluates, and alerts page performance to identify bottlenecks and guide optimization efforts.

There are many excellent tools for evaluating and monitoring page performance, such as gtmetrix, which can review the results of multiple analysis tools at the same time and provide many suggestions.

Deviates from real situation, but this way can’t feedback the overall speed of an area, slow the user how many, can’t reflect the fluctuation of the performance, in addition to white, we have some functional speed, such as page can click time, advertising display time and so on, these are unable to simulate the monitoring.

In order to continuously monitor user access and page function availability in different network environments, we choose to implant JS on the page to monitor online real user data. The specific approach is to use a piece of code to report the user’s data to our server, through a system to summarize the data, processing, and finally graphical data, convenient for us to view the performance of each page.

Design of speed measurement system

The test system is divided into three parts as follows

  • The front-end reporting
    • How to record the time point of speed measurement.
    • How to report.
    • Sampling of data.
  • Data processing, warehousing.
  • The data show

### Front-end report

Embed a front-end JS code in the front end, through these codes to report page performance data, then generally which indicators can better feedback user experience?

The user’s biggest feeling is why the page opened to wait so long, why the picture load so slow, page load half a day can not click. These user feelings are important page performance indicators for programmers. According to the user’s pain points, the indicators are abstracted, including white screen time, first screen time and interactive time. So how do we calculate this time?

Determine the statistical starting point

The start time, should be after we enter the url, hit enter as the start point, so that the user really start to wait. For high end browsers, we can use Navigation Timeing interface directly to get the statistical starting point.

Navigation Timeing interface is a javascript API that accurately measures performance on the Web and provides a detailed set of time states.

Open the console in Chrome, type Performance on the command line, click on it, and view its Timing property. You’ll see the following code

Each performance. Timing attribute represents either a page event (such as a page request) or a page load (such as when DOM starts loading), measured in milliseconds from midnight on January 1, 1970. A result of 0 indicates that the event did not occur (for example, redirectEnd or redirectStart).

Here is a diagram of the sequence of performance. Timing events from Navigation Timing Draft.

NavigationStart refers to the time when the browser requests it, which is generally when you press back to shop in the URL input field, or when the page refreshes by pressing F5.

Explain in detail please click https://www.w3.org/TR/navigation-timing/, other time points or Google it, there are several articles made explanation, here no longer tired.

This interface is already supported by most browsers, with the exception of PC browsers under Internet Explorer 9.

Bad time

The user sees when an element appears in the page display. Many people think that the blank screen time is the first byte of time the page returns, but this is not accurate because the page is blank while the header resource is still loaded.

There are three types of times when the actual screen ends.

The first type of normal page without RENDERING by JS, the white screen time should be after the loading of the header resources, because the browser will only really render the page as long as the loading of the header resources. So it’s best to print the white screen time point at the end of the header (it may not be exact here, but try to get close), as shown in the code.

<! DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, "> <meta http-equiv=" x-UA-compatible "content=" IE =edge"> <! --> <link href="style.css"> <title>Document</title> <scirot performance.timing.navigationStart; </scirot> </head> <body> </body> </html>Copy the code

The second is the use of some front-end frameworks, such as Vue, ReacJS, they need to execute JS before rendering the content to the page, or asynchronously pull data, data pull back to display the page. In this case, we will usually be in the page price loading state. The white screen ends after this loading.

The first screen time

First screen time refers to the time when all resources are displayed on the first screen. This time is inconsistent from page to page. For example, if the first screen of a page is four images, the first screen time should be counted after the four images are loaded, or if the page is asynchronously pulled, the first screen time should be the time to insert the data into the browser. In short, the first screen time is the time when the first screen resource is found and the loading is completed.

Report the way

Once the time is measured, the data needs to be sent to the server. The loss rate of speed measurement data is relatively low, and speed measurement should be carried out as far as possible without affecting the logic of the main process and the performance of the page. The IMG tag GET request is used to report data for the following reasons.

  • There are no ajax cross-domain issues and requests can be made from different sources
  • Very old tag, no browser compatibility issues
 var i = new Image();
 i.onload = i.onerror = i.onabort = function () {
 	i = i.onload = i.onerror = i.onabort = null;
 }
 i.src = url;
            
Copy the code

Some advanced browsers also support the navigator.sendbeacon method. This method can be used to send a small amount of data. It is asynchronous and can be sent even when the browser is closed, making it especially suitable for reporting statistics.

navigator.sendBeacon(url, data ? $.param(data) : null)
Copy the code

Final solution: If the browser supports sendBeacon, use this method preferentially. If the browser does not support sendBeacon, use IMG to report.

The sampling

The data reported by speed measurement is massive. Because the data is too large, the storage processing time will increase, and the server performance is limited. In order to avoid the waste of resources, data sampling processing is carried out during the reporting process. The granularity of sampling is controlled by the client. If sampling is 1/10, rate=10 should be added to the reported data, and rate is the sampling rate.

Data collection and warehousing

We set up an Nginx server on a machine, nginx server can record access, the user’s access record into the log, this log can record all the information request headers, such as request parameters, request IP. Logs can be generated in their own format.

When a page speed gauge sends a request, nginx logs the request and writes it to a log.

We didn’t use Nginx’s Logrotater (log timed polling). Since the minimum granularity of Logrotater is 1 day, we hope that logs are stored in a file of 5 minutes (the reason is that files can be processed in batches, so as to avoid large files processed at one time, and the granularity of the velocity points we queried for the trend of a day is also 5 minutes).

Nginx configuration is as follows:

if ($time_iso8601 ~ "^(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{1})[0-4]") { set $logname $1-$2-$3-$4-$50; } if ($time_iso8601 ~ "^(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{1})[5-9]") { set $logname $1-$2-$3-$4-$55; } to generate the log: access_log logs/stat.y.qq.com.sp.access.$logname.log spdata;Copy the code

The partition is done directly when the log file is generated. However, logrotate cannot be used to periodically delete log files. You need to write an extra script to periodically delete log files to avoid excessive file precipitation and waste disk resources.

Log stores data in the following format:

log_format spdata '$time_local ~|^ $http_x_forwarded_for ~|^ $request ~|^ $http_referer ~|^ $status ~|^ $http_user_agent  ~|^ $cookie_ptisp ~|^ $cookie_uin';Copy the code

One ~ | ^ for the separator

The statistical fields are

  • Time, press 5 minutes
  • IP: indicates the IP address of the user
  • Data: indicates the data reported on the page
    • Product ID (such as QQ Music, National Karaoke)
    • Pid project ID (such as PC client, YQQ, QQ Music mobile phone client, other H5)
    • Pageid pageid (a specific page under the project)
    • Points 1=xx&2= XX &3= XX…
    • R Sampling rate 0~1
  • Referer page referer
  • Ua resolves the platform, system, version and APP network type
  • isps
  • Id of a UIN user

To reduce the strain on the server, the reporting machine is separated from the inbound server. During database entry, the database entry server periodically pulls logs from the reporting machine to import data into the database.

Warehousing of data

Data processing is a big problem for the system, and pv of the whole platform is over 100 million every day. To avoid getting too big, we create new tables with the data collected by date.

Even if a new table is created by date, there are tens of millions of queried data, and it is very time-consuming to directly query the data of the table. In order to solve the problem of time consuming data query, we set up three tables: data statistics table, original data table and original data index table.

Statistical table

Statistics are the average time taken at all points on a page over a 5-minute period. When analyzing the data, the program divides a day into more than 5 minutes, calculates the 5-minute average speed of each speed measurement point, and writes it into the data statistics table. When querying the trend of a certain speed measurement point for a day, we can directly query the statistics table without re-facilitating all points. The use of statistics tables can greatly reduce the amount of data to query, thus improving the query speed, query mysql is millisecond level.

Raw table & index table

Data statistics, can solve most of the data query needs, but if you add a few compound condition query, query condition, country, province, operators, network type, operating platform), statistics, apparently, is fulfilled, if every condition combination to create a TAB, that will produce a lot of additional tables, and query composite utilization rate is not high, This is not feasible.

We store the original data in different tables and look up the original data through the indexes in the index table. If the data in a table exceeds a certain order of magnitude, the query speed is slow. To ensure Mysql performance, it is recommended that the number of records in a single table not exceed 10 million. Query the data of each sub-table by index.

The threshold alarm

Return to slow on a data interface and cause page open speed slow, this time we need a warning, to notify the developer, when dealing with data warehousing, a node takes an average of more than 5 minutes the preset threshold, or the default threshold for 10 seconds, the system will be the information in some way to tell developers. Alarms are used to detect problems and resolve them in a timely manner.

The data show

The system provides a bar chart that mainly shows the time consuming of each speed measuring node on a page, the single day trend of a single speed measuring point, and the trend chart of a period of time. Multidimensional analysis list.

Overview of the page

To display the time consumption of all speed measurement points on the whole page.

The reason for using bar charts is to make it easy for page developers to see where the most time is being spent and where the bottlenecks are.

In addition to viewing the overall time spent on the page, you can also view the details of individual speed points.

Details of speed points

The speed points will display the following information

  • The average time
  • The request quantity
  • Percentage of slow users
  • Normal distribution of velocity

In order to facilitate the mining of potential performance bottlenecks, data need to be analyzed from multiple dimensions. For example, mobile terminals pay more attention to network types, so data need to be analyzed according to network types.

There are other common dimensions

  • countries
  • provinces
  • Operator,
  • Network type
  • The operating system

Abnormal data processing

In the view point chart, the chart will have a long spike. The reason for the spike is that the delay at this point is much higher than the data at surrounding points. In order to find out the cause of the spike, I checked the original table and found that most of the reporting points were normal, but it took more than 30 minutes to report once. At present, I don’t know why such a long rendering time was reported, which may be related to the user’s machine or the network situation at that time. These points have a great influence on the average value of the calculated chart. In order to ensure the overall normality of the data and not to be greatly affected by any abnormal node, points longer than 10 minutes are filtered out directly.

conclusion

We introduced how to build a speed measurement system from three aspects: front-end reporting, data collection and storage, and data display. Performance optimization is a constant concern for us. In order to create smooth experience, speed measurement system is an essential tool.

reference

https://fex.baidu.com/blog/2014/05/build-performance-monitor-in-7-days/

https://www.qcloud.com/community/article/655542

http://javascript.ruanyifeng.com/bom/performance.html