preface

For the front end, the most important thing is the experience, and in the front end, the core of the experience is performance. Second opening rate, fluency and a series of indicators have a direct impact on user experience.

Therefore, the establishment of an accurate, timely and effective front-end performance monitoring system can not only quantify the performance level of the current page, but also provide data support for the effect of the optimization scheme. In addition, it can also provide alarm service when the page performance declines, reminding developers to improve the performance of the page.

Select monitoring indicators

After referring to the practical results of predecessors, we evaluated the calculated cost, applicability and practical value of a series of performance monitoring indicators, and concluded that the following indicators and information are the most practical and cost-effective:

The first is FCP (First Contentful Paint, shown below), which is the most popular metric for page seconds. Although it’s not as relevant as FMP (First Meaningful Paint), LCP (Largest Contentful Paint), speedIndex, etc., However, the advantage is that it can be obtained by calling the Performance API on the Android side, and it can be estimated through the RAF (requestAnimationFrame) on iOS. The implementation process is simple.

The second is TTS (Time to Server), which did not appear in the previous articles.

The Performance API provides requestStart minus fetchStart. This metric is not optimized by the front-end technology, but it can be reflected in the network environment of the current user group. What is the upper limit of page opening per second.

For example, if the performance monitoring data shows that 15% of the users who visit our server (either SSR server or CDN) take at least 1 second to connect to our server, then these users are not allowed to switch on in any way, then the upper limit for a page is 85%.

If the current situation, the page seconds open rate has reached 75% or even higher, then continue to optimize the marginal revenue will be very low, should be enough.

The third is TSP (Time for Server Processing), which is also not included in the previous reference.

It addresses the internal server processing time of a page request in a scenario using SSR, which can be obtained by subtracting requestStart from responseStart provided in the Performance API.

This link performance is too poor, will also become a bottleneck of second opening rate drag, so it must be monitored; A TSP that is too long will squeeze the performance budget of other links, and a TSP that is too small will increase server operation and maintenance costs.

The fourth is the size of CSS files, images and other resources, and the duration of THE XHR request. If the first two resources are not controlled, the page will not be able to quickly enter the available state even if it is second open, such as common feed flow pages.

For XHR, it needs to be classified and discussed. If it is a SSR page, it will have little impact. As long as TSP is kept at a low level, it will basically not drag down the rate of seconds open. It also needs to be monitored and notified to the back end for optimization and treatment when indicators fall.

Finally, some environmental information, such as what brand and model of mobile phone the user is using, whether to open the H5 page in wechat, browser or our version of Dewu App.

When page performance problems occur, this auxiliary information can help us to reproduce the user’s practical scenarios as accurately as possible, and solve the problem efficiently and accurately.

System architecture

The whole system is composed of the following modules:

SDK: Is responsible for collecting users’ page performance data and basic information, and sends the performance data to SLS according to certain sending policies. After embedding the page, the performance data can be collected by itself without interacting with the page code.

SLS: Alibaba Cloud log service, which accepts data sent by SDK and adds additional information such as receiving time and IP to performance data.

Backend: Performance data Backend. This module has two functions. Backend periodically retrieves original performance data from SLS, deduplicates and processes it to obtain performance indicators and user information, and stores the data in specific data tables for query. Another is to provide interface data for data visualization.

DB: indicates the database for the performance log data and performance indicator data after persisting.

Report: Indicates a performance data Report. You can view performance indicators on a specific page in a specific project or version by performing and operating the Report.

The relationship between each module is shown in the figure below:

Key technology decision

There are several key points that need to be thought about and decided upon before development can proceed. There are roughly the following points:

  1. On the mobile side, the Unload event is not always triggered, so the SDK needs to be able to intermittently send data to the SLS. In order to control the frequency of sending and reduce the repetition of data, we adopted a strategy of gradually extending the sending interval, that is, when the current page is opened for a longer time, the frequency of sending data will become lower and lower. When the page is opened for a certain length of time, the SDK will completely stop working.

  2. Because of the duplication of data, it is necessary to calculate fingerPrintJS2 on the user’s side (browser, wechat, app webView). In this case, we choose fingerPrintJS2. In the calculation, we remove the browser feature that causes fingerprint instability, so that the user always has a fixed fingerprint when the page is opened. But relying on fingerprints alone is not enough, because for the same phone model, the calculated prints are likely to be the same. Therefore, when the performance data is deduplicated, the user’s fingerprint, log client time stamp, and user device IP should be combined to deduplicate the performance data. Users who use the same wifi, the same device, and open the page in the same millisecond should be judged according to the current page access and time distribution of the front-end page. The effect of this scheme is very good.

  3. The SDK has some synchronized code and needs to run as early as possible after the page loads. This means that if the SDK fails, the page will not work properly, which is very dangerous. So wrap the SDK synchronization code with a try catch to ensure that SDK exceptions don’t drag down the page.

  4. Some page loading pictures, send a request will be very much, all records to be reported is very unrealistic, so we are monitoring this part, only the file pictures of bytes in the top 10, the loading time down details of the request of the top 10, and then calculate the page load how many pictures, sending too many requests. This allows you to know both the size of the page resource load and the resources that take the most time to load the page.

  5. The main reason why the original logs are sent to SLS is that the amount of concurrent data sent is very large, and the cost of running the log server by yourself is too high. Therefore, SLS is a more cost-effective choice.

  6. Python+Django is used for back-end services. Although Python is a scripting language, its performance is poor, but it is easier for front-end students to get started. For small-scale back-end services, development efficiency can also be guaranteed. With the pYPy compiler, python code can be run faster. In addition, by using multi-process + multi-threading, you can further improve the speed of data processing.

  7. Since we carried out a statistical analysis on the purpose of, so it is not necessary to statistics all the performance data, therefore, we take the step of sampling methods, such as, namely according to the data of one day, we each 5% of log data, only to statistics, the article first 1000 due to the report from the user side data this behavior is random, so the scheme can ensure basically random sampling. In this way, the computation of the statistical work is greatly reduced, and we can use a weak machine to do the data processing, and in the event of a failure, we have time to “retrieve” the unprocessed data.

  8. In terms of database, we chose MySQL. We don’t have very strict requirements on IO, so a regular relational database is sufficient.

Future Development Plan

The current front-end performance monitoring system can meet the daily monitoring needs, but it can go further:

  1. Frame rate statistics: Currently, THE SDK has the function of frame rate statistics. However, due to the large amount of data of the original frame rate, it is necessary to change the frame rate statistics in the later period. For example, only the time interval of the lag and the frame rate distribution in the interval are reported.

  2. Replace FCP with more scientific indicators such as LCP and FMP.

  3. Additional combined with pageName to de-weight, further improve the effect of de-weight.

Article | Lao lang

Focus on object technology, hand in hand to the cloud of technology