background

As a national App with 600 million users per day, Douyin has many user functions, and new services are incubated and launched almost every day. Obviously, the speed of Native development and release pace cannot meet the demands of rapid business iteration. Therefore, there are a large number of Hybrid business scenarios in Douyin, and the technology stacks used include traditional WebView, reactNative, and Lynx developed by byte.

Lynx, bytedance’s original client-side cross-end engine framework, uses JavaScript as the development language, allowing front-end developers to use a familiar DSL for cross-end development. In this year’s Spring Festival Gala, Lynx achieved a very bright performance, which effectively guaranteed the page experience while reducing the cost of the client platform.

When douyin music service carries out Hybrid scenario in the end, it chooses Lynx technology framework. Thanks to the company’s own research, the business has a lot of autonomy in monitoring. Based on actual service scenarios, a set of service monitoring indicators is customized from the user’s perspective.

The container is introduced

Unlike the traditional H5 scenario container Webview, Lynx has its own cross-end container Bullet. Containers are very important for monitoring the whole link of cross-end services. Let’s briefly introduce Douyin cross-end Bullet.

What is Bullet?

Bullet, a cross-end generic container developed by byte, can simultaneously handle Lynx, WebView, and reactNative. It combines the basic capabilities required for cross-end scenarios and eliminates the differences in JSB, resource loading, and other aspects of the three technology stacks. Allow service clients to access the system at the same time, which applies to cross-end scenarios. “Bullet” containers were also used in the Spring Festival in 2011.

Below is a structural drawing of Bullet, showing in general the capabilities it has. πŸ‘‡

Lynx page loading process

Before introducing cross-end monitoring, let’s briefly introduce how Lynx pages are loaded in Bullet.

The Bullet architecture breaks Lynx’s complete loading link down into separate sub-tasks, which are similar to the “chain-executed Promise” in the front end. ** One task is followed by the next task, and the results of the previous task are used as input to the next task.

The whole process is roughly divided into four sub-tasks: route resolution, offline resource loading, Lynx Client initialization, and Lynx first-screen rendering.

The execution sequence is shown in the following figure πŸ‘‡

Bullet has customized a special set of schema rules for Lynx scenarios. Not only does it support parsing and rendering of Lynx resources, but it also supports degrading H5.

When the user behavior triggers the page that arouses the Lynx scene, Bullet will intercept the schema and perform route resolution. According to the routing rules and Gecko, Bullet will load the Lynx static resources. After the resource loading is completed, Lynx will be initialized and the resources will be submitted to Lynx for parsing and rendering.

Gecko, bytedance resource distribution center, focuses on client-side file distribution scenarios and is an important infrastructure for improving client dynamism. As a resource distribution channel, it supports various flexible and customizable distribution rules to facilitate rapid service iteration and greatly improve user experience.

These are the paths when all links work properly. When an exception occurs in each of these segments, Bullet triggers the appropriate error handling mechanism.

When route resolution fails, Bullet loads the BulletView of the bottom and presents an error page to the user. When an exception occurs in other links, if there is a Fallback link in the schema, Bullet will enter the fallback pipeline of the bottom of the bag, and the resolution of routing rules and resource loading will start again. Different from Lynx scenario, the initialized WebView at this time, Render web resources in Webivew.

Monitoring plan

Index definition

For the execution process of Bullet and Lynx mentioned above, combined with the feature of high experience requirements of the music scene, we have customized a variety of monitoring indicators, which are mainly divided into two categories: performance indicators and error indicators.

Performance monitoring

Bullet First screen time of container 🌟🌟🌟🌟🌟

In contrast to the FMP metrics commonly used in Web scenarios, we monitor the time between the user’s click and the first Lynx page loading. In Douyin, the schema of the Hybrid page is managed by the Router service in Bullet, so when the user clicks ~ = the open method of the router in Bullet is called.

The Lynx SDK is responsible for rendering and throws a callback function when the render is complete, so when the Lynx first screen rendering is complete = when the Lynx first screen callback function is executed.

This number monitors the time it takes a user from clicking to seeing the first screen of the page and is one of the most core metrics of concern in current business scenarios.

Calculation formula: Container first screen time = Lynx**** when the first screen rendering is complete – Bullet**** when the router intercepts

Lynx First screen time 🌟🌟🌟🌟

The Bullet container is responsible for the whole Lynx page loading link. The Lynx first screen represents the overall rendering of the Lynx container side. When the overall link takes a long time, we can find the link time point by comparing the first screen time of Bullet container and Lynx.

Calculation formula: Lynx first screen time = Lynx**** when the first screen rendering is complete – Lynx**** when the initialization starts

Local resource interception success rate 🌟🌟🌟

In the actual production environment, Lynx products are delivered in advance using the resource distribution platform Gecko. The delivery strategy is to cache resources in the App in advance at an appropriate time to realize offline loading of resources, which is also one of the reasons for Lynx’s excellent performance.

The Bullet container preferentially tries to load the local resource when the resource is loaded. In some abnormal scenarios, such as network exceptions or Gecko exceptions, local resources fail to be loaded. In this case, Bullet tries to pull remote CDN resources.

The local resource interception success rate monitors the stability of Gecko and Bullet during resource loading. Whether local resources are intercepted successfully has a significant impact on the link time. Therefore, this indicator can be used to analyze the link time.

Calculation formula: Local intercepting success rate = Local loadLynxTotal number of resources successfully resolved/total number of routes resolved

Page degradation rate 🌟🌟🌟

As mentioned in the previous section, Bullet does fallback processing for exceptions on Lynx page links, demoting to web scenarios. Typically, the overall page experience and the first screen load time will be less than in the Lynx scenario when the web is degraded. We monitor this number and make sure it is extremely low to ensure an extreme user experience.

Calculation formula: Page degradation rate = Total number of fallbackPipeline entries/total number of route resolution

[Ultimate indicator] Custom TTI🌟🌟🌟🌟

In a real business scenario, we would normally load the skeleton screen on the first screen. With the exception of a few purely static display pages, the first screen that users can interact with relies on data from at least one API.

Because each business scenario defines page interactivity differently, the end point of TTI cannot be uniformly measured. After communicating with the Bullet side students, the Bullet container injects a containerInitTime value into the Hybrid scene to represent the time the user clicked.

In the scene of douyin music, the duration of the first frame of audio playback when the user clicks to the landing page is the most concerned.

Therefore, the TTI value in tiktok music chart scenario = music playback start time – containerInitTime.

Calculation formula: Custom TTI = Custom Time node – containerInitTime

Monitoring errors

In terms of performance monitoring, we are currently focusing on the user’s time from first screen to interactivity, as well as the time and stability of resource loading in some of the core phases of Lynx loading links.

On error metrics, we focus on some exceptions in the page runtime to ensure the stability of the page.

Page load failure rate 🌟🌟🌟🌟🌟

Although Bullet does a lot of keeping the Bullet page loaded throughout Lynx’s loading link, it is still necessary to monitor the loading failure rate of Bullet pages in case of a black swan event. This indicator is also one of the core indicators of Lynx link stability.

Calculation formula:BulletPage loading failure rate = Number of page loading failures/Total route resolution

JS error /JSB error /API request error/engine layer error 🌟🌟🌟🌟

While the page is running, we monitor all kinds of error conditions. This includes JS errors, JSB errors, errors in sending API requests, and errors in the container (engine) layer.

Data analysis & monitoring report

The data daily

In view of the above monitoring indicators, we developed a flying book robot to synchronize the data indicators of the business scene every day in the related business synchronization group. The following is the data indicator card of a douyin music scene on a certain day.

Threshold setting

In the Spring Festival project of douyin in 2015, lynx technology stack was used. According to the indexes summarized in the Spring Festival, we developed a threshold standard.

Performance threshold

Click on the first screen

Lynx first screen

Degradation rate of H5

Local resource interception rate

IOS
500ms

200ms

0.0001

99%

Android
800ms

680ms

0.0001

99%

Error threshold

JSB error rate

Js error rate

Engine layer NA error rate

API interface error rate

IOS

0.0001

0.001

0.001

0.015

Android

0.0001

0.001

0.001

0.01

Real-time alarm

At present, Hybrid related buried point monitoring data in Douyin are reported to the monitoring platform in real time, and the monitoring platform will carry out real-time data cleaning.

Using the timing detection capability of the monitoring platform, we set the data detection for abnormal errors every half an hour. An alarm is triggered when the threshold for the related error exceeds the set threshold. (The alarm capability of the monitoring platform is connected with the flying book, and the flying book will send an urgent message when the alarm is reported)

Error management

So far, through this monitoring program, we have found and governed a lot of online problems. At present, there are mainly the following types of problems to be solved:

Scenario 1: Advance key page first screen performance optimization & key interface slimming

Through the daily performance data monitoring, it was found that the performance performance of the activity page was lower than expected during the music planning activity of musicians. The Bullet container first screen time of the activity page was over 1s on Android side, the first screen time of Lynx was over 800ms, and the first screen TTI of the page was over 3.5s.

Further analysis of Lynx rendering links shows that the first screen time is mainly spent on resource pack loading.

Lynx rendering link time diagram

Therefore, the volume of the resource bundle is compressed and optimized. The volume of the resource bundle is reduced from 1134KB to 416KB by cutting the first screen of the first screen of the first screen of the page and compression of static pictures. The package volume is reduced by 63.32% and the first screen time of the page is also reduced by 38.75%.

Product resource pack volume changes

To solve the problem that the TTI of the first screen of the active page is too large, the interface response time is reduced by nearly 1 second by reducing the interface data and shortening the acquisition path.

Response time of the interface on the first screen. Procedure

Scenario 2: Push to resolve the code optimizations underlying Lynx

Through the error monitoring of the Lynx engine layer, we found some code problems at the bottom of Lynx, such as various void detection problems in the code. Through the timely repair and treatment of students in charge of Lynx, the fault tolerance and stability of online business were improved.

Compatibility issues of various types of space judgment at the bottom of Lynx

Summary & Future planning

This is douyin music’s current cross-end monitoring practice within Douyin. At present, we only monitor some core data indicators of the page. Compared with the perfect monitoring indicators of the Web scenario, the monitoring of the Lynx scenario is still far from perfect. Lack of page fluency, LCP, FID, etc.

Meanwhile, in the dimension of data analysis, it is not detailed enough at present. For example, different threshold standards need to be set for different models with different scores. An error rating system is introduced for end-end error scenarios. The reported errors are classified into fatal errors (page loading errors), Serious errors (users cannot interact normally), and Warning errors (affecting user experience) based on the actual impact on users. Help business development students to better assess and handle page errors.

In the future, we also hope to establish a comprehensive and perfect Hybrid performance & error evaluation system, which can focus data on the three scenarios of Web, Lynx and Native, and help the development students to clearly and comprehensively compare the advantages and disadvantages of the three technology stacks from the data dimension when selecting technologies.

Road block and long, line will come, line and ceaseless, the future can be.

[Recruitment Information]

We are the douyin music research and development team. We need to support douyin, a product with hundreds of millions of daily users. The vision of the team business is to create “the most influential music platform in the world, where everyone can better discover and communicate, and musicians can better create and grow”.

Our existing business:

1. Music Center — Provide support for submission, consumption, distribution and other music scenes for tiktok and other byte services.

2. Chinese Music — An attempt at Music consumption in Douyin & an argument for music value.

3. Musician – Music creator platform.

Our recruitment includes service side, front end, client side, test side, social recruitment, school recruitment and intern students. The Base is in Shanghai and Shenzhen.

Welcome to join us! Please send your resume to [email protected].

reference

  • Developer.aliyun.com/article/778…
  • Web. Dev/vitals / \? Sp…
  • Web. / dev/LCP \? SPM = a…
  • Web. / dev/fid \? SPM = a…

Welcome to “ByteFE”

Resume delivery email: [email protected]