When we say fluency, what are we talking about? Different people have different definitions of fluency and different perceptions of the lag threshold, so it is important to clarify what is involved before starting this series of articles, in case there are any differences in understanding, and in case you look at these questions with questions in mind. Here are some basic instructions
- For mobile phone users, lag includes many scenarios, such as dropping frames when sliding the list, excessively long white screen for application startup, slow light screen when clicking the power button, no response to interface operation and then flash back, no response when clicking the icon, incoherent window animation, no follow-up when sliding, restarting the phone and entering the desktop lag, etc. These scenarios are a little different from what we developers think of as a problem, and developers tend to analyze these problems in a more nuanced way. This is a cognitive difference between developers and users, especially when dealing with user (or tester) problem feedback
- For developers, the above scenario includes fluency (sliding list when dropped frames, window animation incoherent, reboot into desktop phone card), the response speed (application startup hang is too long, slow light screen click the power button and sliding also), stability (interface without reaction and then flash back, click the icon no response) of the three big categories. The reason for this classification is that each classification has different analysis methods and steps, and it is important to quickly identify which category the problem belongs to
- Technically speaking, fluency, response speed and stability (ANR) the reason why these three users perception are caton, because the principle of these three problems are consistent, is because the main thread of the Message in a mission timeouts, according to the different timeout threshold division, so to understand these problems, You need to understand some basic operating mechanisms of the system, and this article will introduce some basic operating mechanisms
- This series focuses on analyzing problems related to fluency, and there will be a special article on response speed and stability. After understanding the content related to fluency, it will get twice the result with half the effort to analyze response speed and stability
- Fluency this series is mainly about how to use Systrace (Perfetto) tool to analyze. The reason why Systrace is the starting point is that there are many factors affecting fluency, including the reasons of App itself and system. The (Perfetto) tool can show the process of the problem from the perspective of the whole machine operation, which is convenient for us to locate the problem initially
Systrace Fluency exercises currently include the following three chapters
- Systrace Fluency Practice 1: Understand the Caton Principle
- Systrace Fluency Combat 2: Case study: MIUI desktop slide stuck analysis
- Systrace Fluency Practice 3: Some questions during Caton analysis
If you are not familiar with the basic use of the Systrace (Perfetto) tool, you should first complete the Systrace basics series
Systrace is the first tool to analyze the lag problem. It gives developers a way to look at the problem from a global perspective. With Systrace analysis, we can roughly determine the cause of the lag problem: whether it is caused by the system or the application itself
Of course, Systrace as a tool, and then in-depth analysis of the time will be a little powerless, need to cooperate with TraceView + source code to further locate and solve the problem, and finally use Systrace for verification
Therefore, this paper is more about how to find and analyze the lag problem. As for how to solve it, we need to find appropriate solutions later, such as comparing Systrace performance of rival products, optimizing code logic, optimizing system scheduling, optimizing layout, etc
Case description
When using Mi 10 Pro, there will always be a feeling of lag in the most commonly used scene of desktop sliding. 10 Pro has a 90Hz screen and FPS is also 90, so once there is a lag, there will be an obvious feeling (I am also sensitive to this). I didn’t pay much attention to it before, but after upgrade 12.5, it still hasn’t been fixed, so I want to see what’s going on
After analyzing Systrace, this stuck scenario is a very good example, so I share this example as a practical example of fluency
It is recommended that you download Systrace in the attachment and eat it according to the article
- Since there are many factors affecting the lag problem, I have communicated clearly about the hardware and software versions involved in this analysis before the beginning. If this scenario is optimized in the future, this article will not be modified, and Systrace in the attachment shall prevail
- Hardware: Mi 10 Pro
- Software: MIUI 12.5.3 Stable version
- Xiaomi Desktop version: release-4.21.11.2922-03151646
Systrace analysis
To analyze the problem, our general process is as follows
- To grab Systrace, you can use shell or mobile phone tools to grab
- Open the Systrace file in Chrome (end of HTML). If you can’t open it directly, type Chrome :// Tracing/and drag the Systrace file into Chrome to open it
- Locate the App process in Systrace
- Locate the problem – usually with input events, such as input events
- Analyze the main thread and render thread of the App process
- Analyze the main and Binder threads of the SurfaceFlinger process
- Binder analysis of SystemServer processes
After following this process, you need to look at the processes in reverse, connect the dots, and infer the most likely cause
Start with the Input event
This time I only slide Systrace once, so it is easier to locate, slide the input event consists of an input Down event + several input Move events + an input Up event
In Systrace, the InputDispatcher and InputReader threads in SystemServer are reflected, we mainly look at the main thread of the App
The deliverInputEvent on the App’s main thread identifies the process of processing an input event. After the input is up, the App enters the Fling phase
- www.androidperformance.com/2019/11/04/…
- www.androidperformance.com/2020/08/20/…
Analysis main thread
This time, the blockage occurred mainly after the release of the hand, so we mainly look at the paragraph after the Input Up
The Frame above the main thread is marked with colors, usually green, yellow and red. In the Systrace above, there is no red Frame, only green and yellow. So yellow must be carton? Does red mean carton? Well, not necessarily, just by looking at the main thread, we can’t tell if it’s stuck, which we’ll talk about later
From the main thread we couldn’t determine if there was a lag, so we found three suspicious points. Next we looked at the RenderThread
Analyzing the render thread
Zooming in on the first suspect, it can be seen that the total time of this frame is 19ms, and RenderThread takes 16ms, and the CPU status of RenderThread is running (green). Therefore, the reason for such time of this frame is probably caused by the following two reasons:
- RenderThread itself is time-consuming and busy
- RenderThread’s task was cpu-affected (either low frequency or small core run)
Since this is just a suspect, let’s skip the CPU-related and look at the SurfaceFlinger process to make sure there is a stalling happening
Analysis SurfaceFlinger
For Systrace SurfaceFlinger part reading unfamiliar to preview this article www.androidperformance.com/2020/02/14/…
We’re going to focus on two points here
- Buffer status of the BufferQueue corresponding to the App. Through this, we can know whether the App has available Buffer for SurfaceFlinger to synthesize at the SurfaceFlinger end
- SurfaceFlinger main thread composition. You can determine if a frame is stuck by seeing if the SurfaceFlinger composits when sF-vsync arrives.
The criteria for judging whether there is a lag are as follows
- If the SurfaceFlinger main thread does not have a composition task and the App works properly during this Vsync cycle (vsync-app),If no Buffer is available in the App’s BufferQueue, the frame is stuckThis is shown below (where the first suspect is located)
-
If SurfaceFlinger is composited and the App works properly during this Vsync cycle (vsync-app),If there is no Buffer available in the App’s BufferQueue, this frame is stuck, SurfaceFlinger works because there are other apps that provide buffers that can be synthesized— Caton shows upThis situation is shown in the picture below (also in the attached Systrace).
-
If the SurfaceFlinger is synthesized and the App works properly during the Vsync cycle (vsync-app), and the corresponding App has available buffers in its BufferQueue, then the frame will be synthesized properly. There was no caton– Normal conditionThe normal situation is as follows, as a comparison or posted to facilitate comparison
Back to the first suspicious point of this example, we found that the frame was indeed dropped through the analysis of SurfaceFlinger terminal, because the App did not prepare a usable Buffer for SurfaceFlinger to synthesize. Then we need to see why there is no Buffer available for SurfaceFlinger for this frame
Back to the render thread
Above, we analyzed that the MainThread + RenderThread corresponding to this frame took 19ms, and the RenderThread took 16ms, so let’s look at the situation of RenderThread
There are two main reasons for this
- RenderThread itself is time-consuming and busy
- RenderThread’s task was cpu-affected (either low frequency or small core run)
However, in the scene of desktop sliding, the load is not high, and there is no redundant operation after release, such as View update. The time itself is nearly 3 times more than that of the previous frame, so it can be concluded that the time is not caused by the load increase
Then we need to look at the CPU of the RenderThread:
Since we are in the Running state, we will go to the CPU Info area to check the scheduling of this task for this period of time
Analyze CPU area information
Check out the following two articles: CPU Info and Android Systrace Info
Back to this case, we can see that RenderThread corresponding to App mostly runs on CPU 2 and CPU 0, namely small core (this model is Qualcomm Snapdragon 865, with four small cores, three large cores and one super-large core).
Its corresponding frequency has reached the highest frequency of small core (1.8Ghz).
There is no CPU boost involved
The reason why the RenderThread takes so long is that even if the core runs full, it cannot complete the task in such a short time
So to verify our guess, we need to do the following two steps
- Compared with other normal frames, is there any running in the small core? If there is and there is no frame drop, then our guess is wrong
- Compare the other abnormal frames to see if the frame drop was also caused by the RenderThread task running to the small core. If not, you need to make other assumptions
After analyzing the next few frames with the same process, we found that
- Compared to other normal frames that did not run in the small core, including the next frame after the drop, the scheduler immediately moved the RenderThread to the large core, without the continuous drop of frames
- In contrast to the other abnormal frames, RenderThread ran to the small core, but the poor performance of the small core led to the execution time of RenderThread, which ultimately caused stuttering
So far, this Caton analysis has found the cause: RenderThread fell into a small core
As to why RenderThread task ran ran it fell into the small nucleus, this was associated with the scheduler, nuclear directly to the size of the scheduling related to task load, task fell from large nuclear small nuclear, or from small nuclear migration to big nucleus, the scheduler’s side are to control the parameters and algorithm, so the subsequent optimization may need to start from the aspects of this
- Adjust the threshold parameters for core migration or modify the scheduler algorithm
- Refer to the performance of competing products, see the performance indicators and scheduling of competing products in this scene, and analyze the strategies that competing products may use
What role does Triple Buffer play in this scenario?
In this article, several functions of Triple Buffer are discussed
- Ease off the frame
- Reduce mainthread and render thread wait time
- Reduced GPU and SurfaceFlinger bottlenecks
So what role does Triple Buffer play in the desktop slide jam case? The conclusion: Some scenarios did not work, but had side effects, resulting in more pronounced stutter phenomenon. Here is the analysis process
See Triple Buffer in the articleEase off the frameThe principle of:
When analyzing the case of mi desktop sliding stall, I found that there was a problem. Sometimes the BufferQueue of App corresponding to MI desktop would change from 2 to 0, which is equivalent to directly discarding a Buffer, as shown in the figure below
In this case, if the RenderThread takes a second time in a desktop Fling, it is stuck and the Triple Buffer does not mitigate frame loss
As you can see in the figure below, when the RenderThread takes time again because a Buffer is discarded, there is still no Buffer available and frames are dropped
If you look at the logic of discarding the Buffer, it is easy to think that a frame was already lost and the Buffer corresponding to the elapsed frame was discarded (or the second frame was discarded). In either case, Each frame is calculated (see List Fling), and if you lose a frame and the SurfaceFlinger card is stuck, you feel it
For example, in the case of sliding, offset refers to a distance to the left of the screen
- Normally, when sliding, offset is:
2→4→6→8→10→12
- In the case of a lost frame, the Offset for sliding is:
2→4→6→6→8→10→12
(If the frame that calculated the 8 times out, you’ll see two 6’s, which is the case when you dropped a frame.) - As shown in the figure above, if the time-consuming frame was discarded, the following Offset would appear:
2-4-6 - > 6-10-12
Jump from 6 to 10, which is the equivalent of getting stuck onceCard + jump
series
- Systrace Fluency Practice 1: Understand the Caton Principle
- Systrace Fluency Combat 2: Case study: MIUI desktop slide stuck analysis
- Systrace Fluency Practice 3: Some questions during Caton analysis
The attachment
The attachment has been uploaded to Github and can be downloaded by yourself: github.com/Gracker/Sys…
- Xiaomi_launcher. Zip: Systrace file of desktop slide card, this case is mainly to analyze the Systrace file
- Xiaomi_launcher_scroll_all_the_time. zip: Systrace file that has been sliding on the desktop
- Oppo_launcher_scroll. zip: comparison file
About my && blog
- About me, I really hope to communicate with you and make progress together.
- Blog Content navigation
- Excellent Blog Post record – Android performance optimization is a must
A person can go faster, a group can go farther