Toutiao ANR Optimization Practice series - Monitoring tools and analysis ideas

Preface:

In the previous part, we introduced the design principle and influencing factors of ANR, and classified different scenarios that affected ANR. However, existing system logs are not enough to complete problem attribution in complex scenarios, and some information cannot be obtained from the application side, which makes many online problems more difficult. Therefore, we explore new monitoring capabilities on the application side to make up for the lack of information acquisition. At the same time, the log information and analysis ideas used in the daily analysis process are summarized to help you better master the analysis skills. Let’s take a look at the relevant implementation.

Raster monitoring tool

As the saying goes, “To do a good job, you must first sharpen your tools”. The same is true for daily analysis of ANR problems. A good monitoring tool can not only help us achieve the icing on the icing effect when solving routine problems, but also open our vision and provide us with more clues and ideas when facing complex and hidden problems online.

Tool introduction:

The tool monitors messages in the main thread scheduling process and aggregates them according to certain policies to minimize the impact of the monitoring tool on application performance and memory jitter. At the same time, the message execution process of the four components is monitored to facilitate the tracking and recording of the scheduling and time consumption of such messages. In addition, statistics are made on the messages currently being scheduled and messages to be scheduled in the message queue, so that the overall scheduling situation of the main thread can be played back when problems occur. In addition, we migrate the CheckTime mechanism of system service to the application side and apply it as the thread CheckTime mechanism, so as to predict the system load and scheduling situation in the past period from the timeliness of thread scheduling when the system information is insufficient.

So the tool can be summed up in one sentence: from point to surface, playback past, present and future.

Because of its realization principle and the effect of message aggregation, the tool is named Raster because it visually displays the time-consuming fragments of different lengths in the main thread scheduling process, just like a Raster.

Origin of Monitoring Tools

Here’s an example:

For example, the following figure shows the ANR problem encountered offline. From the Trace log obtained from the mobile phone, it can be seen that there is not much effective information from the main thread stack.Continue to analyze other information from Trace, including the vm and thread status information of each process, as well as the process with high CPU usage and system load (CPU, IO) before or after ANR, etc., as shown in the following figure:However, it is difficult for many students to make further analysis from these information, because these information only enumerates the current state of each process or thread, and does not properly monitor and record the process that affects these indicators. The problem with this kind of scenario in real life happens in large numbers online every day. So how to solve this situation better? Here’s how we did it.

Message scheduling monitoring

inANR design principle and influencing factors of Android systemIn the first article, we said that many scenarios of ANR problems are caused by long time consuming and continuous accumulation of historical messages. However, we do not know what messages have been scheduled before ANR occurs. If the time consuming of each message scheduling can be monitored and recorded, the recorded information can be obtained when ANR occurs. If you can calculate the time it takes to execute a message, you can clearly know what happened to the main thread before ANR occurred. According to this idea, the following schematic diagram is sorted out:However, based on the above schematic diagram and the actual business scenario, we find that for most business messages, single time is very small. If each message is recorded separately, it may take thousands of records to track and record all messages within the first 10S or even longer time range of ANR, which is obviously unreasonable. And so many messages are not convenient for us to check later.

Message aggregation classification

Associated with the actual business scenarios in many cases are less time-consuming, and less time-consuming process for this kind of problems at the basic message can be ignored, so we can according to certain conditions for this kind of news aggregation, message within a period of time to accumulate calculation, if the total time consuming more than threshold specified by us, Then, the number of messages and the total elapsed time are combined into one record. If the total elapsed time of 16 messages exceeds 300ms, a record is generated. In this way, the memory jitter caused by a large number of records and frequent updates can be better solved by storing 100 of them, which can monitor the scheduling history of the main thread 30 seconds before ANR occurs (actually it may be more than 15S, as we will explain why it is in this range later).

According to the above ideas, we further carried out aggregation optimization and key message filtering for message monitoring and recording. In summary, it can be divided into the following types:

Multi-message aggregation:

This scenario is used when the main thread schedules multiple messages continuously and each message takes only a small amount of time. The message time is accumulated until the message time exceeds the specified threshold. Then a record is generatedAnd indicate in this record how many messages were dispatched for this aggregation. According to the idea of message aggregation, the schematic diagram of main thread message scheduling when problems occur is as follows: Count indicates how many messages this record contains; Wall represents the cumulative time taken to execute this round of messages

Message aggregation split:

In view of the above message aggregation strategy, there are some special cases, such as in the process of multiple messages to the cumulative statistics, if after the first N times message scheduling, accumulative total takes no more than the threshold, but after scheduling the next message, found that the total time consuming more than the threshold, and also significantly beyond, such as setting the threshold value is 300 ms, But news first N accumulated 200 ms, cumulative takes 900 ms coupled with the news, so this kind of situation, we clearly know is last message takes seriously, so you need to separate records the news, and mark the time consuming and related information, at the same time will be N times before message scheduling time and number together and separate records, This scenario is equivalent to generating two records at a time.

In order to consider the impact on performance monitoring tools, in each round the statistics we only need to save, update the thread CPU execution time, if there is a news aggregation separation of scene, will be a record before the default CPU time to 1, because this record is not our focus on, so will all the CPU time in the statistics after return to a message.

In some extreme situations, such as the monitoring the first message to perform time-consuming 1 ms, but along with the message time consuming, accumulative total more than 600 ms, so the two news time-consuming than setting the threshold of 300 ms, you will need to separate records at the time-consuming serious news, but also the preceding 1 ms needs to be documented separately. If the situation is repeated again and again, the above mentioned 100 records will be saved and the overall monitoring traceable time range will fluctuate. A schematic of this scenario is shown below: Count indicates how many messages this record contains; Wall represents the cumulative time taken to execute this round of messages

Key message aggregation:

In addition to the above single-time messages that need to be split and recorded separately, there is another kind of messages that need to be marked separately for better identification, that is, the application components that may cause ANR, such as Activity, Service, Receiver, Provider, etc. To monitor the execution of these components, we need to monitor the activityThread.h message scheduling, log the execution of messages related to these components separately, regardless of how long they take, and save one or more previously monitored messages as a record. A schematic of this scenario is shown below: Count indicates how many messages this record contains; Wall represents the cumulative time taken to execute this round of messages

IDLE scenario aggregation:

Familiar message queue students all know, the main thread is way based on message queue scheduling, in every message scheduling is completed next to read from the message queue scheduling messages, if there is no message message queue, or the next message not to set the scheduling time, and no IDLE message waiting for dispatch, The main thread will enter IDLE state, as shown in the following diagram:Due to the above scheduling logic, the host thread enters IDLE several times during message scheduling, which also involves thread context switching (e.g. Switching from Java environment to Native environment) will detect whether there are pending requests. Therefore, for frequently called interfaces, they will be hit when ANR occurs. Theoretically, the more frequently called interfaces are hit, the higher the probability is, as shown in the following stack:But this scenario above IDLE time can be long or short, if what kind of message aggregation strategy, according to the above completely continuous polymerization, multiple message may give aggregation in such scenario also, to a certain extent caused by disturbance, this is obviously not what we want, so need to be further optimized, at the end of each message scheduling, access to the current time, Again in before the start of the next message scheduling, access to the current time, and statistics from the end of the last message scheduling interval length, if the interval is longer, so also need to record alone, if the interval time is shorter, we think that can be ignored, and combine to statistics message before its track, here is complete the monitoring and classification of all kinds of scenarios. A schematic of this scenario is shown below: Count indicates how many messages this record contains; Wall represents the cumulative time taken to execute this round of messages

Time-consuming message stack sampling:

The monitoring and aggregation strategies of the main thread message scheduling process are mainly described above to facilitate ANR occurrence and offline playback. However, for those messages that take a long time, only knowing the time and message tag is far from enough, because the business logic inside each message is a black box for us, and there are many uncertainties in the time of each interface, such as lock wait, Binder communication, IO and other system calls.

Therefore, it is necessary to know the time-consuming situation of the internal interface of these time-consuming messages. We choose two schemes for comparison:

The first scheme is to plug each function, count its time in the process of entering and exiting, and aggregate and summarize. The advantage of this scheme is that it can accurately know the real time of each function, but the disadvantage is that it affects the package volume and performance, and is not conducive to the efficient reuse of other products.

In the second scheme, when each message is executed, the timeout monitoring of the child thread is triggered. If the execution of the message is not finished after the timeout, the main thread stack is captured and the next timeout monitoring is set for the message until the execution of the message is finished and the monitoring of this round is cancelled. If has completed before trigger timeout, then cancel the monitor and the next message starts, once again set the timeout monitoring, but because most time consuming very little news, if frequent set and cancel every time, will impact performance, so we are optimized, ANR timeout monitoring system using the same time alignment scheme, To be specific:

Take the message start time plus the timeout period as the target timeout period. After each timeout period, check whether the current time is greater than or equal to the target time. If yes, it indicates that the target time is not updated, that is, the message is not finished, and grab the stack. If the current time is smaller than the target time after each timeout, the last message is executed, a new message is executed, and the target timeout period is updated. In this case, the asynchronous monitoring needs to align the target timeout and set the timeout monitoring again, and repeat.

According to the above ideas, the flow chart is as follows:Note that the timeout period of the message sampling stack should not be too short, otherwise frequent stack fetching will have a great impact on the performance of the main thread, and it should not be too long, otherwise the data confidence will be reduced due to too low sampling. The specific duration can be flexibly adjusted according to the complexity of each product.

Monitor messages being scheduled and time spent:

In addition to monitoring the main thread history message scheduling and time consuming before ANR occurs, we also need to know the messages being scheduled and time consuming when ANR occurs, so that when we see the Trace stack of ANR, we can clearly know how long the current Trace logic has been executed, which helps us eliminate interference and locate quickly. With the help of this monitoring, you can answer the question of whether and how long the current Trace stack takes when ANR occurs, and avoid falling into the mistake of “Trace stack”.

Get Pending messages:

At the same time, in addition to monitoring the main thread historical message scheduling and time consumption, we also need to obtain the messages to be scheduled in the message queue when ANR occurs, so as to provide more clues for us to analyze problems, such as:

Whether messages to be scheduled in the message queue are blocked and the Block duration. The busy degree of the main thread can be estimated according to the Block duration.
You can observe whether ANR occurs in the message queue application component messages, such as Service messages, their position in the message queue to be scheduled and the Block duration.
You can observe what messages are in the message queue and whether there are certain patterns, such as a large number of repeated messages. If there are a large number of repeated messages, it is likely that the business logic related to the message is abnormal and frequently interacts with the main thread (we will also describe this later in the case study).

As we mentioned before, the time consuming of a message scheduling can be counted from two dimensions, namely Wall Duration and Cpu Duration. The statistics of these two dimensions can help us better predict a serious time consuming, whether it is to execute a large amount of business logic or to wait or be preempted. If it is the latter, it can be seen that the ratio of Wall Duration to Cpu Duration of this type of message is relatively large. Of course, how to better and more comprehensively distinguish the time of a message from that of waiting too much or that of thread scheduling is preempted will be introduced in combination with other reference information later.

Complete schematic diagram

After counting Cpu Duration, the main thread monitoring function is complete. The schematic diagram is as follows:With this message scheduling monitoring tool, we can clearly see the main thread history of message scheduling when ANR occurs. The duration of messages that are being scheduled, messages to be scheduled in the message queue, and related information. Moreover, by using this monitoring tool, the actual time of the main thread Trace when ANR occurs can be seen at a glance. Therefore, some students’ doubts about whether and how long the current stack takes are well solved.

As can be seen from the above introduction, in order to mark single-time and critical messages, we use a variety of aggregation strategies. Therefore, the information recorded in the monitoring process may represent different types of messages. In order to facilitate differentiation, we add Type markers in the visual display to facilitate differentiation.

Example:

For example, as can be seen from the Trace log in the figure below, when ANR occurred, the main thread was blocked in the communication process with Binder. Perhaps many students’ first reaction was that WMS service did not timely respond to the request of BinderHowever, combined with the following message scheduling monitoring to verify, we found that the current message Wall Duration was only 44ms, and two historical messages before the message took serious time, respectively 2166ms and 3277ms, that is to say, the time of Binder invocation was not serious. The real problem is that the first two messages take a long time, which affects the subsequent message scheduling. The ANR problem can be solved only when the two message time-consuming problems are solved simultaneously.If you do not have a message scheduling monitoring tool, you will blindly analyze the IPC problem of the current logical call, and you may make a directional error and fall into the trap of “Trace stack”.

In the following figure, you can see that the main thread is dispatching a message that takes more than 1S, but another historical message that took up to 9828ms before that. Continue to observe the status of message queues to be scheduled (gray) in the following figure. You can see that the first message to be scheduled has been blocked for 14S. From this we can see that the historical message before the ANR message is the culprit that causes the ANR message. Of course, the message that is being executed also needs to be optimized for performance, because we said earlier that “no message is innocent when ANR occurs”.Because of these monitoring capabilities, the daily confusion of whether and how long the business logic in the Trace log takes becomes immediately clear.

Checktime

Checktime is a monitoring of the execution time of some system services (AMS, InputService, etc.) that are frequently accessed by the Android system. When the actual time of such interfaces exceeds the expected value, a prompt message will be given. This class is designed to monitor the ability of a process to be scheduled and responsive in a real-world environment as a result of feedback.

The specific implementation is to analyze and obtain the current system time before and after the execution of each function, and calculate the difference. If the interface time exceeds its set threshold, it will trigger the information reminder of “slow operation”. Part of the code implementation is as follows:The Checktime logic is simple, subtract the comparison time from the current system time, if more than 50ms, will give a Waring log prompt.We often see this type of message in Logcat when analyzing offline problems or at the system level, but for online third-party applications, they can’t get system logs due to permissions and have to implement it themselves.

Thread Checktime:

After understanding the design idea and implementation of system Checktime, we can realize similar functions in the application layer. By using the periodic detection mechanism of other sub-threads, we can obtain the current system time before each scheduling, and then subtract the time we set delay, we can get the real interval time before this thread scheduling. For example, if you set threads to be scheduled every 300ms, the actual response time sometimes exceeds 300ms. If the difference is larger, the threads are not scheduled in a timely manner, which further indicates that the system response capability deteriorates.

In this way, even if the online environment cannot obtain system logs, the impact of system load on thread scheduling can be reflected from the side. As shown in the following figure, when multiple serious delays occur, the thread scheduling is affected.

Summary:

Through the above monitoring capabilities, we can clearly know the history message scheduling of the main thread and the sampling stack of time-consuming messages when ANR occurs, as well as the time consuming of the messages being executed and the status of messages pending in the message queue. At the same time, the thread CheckTime mechanism is used to reflect the response ability of thread scheduling from the side, so the monitoring information of application side is covered from point to surface. However, in the face of ANR problems, this monitoring alone is far from enough. Overall analysis needs to be combined with other information to cope with a more complex system environment. The following is a combination of monitoring tools to introduce the analysis of ANR problems.

ANR analysis ideas:

Before introducing the analysis approach, let’s first discuss what logs are used to analyze such problems. Of course, the ability to obtain information varies greatly in different environments, such as the offline environment and the online environment, as well as from the application side and the system side. Here we will introduce the information commonly used in our daily troubleshooting for better understanding, mainly including the following:

The Trace log
AnrInfo
The Kernel log
Logcat log
Meminfo log
Raster monitoring tool

On the application side, the online environment may only have access to the thread stack within the current process (depending on the implementation principle, see Android ANR design principles and influencing factors) and ANR Info.

On the system side, almost all the above information can be obtained. The more information obtained for such problems, the greater the success rate of analysis and location. For example, complete Trace logs can be used to analyze cross-process Block or deadlock problems, system memory or IO tension, and even the hardware status, such as low power state. Hardware frequencies (CPU, IO, GPU), etc.

Key information interpretation:

Here we extract and interpret the logs listed above, so that we can refer to the information currently obtained in daily development and online problems.

The Trace information

In the aboveANR design principle and influencing factors of Android systemAfter ANR occurs, the system dumps the current process and the thread stack status of the key process (key information shown in red box, more on this later), as shown in the following example:The logs above contain a lot of information. The key information commonly used is described as follows:

Thread stack:
- This is easy to understand, that is, when ANR occurs, the thread is executing the logic, but in many scenarios, the acquisition of the stack does not take long, the reason is detailed in the ANR design principle of Android system and influencing factors.
Thread status:
- In the figure above, “state= XXX” indicates the current thread working status. Running indicates that the current thread is being scheduled by CPU. Runnable indicates that the thread is Ready to wait for CPU scheduling, as shown in the figure above, SignalCatcher thread status. The Native state means that the thread enters the Native environment from the Java environment, and may be executing the Native logic or entering the waiting state. Waiting indicates an idle state. Sleep, Blocked, and so on.
Thread time:

See “utmXXX, stmXXX” in the figure above, which represents the actual running time of the thread scheduled by CPU from its creation to the present, excluding thread wait time or Sleep time. The thread CPU time can be further divided into user space time (UTM) and system space time (STM). The units here are jiffies. When HZ=100, 1UTm equals 10ms.

Utm: The execution time of system calls at the Java layer and Native layer and non-kernel layer is counted as user-space time.
STM: refers to the system space consumption. Generally, during the call to the API of the Kernel layer, space is switched from the user space to the Kernel space. The logic consumption of the Kernel layer is counted as STM, such as file operation, open,write, and read.
Core: Serial number of the CPU core that last executes this thread.
Thread priority:
- Nice: The lower the value, the higher the priority of the current thread and the stronger the CPU scheduling capability. For application processes (threads), nice ranges from 100 to 139. The system adjusts the process priority based on application scenarios. Vendors may enable CPU quotas to limit the scheduling capability.
Scheduling mode:
- Schedstat: see schedstat=(1813617862 14167238 546) for thread CPU execution times (ns), wait times, and Switch times.

AnrInfo information

In addition to Trace, the system captures some system state at the time ANR occurs, such as system load before and after the ANR problem and CPU utilization of Top and key processes. This information is available locally from the Logcat log, or from the API provided by the system on the application side (see:ANR design principle and influencing factors of Android system), Anr Info:For the information in the preceding figure, the following key information is introduced:

ANR Type (longMsg) :
- Indicates what type of message or application component is currently causing the ANR, such as Input, Receiver, Service, and so on.
System Load:

Indicates the overall system Load in different time periods, for example, “Load: 45.53/27.94/19.57 “, the distribution represents the system CPU load in each time period of 1 minute, 5 minutes and 15 minutes before ANR occurs, and the specific value represents the number of tasks waiting for system scheduling per unit time (which can be understood as threads). If the value is too high, it indicates that CPU or I/O contention exists in the system. In this case, common process or thread scheduling is affected. If the temperature of the mobile phone is too high or the power supply is low, the system limits the frequency or even the core, which affects the system scheduling capability.

In addition, you can associate the load of these time periods with the application process startup time. If the Load data shows that the ANR is heavily loaded in the first 5 minutes or even 15 minutes after the process starts for 1 minute, it indicates that the ANR is greatly affected by the system environment.

Process CPU usage:

As shown in the figure above, the CPU usage of the current ANR problem was higher before (CPU usage from XXX to XXX ago) or after (CPU usage from XXX to XXX later). List the user and kernel CPU usage of these processes. Of course, there are many scenarios in which the SYSTEM_server process occupies a high CPU ratio. This depends on the situation. For details on why the SYSTEM_server process occupies a high CPU ratio, see ANR design principle and Influencing Factors of Android. Minor indicates a minor Page Fault that occurs when a file or other memory is loaded into memory but not mapped to the current process and accessed through the kernel. Major is triggered if the accessed content has not been loaded into memory, so you can see that the overhead of major is much higher than that of Minor.

Key processes:
- Kswapd: It is a kernel thread used for page reclamation in Linux. It is mainly used to maintain the balance between available memory and file cache to maximize performance. If the CPU usage of this thread is too high, it indicates that the system is short of available memory, or memory fragmentation is serious, and file cache write back or memory swap (swap to disk) is required. If the thread CPU is too high, the overall system performance will be significantly reduced, which affects all application scheduling.
- MMCQD: Kernel threads manage and forward I/O requests from the upper layer to the Driver layer in a unified manner. If the CPU usage of this thread is too high, it indicates that the system has a large number of file reads and writes. Of course, if the memory is insufficient, file write back and memory swap to disk will be triggered.
System CPU distribution:

The following figure shows the overall CPU usage of the system and the CPU usage of user, kernel, and IOWAIT directions in a period of time. If a large number of file reads and writes occur or memory is tight, ioWAIT occupies a high proportion. At this time, it is necessary to further observe the kernel space CPU usage of the above processes, and then further compare the CPU usage of each thread to find out the one or a type of thread with the largest proportion.

Logcat logs:

In the log, in addition to observing business information, there are also some keywords that can help us predict whether the current system performance is in trouble, as shown in the following figure.“Slow operation”, “Slow delivery”And so on.

Slow operation

In some frequently invoked interfaces, the Android system uses checktime detection before and after the method to determine whether the function execution time exceeds the set threshold. Generally, these values are set loosely. If the actual time exceeds the set threshold, a “Slow XXX” prompt will be given. It indicates that the system process scheduling is affected. Generally speaking, the system process has a high priority. If the system process scheduling is affected, it indicates that the system performance may be affected during this period of time.

The Kernel log:

For the application side, this kind of log is basically unavailable, but the following is the offline test or engaged in system development students, you can use the dmesg command to view. For kernel logs, we mainly analyzed lowmemkiller information, as shown below:

Lowmemkiller:

Students engaged in performance (memory) optimization are familiar with this module, which is mainly used to monitor and manage the available memory of the system. When the available memory is tight, some low-priority applications are forced to be killed from the kernel layer to adjust the system memory. The process priority (OOM_score_ADJ) is calculated by the AMS service based on the current state of the application, such as foreground or background, and the type of application component in which the process exists. For example: The foreground applications with obvious user perception will have the highest priority. In addition, some system services and background services (player scenarios) will also have a higher priority. Of course, there is also a lot of customization (optimization) to prevent third-party applications from taking advantage of system design loopholes and setting their own processes too high a priority to survive.

Message scheduling sequence diagram:

As shown in the figure above, after we analyze the system log, we will further lock or narrow down the scope, but eventually we will go back to the main thread to further analyze the business logic and time consumption of the Trace stack, so that we can know more clearly how long the message is being scheduled. However, in many cases, the current Trace stack is not the answer we expect. Therefore, it is necessary to further confirm the scheduling information of the main thread before ANR and evaluate the influence of historical messages on subsequent message scheduling, so as to facilitate us to find the “real culprit”.

Of course, sometimes we need to further refer to the messages to be scheduled in the message queue. In these messages, we can not only see the duration of the Block of the corresponding application component in ANR, but also know what messages there are. The characteristics of these messages can sometimes provide strong evidence and direction for us to analyze problems.

Analysis methods

Above, we have explained the key information of all kinds of logs. Here, we will introduce how to analyze ANR problems when we encounter them in daily life, and summarize the ideas as follows:

Analyze the stack to see if there are obvious business problems (such as deadlocks, serious service time, etc.). If there are no obvious problems, further observe through ANR Info whether the system load is too high, resulting in poor overall performance, such as CPU, Mem, IO. Then, it further analyzes whether the process is caused by this process or other processes. Finally, it analyzes the CPU usage of each thread to find out the suspicious threads.
Integrated the above information, the use of monitoring tools to collect information, to observe and find the ANR happen before a period of time, what are the main thread takes a longer news, and see the time-consuming longer message sample stack in the implementation process, according to the stack polymerization, further comparison of the current take serious interface or business logic.

The above analysis ideas can be further divided into the following steps:

A look at the Trace:
- Deadlock stack: Observe the Trace stack to see if there is an obvious problem, such as whether the main thread is deadlocked with other threads. If the deadlock is internal to the process, then congratulations, this kind of problem is much clearer, just find the thread that is deadlocked with the current thread, and solve the problem.
- Business stack: Observe the main thread stack is executing business logic, you find the corresponding business student, he admitted that the business logic does have a performance problem, congratulations, you probably solved the problem, why is it only possible to solve the problem? Because some problems depend on the technology stack or framework design, they cannot be solved in a short time. If the business students report that the current business is very simple and hardly time-consuming, and this kind of situation is also a kind of problem frequently encountered in daily life, then we may need to use our monitoring tool to trace the historical message consumption.
- IPC Block stack: By observing the stack Trace, it is found that the main thread stack is communicating with the Binder. Therefore, it cannot be concluded that the IPC block is caused by the main thread stack. The actual situation may be caused by sending the Binder request shortly after, and further analysis is needed. At this time also need to use our own research monitoring tools.
- System stack: By observing Trace, it is found that the current stack is just a simple system stack. In order to find out whether serious time consuming occurs and further analysis and positioning, such as the common NativePollOnce scenario, we need to use our self-developed monitoring tool for further confirmation.
2 seeKey words:Load, CPU, Slow Operation, Kswapd, Mmcqd, Kwork, Lowmemkiller, etc

As we have just introduced, the above keywords are the key information to reflect the system CPU, Mem, IO load. After analyzing the main thread stack information, we need to further search these keywords in the ANRInfo, logcat or Kernel log, and according to the current value of these keywords, Check whether the current system resources (CPU, Mem, and I/O) are insufficient.

View the system load distribution: View the overall system load: User,Sys, and IOWait

By observing the system load, you can determine whether CPU or I/O resources are tight. If the system is overloaded, one or more processes must be responsible for the reverse. Excessive system load affects the scheduling performance of all processes. By observing the CPU usage of users and Sys, you can further analyze whether the current high load occurs in the application space or the system space, such as a large amount of call logic (such as file reading and writing, memory shortage causing the system to recycle memory, etc.). After knowing this, you can further narrow the scope of troubleshooting.

fourView the CPU usage of Top processes

After we know whether the current system load is too high in user space or kernel space, we need to use Anrinfo’s CPU list to further lock which process is causing the load. Then we need to further observe the CPU ratio of each process and the internal user of the process. Sys accounted for.

When analyzing the process CPU ratio, a key piece of information needs to be determined whether the scenario of high CPU of these processes occurred before or after ANR. The following figure shows the PROCESS CPU usage between 4339ms and 22895ms before ANR.

Five see CPU ratio fixed thread: Compares the CPU usage of each thread and the usage of users and kernel in the thread

After locking the direction through the system load (user,sys, ioWAIT) and then locking the target process through the process list, we can then analyze the utM, STM of each thread from within the target process and further analyze which thread is the problem.

In the thread information of the Trace log, you can clearly see the UTM and STM time of each thread. At this point we have completed the load analysis and investigation from the system to the process, and then to the internal thread direction of the process. Of course, sometimes it is not the current process that causes high load on the system, but other processes, which can also affect other processes, resulting in ANR.

Six look at the details of message scheduling lock:
- After analyzing and clarifying whether the system load is normal and which process is responsible for the excessive load, the next step is to use our monitoring tool to further check whether the current message scheduling time is caused by the current message scheduling time, the historical message scheduling time, or the message frequency is too frequent. At the same time through our CheckTime thread scheduling situation analysis whether the current process of CPU scheduling in a timely manner as well as the influence degree, after locking the above scenario, take news of sample stack, further analysis is to find the key of the ultimate to solve the problem, of course take messages inside may have one or more time-consuming longer function interface, Or there may be multiple messages with long functional interfaces, which is why we mentioned earlier that “no message is innocent when ANR occurs.”

More information:

In addition to the preceding information, you can also analyze ANR information based on logCAT logs to check whether abnormal output exists on the service side or system side, for example, search for keywords such as “Slow operation “and “Slow delivery”. We can also observe whether the current process and system process have frequent GC, etc., to help us analyze the system state more comprehensively.

Conclusion:

Above, we focus on the monitoring tool based on the main thread message scheduling, realizing the “point to surface” monitoring capability, so that when ANR problems occur, the overall “past, present and future” of the main thread can be evaluated more clearly and intuitively. Based on daily practices, this section describes the log information and analysis ideas commonly used to analyze ANR problems on application side.

At present, Raster monitoring tool has become a powerful tool for ANR problem analysis because it improves the efficiency and success rate of problem location, and is integrated into the company’s performance stability monitoring platform, which is widely used by many products of the company. Next, we will use this tool and combine the above analysis ideas to talk about how to quickly analyze and locate different types of ANR problems in practical work.

Android Platform architecture team

As byteDance’s Android platform architecture team, we mainly serve Toutiao and serve GIP as well as other products of the company. We continue to optimize and explore user experience, r&d process and architecture in terms of product performance and stability, so as to meet rapid product iteration and maintain high user experience. If you are passionate about technology and want to meet greater challenges and stage, welcome to join us. There are positions available in Beijing and Shenzhen, and you are interested in sending email: [email protected]. Email title: name – GIP-Android platform architecture.

Toutiao ANR Optimization Practice series – Monitoring tools and analysis ideas

Preface:

Raster monitoring tool

Tool introduction:

Origin of Monitoring Tools

Here’s an example:

Message scheduling monitoring

Message aggregation classification

Multi-message aggregation:

Message aggregation split:

Key message aggregation:

IDLE scenario aggregation:

Time-consuming message stack sampling:

Monitor messages being scheduled and time spent:

Get Pending messages:

Complete schematic diagram

Example:

Checktime

Checktime

Thread Checktime:

Summary:

ANR analysis ideas:

Key information interpretation:

The Trace information

Thread stack:

Thread status:

Thread time:

Thread priority:

Scheduling mode:

AnrInfo information

ANR Type (longMsg) :

System Load:

Process CPU usage:

Key processes:

System CPU distribution:

Logcat logs:

Slow operation

The Kernel log:

Lowmemkiller:

Message scheduling sequence diagram:

Analysis methods

A look at the Trace:

2 seeKey words:Load, CPU, Slow Operation, Kswapd, Mmcqd, Kwork, Lowmemkiller, etc

View the system load distribution: View the overall system load: User,Sys, and IOWait

fourView the CPU usage of Top processes

Five see CPU ratio fixed thread: Compares the CPU usage of each thread and the usage of users and kernel in the thread

Six look at the details of message scheduling lock:

More information:

Conclusion:

Android Platform architecture team

Related Posts

Trembling sound practice Flutter

Work with me to develop commercial grade IM (3) – long connection stability connection and reconnection

ARouter – initialization