Writing in the front
ANR problem is not unfamiliar to students engaged in Android development. In daily development, we often encounter various problems caused by application and even system level. Most of the time, because we do not understand its operation principle, we may be confused when facing such problems. At the same time, inadequate surveillance capabilities or limited access to information make these problems seem like a mirror image, making it difficult for us to pursue the truth. The diagram below:
At work, when I helped everyone analyze problems, I found that many students asked, where can I learn more systematically? So I hold “teach them to fish, as teach them to fish”, combine personal understanding and practice, the next will be from the design principle, influence factors, construction, analysis tools, the case of actual combat, optimization of exploring a few chapters, such as to a comprehensive summary of ANR direction, hope to help you in the future work to better understand and respond to the following questions:
- What is ANR?
- How does the system design ANR?
- What information and workflow does the system acquire when ANR occurs?
- What are the causes of ANR?
- How to analyze this kind of problem?
- How to locate problems more quickly and accurately?
- What can we do proactively about this kind of problem?
The paper
Before we formally analyze the ANR issue, let’s look at the following questions:
- How is the system designed for ANR, and which services or components are subject to ANR?
- When ANR occurs, how does the system work and what information does it get?
- What are the scenarios that affect ANR? How do we classify it?
Knowing these things can help us stay focused when facing various questions. Here are some of the questions we’ll introduce and answer.
ANR design principles
ANR Applicatipon No Response; Android ANR is designed to monitor timeout through interactive components (Activity, Service, Receiver, Provider) and user interaction (InputEvent) to determine whether the application process (main thread) is stuck or responds too slowly. Generally speaking, it is the design idea of watchdog in many systems.
Component Timeout classification
When the system sends the component message or Input event to the application process through Binder communication, an asynchronous timeout monitoring is set on both AMS and Input servers. Of course, the timeout period varies according to different types of events. The following are the Settings for different types of timeout thresholds in the Android system:
(The picture is for reference only, domestic manufacturers may have adjustment, each manufacturer’s standard is also different)
Example of Broadcast timeout
After looking at the timeout thresholds for different types of messages, let’s look at how timeout monitoring is designed.
Taking BroadCastReceiver broadcast receiving timeout as an example, broadcast is divided into orderly broadcast and disorderly broadcast, as well as foreground broadcast and background broadcast. The timeout monitoring mechanism is only set for orderly broadcast, and the timeout duration is determined according to the broadcast type of foreground broadcast and background broadcast. For example, the background broadcast timeout duration is 60S, and the foreground broadcast timeout duration is 10S. Let’s take a look at the process of sending broadcast messages with code implementation.
- Disorderly broadcast:
For unordered broadcast, the system sends all the receivers at one time after collecting them, as shown in the following figure:
From the figure above, we can see that out-of-order broadcast is not set timeout monitoring mechanism. It is sent to all receivers at one time, and the application side does not care about when to receive and respond (equivalent to UDP transmission).
- Orderly broadcast:
Taking a look at the send and receive logic of an ordered broadcast, also in the AMS service, BoradCastQueue takes the broadcast message that is currently being sent, fetches the next broadcast receiver, updates the send timestamp, and calculates and sets the timeout from that time (but the system does some optimization here, To avoid having to cancel the timeout monitoring and then reset it every time a broadcast is properly received, reuse it in an aligned manner. Finally, the broadcast is sent to the receiver, and after receiving the completion notification from the client, the next broadcast is sent, and the whole process repeats.
In the client process, the Binder thread receives the broadcast Message from the AMS service, encapsulates the Message into a Message, and then sends the Message to the main thread Message queue (inserted at the current time node of the Message queue, It is based on this kind of design that causes many timeliness problems of message scheduling, which we will discuss in detail later), message receiving logic is as follows:
Under normal circumstances, many broadcast requests will be timely responded by the client, and then the AMS service will be notified to cancel the timeout monitoring. However, if some service scenarios or system scenarios are abnormal and broadcast packets are not scheduled or notified to the system service in time, the system service will trigger a timeout to determine that the application process times out. The AMS response timeout code logic is as follows:
final void broadcastTimeoutLocked(boolean fromMsg) {...long now = SystemClock.uptimeMillis();
BroadcastRecord r = mOrderedBroadcasts.get(0);
if (fromMsg) {
// Avoid frequent cancellations and message timeouts as we mentioned earlier
long timeoutTime = r.receiverTime + mTimeoutPeriod;
if (timeoutTime > now) {
setBroadcastTimeoutLocked(timeoutTime);
return; }}... . Object curReceiver;if (r.nextReceiver > 0) {
// Get the current timeout broadcast receiver
curReceiver = r.receivers.get(r.nextReceiver-1);
r.delivery[r.nextReceiver-1] = BroadcastRecord.DELIVERY_TIMEOUT;
} else {
curReceiver = r.curReceiver;
}
Slog.w(TAG, "Receiver during timeout of " + r + ":"+ curReceiver); . .if(app ! =null) {
anrMessage = "Broadcast of "+ r.intent.toString(); }...if(! debugging && anrMessage ! =null) {
// Start notifying the AMS service to handle the current timeout behavior
mHandler.post(newAppNotResponding(app, anrMessage)); }}Copy the code
Here, the analysis of broadcast sending and timeout monitoring logic is basically finished. Through the introduction, we basically know how the broadcast timeout mechanism is designed and works. The overall flow diagram is as follows:
ANR Trace Dump process
Above, we take broadcast reception as an example to introduce the system monitoring principle. Next, we will introduce the system workflow when ANR occurs.
ANR Information Acquisition:
Continue to take broadcast receiving as an example. As described above, when determining timeout, AMS interface of system service will be called to collect ANR information and archive it (data/ ANR /trace, data/system/ Dropbox). The entry is as follows:
After entering AMS, AppError checks scenarios to filter whether the current process has occurred and is performing a Dump process, has crashed, or has been killed by the system. In addition, scenarios such as whether the system is shutting down are considered. If none of the above conditions are met, ANR is considered to have occurred in the current process.
Next, the system determines whether the current ANR process can be perceived by users, for example, a low-priority process in the background (no important service or Activity interface).
Then start counting information about processes associated with the process, or system core service processes; For example, system processes such as SurfaceFligner and SystemServer, which often interact with application processes, are blocked when responding. As a result, the IPC communication of application processes will be blocked.
Add SystemServer (system service SystemServer) as follows:
It then gets the other system core processes, because these server processes are created directly by the Init process and are not managed by the SystemServer or Zygote process.
After collecting the first step, it’s time to start counting more information about each process locally, such as virtual machine information, Java thread state, and stack. So you know what’s going on with these processes and with the system at the moment. Ideals are plump, reality is skinny, and we’ll talk more about why we feel that way.
Why does the system collect information about other processes? From the perspective of performance, any process with high CPU or IO will seize system resources, thus affecting the delay in scheduling of other processes. Here is a look at the system dump process from a code perspective:
private static void dumpStackTraces(String tracesFile, ArrayList<Integer> firstPids, ArrayList<Integer> nativePids, ArrayList<Integer> extraPids,
boolean useTombstonedForJavaTraces) {... .// Considering the performance impact, a dump lasts for a maximum of 20 seconds. Otherwise, subsequent processes are abandoned and terminated
remainingTime = 20 * 1000;
try{...// Get trace logs of each process in order of priority
int num = firstPids.size();
for (int i = 0; i < num; i++) {
final long timeTaken;
if (useTombstonedForJavaTraces) {
timeTaken = dumpJavaTracesTombstoned(firstPids.get(i), tracesFile, remainingTime);
} else {
timeTaken = observer.dumpWithTimeout(firstPids.get(i), remainingTime);
}
remainingTime -= timeTaken;
if (remainingTime <= 0) {
// If a timeout has occurred, subsequent processes will not be dumped
return; }}}}// Get trace logs of each process in order of priority
for (int pid : nativePids) {
final long nativeDumpTimeoutMs = Math.min(NATIVE_DUMP_TIMEOUT_MS, remainingTime);
final long start = SystemClock.elapsedRealtime();
Debug.dumpNativeBacktraceToFileTimeout(
pid, tracesFile, (int) (nativeDumpTimeoutMs / 1000));
final long timeTaken = SystemClock.elapsedRealtime() - start;
remainingTime -= timeTaken;
if (remainingTime <= 0) {
// If a timeout has occurred, subsequent processes will not be dumped
return; }}}// Get trace logs of each process in order of priority
for (int pid : extraPids) {
final long timeTaken;
if (useTombstonedForJavaTraces) {
timeTaken = dumpJavaTracesTombstoned(pid, tracesFile, remainingTime);
} else {
timeTaken = observer.dumpWithTimeout(pid, remainingTime);
}
remainingTime -= timeTaken;
if (remainingTime <= 0) {
// If a timeout has occurred, subsequent processes will not be dumped
return;
}
}
}
}
......
}
Copy the code
Dump Trace process
For security reasons, processes are isolated from each other, and even system processes cannot directly obtain information about other processes. Therefore, the instruction needs to be sent to the target process through IPC communication. After receiving the signal, the target process assists in completing the Dump information of its own process and sends it to the system process. Taking AndroidP system as an example, the general flow chart is as follows:
The signal receiving and response capability of the application process is realized in the VIRTUAL machine. Signal registration and SIGQUIT (SIGQUIT) are performed during the virtual machine startup process. The initialization logic is as follows:
After receiving the SignalCatcher signal, the SignalCatcher thread first dumps information about the current VIRTUAL machine, such as memory state, object, loading Class, GC, etc., and then sets the check_point of each thread to request the suspend thread. When other threads run a context switch, they check for this flag and voluntarily suspend themselves if they find a suspend request. After all threads are suspended, the SignalCatcher thread starts to traverse the Dump stack and thread data of each thread, and wakes up the thread after it finishes. If some threads fail to suspend until a timeout occurs, the Dump process fails and the timeout exception is actively raised.
According to the process sorted out above, the working process of SignalCatcher acquiring information of each thread is as follows:
Here, the basic system design principle is introduced, and the broadcast transmission as an example to explain how the system determines ANR, and after ANR occurs, how the system obtains system information and process information, and how other processes assist the system process to complete log collection.
On the whole, the link is long and involves interaction with many processes. Meanwhile, in order to further reduce the impact on the application and even the system, the system sets a large number of timeout detection in many links. In addition, it can be seen from the above process that when ANR occurs, the system process not only sends signals to other processes, but also dumps Trace itself, obtains the overall system and CPU usage of each process, and writes the data sent by other processes Dump into files. Therefore, this overhead can cause system processes to take a heavy load during the ANR process, which is the main reason why we often see SystemServer processes with generally high CPU ratios in ANR Trace. Chen ling
How does the application layer determine ANR
After Android M(6.0), the application side cannot directly monitor the occurrence of ANR by listening to data/ ANR /trace files, so what other means do you have to determine anR? Here’s a brief introduction.
Standing on the application side perspective, because the system does not provide too friendly mechanism, to notify the application whether ANR, and a lot of information but also blocked access to applications, but for the three parties App, also need to pay attention to the basic user experience, so many companies are also carried out extensive exploration, and presents a different solution, At present, there are two main schemes (ideas) as follows:
- Main thread watchdog mechanism
The core idea is to set probe messages to the main thread periodically in the application layer, and set timeout monitoring asynchronously. If no status update of the sent probe messages is received within a specified time, ANR is judged to be possible. Why is ANR possible? Because further data needs to be obtained from system services (how to obtain this is described below) to further determine whether ANR has actually occurred.
- Listen for SIGNALQUIT
This scheme has been applied in many companies, and there are related introductions on the Internet. Here are the main ideas. We mentioned above that the virtual machine performs the request by registering and listening to the SIGNALQUIT signal. If you know the signal mechanism, you can also register the same signal in the application layer to listen. However, it should be noted that after the registration of the virtual machine before the registration will be overwritten, need to be restored at the appropriate time, otherwise careful system (vendor) door.
When this signal is received, the scenario is filtered to determine the occurrence of user-aware ANR, and the stack of each thread is obtained from the Java layer, or the interface of the virtual machine’s internal Dump thread stack is obtained through reflection. At the function address of memory mapping, the interface is forcibly called, and the data is redirected to the local output.
In addition, it follows the system information acquisition method and obtains more comprehensive thread information and VM information. However, its disadvantage is that it has a large impact on performance. For complex APPS, it takes more than 10 seconds to perform a Dump.
How does the application layer get ANR Info
As mentioned above, no matter the way of Watchdog or signal monitoring, conclusions should be further filtered to ensure the collection of ANR scenarios we want. Therefore, the interface provided by the system should be used to further determine whether the current application has problems (ANR, Crash).
At the same time, in addition to obtaining the state of each thread in the process, we also need to know some states of the system and other processes, such as system CPU, Mem, IO load, CPU utilization rate of key processes and so on, so as to predict whether the system environment is normal when problems occur.
Interfaces for obtaining information are as follows:
The relevant information obtained through this interface is shown as follows. The keywords selected in the red box in the following figure will be explained in detail in the following chapter of ANR Analysis Ideas:
Factors affecting the
The above describes how the system sets timeout monitoring for various types of messages and how the system and application sides obtain various types of information after the timeout is detected. With that in mind, let’s look at how the ANR problem arises and how we categorize the factors that affect ANR.
Here’s an example:
At work, some students asked “My Service” has a very simple logic. Why is it ANR? In fact, stack and monitoring tools show that the business services he is talking about have not yet been scheduled. Originally, the student saw from our internal monitoring platform that the Service caused ANR, as shown in the figure below:
Below we will answer why can appear above this phenomenon?
q&a
As shown above, system services (AMS, InputService) send messages with timeout properties, such as creating Service, Receiver, Input events, to target processes through Binder or other IPC. Asynchronous timeout monitoring is started. And monitoring of this nature is a black box monitoring, is not really monitor messages sent whether during the execution of a real time out, whether to send the message that is system have implemented, or the real execution process takes how long, as long as the monitoring before the arrival of overtime, the server is not received notice, then judged to be timed out.
As mentioned above, when the system side sends the message to the target process, its Binder thread receives the message and inserts it into the message queue in chronological order. In the subsequent waiting execution process, the following situations will occur:
- In the process startup scenario, a large number of services or basic libraries need to be initialized and many messages need to be scheduled before messages are queued.
- Some scenarios may have a small number of messages, but one or more of them take a long time;
- In some scenarios, other processes or the system is so heavily loaded that the entire system becomes a bit sluggish.
In all of these scenarios, the sent message may be deemed as a timeout problem by the system before it is executed. However, after receiving the signal, the main thread dumps the business stack (logic) of the current message execution process.
So in summary, when ANR problems occur, the Trace stack is not a RootCase in many cases. ANR Info indicates that the ANR caused by a Service or Receiver is, to a large extent, not a problem of the components themselves.
What are the specific categories of scenarios that affect ANR? Let’s talk about them below.
Classification of influencing factors
Based on our work experience in system and application side and our understanding of this kind of problem, we divided the influencing factors that may lead to ANR into the following aspects. The influencing environment was divided into internal application environment and system environment. That is, the system load is normal, but the messages on the main thread in the application are too many or time-consuming. The other is the system or the application of other threads or resources too high load, the main thread scheduling is seriously preempted; The system load is normal and the main thread scheduling problem generally includes the following:
- The current service of the Trace stack consumes serious time.
- The current time of the Trace stack service is not serious, but the historical scheduling has a serious time.
- The current time of the Trace stack service is not serious, but the historical schedule has multiple message time.
- The time of the current Trace stack service is not serious, but there is a large number of repeated messages in historical scheduling (the service frequently sends messages).
- The current Trace stack business logic is not time-consuming, but other threads have serious resource preemption, such as IO, Mem, CPU;
- The current Trace stack business logic is not time-consuming, but other processes have serious resource preemption, such as IO, Mem, AND CPU.
Let’s take a look at each of these scenarios and their performance:
The message that is being scheduled by the main thread takes too long. Procedure
Theoretically, the more serious a message takes, the more likely it is to cause stuttering or ANR. Such a scenario often occurs online and is relatively easy to troubleshoot. It is also a common idea for business development students to analyze such problems.
The schematic diagram of main thread message scheduling when ANR occurs is as follows:
The single-point time of scheduled messages is serious. Procedure
If before a serious history news time consuming, but it was not until the message is performed, the system service is still not reached the critical point of trigger timeout, following the main thread to continue scheduling other message, the system to determine the response timeout, then executing business scenario, unfortunately, to be hit by the current executing business logic can be very simple.
This kind of scenario exists in large numbers online, and it will bring confusion to many students because it is relatively hidden. It will be introduced in ANR example analysis later. The schematic diagram of main thread message scheduling when ANR occurs is as follows:
Multiple consecutive messages are time-consuming. Procedure
In addition to the above two scenarios, there is another case where multiple messages are time-consuming. When the main thread dispatches other messages, the system determines that the response times out, and the business scenario being executed is unfortunately matched. Such scenarios are also common in the actual environment. Such problems are more hidden, and it is difficult to draw a clear line in terms of analysis and attribution of problems. In terms of problem governance, multiple business scenarios need to be optimized. (This will be highlighted later in ANR example analysis)
The schematic diagram of main thread message scheduling when ANR occurs is as follows:
Frequent execution of the same message (service logic exception)
What we have mentioned above is that one or more messages take a long time, and another situation is that the business logic is abnormal or the business thread interacts with the main thread frequently, and a large number of messages are piled up in the message queue. In this case, for the subsequent messages appended to the message queue, although there is no single message with serious time, However, because messages are too dense, it is difficult to schedule them in a timely manner in a period of time. In this scenario, message scheduling is not timely, which leads to response timeout. (This will be described later in ANR example analysis)
The schematic diagram of main thread message scheduling when ANR occurs is as follows:
The application process or system (including other processes) is overloaded
In addition to the above lists some main thread messages take serious or too much, cause of the problems that may arise from message scheduling is not timely, there is a kind of we always met online scenario, it is a process or system itself load is very heavy, such as high CPU, IO, high low memory (application memory thrashing in frequent GC, system memory recovery) and so on. When this happens, it can also lead to poor application or overall system performance, resulting in a series of timeouts.
In this case, from the perspective of the main thread message scheduling, many messages are time-consuming, and the statistics of Wall Duration(absolute time: including normal scheduling, waiting, sleep time) and CPU Duration(absolute time: Only including CPU execution time), if this happens, we think that the system load may be abnormal, and need to use system information for further comparison and analysis. This affects not only the current application, but also other applications and even the system itself.
The schematic diagram of main thread message scheduling when ANR occurs is as follows:
conclusion
Through the above introduction, we introduce the design principle and working process of ANR, and have a further understanding of the factors affecting ANR and classification. From the classification, we can find that there are many kinds of scenarios that affect ANR, and even many cases are caused by layer upon layer. Therefore, we can borrow a phrase to describe it: “When ANR happens, no message is innocent”.
subsequent
Relying on the existing monitoring capability of the system, it cannot intuitively reflect the numerous scenarios listed above, let alone intuitively tell us the main thread scheduling situation before ANR occurs. Only relying on ANR to obtain the relevant information and some logs of the system and Top process can only help us to complete the positioning of the first stage in most cases, such as the conclusion that the system is overloaded and the main thread is too busy. However, it cannot further analyze and solve problems, especially some offline problems that are difficult to reproduce.
For each of us, the goal of work is not only to set the direction, but also to solve the problem. So how can we better solve the above system monitoring ability is not perfect and the application side information blind area problem? This is the next issue we want to focus on the “monitoring tool,” a good tool, not only can help us in solving the problem of conventional achieve the final result, when faced with a more complex hidden problems, can we open field of vision, to provide more direction, the article next week we will go to see it is how to design and application.
Android Platform architecture team
As byteDance’s Android platform architecture team, we mainly serve Toutiao and serve GIP as well as other products of the company. We continue to optimize and explore user experience, r&d process and architecture in terms of product performance and stability, so as to meet rapid product iteration and maintain high user experience.
If you are passionate about technology and want to meet greater challenges and stage, welcome to join us. There are positions available in Beijing and Shenzhen, and you are interested in sending email: [email protected]. Email title: name – GIP-Android platform architecture.
Welcome to Bytedance Technical Team
Resume mailing address: [email protected]