Introduce ANR

The full name of ANR is Application No Responding, that is, the Application does not respond. Specifically, some specific messages (Key dispatches, Broadcast dispatches, and services) are not processed within the specified period of time by the Application UI thread (main thread). An ANR exception is triggered.

ANR is guaranteed by the message processing mechanism. Android has implemented a set of sophisticated mechanisms to discover ANR in the system layer, and the core principles are message scheduling and timeout processing. At the system layer, all messages related to ANR are scheduled by the system process System_server, which is ActivityManagerService, and then distributed to the application process to complete the actual processing of messages. At the same time, System processes design different timeout limits to track message processing. The timeout limit comes into play when the application mishandled the Message, collecting system state such as CPU/IO usage, the process function CallStack, and (some platforms, such as MTK, print the corresponding Message for debugging analysis). Finally, a user process is reported to be unresponsive (ANR dialog).

ANR type

There are generally three types of ANR:

KeyDispatchTimeout

This is mainly because the button or touch event does not respond within a specific period of time. Generally, the default timeout time of Android platform is 5s and anR is reported. However, some platforms will change this time, such as MTK, some platforms are 8s.

This timeout can be in ActivityManagerService. Java view:

// M: ANR debug mechanism (Ori: 5*1000)
static final int KEY_DISPATCHING_TIMEOUT = 8*1000;
Copy the code

BroadcastTimeout

An ANR dialog box is displayed for this type of timeout, and the user can choose forceStop or continue to wait.

The BroadcastRecevier can’t handle it at the specified time. The foreground broadcast timeout is 10s, and the background broadcast timeout is 60s. No dialog box is displayed for this timeout.

==> AMS.java

// How long we allow a receiver to run before giving up on it.
static final int BROADCAST_FG_TIMEOUT = 10*1000;
static final int BROADCAST_BG_TIMEOUT = 60*1000;
Copy the code

ServiceTimeout&nbsp

If the Service fails to complete the operation within a specified period of time, the ANR will report a Service timeout. This type of ANR also does not display a prompt box. The timeout period is 20 seconds for the foreground Service and 200 seconds for the background Service.

==> ActivityServices.java

// How long we wait for a service to finish executing.
static final int SERVICE_TIMEOUT = 20*1000;
// How long we wait for a service to finish executing.
static final int SERVICE_BACKGROUND_TIMEOUT = SERVICE_TIMEOUT * 10;
Copy the code

Among the three ANR types, only the first one will display the system prompt dialog box, because the user is doing interface interaction. If there is no response for a long time, the user will suspect that the device has died. Most people will start to press randomly, or even pull out the battery and restart, which will give the user a very bad experience.

All three types of ANR will output error messages in the log. You will find that the function stack information of each application and system process is exported to a file called /data/ ANR/testamp.txt, which is the key file for analyzing the cause of ANR. You also see the CPU usage at the time in the log, which is also important information, and how to use it to analyze ANR problems will be described in more detail in a later section.

These three ANRs are not isolated and may interact with each other. For example, an application process has both a BroadcastReceiver and an Activity that is displaying messages, both running in the main thread of the process. If BR’s onReceive function does not return and the user clicks on the screen, and the onReceive does not return for more than 5 seconds, the main thread is unable to process the user input event, the first ANR is caused. If no return continues for more than 10 seconds, a second ANR is generated.

How ANR occurs

The basic principle of ANR occurrence is as follows:

  • Call hander.sendMessageatTime () to send an ANR message after the relevant operation (for example, the activity is 5s after the current time).
  • Perform related operations
  • Remove the message after the operation is complete. ANR occurs when the message is fetched and executed by handler if the operation is not completed by the specified time.

Android for message processing is completed by Handler, do not understand the Handler students can refer to the article Android Handler source code analysis (details).

ANR analysis example

ANR is essentially a Performance problem. When ANR occurs, if the problem may be caused by APK itself, the main direction of investigation is APK itself, and the analysis is to see if there is a time-consuming operation in the UI thread.

Personally, there are some ANR problems that are difficult to analyze. For example, when the APP is running, message processing fails due to some influence at the bottom of the current system. Such problems are often irregular and difficult to reproduce. This type of ANR problem usually takes a lot of time to understand some other behavior of the system, and beyond the scope of the ANR itself, this kind of problem is a warning signal that tells you something is wrong with the system.

View the log file

Search keyword ANR:

04-17 17:15:23. 817 E/ActivityManager (187) : ANR in com.da.android (com.da.android/com.text.my.app.ui.activities.ContentActivity)Copy the code

The above text means that ANR occurred at 17:15:23ms 817ms on April 17th. In fact, it usually includes the reason for the ANR, the CPU usage,

Here we write a demo, for example, in the broadcast receiver onReceive method (the broadcast receiver onReceive method is executed on the main thread by default) sleep for a period of time, if the main thread has no operation at this time, then the broadcast receiver will execute normally. It doesn’t cause ANR. If I send a broadcast and hit the back button multiple times, the main thread is asleep and can’t respond, ANR will appear.

The complete log information is as follows:

ANR in com.brotherd.broadcastdemo (com.brotherd.broadcastdemo/.MainActivity) PID: 6317 Reason: Input dispatching timed out (Waiting to send key event because the focused window has not finished processing all of the Input events that were previously delivered to it. Outbound queue length: 0. CPU Usage from 45678ms to 0ms ago (2020-10-27 08:39:06.256 to 2020-10-27 / /system_server: / /system_server: / /system_server: / /system_server: / /system_server: / / 6531 minor 16 major 13% 23112 / com. Android. The systemui: 9.7% user + 3.5% kernel/faults: 6033 minor // minor/faults: 469 minor 2 Major 4.673 / hwPGED: 0% user + 4.3% kernel/faults: 29 minor 549/android.hardware.graphics.com [email protected]: 2.3% and 1.1% user + 1.1% kernel/faults: 149 0.1%, 2246 minor/com. Huawei. Android. Pushagent: 0% user + 0% kernel/faults: 1.5%, 461/2583 minor 22 major logd: 0.6% user + 0.8% kernel/faults: 52 1.5% minor 2077 / com. Huawei. Powergenie: 0.7% user + 0.8% kernel/faults: 843 minor 2 major 0.5% 17384 / com. Android. Settings: 0.4% user + 0.1% kernel/faults: 367 /kworker/u16:0 / faults: 1 minor 0.9% 18949 / android. Process. Media: 0.5% user + 0.3% kernel/faults: 1408 minor 8 major 0.8% 148 / kswapd0: 367 + 367 / self/self/self 139 minor 0.5% 2105 / com. Huawei. Android. The launcher: 0.4% user + 0.1% kernel/faults: 159 minor 1 major 0.5% 2366 / com. Huawei, systemmanager: service: 0.3% user + 0.1% kernel/faults: 341 Minor 0.352 /hisi_frw/0: 0% user + 0.5% kernel 0.351 /kworker/ U16:8: User + 0.4% 0% 0.4% kernel 27040 / com. Huawei. Appmarket: 0.3% user + 0.1% kernel/faults: 350 Minor 2 Major 0.4% 238/kworker/ 0:1h: 0% user + 0.4% kernel 0.4% 2028/com.huawei.hiview: The kernel/faults system was faulted with exceptions. User + 0.3% 0% 0.3% kernel/[email protected]: 677 0.3% user + 0% kernel/faults: 19 Minor 0.367 /kworker/ U16:1:0% + 0.367 /com.huawei. Hidisk: 0.367 /kworker/ u16:1:0% + 0.367 /com.huawei. 404 Minor 1 /kworker/ U16:10: Faults: 25 Minor 352 /kworker/ U16:9: 367 /com.huawei.hwid: 367 /com.huawei.hwid: 367 / faults: 754 Minor /kworker/ u16:0 User + 0.2% 0% 0.2% kernel/[email protected]: 548 0.2% user + 0% kernel/faults: 47 0.2% minor 2042 / com. Huawei. Systemserver: 0.1% user + 0.1% kernel/faults: 95 minor 0.2% 426 / oeminfo_nvm_server: 367 /kworker/ U173:1: Faults: 20 minor /kworker/u16:2: User + 0.2% 0% 0.2% kernel 5290 / com. Huawei. Health: DaemonService: 0.1% user + 0% kernel/faults: 77 Minor 0.1 / Rcu_preempt: 0% user + 0.1% kernel 0.93 / SYS_heap: 0% user + 0.1% kernel 0.949 /hisi_hcc: 637/ hisECD: 0% kernel/faults: 66 minor 697 /com.android.phone: Faults: 97 minor /kworker/u16:5 / faults: 99 minor /kworker/u16:5 / faults: 99 minor /kworker/u16:5 / faults: 0.1% user + 0% kernel/faults: 142 minor 2 major 0.1% 2063 / com. Huawei. HwOPServer: 0% user + 0% kernel/faults: 69 minor 1// faults: 0 // faults: 0 // 263 minor 5 major 0% 465 / [email protected]: 0% user + 0% kernel/faults: 111 Minor 5 Major 0.344 /kworker/u16:4: 0% user + 0.1% kernel 0.355 / displayEngineserver: Faults: 21 minor /oal_gpio_rx_dat User + 0.1% 0% 0.1% kernel 3955 / com. Eg. Android. AlipayGphone: push: 0% user + 0% kernel/faults: Huawe // CPU usage after ANR occurred 2020-10-27 08:39:53. CPU Usage from 51ms to 493ms later (2020-10-27 08:39:51.984 to 2020-10-27 08:39:52.426) with 99% awake: 48% 1265/ System_server: 10% User + 37% kernel 45% 1285/ActivityManager: 8% user + 37% kernel 4.9% Faults: 6 minor faults: 7 minor faults: 7 minor faults: 7 minor faults: User + 3.1% 0% 3.7% kernel 23112 / com. Android. The systemui: 9.7% 3.7% user + 0% kernel TOTAL: 5.2% user + 4.4% kernel // You can see the entire CPU in totalCopy the code

Look at the first few lines:

ANR in com.brotherd.broadcastdemo (com.brotherd.broadcastdemo/.MainActivity) PID: 6317 Reason: Input dispatching timed out (Waiting to send key event because the focused window has not finished processing all of the  input events that were previously delivered to it. Outbound queue length: 0. Wait queue length: 2.)Copy the code

These lines indicate the activity in which the ANR occurred, the process ID, and the reason for the ANR (input event distribution timed out).

ANRManager will print the CPU usage before and after ANR, which can reflect the Performance status of the system at that time:

  • If the CPU usage is close to 100%, the current device is busy and CPU starvation may be causing the ANR.
  • If CPU usage is low, the main thread is blocked
  • If the IOwait is high, the ANR may be due to I/O operations on the main thread

So at this point, we’re going to look at what the main thread is doing when anR happens.

Tractes.txt file (data/anr/ testamp.txt)

The log file only tells you when ANR occurred, but it does not give you detailed information. In this case, you should check the trace file. Threads from the process are dumped into this trace file, so the trace file contains the runtime status of each Thread.

The stack information for the current example is as follows:

In the message above, there is this line

at com.brotherd.broadcastdemo.MyReceiver.onReceive(MyReceiver.java:23)
Copy the code

This is the root cause of ANR. The main Thread is blocked by calling Thread’s sleep method in the MyReceiver onReceive method, resulting in ANR.

Then we look at some field information.

"main" prio=5 tid=1 Sleeping
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x7682ab30 self=0x7bd3815c00
  | sysTid=6317 nice=-10 cgrp=default sched=0/0 handle=0x7c59fc8548
  | state=S schedstat=( 1009468742 32888019 224 ) utm=91 stm=9 core=4 HZ=100
  | stack=0x7ff27e1000-0x7ff27e3000 stackSize=8MB
  | held mutexes=
Copy the code

The meanings of these fields are as follows:

Thread information

  • Thread priority: prio=5
  • Thread ID: TID =1
  • Thread status: Sleeping
  • Thread group name: group=”main”
  • The number of times a thread is suspended: sCount=1
  • The number of times a thread is suspended by the debugger: dsCount=0
  • The thread’s Java object address: obj= 0x7682AB30
  • Self =0x7bd3815c00
  • In Linux, the kernel thread ID: sysTid=6317 Is the same as the process ID of the main thread
  • Thread scheduling priority: nice=-10
  • Thread scheduling group: CGRP =default
  • Thread scheduling policy and priority: sched=0/0
  • Address of thread handler function: handle= 0x7C59FC8548

Thread context information:

  • Thread scheduling status: state=S
  • Schedstat =(1009468742 32888019 224) schedstat=(1009468742 32888019 224)
  • Thread scheduling time in user mode: UTM =91
  • Thread scheduling time value in kernel state: STM =9
  • The last CPU core to execute this thread is core=4

Thread stack information:

  • Stack address and size: stack= 0x7ff27E1000-0x7FF27e3000 stackSize=8MB

Thread state

In addition, thread states are important, and understanding the meaning of these states is helpful to analyze the cause of ANR, as summarized below:

There is a mapping between the states in Thread. Java and thread. CPP. You can see that the former is more general and easier to understand for Java users; The latter is more detailed, oriented towards the virtual machine’s internal environment. Traces.txt displays Thread states that are defined in thread.cpp. In addition, all threads are local POSIX compliant threads.

Example thread state:

  • The main thread is running or native: while the corresponding stack corresponds to the function in the App, it is likely that the timeout occurs when the function is executed.
  • The main thread is blocked: it is obvious that the thread is locked, so you can see which thread is locked and consider optimizing your code. If it is a deadlock problem, it is even more important to resolve it in a timely manner.

Deadlock demo 

Let’s look at an example of a deadlock:

"PowerManagerService" prio=5 tid=24 MONITOR
| group="main" sCount=1 dsCount=0 obj=0x41dd0eb0 self=0x5241b218
| sysTid=567 nice=0 sched=0/0 cgrp=apps handle=1380038664
| state=S schedstat=( 6682116007 11324451214 33313 ) utm=450 stm=219 core=1
at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:~13045)
- waiting to lock <0x41a874a0> (a com.android.server.am.ActivityManagerService) held by tid=12 (android.server.ServerThread)
at android.app.ContextImpl.sendBroadcast(ContextImpl.java:1144)
at com.android.server.power.PowerManagerService$DisplayBlankerImpl.unblankAllDisplays(PowerManagerService.java:3442)
at com.android.server.power.DisplayPowerState$PhotonicModulator$1.run(DisplayPowerState.java:456)
at android.os.Handler.handleCallback(Handler.java:800)
at android.os.Handler.dispatchMessage(Handler.java:100)
at android.os.Looper.loop(Looper.java:194)
at android.os.HandlerThread.run(HandlerThread.java:60)
Copy the code

The thread tid=24 is waiting for a lock < 0x41A874A0 > and the lock is occupied by the thread tid=12.

"android.server.ServerThread" prio=5 tid=12 MONITOR
| group="main" sCount=1 dsCount=0 obj=0x41a76178 self=0x507837a8
| sysTid=545 nice=-2 sched=0/0 cgrp=apps handle=1349936616
| state=S schedstat=( 15368096286 21707846934 69485 ) utm=1226 stm=310 core=0
at com.android.server.power.PowerManagerService.isScreenOnInternal(PowerManagerService.java:~2529)
- waiting to lock <0x41a7e2e8> (a java.lang.Object) held by tid=85 (Binder_B)
at com.android.server.power.PowerManagerService.isScreenOn(PowerManagerService.java:2522)
at com.android.server.wm.WindowManagerService.sendScreenStatusToClientsLocked(WindowManagerService.java:7749)
at com.android.server.wm.WindowManagerService.setEventDispatching(WindowManagerService.java:7628)
at com.android.server.am.ActivityManagerService.updateEventDispatchingLocked(ActivityManagerService.java:8083)
at com.android.server.am.ActivityManagerService.wakingUp(ActivityManagerService.java:8077)
Copy the code

The thread tid=12 is waiting for the lock <0x41a7e2e8> and the lock is occupied by the thread tid=85.

"Binder_B" prio=5 tid=85 MONITOR | group="main" sCount=1 dsCount=0 obj=0x42744770 self=0x58329e88 | sysTid=3700 nice=-20  sched=0/0 cgrp=apps handle=1471424616 | state=S schedstat=( 1663727513 2044643318 6806 ) utm=132 stm=34 core=1 at com.android.server.power.PowerManagerService$DisplayBlankerImpl.toString(PowerManagerService.java:~3449) - waiting to lock <0x41a7e420> (a com.android.server.power.PowerManagerService$DisplayBlankerImpl) held by tid=24 (PowerManagerService)Copy the code

It is worth noting that trace generally contains time, so try to analyze trace whose time is similar to ANR to avoid other interference. The tid=85 thread was waiting for the lock <0x41a7e420> to be released, and the lock was occupied by tid=24, so the deadlock was sent. In this case, we need to find the source code where the deadlock occurred, analyze it and modify it.

Other analysis ideas

Sometimes it is difficult to determine the cause of ANR even after log analysis or traces analysis, so you may need to try other methods:

  • Find 8s before the occurrence time of ANR from main log and see what operation the main thread of app is doing. If there are exceptions, the PID of process will be equal to the main thread ID.

  • Search log on a large scale to see if there was a crash before anR occurred. Is it possible to cause ANR.

  • Grab some auxiliary information of the platform, including but not limited to:

    • memory info
    • process status
    • cpu info
    • Binder info
    • ftrace

Reduce the incidence of ANR

  • Put all time-consuming operations such as network access, Socket communication, querying a large number of SQL statements, complex logic calculations, etc. into child threads, and then update the UI via handler.sendMessage, runonUIThread, AsyncTask, etc. Make sure the user interface is smooth at all costs. If a time-consuming operation requires the user to wait, a degree bar can be displayed on the interface.
  • Use AsyncTask to process time-consuming I/O operations. In some synchronous operations, the main thread may be locked and need to wait for other threads to release the corresponding lock to continue execution, which has a certain ANR risk. In this case, asynchronous threads can also be used to execute the corresponding logic. Also, avoid deadlocks.
  • Using Thread or HandlerThread, call Process. SetThreadPriority (Process. THREADPRIORITYBACKGROUND) set priority, otherwise, still can reduce response program, Because the default Thread has the same priority as the main Thread.
  • Instead of blocking the main Thread with thread.wait () or thread.sleep (), use handlers to process worker Thread results.
  • Try to avoid time-consuming code in the Activity’s onCreate and onResume callbacks
  • Avoid time-consuming code in View onOnTouchevent and onClick as well
  • IntentService is recommended for onReceive code in BroadcastReceiver to minimize time consumption.
  • The lifecycle functions of each component should not have time-consuming operations, even for a background Service or ContentProvider. When the application is running in the background, its onCreate() will not have user input causing an event to go unanswered ANR. However, long execution time can also cause ANR of Service and ANR of ContentProvider

 

Refer to the article

Android System Stability – ANR (1)

ANR problem analysis guide