The introduction

Whether engaged in Android Application development, is also android system development, should encounter Application non-response (ANR, Application Not Responding) problem, when the Application can Not respond to a period of time, will pop up ANR dialog box, let the user choose to continue to wait, or forced to close.

Most people’s knowledge of ANR is limited to the fact that it is caused by mainline time or CPU busy. Having interviewed countless candidates, few have been able to really clarify the ins and outs of ANR at the system level, for example, what are the pathways that lead to ANR? Is it possible to have ANR even when the main thread doesn’t take time? How to debug ANR better?

Without in-depth study of the Android Framework source code, it is difficult to form a comprehensive and correct understanding of ANR. Study system source code and work practice after refining, illustrated with the way to explain to you, I believe that will help you deepen the understanding of ANR.

ANR trigger mechanism

For the process of knowledge learning, we need to know what it is and why, so as to achieve paoding as effortlessly. In order to understand ANR, we need to look at the root of the answer: how is ANR triggered?

ANR is a mechanism to monitor whether Android applications respond in a timely manner. ANR can be likened to detonating a bomb. Then the whole process consists of three parts:

  1. Buried time bomb: central control system (system_server process) start countdown, in the specified time if the target (application process) did not finish all the work, the central control system will be directed to blow up (kill process) target.
  2. Bomb demolition: in the specified time to finish all the work site, and timely report to the central control system to complete, request to remove the time bomb, it survived.
  3. Detonate bomb: central control system immediately encapsulate the scene, capture snapshots, collect the target execution slow traces, facilitate the follow-up case detection (debugging analysis), and finally blow up the target.

Common ANRS include Service, broadcast, provider, and input. For more details, see gityuan.com/2016/07/02/… Next, this paper will explain them in the form of pictures and texts.

Service Timeout mechanism

Here is a look at the steps in the startService process of bomb planting and bomb disarming.

Diagram 1:

  1. The client (App process) initiates a service startup request to the central control system (system_server process)
  2. The central control system sends an idle communicator (binder_1 thread) to receive the request and then sends a message to the component manager (ActivityManager thread), planting the time bomb
  3. Communicator 1 (binder_1) tells the site communicator to get ready to work
  4. Communicator no. 3 (Binder_3) will forward the task to the contractor (main thread) and join the contractor’s task queue (MessageQueue).
  5. The contractor makes an effort to complete the service startup lifecycle and then waits for SharedPreferences(SP) to persist.
  6. The contractor shall report the work to the central control system immediately after the completion of SP execution
  7. Central control system communicator no. 2 (Binder_2) received the contractor’s report of completion, immediately dismantle the bomb. If the bomb is defused before the bomb countdown is over it will be fine, otherwise it will trigger an explosion (trigger ANR)

For more details, see startService startup process analysis, gityuan.com/2016/03/06/…

Broadcast timeout mechanism

Broadcast is similar to the service timeout mechanism. For statically registered broadcasts, SP needs to be detected during timeout detection, as shown in the following figure.

Diagram 2:

  1. The client (App process) sends a broadcast request to the central control system (system_server process)
  2. The central control system sends an idle communicator (binder_1) to receive the request and forward it to the component manager (ActivityManager thread)
  3. The component manager’s execution of the task (processNextBroadcast method) sets a time bomb
  4. The component manager notifies the communicator at the site (the process on which the Receiver is located) to get ready to work
  5. Communicator no. 3 (Binder_3) will forward the task to the contractor (main thread) and join the contractor’s task queue (MessageQueue).
  6. After finishing the work (the life cycle of receiver startup), the contractor finds that SP is still writing files in the current process, and then reports the task to SP worker (queued-work-looper thread).
  7. SP workers after painstaking finally completed the SP data persistence work, can report to the central control system to complete the work
  8. Central control system communicator no. 2 (Binder_2) received the contractor’s report of completion, immediately dismantle the bomb. If the bomb is defused before the countdown ends it will be fine, otherwise it will trigger an explosion (trigger ANR)

(Note: SP from 8.0 uses a “queued-work-looper” handler thread, in older versions newSingleThreadExecutor created a single thread pool)

In the case of dynamic broadcast or static broadcast, no SP task executing persistent operation is reported to the central control system directly instead of going through queued-work-looper thread. The process is simpler, as shown in the following figure:

As can be seen, only the broadcast timeout detection process of XML static registration takes into account whether any SP has not completed, and dynamic broadcast is not affected by it. SP’s Apply will update the modified data items to the memory, and then asynchronously synchronize the data to the disk file. Therefore, in many places, it is recommended to use the Apply mode to invoke the main thread to avoid blocking the main thread. However, the static broadcast timeout detection process requires all SP to be persisted to the disk, and if overuse apply will increase the probability of applying ANR. See http://gityuan.com/2017/06/18/SharedPreferences for more details

The original intention of Google is to ensure that SP data persistence can be completed before the process is killed in the static broadcast scenario. Because the priority of this process is Foreground level before the broadcast receiver is reported to the central control system, the process under the higher priority will not be killed, but can be allocated more CPU time slices to accelerate the completion of SP persistence.

For more details, see The Android Broadcast mechanism analysis, gityuan.com/2016/06/04/…

Provider timeout mechanism

The provider timeout is detected only when the provider process is started for the first time. If the provider process is started, requesting the Provider again does not trigger the provider timeout.

Diagram 3:

  1. The client (App process) initiates a request to the central control system (system_server process) to obtain the content provider
  2. The central system sends an idle communicator (binder_1) to receive the request, detects that the content provider is not started, and incubates the new process through Zygote
  3. The newly hatched Provider process registers its existence with the central control system
  4. Communicator 2 of the central control system receives this message and sends a message to the component manager (active Manager thread), planting the bomb
  5. Communicator 2 informs the site (Provider process) communicator to get ready to work
  6. Communicator No. 4 (binder_4) will forward the task to the contractor (main thread) and join the contractor’s task queue (MessageQueue).
  7. The contractor after some efforts to finish the work (to complete the installation of the provider) to the central control system report that the work has been completed
  8. Central control system communicator no. 3 (Binder 3) received the contractor’s report of completion, immediately dismantle the bomb. If the bomb is defused before the countdown ends it will be fine, otherwise it will trigger an explosion (trigger ANR)

For more details, see understanding the Principles of ContentProvider at gityuan.com/2016/07/30/…

Inpu timeout mechanism

The timeout detection mechanism of input is quite different from that of Service, broadcast, and provider. To better understand the input process, we will introduce the related work of two important threads:

  • The InputReader thread is responsible for reading the input event via EventHub(listening directory dev/input), putting it into the mInBoundQueue of InputDispatcher and informing it to handle the event.
  • The InputDispatcher thread dispatches incoming input events to the target application window using three event queues:
    • MInBoundQueue records input events sent by the InputReader.
    • OutBoundQueue is used to record input events that will be distributed to the target application window.
    • WaitQueue records input events that have been sent to the target application and have not yet been processed by the application.

The timeout mechanism of INPUT does not mean that an explosion will occur when the time is up, but that the explosion will be detected only in the process of handling subsequent reported events. Therefore, IT is more believed that it is a process of mine clearance, as shown in the following figure.

Diagram 4:

  1. The InputReader thread uses EventHub to listen for input events reported from the bottom, and when it receives an input event, it puts it to the mInBoundQueue and wakes up the InputDispatcher thread
  2. InputDispatcher Sends input events and sets the start time of burying mines. Check if there is any mPendingEvent in progress. If not, fetch the mInBoundQueue header event, assign it to mPendingEvent, and reset ANR timeout. Otherwise, events are not fetched from the mInBoundQueue and timeout is not reset. Then check that the window is ready (checkWindowReadyForMoreInputLocked), meet the following any situation, and will enter a state of mine clearance (check whether the events of the former one is processing timeout), termination of the event distribution, otherwise continue to step 3.
    • For keyword-type input events, outboundQueue or waitQueue is not empty,
    • For non-keypress input events, the waitQueue is not empty and the wait for the queue head times out by 500ms
  3. When the application window is ready, the mPendingEvent is moved to the outBoundQueue queue
  4. When the outBoundQueue is not empty and the peer of the application pipeline is properly connected, the data is removed from the outBoundQueue and put into the waitQueue
  5. InputDispatcher tells the target application process through the socket that it is ready to work
  6. Socketpair for bidirectional communication with the central control system has been created by default during the initialization of App. At this time, after receiving the input event, the foreman of App (main thread) will forward it to the target window layer by layer for processing
  7. After the contractor finishes the work, he reports the work to the central control system through the socket, and the central control system removes the event from the waitQueue.

Why is input timeout minesweeper and not timed explosion? The reason is that for input, even if the execution time of an event exceeds the timeout period, ANR will not be triggered as long as the user does not regenerate the input event later. Here, minesweeping means that on the premise that a time-consuming event is being processed in the current input system, each subsequent input event will detect whether the previous event being processed has timed out (entering the minesweeping state) and whether the current time between the last input event distribution time point has exceeded the timeout period. If the previous input event resets the TIMEOUT of ANR so that it does not explode.

For more details see Input System-ANR principle analysis, gityuan.com/2017/01/01/…

ANR Timeout threshold

Different components have different timeout thresholds. The following table lists the timeout thresholds for Service, broadcast, ContentProvider, and INPUT:

The difference between front desk and back office service

The system starts the foreground service timeout for 20s, and the background service timeout for 200s, so the system is how to distinguish the foreground or background service? Take a look at the core logic of ActiveServices:

ComponentName startServiceLocked(...). {
    final boolean callerFg;
    if(caller ! =null) {
        finalProcessRecord callerApp = mAm.getRecordForAppLocked(caller); callerFg = callerApp.setSchedGroup ! = ProcessList.SCHED_GROUP_BACKGROUND; }else {
        callerFg = true; }... ComponentName cmp = startServiceInnerLocked(smap, service, r, callerFg, addToStarting);return cmp;
}
Copy the code

In startService, the process scheduling group to which the initiator process callerApp belongs determines whether the service to be started belongs to the foreground or background. If the initiator process is not equal to processList.sched_group_background, it is considered a foreground service, otherwise it is a background service, and marks the member variable createdFromFg in ServiceRecord.

What processes belong to the SCHED_GROUP_BACKGROUND scheduling group? SCHED_GROUP can be divided into TOP, foreground and background, with SCHED_GROUP and SCHED_GROUP having a complex relationship between SCHED_GROUP and TOP for processes with AD0 and foreground for processes with ad100 or 200. Adj The process whose value is greater than 200 belongs to the background process group. The meaning of Adj is shown in the following table. To put it simply, processes with Adj>200 are insensitive to users and mainly perform background work. Therefore, background services have a longer timeout threshold and belong to the background process scheduling group.

For details, see the Android process priority algorithm adj. gityuan.com/2018/05/19/…

To be precise, a foreground service is a service initiated by a process in the foreground process scheduling group. This is different from fg-service, which is a service with a foreground notification.

The foreground and background broadcasts timed out

The foreground broadcast timeout is 10 seconds, and the background broadcast timeout is 60 seconds. How can I distinguish foreground from background broadcast? Take a look at the core logic of AMS:

BroadcastQueue broadcastQueueForIntent(Intent intent) {
    final booleanisFg = (intent.getFlags() & Intent.FLAG_RECEIVER_FOREGROUND) ! =0;
    return (isFg) ? mFgBroadcastQueue : mBgBroadcastQueue;
}

mFgBroadcastQueue = new BroadcastQueue(this, mHandler,
        "foreground", BROADCAST_FG_TIMEOUT, false);
mBgBroadcastQueue = new BroadcastQueue(this, mHandler,
        "background", BROADCAST_BG_TIMEOUT, true);
Copy the code

Whether the flag of the Intent in which sendBroadcast(Intent Intent) is sent contains FLAG_RECEIVER_FOREGROUND determines whether the broadcast is placed in the foreground broadcast queue or background broadcast queue. The foreground broadcast queue timeout is 10 seconds. The timeout value of the background broadcast queue is 60 seconds. By default, broadcasts are placed in the background broadcast queue unless FLAG_RECEIVER_FOREGROUND is specified.

Background broadcast has a longer timeout threshold than foreground broadcast. Meanwhile, when background service (mDelayBehindServices) is started during broadcast distribution, broadcast distribution will be delayed and the broadcast ANR caused by waiting for service will be ignored. Background broadcast belongs to background process scheduling group, while foreground broadcast belongs to foreground process scheduling group. In short, background broadcasts are less prone to ANR and slower to execute.

In addition, only the broadcast processed in serial has timeout mechanism, because the receiver is processed in serial, and the slow processing of the previous receiver will affect the later receiver. Parallel broadcast distributes broadcast events to all receivers at once through a loop, so there are no issues affecting each other, and there are no broadcast timeouts.

Foreground broadcast is precisely a broadcast located in the foreground broadcast queue.

Front and back ANR

In addition to front desk service, front desk broadcast, and front Desk ANR may confuse you, let’s take a look at the core logic:

final void appNotResponding(...). {...synchronized(mService) { isSilentANR = ! showBackground && ! isInterestingForBackgroundTraces(app); . }... File tracesFile = ActivityManagerService.dumpStackTraces(true, firstPids,
            (isSilentANR) ? null : processCpuTracker,
            (isSilentANR) ? null : lastPids,
            nativePids);

    synchronized (mService) {
        if (isSilentANR) {
            app.kill("bg anr".true);
            return; }...// The dialog box for selecting ANR is displayed
        Message msg = Message.obtain();
        msg.what = ActivityManagerService.SHOW_NOT_RESPONDING_UI_MSG;
        msg.obj = newAppNotRespondingDialog.Data(app, activity, aboveSystem); mService.mUiHandler.sendMessage(msg); }}Copy the code

The decision of foreground or background ANR depends on whether ANR can be perceived by the user when ANR occurs in the application, such as the process with the activity visible to the current foreground or the fG-service process with the foreground notification. These are the scenes that can be perceived by the user, and the occurrence of ANR has a great influence on user experience. Therefore, a pop-up box is needed to let the user decide whether to exit or wait. If the application is killed directly, the user will blink back inexplicably.

Compared with the foreground ANR, the background ANR only captures the trace of the unresponsive process, does not collect CPU information, and directly kills the unresponsive process in the background without prompting the user with a pop-up box.

The foreground ANR is precisely the ANR that occurs to user-aware processes.

ANR explosion scene

After ANR occurs on service, broadcast, Provider, and input, the central control system captures onsite information immediately for debugging and analysis. The collected information includes:

  • Output am_ANR information to EventLog, that is, the time point at which ANR is triggered is the am_ANR information output in EventLog
  • Collect thread call stack trace information for the following important processes and save it in data/anr/ tetrace.txt
    • The current ANR process, the system_server process, and all the persistent processes
    • Audioserver, Cameraserver, Mediaserver, SurfaceFlinger and other important native processes
    • Top 5 processes in CPU usage
  • Output the reason for ANR occurrence and CPU usage information to the main log
  • Save the traces files and CPU usage information to a dropbox directory called data/system/dropbox
  • The ANR dialog box is displayed for processes that users can perceive. If ANR occurs for processes that users cannot perceive, the ANR dialog box is directly killed

The whole ANR information collection process is time-consuming. The trace information of the process is captured, and every capture waits for 200ms. The more persistent it is, the longer the waiting time is. For Java processes, execute kill -3 [pid] in adb shell environment to fetch the corresponding PID call stack; For Native process in ADB shell environment to execute the debuggerd -B [PID] can grab the corresponding PID call stack. Save traces of ANR problems in tetrace.txt and dropbox. For more details, see understanding the Android ANR information collection process at gityuan.com/2016/12/02/…

With field information, you can debug and analyze, first locate the ANR time point, then check the trace information, and then analyze whether there are time-consuming message calls, binder calls, lock contention, CPU resource preemption, as well as the context of the specific scene. Debugging means need to refine more debugging information from the system perspective for the above-mentioned resources such as message, binder, lock, etc., which will not be expanded here and will be explained in ANR case later.

As an application developer, let the main thread only do UI-related operations as far as possible, and avoid time-consuming operations, such as overly complex UI drawing, network operations, file IO operations; Avoid lock competition between the main thread and the worker thread, reduce system time. Use sharePreference carefully, and pay attention to the provider query operation performed by the main thread. In short, reduce the load on the main thread as much as possible and keep it idle so that it can respond to user operations at any time.

answer

Finally, to answer the question at the beginning of this article, what are the paths that lead to ANR? Slow execution of any one or more paths between planting the bomb and disarming the bomb will result in slow execution of ANR (service as an example), which can be the callback method of service lifecycle (such as onStartCommand). The service callback is delayed because of other time-consuming messages in the message queue of the main thread, the SP operation is slow to execute, or the binder thread of the system_server process is busy and does not receive the bomb disposal command in time. In addition, the ActivityManager thread can block and the foreground service can take more than 10 seconds to execute without ANR.

What are the reasons why the main thread is idle or stuck in non-time-consuming code from trace when ANR occurs? It may take too long to capture trace and miss the scene, it may take too long to capture trace, it may take too long to capture trace and miss the scene, it may take too many messages to accumulate in the main thread message queue and the last moment of capturing snapshot is only a transient state, it may take too long for the broadcast “queued-work-looper” to process SP operation.

The knowledge of this article is derived from the research and work practice of the Android system source code, Android Dharma hospital exclusive martial arts secrets to share with you, I hope to improve your understanding of ANR.