Where do I start reading the code?

The code entry for the system ANR process is AMS#appNotResponding

ANR DUMP

After ANR occurs, the system dumps the thread stack and state of the current process and key processes

The Skip ANR scenario

  1. During shutdown skipping ANR
  2. Skipping duplicate ANR
  3. Crashing app skipping ANR
  4. App already killed by AM skipping ANR
  5. Skipping died app ANR

Only ANR processes are dumped

If the dump time is longer than 1 minute, only the ANr process is dumped to avoid making the system more busy

    /** * If the system is extremely slow somehow that the ANR has been pending too long for more than * this time, the information might be outdated. So we only the dump the unresponsive process * instead of including other processes to avoid making the system more busy. */
    private static final long EXPIRED_REPORT_TIME_MS = TimeUnit.MINUTES.toMillis(1);
Copy the code

If the background anR process is used, only the ANR process is dumped

    /** * Unless configured otherwise, swallow ANRs in background processes & kill the process. * Non-private access is for tests only. */
    @VisibleForTesting
    boolean isSilentAnr(a) {
        return! getShowBackground() && ! isInterestingForBackgroundTraces(); }Copy the code

ANR Output summary

Key information will be output when ANR is triggered :(this is time-consuming and may take 10s)

  1. Output am_ANR information to EventLog.
  2. /data/anr/ testamp.txt;
  3. ANR Reason and CPU usage information are displayed in the main log.
  4. Save the CPU usage and process trace files to /data/system/dropbox.

Key information interpretation

ANR Indicates the keyword of the occurrence log

10-25 15:37:55.057 1000 1787 32274 I am_anr: [55, 0274 com. Unicom woshipin, 953712196, Input dispatching timed out (com.unicom.woshipin/com.unicom.wotvvertical.busiz.main.PortMainActivity, 49ec192 com.unicom.woshipin/com.unicom.wotvvertical.busiz.main.PortMainActivity (server) is not responding. Waited 8012ms for MotionEvent(Action =DOWN)] 10-25 15:38:04.818 1000 1787 32274 E ActivityManager: ANR in com.unicom.woshipin (com) unicom. Woshipin/com. Unicom. Wotvvertical. Busiz. Pugc. Videopost. Videoselect. VideoSelectActivity) 10 to 25 15:38:04. 823 1000 1787 32274 D ActivityManager: Completed ANR of com.unicom.woshipin in 9834ms, latency 1ms

Trace

  1. “SignalCatcher” thread. When the Zygote forks, it calls StartSignalCatcher at InitNonZygoteOrPostFork to create a new SignalCatcher thread. This thread is used to catch Linux signals.
  2. “Runnable”, “Waiting”, “state= XXX “: indicates the working status of the current thread
  3. “Utm = XXX” and “STM = XXX “: indicates the actual running time of the thread scheduled by the CPU from creation to present, excluding the thread wait time or Sleep time. The thread CPU time can be further divided into utM and STM. The units here are jiffies. When HZ=100, 1UTm equals 10ms.
  4. “Core = XXX” : the serial number of the CPU core that executed the thread last
  5. “Nice = XXX “: indicates the thread priority. The lower the value, the higher the priority
  6. “Schedstat = XXX “: schedstat=(38065842 5059682 54) for thread CPU execution times (ns), wait times, and Switch times.

NativePollOnce scenario

Conditions for entering the NativePollOnce scenario:

When MessageQueue has no message that needs to be dispatched immediately, it enters the wait (nativePollOnce) state conditionally, and wakes up (nativeWake) the thread and executes after timeout or new message is added.

If the main thread in the ANR log is in such a state, there are two possible reasons:

  1. This ANR is caused by other factors such as CPU preemption or memory tightness
  2. When this ANR log was captured, the main thread had returned to normal

ANR info

  1. “Reason”: Indicates ANR types, such as Input, Receiver, and Service
  2. “Load: 0.44/0.54/0.8 “: indicates the system Load, which respectively represents the system CPU Load in each time period of 1 minute, 5 minutes, and 15 minutes before ANR occurs. The specific value represents the number of tasks waiting for system scheduling per unit of time (which can be understood as threads).
  3. “CPU Usage from XXX to XXX ago”: indicates the CPU usage of the process before ANR occurred
  4. “CPU Usage from XXX to XXX later”: indicates the PROCESS CPU usage after ANR occurs
  5. CPU usage consists of user, kernel, and iowait

Key processes:

  1. kswapd:It is a kernel thread used for page reclamation in Linux. It is used to maintain the balance between available memory and file cache to maximize performance. If the CPU usage of this thread is too high, noteThe available system memory is insufficientOr the memory fragmentation is serious and file cache write back or memory swap (memory swap to disk) is required. If the thread CPU is too high, the overall system performance deteriorates significantly, affecting all application scheduling.
  2. mmcqd:Kernel threads manage and forward I/O requests from the upper layer to the Driver layer in a unified manner. If the CPU usage of this thread is too high, a large number of file reads and writes exist in the system. Of course, if the memory is insufficient, file write back and memory swap will be triggeredkswapdmmcqdThey are often synchronized
  3. kworker:Represents the Linux kernel process that performs “work” (handling system calls). You can have several in your process list:kworker/0:1Is your first CPU core,kworker/1:1It’s the second class.

Logcat log

In some frequently invoked interfaces, the Android system uses checkTime and checkSLow detection before and after the method to determine whether the function execution time exceeds the set threshold. When the actual time exceeds the threshold, a Slow XXX message is displayed indicating that the system process scheduling is affected.

Generally speaking, these system processes have a high priority. If the system process scheduling is affected, it reflects that the system performance is likely to have problems during this period.

“Slow operation”

Calculation of Slow operation of the system:

    private void checkSlow(long startTime, String where) {
        long now = SystemClock.uptimeMillis();
        if ((now - startTime) > 50) {
            // If we are taking more than 50ms, log about it.
            Slog.w(TAG, "Slow operation: " + (now - startTime) + "ms so far, now at "+ where); }}Copy the code

Principle of ANR monitoring

When the system sends the component message or Input event to the application process through Binder communication, an asynchronous timeout monitoring is set on both AMS and Input servers.

ANR classification

This article briefly analyzes a few of these scenarios:

InputDispatcher Timeout principle

When an application receives an input event, it registers the input channel and input window corresponding to the input event with the system. InputDispatcher listens for these registered input events.

When the input event execution is complete, a processing finish message is sent to the system.

If no input event finish message is received within 5s, InputDispatcher sends ANR notification.

Input EVENT1: After sending mPendingEvnet to the APP layer through InputChannel, the mPendingEvnet is saved to the waitQueue.

Input event2: The next input event comes, waitQueue isn’t empty, cause checkWindowReadyForMoreInputLocked return value is not empty, trigger handleTargetsNotReadyLocked, Then the current time + 5 s as mInputTargetWaitTimeoutTime, and set the mInputTargetWaitTimeoutTime mLooper next time of awakening

BroadcastReceiver Timout principle

Statically registered and ordered broadcasts are ANR, dynamically registered and unordered broadcasts are not ANR

  1. Out-of-order broadcasts are sent to all recipients at once and no timeout monitoring is set.
final void processNextBroadcastLocked(boolean fromMsg, boolean skipOomAdj, boolean parallelOnce) {...// First, deliver any non-serialized broadcasts right away.
    while (mParallelBroadcasts.size() > 0) {
        r = mParallelBroadcasts.remove(0); r.dispatchTime = SystemClock.uptimeMillis(); r.dispatchClockTime = System.currentTimeMillis(); .final int N = r.receivers.size();
        if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST, "Processing parallel broadcast ["
                + mQueueName + "]" + r);
        for (int i=0; i<N; i++) {
            Object target = r.receivers.get(i);
            if (DEBUG_BROADCAST)  Slog.v(TAG_BROADCAST,
                    "Delivering non-ordered on [" + mQueueName + "] to registered "
                    + target + ":" + r);
            deliverToRegisteredReceiverLocked(r, (BroadcastFilter)target, false, i);
        }
        addBroadcastToHistoryLocked(r);
        if (DEBUG_BROADCAST_LIGHT) Slog.v(TAG_BROADCAST, "Done with parallel broadcast ["
                + mQueueName + "]"+ r); }... }Copy the code
  1. Orderly broadcast
    final void processNextBroadcastLocked(boolean fromMsg, boolean skipOomAdj, boolean parallelOnce) {...// Get the next receiver...
        int recIdx = r.nextReceiver++;
          
        // Keep track of when this receiver started, and make sure there
        // is a timeout message pending to kill it if need be.
        r.receiverTime = SystemClock.uptimeMillis();
        if (recIdx == 0) { r.dispatchTime = r.receiverTime; r.dispatchClockTime = System.currentTimeMillis(); . }if (! mPendingBroadcastTimeoutMessage) {
            long timeoutTime = r.receiverTime + mConstants.TIMEOUT;
            if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST,
                    "Submitting BROADCAST_TIMEOUT_MSG ["
                    + mQueueName + "] for " + r + " at " + timeoutTime);
            // Set timeout monitoring
            setBroadcastTimeoutLocked(timeoutTime);
        }
    
        final BroadcastOptions brOptions = r.options;
        final Object nextReceiver = r.receivers.get(recIdx);
    
        if (nextReceiver instanceof BroadcastFilter) {
            // Simple case: this is a registered receiver who gets
            // a direct call.
            BroadcastFilter filter = (BroadcastFilter)nextReceiver;
            if (DEBUG_BROADCAST)  Slog.v(TAG_BROADCAST,
                    "Delivering ordered ["
                    + mQueueName + "] to registered "
                    + filter + ":" + r);
            // Send to the receiver
            deliverToRegisteredReceiverLocked(r, filter, r.ordered, recIdx);
            if (r.receiver == null| |! r.ordered) {// The receiver has already finished, so schedule to
                // process the next one.
                if (DEBUG_BROADCAST) Slog.v(TAG_BROADCAST, "Quick finishing ["
                        + mQueueName + "]: ordered="
                        + r.ordered + " receiver="+ r.receiver); r.state = BroadcastRecord.IDLE; scheduleBroadcastsLocked(); . }Copy the code

Orderly broadcast execution process

processNextBroadcastLocked

-> deliverToRegisteredReceiverLocked

-> ActivityThread#scheduleReceiver inserts an H. lever message to the ActivityThread main thread message queue

-> H.riceiver message executes handleReceiver

-> receiver.onReceive(context.getReceiverRestrictedContext(), data.intent);

-> handleReceiver calls data.Finish () to notify AMS

– > AMS# finishReceiver method called processNextBroadcastLocked execution of a broadcast

– > after processNextBroadcastLocked distribute disorderly radio will call cancelBroadcastTimeoutLocked cancelled a timeout on the monitor

When a broadcast is sent, the system checks whether the process exists. If the process does not exist, it is created. The process creation time is counted in the timeout period

Principle of ContentProvider Timout

ContentProvider timeout is detected when the Provider process is started for the first time. Requesting the Provider again does not trigger the Provider timeout.

ContentProvider timeout listening occurs during AMS#attachApplicationLocked

private boolean attachApplicationLocked(@NonNull IApplicationThread thread, int pid, int callingUid, long startSeq) { ... if (providers ! = null && checkAppInLaunchingProvidersLocked(app)) { Message msg = mHandler.obtainMessage(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG); msg.obj = app; mHandler.sendMessageDelayed(msg, ContentResolver.CONTENT_PROVIDER_PUBLISH_TIMEOUT_MILLIS); } checkTime(startTime, "attachApplicationLocked: before bindApplication"); . }Copy the code

The removal of the listener message occurs in the AMS#publishContentProviders procedure, which removes the TIMEOUT message after installing the provider

public final void publishContentProviders(IApplicationThread caller, List<ContentProviderHolder> providers) { if (providers == null) { return; } enforceNotIsolatedCaller("publishContentProviders"); synchronized (this) { final ProcessRecord r = getRecordForAppLocked(caller); if (DEBUG_MU) Slog.v(TAG_MU, "ProcessRecord uid = " + r.uid); if (r == null) { throw new SecurityException( "Unable to find app for caller " + caller + " (pid=" + Binder.getCallingPid() + ") when publishing content providers"); } final long origId = Binder.clearCallingIdentity(); final int N = providers.size(); for (int i = 0; i < N; i++) { ContentProviderHolder src = providers.get(i); if (src == null || src.info == null || src.provider == null) { continue; } ContentProviderRecord dst = r.pubProviders.get(src.info.name); if (DEBUG_MU) Slog.v(TAG_MU, "ContentProviderRecord uid = " + dst.uid); if (dst ! = null) { ComponentName comp = new ComponentName(dst.info.packageName, dst.info.name); mProviderMap.putProviderByClass(comp, dst); String names[] = dst.info.authority.split(";" ); for (int j = 0; j < names.length; j++) { mProviderMap.putProviderByName(names[j], dst); } int launchingCount = mLaunchingProviders.size(); int j; boolean wasInLaunchingProviders = false; for (j = 0; j < launchingCount; j++) { if (mLaunchingProviders.get(j) == dst) { mLaunchingProviders.remove(j); wasInLaunchingProviders = true; j--; launchingCount--; } } if (wasInLaunchingProviders) { mHandler.removeMessages(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG, r); }... } } Binder.restoreCallingIdentity(origId); }}Copy the code

The ContentProvider triggering the timeout does not generate an ANR, but simply removes the process

com/android/server/am/ActivityManagerService.java
...
case CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG: {
    ProcessRecord app = (ProcessRecord)msg.obj;
    synchronized (ActivityManagerService.this) {
        processContentProviderPublishTimedOutLocked(app);
    }
                
private final void processContentProviderPublishTimedOutLocked(ProcessRecord app) {
  cleanupAppInLaunchingProvidersLocked(app, true);
  mProcessList.removeProcessLocked(app, false.true,
          ApplicationExitInfo.REASON_INITIALIZATION_FAILURE,
          ApplicationExitInfo.SUBREASON_UNKNOWN,
          "timeout publishing content providers");
}
Copy the code

The Service principle of Timeout

  1. ANR is triggered by the following callbacks to the Service: onCreate(), onStartCommand(), onStart(), onBind(), onRebind(), onTaskRemoved(), onUnbind(), onDestroy().
  2. The foreground Service timeout duration is 20 seconds, and the background Service timeout duration is 200 seconds
  3. How to distinguish foreground from Background execution ———— The APP is in user mode, and the Service executed in this case is foreground execution.
  4. User state: The process has foreground activity, foreground broadcast running, and foreground Service running

Limitations of ANR mechanism

From the previous understanding of how each type of ANR works, we can see that the system initiates asynchronous monitoring by sending timeout messages.

While waiting for an execution timeout message:

  1. In the process startup scenario, there is a lot of business and library initialization, and many messages are scheduled before the timeout message is inserted
  2. In some scenarios, a small number of messages take time for one or more messages
  3. In some cases, other threads or the system is overloaded, causing the entire system to stall

In these scenarios, if AMS is not notified in time to remove the timeout message, the ANR mechanism will be triggered,

The ANR process then dumps the stack for a message execution.

So when ANR problems occur, a lot of times the Trace stack is not a RootCause

ANR info indicates that ANR caused by a component is most likely not the fault of the component itself.


Reference article:

Toutiao ANR Optimization Practice series – Design principles and influencing factors

ANR perspective InputDispatcher