preface

In the project, the use of multithreading is a very common thing, but if not handled properly, the code is not good, it may lead to a deadlock, thread for the deadlock problem, from discovery to the location problem is difficult, if you are online user thread deadlock happens, it’s difficult, so it’s best to the project itself has a thread deadlock detection mechanism, It can automatically detect, automatically report, and then we can analyze the report log.

The WatchDog principle

Before discussing how to design a complete thread deadlock detection scheme, we first understand the principle of WatchDog implementation in the Android system. The essence of WatchDog is a deadlock detection thread, running in the SystemServer process, its role is to constantly detect AMS, WMS and other key services whether there is a deadlock, if the deadlock, stop the SystemServer process, then Zygote process will commit suicide, and then restart, That is, the phone system reboots.

Next through the source code analysis of WatchDog implementation principle.
  1. When Zygote starts the SystemServer process, it calls the SystemServer class main static method:

     /**
      * The main entry point from zygote.
      */
     public static void main(String[] args) {
         new SystemServer().run();
     }
    Copy the code
  2. Look directly at the run method:

    private void run() { ...... // Start services. try { traceBeginAndSlog("StartServices"); startBootstrapServices(); startCoreServices(); startOtherServices(); SystemServerInitThreadPool.shutdown(); } catch (Throwable ex) { Slog.e("System", "******************************************"); Slog.e("System", "************ Failure starting system services", ex); throw ex; } finally { traceEnd(); }... }Copy the code
  3. The startOtherServices method starts a number of key services, including AMS, WMS, and WatchDog:

     private void startOtherServices() {
     	......
     	traceBeginAndSlog("StartWatchdog");
         Watchdog.getInstance().start();
         traceEnd();
     }
    Copy the code
  4. As you can see, WatchDog is designed as a singleton because it inherits from the Thread class. This is where the start method is called, which eventually opens a Thread of detection and executes its run method:

    @Override public void run() { boolean waitedHalf = false; while (true) { final List<HandlerChecker> blockedCheckers; final String subject; final boolean allowRestart; int debuggerWasConnected = 0; synchronized (this) { long timeout = CHECK_INTERVAL; // Make sure we (re)spin the checkers that have become idle within // this wait-and-check interval for (int i=0; i<mHandlerCheckers.size(); i++) { HandlerChecker hc = mHandlerCheckers.get(i); hc.scheduleCheckLocked(); } if (debuggerWasConnected > 0) { debuggerWasConnected--; } // NOTE: We use uptimeMillis() here because we do not want to increment the time we // wait while asleep. If the device is asleep  then the thing that we are waiting // to timeout on is asleep as well and won't have a chance to run, causing a false // positive on when to kill things. long start = SystemClock.uptimeMillis(); while (timeout > 0) { if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } try { wait(timeout); } catch (InterruptedException e) { Log.wtf(TAG, e); } if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start); } boolean fdLimitTriggered = false; if (mOpenFdMonitor ! = null) { fdLimitTriggered = mOpenFdMonitor.monitor(); } if (! fdLimitTriggered) { final int waitState = evaluateCheckerCompletionLocked(); if (waitState == COMPLETED) { // The monitors have returned; reset waitedHalf = false; continue; } else if (waitState == WAITING) { // still waiting but within their configured intervals; back off and recheck continue; } else if (waitState == WAITED_HALF) { if (! waitedHalf) { // We've waited half the deadlock-detection interval. Pull a stack // trace and wait another half. ArrayList<Integer> pids = new ArrayList<Integer>(); pids.add(Process.myPid()); ActivityManagerService.dumpStackTraces(true, pids, null, null, getInterestingNativePids()); waitedHalf = true; } continue; } // something is overdue! blockedCheckers = getBlockedCheckersLocked(); subject = describeCheckersLocked(blockedCheckers); } else { blockedCheckers = Collections.emptyList(); subject = "Open FD high water mark reached"; } allowRestart = mAllowRestart; } // If we got here, that means that the system is most likely hung. // First collect stack traces from all threads of the system process. //  Then kill this process so that the system will restart. EventLog.writeEvent(EventLogTags.WATCHDOG, subject); ArrayList<Integer> pids = new ArrayList<>(); pids.add(Process.myPid()); if (mPhonePid > 0) pids.add(mPhonePid); // Pass ! waitedHalf so that just in case we somehow wind up here without having // dumped the halfway stacks, we properly re-initialize the trace file. final File stack = ActivityManagerService.dumpStackTraces( ! waitedHalf, pids, null, null, getInterestingNativePids()); // Give some extra time to make sure the stack traces get written. // The system's been hanging for a minute, another second or two won't hurt much. SystemClock.sleep(2000); // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log doSysRq('w'); doSysRq('l'); // Try to add the error to the dropbox, but assuming that the ActivityManager // itself may be deadlocked. (which has happened, causing this statement to // deadlock and the watchdog as a whole to be ineffective) Thread dropboxThread = new Thread("watchdogWriteToDropbox") { public void run() { mActivity.addErrorToDropBox( "watchdog", null, "system_server", null, null, subject, null, stack, null); }}; dropboxThread.start(); try { dropboxThread.join(2000); // wait up to 2 seconds for it to return. } catch (InterruptedException ignored) {} IActivityController controller; synchronized (this) { controller = mController; } if (controller ! = null) { Slog.i(TAG, "Reporting stuck state to activity controller"); try { Binder.setDumpDisabled("Service dumps disabled due to hung system process."); // 1 = keep waiting, -1 = kill system int res = controller.systemNotResponding(subject); if (res >= 0) { Slog.i(TAG, "Activity controller requested to coninue to wait"); waitedHalf = false; continue; } } catch (RemoteException e) { } } // Only kill the process if the debugger is not attached. if (Debug.isDebuggerConnected()) { debuggerWasConnected = 2; } if (debuggerWasConnected >= 2) { Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process"); } else if (debuggerWasConnected > 0) { Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process"); } else if (! allowRestart) { Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process"); } else { Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject); WatchdogDiagnostics.diagnoseCheckers(blockedCheckers); Slog.w(TAG, "*** GOODBYE!" ); Process.killProcess(Process.myPid()); System.exit(10); } waitedHalf = false; }}Copy the code

    Not surprisingly, the run method is an endless loop that executes the test logic. As for the specific detection logic, it can be divided into three parts:

    (1) Detect whether deadlock occurs.

    (2) Collect information on all SystemServer threads if a deadlock occurs.

    (3) Kill the SystemServer process.

    The following analysis of the three parts of the specific implementation details, first look at the WatchDog is how to detect deadlocks, there are several variables involved, first understand:

    public class Watchdog extends Thread { ...... final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>(); final HandlerChecker mMonitorChecker; . }Copy the code

    HandlerChecker is a detector, a HandlerChecker object that checks for one thread, and because we’re checking for multiple threads, it’s an ArrayList. And inside the HandlerChecker, we have a Handler object reference that corresponds to the Looper of the detection thread, and we can use this Handler object reference to send a message to the detection thread Looper, and also, HandlerChecker also has a Monitor list, and this is the list of objects that are being tested in the thread that is being tested, such as AMS and WMS in the UI thread, that are added to the Monitor list in the HandlerChecker that is being tested in the UI thread. Of course, AMS, WMS class they both implement the Monitor interface. Their relationship is as follows:

    1 WatchDog –> N HandlerChecker –> corresponds to N detection threads

    One HandlerChekcer –> N monitors –> corresponds to N detection objects

    For each HandlerChecker object, a task is posted to the head of the message queue using the Handler object of the corresponding thread, and the Monitor list is iterated as the task executes. Call the Monitor method on each Monitor object. The implementation of the Monitor method is implemented by the concrete class itself, such as AMS’s Monitor method:

     /** In this method we try to acquire our lock to make sure that we have not deadlocked */
     public void monitor() {
         synchronized (this) { }
     }
    Copy the code

    WMS monitor method:

     // Called by the heartbeat to ensure locks are not held indefnitely (for deadlock detection).
     @Override
     public void monitor() {
         synchronized (mWindowMap) { }
     }
    Copy the code

    As you can see, their logic is to see if they can get the lock properly, and if the lock is being held by another thread, they will wait until they get the lock. Note used in the AMS is this lock, use the lock is mWindowMap WMS, as long as able to acquire the lock, normal service is no problem, in the normal state, on the contrary, cannot acquire the lock, means that the lock was a thread take up for a long time for unknown reasons, other methods can’t acquire the lock service would lead to cannot be call returns, That is, the service is abnormal.

    In addition to waiting for the lock, the monitor method may not be called at all. Although the monitor method performs the task at the top of the Looper message queue, if the previous message blocks, the Monitor method cannot be executed, which is also considered to be deadlocked.

    Continue to look down, send all of the threads after all monitor to perform tasks, testing thread started to wait for, wait for 30 s here, wait for after the end of, began to see all the monitor task execution status, performed is normal, no execution or in the execution, also means that a thread sends the deadlock, then all threads began to collect information:

    // If we got here, that means that the system is most likely hung. // First collect stack traces from all threads of the system process. //  Then kill this process so that the system will restart. EventLog.writeEvent(EventLogTags.WATCHDOG, subject); ArrayList<Integer> pids = new ArrayList<>(); pids.add(Process.myPid()); if (mPhonePid > 0) pids.add(mPhonePid); // Pass ! waitedHalf so that just in case we somehow wind up here without having // dumped the halfway stacks, we properly re-initialize the trace file. final File stack = ActivityManagerService.dumpStackTraces( ! waitedHalf, pids, null, null, getInterestingNativePids()); // Give some extra time to make sure the stack traces get written. // The system's been hanging for a minute, another second or two won't hurt much. SystemClock.sleep(2000);Copy the code

    The collected thread information is output to the /data/anr/ testamp.txt file.

    Finally, the WatchDog kills SystemServer, which is relatively easy to implement by calling the killProcess method of Process.

    Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject); WatchdogDiagnostics.diagnoseCheckers(blockedCheckers); Slog.w(TAG, "*** GOODBYE!" ); Process.killProcess(Process.myPid()); System.exit(10);Copy the code
Summary: the core of WatchDog is to detect deadlock logic, open the deadlock loop thread to detect whether all threads are blocked, if so, it shows that the deadlock occurred.

The project design

The design of thread deadlock detection scheme mainly considers two aspects: detection logic and thread information capture.

  • Detection logic

    • Plan a

      This is similar to a WatchDog, which starts an infinite loop to check if another thread’s Looper is blocked. There are two types of blocking: message queue blocking and failure to acquire key object locks.

    • Scheme 2

      Deadloop thread timing (say 5s) checks the state of all threads. If at least two threads are BLOCKED for too long (say 3 minutes), a thread is considered deadlocked. This method does not apply to ReentrantLock. The thread state obtained when blocking is WAITING rather than BLOCKED.

      Java Thread States and Life Cycle

  • Thread fetching

    Because it is captured in App, compared with AMS, there may be permission problems. We can directly grasp the stack information of the thread, but this will not have the information of which thread is holding which lock, which is not conducive to our analysis and positioning problems. According to the article “Handq Android Thread deadlock monitoring and Automatic Analysis practice”, it is classified into three types of lock use. Synchronized, wait/notify, and ReentrantLock, and the /data/anr/ testamp. TXT files of the former two systems will have information about which thread holds which lock, and which thread is waiting for which lock. So the app only needs to send SIGQUIT signal to the ANR process to trigger the generation of tetrace. TXT file, but for ReentrantLock, you can only manually record the thread usage of the lock in the code. But here are two problems:

    1. For some phones, you may not have permission to read /data/anr/ testamp. TXT, or the file name may be called testamp-xxx.

      There is no solution to this problem, only thread stack information can be read.

    2. How does App send SIGQUIT signal to ANR process?

      Android.os. Process has a ready-made interface:

       Process.sendSignal(Process.myPid(), Process.SIGNAL_QUIT);
      Copy the code

    Of course, it is possible to figure out the cause of the deadlock in most cases by simply calling the stack information, so there is no need to know who holds the lock.