One: Why do you need a watchdog?
Watchdog, I first saw this word in my college SCM book, talking about the Watchdog timer. In a long time ago when the microcontroller was just developed, the microcontroller was easily affected by the external work, leading to their own programs run away, so there is a watchdog protection mechanism, that is: every time you need to feed the dog, if not feed the dog, the watchdog will trigger restart. The general principle is that after the operation of the system, the watchdog counter will start to count automatically. If the watchdog is not cleared within a certain period of time, the watchdog counter will overflow, resulting in watchdog interruption and system reset.
And mobile phones, is a super super microcontroller, its running speed is faster than single chip N times, N times larger storage space than the single chip microcomputer, it run several threads, all kinds of hardware and software work together, not afraid of ten thousand, and they were afraid to one thousand, one thousand we have system deadlock, one thousand our mobile phones has also been a lot of interference program run away. Jj Smida can happen, so we need a watchdog mechanism, too.
Two: Android system layer watchdog
The watchdog is divided into hardware watchdog and software watchdog. Hardware is the timer circuit of SCM, and software is the watchdog that realizes a similar mechanism by ourselves. In order to ensure the stability of the System, The Android system also designed such a watchdog. In order to ensure the normal operation of various system services, it needs to monitor a lot of services, and restart core services when they are abnormal, and save the scene.
Let’s take a look at how the Android system Watchdog is designed.
Note: this article is explained by Android6.0 code
Android Watchdog source path here: frameworks/base/services/core/Java/com/Android/server/Watchdog. Java
Watchdog initialization in SystemServer. / frameworks/base/services/Java/com/android/server/SystemServer Java
The Watchdog is initialized in SystemServer.
492 Slog.i(TAG, "Init Watchdog");
493 final Watchdog watchdog = Watchdog.getInstance();
494 watchdog.init(context, mActivityManagerService);
Copy the code
In this case, the Watchdog will go through the following initialization method, first constructor, init method:
216 private Watchdog(a) {
217 super("watchdog");
218 // Initialize handler checkers for each common thread we want to check. Note
219 // that we are not currently checking the background thread, since it can
220 // potentially hold longer running operations with no guarantees about the timeliness
221 // of operations there.
222
223 // The shared foreground thread is the main checker. It is where we
224 // will also dispatch monitor checks and do other work.
225 mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
226 "foreground thread", DEFAULT_TIMEOUT);
227 mHandlerCheckers.add(mMonitorChecker);
228 // Add checker for main thread. We only do a quick check since there
229 // can be UI running on the thread.
230 mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
231 "main thread", DEFAULT_TIMEOUT));
232 // Add checker for shared UI thread.
233 mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
234 "ui thread", DEFAULT_TIMEOUT));
235 // And also check IO thread.
236 mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
237 "i/o thread", DEFAULT_TIMEOUT));
238 // And the display thread.
239 mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
240 "display thread", DEFAULT_TIMEOUT));
241
242 // Initialize monitor for Binder threads.
243 addMonitor(new BinderThreadMonitor());
244 }
246 public void init(Context context, ActivityManagerService activity) {
247 mResolver = context.getContentResolver();
248 mActivity = activity;
249 // Register to restart the broadcast
250 context.registerReceiver(new RebootRequestReceiver(),
251 new IntentFilter(Intent.ACTION_REBOOT),
252 android.Manifest.permission.REBOOT, null);
253 }
Copy the code
The Watchdog class inherits from Thread, however, so you’ll need one more place to start, which is this line of code in the SystemReady interface of ActivityManagerService.
Watchdog.getInstance().start();
TAG: HandlerChecker
The HandlerChecker class is used by the Watchdog to detect the main thread, IO thread, display thread, and UI thread. The idea is to use the looper MessageQueue of each Handler to determine if the thread is stuck. Of course, this thread is running in the SystemServer process.
public final class HandlerChecker implements Runnable {
88 private final Handler mHandler;
89 private final String mName;
90 private final long mWaitMax;
91 private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
92 private boolean mCompleted;
93 private Monitor mCurrentMonitor;
94 private long mStartTime;
95
96 HandlerChecker(Handler handler, String name, long waitMaxMillis) {
97 mHandler = handler;
98 mName = name;
99 mWaitMax = waitMaxMillis;
100 mCompleted = true;
101 }
102
103 public void addMonitor(Monitor monitor) {
104 mMonitors.add(monitor);
105 }
106 // Record the current start time
107 public void scheduleCheckLocked(a) {
108 if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
109 // If the target looper has recently been polling, then
110 // there is no reason to enqueue our checker on it since that
111 // is as good as it not being deadlocked. This avoid having
112 // to do a context switch to check the thread. Note that we
113 // only do this if mCheckReboot is false and we have no
114 // monitors, since those would need to be executed at this point.
115 mCompleted = true;
116 return;
117 }
118
119 if(! mCompleted) {120 // we already have a check in flight, so no need
121 return;
122 }
123
124 mCompleted = false;
125 mCurrentMonitor = null;
126 mStartTime = SystemClock.uptimeMillis();
127 mHandler.postAtFrontOfQueue(this);
128 }
129
130 public boolean isOverdueLocked(a) {
131 return(! mCompleted) && (SystemClock.uptimeMillis() > mStartTime + mWaitMax);132 }
133 // Get the completion time identifier
134 public int getCompletionStateLocked(a) {
135 if (mCompleted) {
136 return COMPLETED;
137 } else {
138 long latency = SystemClock.uptimeMillis() - mStartTime;
139 if (latency < mWaitMax/2) {
140 return WAITING;
141 } else if (latency < mWaitMax) {
142 return WAITED_HALF;
143 }
144 }
145 return OVERDUE;
146 }
147
148 public Thread getThread(a) {
149 return mHandler.getLooper().getThread();
150 }
151
152 public String getName(a) {
153 return mName;
154 }
155
156 public String describeBlockedStateLocked(a) {
157 if (mCurrentMonitor == null) {
158 return "Blocked in handler on " + mName + "(" + getThread().getName() + ")";
159 } else {
160 return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
161 + " on " + mName + "(" + getThread().getName() + ")";
162 }
163 }
164
165 @Override
166 public void run(a) {
167 final int size = mMonitors.size();
168 for (int i = 0 ; i < size ; i++) {
169 synchronized (Watchdog.this) {
170 mCurrentMonitor = mMonitors.get(i);
171 }
172 mCurrentMonitor.monitor();
173 }
174
175 synchronized (Watchdog.this) {
176 mCompleted = true;
177 mCurrentMonitor = null;
178 }
179 }
180 }
Copy the code
From the code above, we can see that a core method is
mHandler.getLooper().getQueue().isPolling()
The implementation of this method is in MessageQueue, I posted the code, we can see the above comment: return whether the current looper thread is working at Polling, this is a good way to check whether the loop is alive. As you can see from the HandlerChecker source code, if looper returns true, it will return directly.
139 /**
140 * Returns whether this looper's thread is currently polling for more work to do.
141 * This is a good signal that the loop is still alive rather than being stuck
142 * handling a callback. Note that this method is intrinsically racy, since the
143 * state of the loop can change before you get the result back.
144 *
145 * <p>This method is safe to call from any thread.
146 *
147 * @return True if the looper is currently polling for events.
148 * @hide149 * /
150 public boolean isPolling(a) {
151 synchronized (this) {
152 return isPollingLocked();
153 }
154 }
155
Copy the code
If it does not return true, Looper is currently working, posts itself, and sets mComplete to false to indicate that a message has been sent and is waiting to be processed. If the current looper is not blocking, it will soon call its own run method.
What does my own run method do? This is a TAG: line 166 of the HandlerChecker source code that traverses and Monitors its own Monitors. If a monitor blocks, mComplete will always be false.
So when the system checks the call to get the completed state, it will enter the else, calculate the time, and return the corresponding time status code.
133 // Get the completion time identifier
134 public int getCompletionStateLocked(a) {
135 if (mCompleted) {
136 return COMPLETED;
137 } else {
138 long latency = SystemClock.uptimeMillis() - mStartTime;
139 if (latency < mWaitMax/2) {
140 return WAITING;
141 } else if (latency < mWaitMax) {
142 return WAITED_HALF;
143 }
144 }
145 return OVERDUE;
146 }
Copy the code
Okay, so now we know how to tell if a thread is stuck, right
- MessageQueue.isPolling
- Monitor.monitor
TAG: the Monitor
204 public interface Monitor {
205 void monitor(a);
206 }
Copy the code
Monitor is an interface that is implemented by several classes. For example: here’s what I found
225 mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
226 "foreground thread", DEFAULT_TIMEOUT);
227 mHandlerCheckers.add(mMonitorChecker);
275 public void addMonitor(Monitor monitor) {
276 synchronized (this) {
277 if (isAlive()) {
278 throw new RuntimeException("Monitors can't be added once the Watchdog is running");
279 }
280 mMonitorChecker.addMonitor(monitor);
281 }
282 }
Copy the code
So any class that implements this interface, it just needs to tune that interface. Let’s take a look at the tuning of the ActivityManagerService class. The path is here. Click to enter. /frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java
2381 Watchdog.getInstance().addMonitor(this);
19655 /** In this method we try to acquire our lock to make sure that we have not deadlocked */
19656 public void monitor(a) {
19657 synchronized (this) {}19658 }
Copy the code
As you can see, our AMS implements this interface and registers itself with Watchdog on line 2381. Meanwhile, its monitor method simply synchronizes itself to ensure that it is not deadlocked. It wasn’t much to do, but it was enough. That’s enough for the outside world to use this method to figure out if AMS is dead.
Now that we know how to determine if other services are deadlocked, let’s look at how Watchdog’s run method accomplishes this set of mechanisms.
TAG: Watchdog.run
The run method is an infinite loop, iterating through all the HandlerCheckers and tuning their monitoring methods, waiting 30 seconds to evaluate the status. See the notes below for details:
341 @Override
342 public void run(a) {
343 boolean waitedHalf = false;
344 while (true) {
345 final ArrayList<HandlerChecker> blockedCheckers;
346 final String subject;
347 final boolean allowRestart;
348 int debuggerWasConnected = 0;
349 synchronized (this) {
350 long timeout = CHECK_INTERVAL;
351 // Make sure we (re)spin the checkers that have become idle within
352 // this wait-and-check interval
// In this case, we iterate over all handlerCheckers and tune their monitoring methods to record the start time
353 for (int i=0; i<mHandlerCheckers.size(); i++) {
354 HandlerChecker hc = mHandlerCheckers.get(i);
355 hc.scheduleCheckLocked();
356 }
357
358 if (debuggerWasConnected > 0) {
359 debuggerWasConnected--;
360 }
361
362 // NOTE: We use uptimeMillis() here because we do not want to increment the time we
363 // wait while asleep. If the device is asleep then the thing that we are waiting
364 // to timeout on is asleep as well and won't have a chance to run, causing a false
365 // positive on when to kill things.
366 long start = SystemClock.uptimeMillis();
// Wait 30 seconds, use uptimeMills to not count the phone sleep time, the system serves the same sleep
367 while (timeout > 0) {
368 if (Debug.isDebuggerConnected()) {
369 debuggerWasConnected = 2;
370 }
371 try {
372 wait(timeout);
373 } catch (InterruptedException e) {
374 Log.wtf(TAG, e);
375 }
376 if (Debug.isDebuggerConnected()) {
377 debuggerWasConnected = 2;
378 }
379 timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
380 }
381 // Evaluate the Checker state, which iterates through all handlerCheckers and returns the maximum value.
382 final int waitState = evaluateCheckerCompletionLocked();
COMPLETED indicates that the message was COMPLETED and the thread is not blocked
383 if (waitState == COMPLETED) {
384 // The monitors have returned; reset
385 waitedHalf = false;
386 continue;
// WAITING Message processing takes 0 to 29 seconds
387 } else if (waitState == WAITING) {
388 // still waiting but within their configured intervals; back off and recheck
389 continue;
// the WAITED_HALF message takes 30-59 seconds to process. The thread may be blocked and needs to save the current AMS stack state
390 } else if (waitState == WAITED_HALF) {
391 if(! waitedHalf) {392 // We've waited half the deadlock-detection interval. Pull a stack
393 // trace and wait another half.
394 ArrayList<Integer> pids = new ArrayList<Integer>();
395 pids.add(Process.myPid());
396 ActivityManagerService.dumpStackTraces(true, pids, null.null.397 NATIVE_STACKS_OF_INTEREST);
398 waitedHalf = true;
399 }
400 continue;
401 }
402 ] // Ie should have spent more than 60 seconds on message processing, ie should have gone out by 60 seconds. So the rest of this is going to be dealing with time-outs
403 // something is overdue!
404 blockedCheckers = getBlockedCheckersLocked();
405 subject = describeCheckersLocked(blockedCheckers);
406 allowRestart = mAllowRestart;
407 }
408
409 // If we got here, that means that the system is most likely hung.
410 // First collect stack traces from all threads of the system process.
411 // Then kill this process so that the system will restart.
412 EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
413. Preservation of all kinds of records468
469 // Only kill the process if the debugger is not attached.
470 if (Debug.isDebuggerConnected()) {
471 debuggerWasConnected = 2;
472 }
473 if (debuggerWasConnected >= 2) {
474 Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
475 } else if (debuggerWasConnected > 0) {
476 Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
477 } else if(! allowRestart) {478 Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
479 } else {
480 Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
481 for (int i=0; i<blockedCheckers.size(); i++) {
482 Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
483 StackTraceElement[] stackTrace
484 = blockedCheckers.get(i).getThread().getStackTrace();
485 for (StackTraceElement element: stackTrace) {
486 Slog.w(TAG, " at " + element);
487 }
488 }
489 Slog.w(TAG, "*** GOODBYE!");
490 Process.killProcess(Process.myPid());
491 System.exit(10);
492 }
493
494 waitedHalf = false;
495 }
496 }
Copy the code
And you can see that if you go to line 412. It’s preparation for rebooting the system. Will do the following:
- Write the Eventlog
- Output stack information of system_server and three native processes in the way of appending
- Example Output kernel stack information
- Dump all blocked threads
- The dropbox information is displayed
- *** WATCHDOG KILLING SYSTEM PROCESS:
Iii. Summary:
The above is the principle of the Android system layer Watchdog. Well designed. If I were designing it, I would never have thought of using Monitor’s locking mechanism.
Next, summarize the following:
- The Watchdog is a thread used to monitor whether services in the system run normally and no deadlocks occur
- HandlerChecker checks handlers and monitors
- Monitor determines deadlocks by locking
- Logs are generated if the timeout period exceeds 30 seconds, and restarts if the timeout period exceeds 60 seconds (except debug).
Anderson/Jerey_Jobs
Blog address: jerey.cn/ Jane Book address: Anderson Big code cinder Github address: github.com/Jerey-Jobs