A BUG was reported in the test today, and a wave of problems were solved successfully. But I feel some ideas, skills and knowledge points in the middle are more interesting, so record them.
Problem location and analysis
First of all, this problem is a probability problem, in the pressure measurement of the reset function of the machine. A service I was in charge of would start automatically when starting up. The test found that the function could not be used normally after a certain reset was completed and started up, so I was immediately called to see it.
- First of all, the site was still there when I arrived. Since this was a Service, there was no abnormality on the UI. So adb connects to the machine and uses the PS command to check the process, and finds that the service process exists
- Secondly, I checked the log and found no abnormal printing or crash restart trace
- The service will synchronize back to the main thread after the child thread has done some initialization:
Log.d(TAG, "child thread finish");
mHandler.sendEmptyMessage(MSG_START_FUNCTION);
Copy the code
The print for the child thread was found, and its next line is to send Message to Handler, but the print for the main thread was not found.
Because this part of the code is fairly simple, there are no bugs, unless something is wrong with the Handler mechanism. Since our machine is still in the development stage, it is possible for the system engineer to accidentally fix some strange problems during debugging, but we should not think so at first, otherwise the system engineer will be confused and unable to start with the problem.
Since Handler messages are executed one by one, if one Message blocks, subsequent messages cannot be processed. Since this is the main thread Handler, this problem will also occur if our main thread gets stuck.
However, if the main thread is stuck, more than ten minutes have passed and no ANR appears. /data/ ANR/is also empty below. However, we can use the kill -3 <pid> command to force out the trace file to see the call stack applied to all the current threads. Then analyze what the main thread now looks like:
"main" prio=5 tid=1 Waiting | group="main" sCount=1 dsCount=0 flags=1 obj=0x7137cc28 self=0xe3f82a10 | sysTid=1208 nice=0 cgrp=default sched=0/0 handle=0xf09e6470 | state=S schedstat=( 456814323 745320630 635 ) utm=40 stm=5 core=0 HZ=100 | stack=0xff1c8000-0xff1ca000 stackSize=8192KB | held mutexes= at java.lang.Object.wait(Native method) - waiting on <0x017f64da> (a java.lang.Object) at java.lang.Object.wait(Object.java:442) at java.lang.Object.wait(Object.java:568) at h.a.a.a.a.l.q.f(:4) - locked <0x017f64da> (a java.lang.Object) at h.a.a.a.a.l.q.e(:2) at d.d.a.d.f.d.b(:3) at d.d.a.d.f.a.run(lambda:-1) at android.os.Handler.handleCallback(Handler.java:938) at android.os.Handler.dispatchMessage(Handler.java:99) at android.os.Looper.loop(Looper.java:223) at android.app.ActivityThread.main(ActivityThread.java:7666) at java.lang.reflect.Method.invoke(Native method) at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:592) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:952)Copy the code
The main thread is stuck in Object.wait.
However, a search in the code does not use the wait method directly, but there is a third party library that may use it in a similar operation. Since this problem has not been reported before, it should be a small probability of reality, so we must give a real hammer and solve it, otherwise the regression of the problem is difficult.
The code is confused. Although we can use mapping. TXT file to restore it, it is difficult to find the corresponding version and mapping. TXT file because the matching of this project is not mature and the version number mechanism has not been added.
So I choose to pull the APK from the machine adb pull and decompile it with jadx to find H.A.A.A.A.L.Q.F:
You can see some strings and some general code logic. Compared with the third party library we guessed before, we found that our guess was correct, and then sorted back to the whole stack step by step. It turns out that one of the library’s wait methods has been blocking the main thread. This is probably a bug with the third party library, but luckily it has an overloaded method that can pass in a timeout, so we added a 3 second timeout and then tried again to fix the problem. Also, waiting on the main thread is not a good habit, so we can move it to the child thread.
ANR principle
Although the problem is solved, there are some interesting points worth exploring. We all know that you can’t do time-consuming operations on the main thread, otherwise you’ll get ANR. However, the main thread is blocked for more than ten minutes. Even if we have a Service, we should have ANR after at most 200s. Why there is no ANR?
After I recovered the stack, I found that the wait was blocked at the time of Application. OnCreate, meaning that delay in Application. OnCreate does not cause ANR.
Let’s review the four types of ANR:
1. KeyDispatchTimeout: ANR occurs if the input event is not processed within 5S
2. ServiceTimeout: bind,create,start,unbind, etc. The ANR occurs when the foreground Service does not complete processing within 20 seconds and the background Service does not complete processing within 200 seconds
3. BroadcastTimeout: BroadcastReceiver When the BroadcastReceiver processes transactions, the foreground broadcasts within 10S and the background broadcasts within 60s. ANR occurred without processing completion
4. ProcessContentProviderPublishTimedOutLocked: ContentProvider publish within 10 s ANR no processing is completed
The ANR of Service, Broadcast, and ContentProvider is not included in the calculation of Application. OnCreate because this callback is initialization of the process and is not included in the four components.
Another point is that there is no ANR for the Activity life cycle, which means blocking the Activity’s onCreate and onStart life cycles does not cause an ANR.
An Activity’s ANR is the result of input events such as keystrokes and touch message processing time.
Time bomb mechanism
ServiceTimeout, BroadcastTimeout, ProcessContentProviderPublishTimedOutLocked principle are similar
- Before processing using Handler. SendMessageDelayed sends a ANR message
- Use handler.removemessages to remove ANR messages after the processing is complete
This can be likened to a time bomb. A time bomb is planted before disposal. As long as the disposal is not completed and the bomb is removed within the specified time, it will explode.
Let’s use just one Service example here. Before calling service. onCreate in AMS, sendMessageDelayed a Message of SERVICE_TIMEOUT_MSG:
// AMS start service
private final void realStartServiceLocked(ServiceRecord r, ProcessRecord app, boolean execInFg) throws RemoteException {.../ / in bumpServiceExecutingLocked SERVICE_TIMEOUT_MSG will be sent
bumpServiceExecutingLocked(r, execInFg, "create"); .// Call service.oncreate asynchronouslyapp.thread.scheduleCreateService(r, r.serviceInfo, mAm.compatibilityInfoForPackage(r.serviceInfo.applicationInfo), app.getReportedProcState()); . }/ / the following code tracking bumpServiceExecutingLocked SERVICE_TIMEOUT_MSG is how to happen
private final void bumpServiceExecutingLocked(ServiceRecord r, boolean fg, String why) {... scheduleServiceTimeoutLocked(r.app); . }private final void bumpServiceExecutingLocked(ServiceRecord r, boolean fg, String why) {... scheduleServiceTimeoutLocked(r.app); . }void scheduleServiceTimeoutLocked(ProcessRecord proc) {
if (proc.executingServices.size() == 0 || proc.thread == null) {
return;
}
Message msg = mAm.mHandler.obtainMessage(
ActivityManagerService.SERVICE_TIMEOUT_MSG);
msg.obj = proc;
mAm.mHandler.sendMessageDelayed(msg,
proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT);
}
Copy the code
The ActivityThread notifies AMS when the Service. OnCreate call completes:
private void handleCreateService(CreateServiceData data) {.../ / create the service
service = packageInfo.getAppFactory().instantiateService(cl, data.info.name, data.intent);
// Call the onCreate lifecycleservice.onCreate(); .// Tell AMS that service. onCreate has been called
ActivityManager.getService().serviceDoneExecuting(
data.token, SERVICE_DONE_EXECUTING_ANON, 0.0); . }Copy the code
AMS would bomb serviceDoneExecutingLocked inside again:
private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying, boolean finishing) {.../ / remove SERVICE_TIMEOUT_MSGmAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app); . }Copy the code
For example, AMS plants a time bomb in your house, threatens you to do something within a certain time limit, and then tells him to stop ticking or he will blow your house up (ANR).
KeyDispatchTimeout principle
The ANR of an Activity is not implemented by planting a time bomb as described above; it has a different set of logic.
As mentioned earlier, the Activity’s life cycle does not trigger an ANR; its ANR is actually generated during the processing of the input event. For example, it takes too long to process key messages or touch messages.
The underlying distribution logic for input events has been blogged for those of you who are interested. This article will supplement the ANR detection principle for input event distribution.
In the native layer of inputDispatcher. CPP, each input event will wake up the Dispatcher thread for distribution processing, we take the key message as an example:
void InputDispatcher::dispatchOnce(a) {...dispatchOnceInnerLocked(&nextWakeupTime); . }void InputDispatcher::dispatchOnceInnerLocked(nsecs_t* nextWakeupTime) {...// Ready to start a new event.
// If we don't already have a pending event, go grab one.
if (! mPendingEvent) {
...
resetANRTimeoutsLocked();
}
...
switch (mPendingEvent->type) {
...
case EventEntry::TYPE_KEY: {
...
done = dispatchKeyLocked(currentTime, typedEntry, &dropReason, nextWakeupTime); . }... }... }Copy the code
Suppose our application receives its first input event, KEY_DOWN. You can see that dispatchOnceInnerLocked determines that if this is a new event, we call resetANRTimeoutsLocked to clear the FLAG for ANR, and then dispatchKeyLocked.
The most important step in resetANRTimeoutsLocked is to set mInputTargetWaitCause to INPUT_TARGET_WAIT_CAUSE_NONE:
void InputDispatcher::resetANRTimeoutsLocked(a) {... mInputTargetWaitCause = INPUT_TARGET_WAIT_CAUSE_NONE; . }Copy the code
Go back inside dispatchKeyLocked to get the current focus Windows distribution key message:
bool InputDispatcher::dispatchKeyLocked(nsecs_t currentTime, KeyEntry* entry,
DropReason* dropReason, nsecs_t* nextWakeupTime) {...int32_t injectionResult = findFocusedWindowTargetsLocked(currentTime, entry, inputTargets, nextWakeupTime); . }int32_t InputDispatcher::findFocusedWindowTargetsLocked(nsecs_t currentTime,
const EventEntry* entry, Vector<InputTarget>& inputTargets, nsecs_t* nextWakeupTime) {... reason =checkWindowReadyForMoreInputLocked(currentTime,
mFocusedWindowHandle, entry, "focused");
if(! reason.isEmpty()) {
injectionResult = handleTargetsNotReadyLocked(currentTime, entry,
mFocusedApplicationHandle, mFocusedWindowHandle, nextWakeupTime, reason.string()); . }... }Copy the code
Because this is a new event, Windows has no messages in progress. CheckWindowReadyForMoreInputLocked get reson is empty, not enter handleTargetsNotReadyLocked, but normal circulated to the window.
If the application handles the KEY_DOWN card dead, then mPendingEvent is not NULL when the user raises the finger to trigger the KEY_UP event, and __ does not clear the ANR flag. And checkWindowReadyForMoreInputLocked return reason is not empty, will enter handleTargetsNotReadyLocked method:
int32_t InputDispatcher::handleTargetsNotReadyLocked(nsecs_t currentTime,
const EventEntry* entry,
const sp<InputApplicationHandle>& applicationHandle,
const sp<InputWindowHandle>& windowHandle,
nsecs_t* nextWakeupTime, const char* reason) {...if(mInputTargetWaitCause ! = INPUT_TARGET_WAIT_CAUSE_APPLICATION_NOT_READY) { ... mInputTargetWaitCause = INPUT_TARGET_WAIT_CAUSE_APPLICATION_NOT_READY; . mInputTargetWaitTimeoutTime = currentTime + timeout; . }...if (currentTime >= mInputTargetWaitTimeoutTime) {
onANRLocked(currentTime, applicationHandle, windowHandle, entry->eventTime, mInputTargetWaitStartTime, reason); . }else{ *nextWakeupTime = mInputTargetWaitTimeoutTime; . }}Copy the code
We can see in this method that mInputTargetWaitCause is not INPUT_TARGET_WAIT_CAUSE_APPLICATION_NOT_READY (because KEY_DOWN is already in resetANRTimeoutsLocked) Set to INPUT_TARGET_WAIT_CAUSE_NONE), so will enter the if setting mInputTargetWaitTimeoutTime and mInputTargetWaitCause inside.
“Behind the currentTime > = mInputTargetWaitTimeoutTime” judge not to enter, because is just setting mInputTargetWaitTimeoutTime but will go into the else set nextWakeupTime, The thread then sleeps. This means that KEY_UP will be delayed until timeout.
By the time the thread wakes up, mInputTargetWaitCause is already INPUT_TARGET_WAIT_CAUSE_APPLICATION_NOT_READY, so it will not be changed. Then “currentTime > = mInputTargetWaitTimeoutTime” judgment will be successful into the ANR onANRLocked trigger applications.
In simple terms, when KEY_UP event arrives, it is found that the previous event has not been processed, so it will delay 5s to check again. If the last event has not been processed by this time, ANR will be triggered.
One feature of this mechanism is that if you are stuck in KEY_UP, the interface does not animate anything and does not trigger input events. So even though the main thread is stuck, no ANR will be reported for any length of time. If you trigger an input event (such as a touch or keystroke) at this point, you’ll notice that an ANR occurs in 5 seconds.
Similarly, this mechanism is like a testy terrorist going to FocusWindow to confess. If the priest is already taking a client, he will check back later, and if the priest is still unavailable, he will detonate the bomb and end it (ANR).
feeling
As we get older, the brain is like a long-running hard drive filled with all kinds of useful and useless stuff. Loading speed and retrieval hit rate are getting worse and worse. Just like I had read the principle of ANR before, but when I saw this problem, my first reaction was that the main thread could not be stuck or ANR would be blocked. Therefore, in addition to all kinds of rote memorization of eight-essay knowledge, I think we should pay more attention to debugging skills and problem-solving ability, which is the core competitiveness of the elderly programmers.