This analysis is based on Android S(12)
When ANR occurs in App or watchdog is triggered by System, the System hopes to generate a trace file to record the call stack information of each thread as well as the status information of some processes/threads. This file is usually stored in the /data/anr directory and not available to APP developers. But since Android R (11), the App it can be read by the AMS getHistoricalProcessExitReasons interface details of the file. Here’s what a typical trace file looks like.
----- pid 8331 at 2021-11-26 09:10:03 ----- Cmd line: com.hangl.test Build fingerprint: xxx ABI: 'arm64' Build type: optimized Zygote loaded classes=9118 post zygote classes=475 Dumping registered class loaders #0 dalvik.system.PathClassLoader: [], parent #1 #1 java.lang.BootClassLoader: [], no parent ... Suspend all histogram: Sum: 161US 99% C.I. 2US-60US Avg: 16.100us Max: 60us DALVIK THREADS (14): suspend all histogram: Sum: 161US 99% C.I. 2US-60US Avg: 16.100us "Signal Catcher" daemon prio=5 tid=7 Runnable | group="system" sCount=0 dsCount=0 flags=0 obj=0x14dc0298 self=0x7c4c962c00 ... "main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 flags=1 obj=0x7263ee78 self=0x7c4c7dcc00 | sysTid=8331 nice=-10 cgrp=default sched=0/0 handle=0x7c4dd45ed0 | state=S schedstat=( 387029514 32429484 166 ) utm=28 stm=10 core=6 HZ=100 | stack=0x7feacb5000-0x7feacb7000 stackSize=8192KB | held mutexes= native: #00 pc 00000000000d0f48 /apex/com.android.runtime/lib64/bionic/libc.so (__epoll_pwait+8) native: #01 pc 00000000000180bc /system/lib64/libutils.so (android::Looper::pollInner(int)+144) native: #02 pc 0000000000017f8c /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+56) native: #03 pc 000000000013b920 /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44) at android.os.MessageQueue.nativePollOnce(Native method) at android.os.MessageQueue.next(MessageQueue.java:336) at android.os.Looper.loop(Looper.java:174) at android.app.ActivityThread.main(ActivityThread.java:7397) at java.lang.reflect.Method.invoke(Native method) at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:935) "Jit thread pool worker thread 0" daemon prio=5 tid=2 Native | group="main" sCount=1 dsCount=0 flags=1 obj=0x14dc0220 self=0x7bb9a05000 ...Copy the code
This article is not intended to discuss the types of ANR triggers, nor is it intended to show the sequence in which each piece of content is generated in a chronological order, as many articles have been written about these, and some of them are excellent. For this reason, this article will focus on the call stack generation process, which will help us better understand the trace information.
preface
For both ANR and Watchdog, the generation of trace is done in the Target Process. Taking ANR as an example, its determination process takes place in system_server(AMS), while the generation process of trace takes place in APP. So how do you get your APP to start the process? The answer is to send it SIGQUIT(signal 3). This is done because cross-process information collection is typically done using the PTrace scheme, which requires either special privileges or interprocess parent-child relationships, which are not as convenient as in-process collection.
Therefore, the first step in the analysis is to see how signal 3 is processed in the process.
1. Signal Catcher thread
The “Signal Catcher” thread exists in every Java process. When running properly, it will suspend for signal 3(and signal 10). When the process receives Signal 3, it hands it off to the Signal Catcher thread, which handles HandleSigQuit.
void SignalCatcher::HandleSigQuit(a) {
Runtime* runtime = Runtime::Current(a); std::ostringstream os; os <<"\n"
<< "----- pid " << getpid() < <" at " << GetIsoDate() < <" -----\n";
DumpCmdLine(os);
// Note: The strings "Build fingerprint:" and "ABI:" are chosen to match the format used by
// debuggerd. This allows, for example, the stack tool to work.
std::string fingerprint = runtime->GetFingerprint(a); os <<"Build fingerprint: '" << (fingerprint.empty()?"unknown" : fingerprint) << "'\n";
os << "ABI: '" << GetInstructionSetString(runtime->GetInstructionSet()) < <"'\n";
os << "Build type: " << (kIsDebugBuild ? "debug" : "optimized") < <"\n";
runtime->DumpForSigQuit(os);
if ((false)) {
std::string maps;
if (android::base::ReadFileToString("/proc/self/maps", &maps)) {
os << "/proc/self/maps:\n" << maps;
}
}
os << "----- end " << getpid() < <" -----\n";
Output(os.str());
}
Copy the code
The jump process is not shown in the middle and goes directly to the issue we care about: the call stack collection process. With the ThreadList::Dump function, we can collect call stack information for all threads.
void ThreadList::Dump(std::ostream& os, bool dump_native_stack) {
Thread* self = Thread::Current(a); {MutexLock mu(self, *Locks::thread_list_lock_);
os << "DALVIK THREADS (" << list_.size() < <"):\n";
}
if(self ! =nullptr) {
DumpCheckpoint checkpoint(&os, dump_native_stack);
size_t threads_running_checkpoint;
{
// Use SOA to prevent deadlocks if multiple threads are calling Dump() at the same time.
ScopedObjectAccess soa(self);
threads_running_checkpoint = RunCheckpoint(&checkpoint);
}
if(threads_running_checkpoint ! =0) {
checkpoint.WaitForThreadsToRunThroughCheckpoint(threads_running_checkpoint); }}else {
DumpUnattachedThreads(os, dump_native_stack); }}Copy the code
The key is to execute the RunCheckpoint function. It divides information collection for each thread into separate tasks: ** If the thread is in a Runnable state (running Java code), it assigns the collected tasks to the thread itself; If the thread is in another state, the “Signal Catcher” thread completes on its behalf. ** Keep this in mind, as sections 2 and 3 below analyze two different cases of it.
2. The Checkpoint mechanism
The thread that assigns tasks to the Runnable state uses the checkpoint mechanism, which has two parts:
-
The “Signal Catcher” Thread calls RequestCheckpoint to change the internal data of the target Thread’s ART ::Thread object, specifically by changing the following two fields.
tls32_.state_and_flags.as_struct.flags |= kCheckpointRequest; tlsPtr_.checkpoint_function = function; (TLS32_ and tlsPtr_ are both internal data for art::Thread objects.)Copy the code
-
For the ART virtual machine, the target thread checks the state_AND_flags field at the start of each method and at the jump point of the loop, and executes the checkpoint function if it is set to checkpoint. Checkpoint placement ensures that the thread processes checkpoint tasks “in time” : Because all code that executes forward (linear, conditional branches count) executes in a finite amount of time, code that can lead to long execution is either a loop or a method call, simply inserting checkpoints in both places ensures timeliness. (Refer to R’s answer)
Here’s another example of a checkpoint for a target thread, just to make it real.
Bytecode can be interpreted for execution in the ART virtual machine or compiled into machine code for execution. When a method is compiled to machine code (as shown below), we see an action to check state_AND_flags at the function entrance. When a flag bit is set, the pTestSuspend action is executed.
CODE: (code_offset=0x003f9ae0 size=788)... 0x003f9ae0: d1400bf0 sub x16, sp, #0x2000 (8192) 0x003f9ae4: b940021f ldr wzr, [x16] StackMap[0] (native_pc=0x3f9ae8, dex_pc=0x0, register_mask=0x0, stack_mask=0b) 0x003f9ae8: f8180fe0 str x0, [sp, #-128]! 0x003f9aec: a9035bf5 stp x21, x22, [sp, #48] 0x003f9af0: a90463f7 stp x23, x24, [sp, #64] 0x003f9af4: a9056bf9 stp x25, x26, [sp, #80] 0x003f9af8: a90673fb stp x27, x28, [sp, #96] 0x003f9afc: a9077bfd stp x29, lr, [sp, #112] 0x003f9b00: b9008fe2 str w2, [sp, #140] 0x003f9b04: 79400270 ldrh w16, [tr] ; State_and_flags 0x003F9B08:350016f0 CBNZ w16, #+ 0x2DC (addr 0x3F9DE4) // If state_AND_flags is not 0, go to 0x3F9DE4... 0x003f9de4: 940e62c3 bl #+0x398b0c (addr 0x7928f0) ; PTestSuspend // Jump to pTestSuspendCopy the code
After several jumps, pTestSuspend ends up calling the Thread::CheckSuspend function. When the checkpoint is set, the corresponding checkpoint function (RunCheckpointFunction) is executed.
inline void Thread::CheckSuspend(a) {
DCHECK_EQ(Thread::Current(), this);
for (;;) {
if (ReadFlag(kCheckpointRequest)) {
RunCheckpointFunction(a); }else if (ReadFlag(kSuspendRequest)) {
FullSuspendCheck(a); }else if (ReadFlag(kEmptyCheckpointRequest)) {
RunEmptyCheckpoint(a); }else {
break; }}}Copy the code
As an example of the Runnable thread collecting the call stack itself, line 2292 is exactly the first line of the writeNoException method, consistent with the above description of inserting checkpoints at the start of each method.
"Binder:2278_C" prio=5 tid=97 Runnable | group="main" sCount=0 ucsCount=0 flags=0 obj=0x16104b20 self=0xb400007117c7afb0 | sysTid=2890 nice=0 cgrp=foreground sched=0/0 handle=0x6eafe24cb0 | state=R schedstat=( 47445156223 266433061959 175792 ) utm=1623 stm=3121 core=4 HZ=100 | stack=0x6eafd2d000-0x6eafd2f000 stackSize=991KB | held mutexes= "mutator lock"(shared held) at android.os.Parcel.writeNoException(Parcel.java:2292) at android.os.IPowerManager$Stub.onTransact(IPowerManager.java:474) at android.os.Binder.execTransactInternal(Binder.java:1184) at android.os.Binder.execTransact(Binder.java:1143)Copy the code
2291 public final void writeNoException(a) {
2292 AppOpsManager.prefixParcelWithAppOpsIfNeeded(this);
Copy the code
3. Suspend flag
For threads in a non-runnable state, the collection is handled by “Signal Catcher”. Here I’ve combed through the process of “foundry” for a single thread in four steps.
thread->ModifySuspendCount(self, +1.nullptr, SuspendReason::kInternal);
checkpoint_function->Run(thread);
thread->ModifySuspendCount(self, - 1.nullptr, SuspendReason::kInternal);
Thread::resume_cond_->Broadcast(self);
Copy the code
- Increment target thread suspend Count (+1) and place the suspend flag on it.
- Run the corresponding function to collect information for the target thread.
- Reduces the suspend Count (-1) of target thread, and clears the suspend flag bit if suspend Count is reduced to 0.
- Call the Broadcast function of the resume_cond_ condition variable, which wakes up all threads waiting on it.
The process is always easy to sort out. The difficulty is to understand the reasons behind the process design. Here’s a breakdown.
-
Why is it necessary to set the suspend flag to target thread before performing information collection?
Before we answer that question, we need to add some basics. Each Java thread is essentially a pThread, which in turn corresponds to a Task_struct object in the kernel, which is the basic unit of CPU scheduling. From the CPU’s point of view, the thread can be r-state, S-state, D-state, and so on, with the following meanings. However, there is another set of states recorded in the virtual machine for The Java thread, which reflects the state from the virtual machine’s perspective, as follows.
R running or runnable (on run queue) D uninterruptible sleep (usually IO) S interruptible sleep (waiting for an event to complete)Copy the code
enum ThreadState { // Java // Thread.State JDWP state kTerminated = 66.// TERMINATED TS_ZOMBIE Thread.run has returned, but Thread* still around kRunnable, // RUNNABLE TS_RUNNING runnable kTimedWaiting, // TIMED_WAITING TS_WAIT in Object.wait() with a timeout kSleeping, // TIMED_WAITING TS_SLEEPING in Thread.sleep() kBlocked, // BLOCKED TS_MONITOR blocked on a monitor kWaiting, // WAITING TS_WAIT in Object.wait() kWaitingForLockInflation, // WAITING TS_WAIT blocked inflating a thin-lock kWaitingForTaskProcessor, // WAITING TS_WAIT blocked waiting for taskProcessor kWaitingForGcToComplete, // WAITING TS_WAIT blocked waiting for GC kWaitingForCheckPointsToRun, // WAITING TS_WAIT GC waiting for checkpoints to run kWaitingPerformingGc, // WAITING TS_WAIT performing GC kWaitingForDebuggerSend, // WAITING TS_WAIT blocked waiting for events to be sent kWaitingForDebuggerToAttach, // WAITING TS_WAIT blocked waiting for debugger to attach kWaitingInMainDebuggerLoop, // WAITING TS_WAIT blocking/reading/processing debugger events kWaitingForDebuggerSuspension, // WAITING TS_WAIT waiting for debugger suspend all kWaitingForJniOnLoad, // WAITING TS_WAIT waiting for execution of dlopen and JNI on load code kWaitingForSignalCatcherOutput, // WAITING TS_WAIT waiting for signal catcher IO to complete kWaitingInMainSignalCatcherLoop, // WAITING TS_WAIT blocking/reading/processing signals kWaitingForDeoptimization, // WAITING TS_WAIT waiting for deoptimization suspend all kWaitingForMethodTracingStart, // WAITING TS_WAIT waiting for method tracing to start kWaitingForVisitObjects, // WAITING TS_WAIT waiting for visiting objects kWaitingForGetObjectsAllocated, // WAITING TS_WAIT waiting for getting the number of allocated objects kWaitingWeakGcRootRead, // WAITING TS_WAIT waiting on the GC to read a weak root kWaitingForGcThreadFlip, // WAITING TS_WAIT waiting on the GC thread flip (CC collector) to finish kNativeForAbort, // WAITING TS_WAIT checking other threads are not run on abort. kStarting, // NEW TS_WAIT native thread started, not yet ready to run managed code kNative, // RUNNABLE TS_RUNNING running in a JNI native method kSuspended, // RUNNABLE TS_RUNNING suspended by GC or debugger }; Copy the code
A thread in the R state indicates that it is logically running (it may not be executing yet due to scheduling, but it will always be executing [for some time]), and it may be running code in the Kernel, native, or Java layer. The state recorded in the virtual machine is Runnable only if it is running in the Java layer.
If the target thread is in a non-runnable state, it is not in the Java layer. But just because it’s not in the Java layer doesn’t mean it’s not running. The Target thread may return to the Java layer at any time during the collection process on behalf of the Signal Catcher, either having finished work on the Native layer or initiated a call to a Java method. Once you return to the Java layer, the call stack shape of the Java layer is changed. This creates competition between “Signal Catcher” and target threads for the overall shape of the call stack.
So we need a solution to this competition.
All return to the operation of the Java layer need to switch from the thread state, namely TransitionFromSuspendedToRunnable function called. The suspend flag is judged internally, and once it is set, the target thread waits on the resume_cond_ condition variable. Therefore, setting the suspend flag ensures that target Threads cannot return to the Java layer, or change the shape of the Java layer’s call stack. (It is worth noting that some comments on the Web suggest that the suspend flag is set to suspend a thread, which is a loose understanding. For threads that do not want to return to the Java layer, setting the suspend flag does not affect it at all.
-
Why call the Broadcast function of the resume_cond_ condition variable after information collection?
Because some threads ready to return to the Java layer are now waiting on the resume_cond_ condition variable (in the S state), it is necessary to wake them up after the collection operation.
-
With all this analysis, let’s give you a real case. Native #2 tells you that the main thread has finished working in the Native layer and wants to return to the Java layer. But we can’t find TransitionFromSuspendedToRunnable figure from the stack, the reason is that it is the inline (inline) to internal GoToRunnable function. And #1 WaitingHoldingLocks waits for the reSUME_cond_ condition variable.
"main" prio=5 tid=1 Native | group="main" sCount=1 ucsCount=0 flags=1 obj=0x71a33c18 self=0xb400006f417a1380 | sysTid=14756 nice=-10 cgrp=top-app sched=0/0 handle=0x71027344f8 | state=S schedstat=( 603683604122 79803215759 1916541 ) utm=43513 stm=16854 core=6 HZ=100 | stack=0x7fe8361000-0x7fe8363000 stackSize=8188KB | held mutexes= native: #00 pc 000000000004dff0 /apex/com.android.runtime/lib64/bionic/libc.so (syscall+32) native: #01 pc 000000000028dc74 /apex/com.android.art/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+152) native: #02 pc 000000000074c4ec /apex/com.android.art/lib64/libart.so (art::GoToRunnable(art::Thread*)+412) native: #03 pc 000000000074c318 /apex/com.android.art/lib64/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+28) at android.os.BinderProxy.transactNative(Native method) at android.os.BinderProxy.transact(BinderProxy.java:571) at com.android.internal.telephony.ISub$Stub$Proxy.getAvailableSubscriptionInfoList(ISub.java:1543) at android.telephony.SubscriptionManager.getAvailableSubscriptionInfoList(SubscriptionManager.java:1640) Copy the code
Note, however, that this trace is only generated under the following timing conditions. If the function running in the Native layer does not terminate, there is no need to return to the Java layer and no GoToRunnable is called.
Therefore, when we see the main thread call stack as shown above when ANR occurs, do not assume that GoToRunnable is responsible for ANR. It simply indicates that the thread wants to go back to the Java layer in the middle of execution, and what may really cause ANR is the overall time taken for a message.
4. Java call stack collection
(This section is sketchy and can be skipped if you are not interested.)
By calling the StackDumpVisitor: : WalkStack function, we can collect the call stack information to Java layer. The interior of this function is quite complicated, and a series of knowledge such as ArtMethod and DexFile need to be supplemented for a complete understanding. This article is not intended to be a complete introduction, but a general summary.
Each instruction in machine code is numbered, and it appears as a PC value at run time. Similarly, each instruction of the Dex bytecode is numbered, which is represented in the Dex file as dex_pc (the dex_pc of each method is numbered starting from 0). For example, 0x0003 and 0x0008 in the following file are dex_pc.
DEX CODE:
0x0000: 7010 5350 0100 | invoke-direct {v1}, void android.media.IPlayer$Stub.<init>() // method@20563
0x0003: 2200 791f | new-instance v0, java.lang.ref.WeakReference // type@TypeIndex[8057]
0x0005: 7020 84fa 2000 | invoke-direct {v0, v2}, void java.lang.ref.WeakReference.<init>(java.lang.Object) // method@64132
0x0008: 5b10 582e | iput-object v0, v1, Ljava/lang/ref/WeakReference; android.media.PlayerBase$IPlayerWrapper.mWeakPB // field@11864
0x000a: 0e00 | return-void
Copy the code
Bytecode may be interpreted or compiled into machine code execution (AOT or JIT) when it is actually run, and the call stack traceback is different for the two implementations. The reason is that when machine code is executed, Java methods behave like pure native methods in stack frame structure (the new interpreter NTERP is introduced on S, and its stack frame structure is consistent with machine code execution, so its performance is better than previous MTERP). The interpretive execution (in this case, the MTERP interpreter) has a dedicated data structure to record dex_PC values.
When we want to trace back a frame of Java call stack information, we want three pieces of information: the method name, the file name, and the line number. (As for lock information, it is not present in every frame, so it is a separate topic and not covered here.)
at android.os.Looper.loop(Looper.java:174)
Copy the code
In order to obtain these three information, there are actually three dependent data: ArtMethod object, DexFile information and dex_pc value. Since the DexFile information can be obtained indirectly through ArtMethod, the main purpose of our backtracking is to find its ArtMehtod object and dex_PC value for each frame.
This lookup is easy for explain execution, because explain execution has a special data structure to record it, and this particular data structure is called a ShadowFrame.
For machine code execution, however, the problem becomes much more complicated. Fortunately, the machine code execution of each frame follows a pattern: at the top of the stack is the ArtMethod pointer to the current execution method. So when a sequence of method calls occurs, we can resolve all the information based on the sp value of the last frame, as follows:
- With the sp value, we dereference twice to get the ArtMethod object of the currently running method.
- Obtain FrameInfo by ArtMethod, where frame size is known.
- Sp +frame size indicates the sp value of the previous frame.
- The value of the return address is also known from the sp of the previous frame, which is usually stored in the X30 register and pushed onto the stack at a fixed offset when the method is called.
Therefore, we can retrieve the ArtMethod object and THE PC value for each frame (the top frame is either native or Runtime and does not need to recover the line number). The dex_PC value can be further obtained by using the following method so that the details of each frame can be parsed out.
uint32_t StackVisitor::GetDexPc(bool abort_on_failure) const {
if(cur_shadow_frame_ ! =nullptr) {
return cur_shadow_frame_->GetDexPC(a); }else if(cur_quick_frame_ ! =nullptr) {
if (IsInInlinedFrame()) {
return current_inline_frames_.back().GetDexPc(a); }else if (cur_oat_quick_method_header_ == nullptr) {
return dex::kDexNoIndex;
} else if ((*GetCurrentQuickFrame() - >IsNative()) {
return cur_oat_quick_method_header_->ToDexPc(
GetCurrentQuickFrame(), cur_quick_frame_pc_, abort_on_failure);
} else if (cur_oat_quick_method_header_->IsOptimized()) {
StackMap* stack_map = GetCurrentStackMap(a);DCHECK(stack_map->IsValid());
return stack_map->GetDexPc(a); }else {
DCHECK(cur_oat_quick_method_header_->IsNterpMethodHeader());
return NterpGetDexPC(cur_quick_frame_); }}else {
return 0; }}Copy the code
There is one other case that was omitted from the description above, the Java Inline case, which is also time-consuming in the unwind process.
5. Native call stack collection
In the normal trace generation process, whether the native call stack of a thread is collected depends on the judgment of the following functions, and the serial number below indicates the judgment priority.
static bool ShouldShowNativeStack(const Thread* thread)
REQUIRES_SHARED(Locks::mutator_lock_) {
ThreadState state = thread->GetState(a);// In native code somewhere in the VM (one of the kWaitingFor* states)? That's interesting.
if (state > kWaiting && state < kStarting) {
return true;
}
// In an Object.wait variant or Thread.sleep? That's not interesting.
if (state == kTimedWaiting || state == kSleeping || state == kWaiting) {
return false;
}
// Threads with no managed stack frames should be shown.
if(! thread->HasManagedStack()) {
return true;
}
// In some other native method? That's interesting.
// We don't just check kNative because native methods will be in state kSuspended if they're
// calling back into the VM, or kBlocked if they're blocked on a monitor, or one of the
// thread-startup states if it's early enough in their life cycle (http://b/7432159).
ArtMethod* current_method = thread->GetCurrentMethod(nullptr);
returncurrent_method ! =nullptr && current_method->IsNative(a); }Copy the code
-
When state is a virtual machine-specific state, the native call stack needs to be collected. So what is a virtual machine-specific state? For example, kWaitingForGcToComplete, which indicates that the current thread is waiting for GC to finish. Therefore, we can understand that these states are due to the virtual machine itself and affect the running state of the thread.
-
If state is Waiting or Sleeping related, the collection of native call stacks is omitted. Because the call stack of the native layer of the thread in this state will eventually be futex system call, the output of these call stacks will not bring valuable information to debugging, so it can be omitted.
-
If the thread has no Java layer call stack information, it needs to collect the native call stack, otherwise there is no information to output.
-
If the last frame of the Java layer call stack is a Native method, the Native call stack needs to be collected to understand the specific actions of the Native layer.
The next step is to discuss how to collect the native call stack, a process technically known as unwinding, or unwinding, which is done in Android through the Libunwindstack library.
Collect the Native call stack, essentially finding the PC value for each frame. Once we get the sp value of the last frame, we can constantly track back the PC value of each frame by looking for the return address.
Therefore, the following questions can be reduced to two:
- How to find the last frame register (SP/PC) value?
- How to find the return address of each frame?
5.1 How do I find the register value of the last frame
Register information is thread-dependent in nature, so two cases are discussed here.
- This thread collects the call stack for this thread.
- The “Signal Catcher” thread collects call stacks from other threads.
This thread obtains the register value relatively simple, only needs some basic assembly instruction to be able to complete. For example, the following code can store the values of 32 general-purpose registers in user-space specific data structures.
inline __attribute__((__always_inline__)) void AsmGetRegs(void* reg_data) {
asm volatile(
"1:\n"
"stp x0, x1, [%[base], #0]\n"
"stp x2, x3, [%[base], #16]\n"
"stp x4, x5, [%[base], #32]\n"
"stp x6, x7, [%[base], #48]\n"
"stp x8, x9, [%[base], #64]\n"
"stp x10, x11, [%[base], #80]\n"
"stp x12, x13, [%[base], #96]\n"
"stp x14, x15, [%[base], #112]\n"
"stp x16, x17, [%[base], #128]\n"
"stp x18, x19, [%[base], #144]\n"
"stp x20, x21, [%[base], #160]\n"
"stp x22, x23, [%[base], #176]\n"
"stp x24, x25, [%[base], #192]\n"
"stp x26, x27, [%[base], #208]\n"
"stp x28, x29, [%[base], #224]\n"
"str x30, [%[base], #240]\n"
"mov x12, sp\n"
"adr x13, 1b\n"
"stp x12, x13, [%[base], #248]\n"
: [base] "+r"(reg_data)
:
: "x12"."x13"."memory");
}
Copy the code
But what if it is retrieved across threads (not processes)?
The answer is through signals. Register values are not backed up while the target thread is running in user space. The information is backed up only when it switches between user and kernel mode. In addition, the switching process also detects the signal and triggers the signal processing function, so the register information just backed up can be further passed to the processing function. And that’s where we actually get register values across threads.
Android uses THREAD_SIGNAL (33) to do this, and its processing function is relatively simple. The register information in sigContext is copied to the global data so that other threads can retrieve it.
5.2 How do I find the return address of each frame
When a function call occurs, the return address is usually in the X30 register (AArch64). If the register is needed internally by the caller, its starting fragment must store the value of X30 on the stack, or the return address will be lost. But where exactly is the value of X30 in the stack?
When the -fomit-frame-pointer compilation option is turned on, x30 is stored next to x29(the FP register), so it is easy to spot. Without this compilation option, however, the x30 value depends on more information. In 64-bit libraries, this Information is called “Call Frame Information” and is stored in the.eh_frame section of the ELF file. An article of wechat technology team describes this clearly, quoted as follows:
When your code reaches a certain “line”, depending on the PC at that time, we can look up the “Call Frame Information” to see how each register should be restored when exiting the current function stack. For example, it may describe where in the current stack the register value should be read back.
In addition to unwinding pure native frames, the Libunwindstack library also supports AOT/JIT frames as well as interpreted execution frames. This shows that the call stack collected by LibunwindStack can reflect not only the call information of native layer, but also that of Java layer, as shown in the following example.
#00 pc 000aa0f8 /system/lib/libart.so (void std::__1::__tree_balance_after_insert<std::__1::__tree_node_base<void*>*>(std::__1::__tree_node_base<void*>*, std::__1::__tree_balance_after_insert<std::__1::__tree_node_base<void*>*>)+32) #01 pc 001a0a35 /system/lib/libart.so (art::gc::space::LargeObjectMapSpace::Alloc(art::Thread*, unsigned int, unsigned int*, unsigned int*, unsigned int*)+180) #02 pc 003cd4f5 /system/lib/libart.so (art::mirror::Object* art::gc::Heap::AllocLargeObject<false, art::mirror::SetLengthVisitor>(art::Thread*, art::ObjPtr<art::mirror::Class>*, unsigned int, art::mirror::SetLengthVisitor const&)+108) #03 pc 003cb659 /system/lib/libart.so (artAllocArrayFromCodeResolvedRegionTLAB+484) #04 pc 00411613 /system/lib/libart.so (art_quick_alloc_array_resolved16_region_tlab+82) #05 pc 0020cfe3 /system/framework/arm/boot-core-oj.oat (offset 0x10d000) (java.lang.AbstractStringBuilder.append+242) #06 pc 002b809b /system/framework/arm/boot-core-oj.oat (offset 0x10d000) (java.lang.StringBuilder.append+50) #07 pc 001199b7 /system/framework/arm/boot-core-libart.oat (offset 0x76000) (org.json.JSONTokener.nextString+214) #08 pc 00119b73 /system/framework/arm/boot-core-libart.oat (offset 0x76000) (org.json.JSONTokener.nextValue+162) #09 pc 001195db /system/framework/arm/boot-core-libart.oat (offset 0x76000) (org.json.JSONTokener.readObject+314) #10 pc 00119b47 /system/framework/arm/boot-core-libart.oat (offset 0x76000) (org.json.JSONTokener.nextValue+118) #11 pc 0040d775 /system/lib/libart.so (art_quick_invoke_stub_internal+68) #12 pc 003e72c9 /system/lib/libart.so (art_quick_invoke_stub+224) #13 pc 000a103d /system/lib/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+136) #14 pc 001e60f1 /system/lib/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread*, art::ArtMethod*, art::ShadowFrame*, unsigned short, art::JValue*)+236) #15 pc 001e0bdf /system/lib/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+814) #16 pc 003e1f23 /system/lib/libart.so (MterpInvokeVirtual+442) #17 pc 00400514 /system/lib/libart.so (ExecuteMterpImpl+14228) #18 pc 002613ec /system/priv-app/ReusLauncherDev/ReusLauncherDev.apk (offset 0x9c9000) (com.reus.launcher.AsusAnimationIconReceiver.a+80) #19 pc 001c535b /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEb.llvm.866626450+378 ) #20 pc 001c9a41 /system/lib/libart.so (art::interpreter::ArtInterpreterToInterpreterBridge(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame*, art::JValue*)+152) #21 pc 001e0bc7 /system/lib/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+790) #22 pc 003e2eff /system/lib/libart.so (MterpInvokeStatic+130) #23 pc 00400694 /system/lib/libart.so (ExecuteMterpImpl+14612) #24 pc 0028ae7a /system/priv-app/ReusLauncherDev/ReusLauncherDev.apk (offset 0x9c9000) (com.reus.launcher.d.run+1274) #25 pc 001c535b /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEb.llvm.866626450+378 ) #26 pc 001c9987 /system/lib/libart.so (art::interpreter::EnterInterpreterFromEntryPoint(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame*)+82) #32 pc 0040d775 /system/lib/libart.so (art_quick_invoke_stub_internal+68) #33 pc 003e72c9 /system/lib/libart.so (art_quick_invoke_stub+224) #34 pc 000a103d /system/lib/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+136) #36 pc 00348f6d /system/lib/libart.so (art::InvokeVirtualOrInterfaceWithJValues(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, jvalue*)+320) #37 pc 00369ee7 /system/lib/libart.so (art::Thread::CreateCallback(void*)+866) #38 pc 00072131 /system/lib/libc.so (__pthread_start(void*)+22) #39 pc 0001e005 /system/lib/libc.so (__start_thread+24)Copy the code
#24 is a dummy frame, that is, it does not exist on the stack but is added to aid debugging, and it reflects the Java methods being interpreted by the interpreter in frames #23 to #25. #05 to #10 reflect Java methods of machine code execution (AOT-compiled). #00~#04 reflect pure native layer (so library) function calls.
Here’s a question: LibunwindStack can collect calls to the Java layer, so why does the Native call stack in the Trace file only show calls to the Native layer?
The reason is that the trace file is truncated and omitted when collecting the call stack. Specific strategies are as follows:
-
During traceback, if you encounter a frame that belongs to a file with the suffix OAT or odex, the traceback is stopped. The reason for this is that JNI’s springboard functions/AOt-compiled Java methods are usually in oAT /odex files and stop tracebacks when they are encountered so that subsequent Java method tracebacks can be omitted.
backtrace_map_->SetSuffixesToIgnore(std::vector<std::string> { "oat"."odex" }); Copy the code
-
Traceback frames that fall on “libunwindstack.so” and “libbacktrace.so” will not be displayed. The reason is that these frames reflect the call stack collection process, not the original call logic of the thread.
std::vector<std::string> skip_names{"libunwindstack.so"."libbacktrace.so"}; Copy the code
5.3 Defects of current call stack backtracking
If you think about the first strategy, it turns out to be flawed. This defect has two main points:
-
Is the JNI springboard function necessarily in the OAT /odex file?
It’s not. In the Dex2OAT stage, the system will uniformly generate a JNI springboard function for a native method with compatible parameters (same number and similar type) in OAT/Odex file. This point can be developed for example.
#05 pc 00000000000eeb24 /system/lib64/libandroid_runtime.so (android::nativeCreate(_JNIEnv*, _jclass*, _jstring*, int)+132) #06 pc 00000000003dff04 /system/framework/arm64/boot-framework.oat (offset 0x3d6000) (android.graphics.FontFamily.nInitBuilder [DEDUPED]+180) #07 pc 000000000091414c /system/framework/arm64/boot-framework.oat (offset 0x3d6000) (android.database.CursorWindow.<init>+172) Copy the code
The above call stack is a typical output of a pre-Android S Tombstone file. The CursorWindow constructor in #7 calls nativeCreate, but the nInitBuilder in #6 calls nInitBuilder. The reason is that a JNI springboard function can be used by multiple native methods, while backtracking only shows a name from many native methods. So the DEDUPED tells us that this frame can’t be trusted. Specific explanations are as follows:
## DEDUPED frames If the name of a Java method includes `[DEDUPED]`, this means that multiple methods share the same code. ART only stores the name of a single one in its metadata, which is displayed here. This is not necessarily the one that was called. Copy the code
Looking further at the method definitions of nativeCreate and nInitBuilder, it can be found that they have the same number and type of parameters, so they can share a JNI springboard function after Dex2OAT.
private static native long nativeCreate(String name, int cursorWindowSize); private static native long nInitBuilder(String langs, int variant); Copy the code
Fortunately, starting with Android S, this frame no longer displays the specific method name, but rather the uniform art_jni_trampoline, which can be less confusing to developers. The following is an example.
#05 pc 00000000004a600c /apex/com.android.art/lib64/libart.so (art::VMDebug_countInstancesOfClass(_JNIEnv*, _jclass*, _jclass*, unsigned char)+876) (BuildId: 2ede688a1cdde049a8439e413c1c41f8) #06 pc 0000000000010fb4 /apex/com.android.art/javalib/arm64/boot-core-libart.oat (art_jni_trampoline+180) (BuildId: a58ab7e35be2dda5ad3453c56bfefea6edf331bf) #07 pc 000000000064037c /system/framework/arm64/boot-framework.oat (android.os.Debug.countInstancesOfClass+44) (BuildId: e47113da18d4f822af52023fa19893d55035facd) #08 pc 0000000000812930 /system/framework/arm64/boot-framework.oat (android.view.ViewDebug.getViewRootImplCount+48) (BuildId: e47113da18d4f822af52023fa19893d55035facd) Copy the code
The JNI springboard functions generated by Dex2OAT are actually located in oAT/Odex files. In other cases, dex2OAT does not generate JNI springboard functions for native methods, but uses a uniform art_quick_GENERic_jni_trampoline to perform parameter passing and state switching dynamically at runtime. At this point, art_quick_generic_jni_trampoline is in libart.so and does not conform to the oAT /odex suffix, so call stack backtracking continues when it hits this frame. If subsequent Java methods are all interpreted execution, the interpreted execution frames are all backtracked, as shown in the following example.
"Binder:1083_11" prio=5 tid=127 Native | group="main" sCount=1 dsCount=0 flags=1 obj=0x16002138 self=0xb40000715181e940 | sysTid=6990 nice=0 cgrp=default sched=0/0 handle=0x6f5580fcc0 | state=S schedstat=( 4739949803 13009985270 12510 ) utm=234 stm=239 core=3 HZ=100 | stack=0x6f55718000-0x6f5571a000 stackSize=995KB | held mutexes= native: #00 pc 000000000009aa34 /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+4) native: #01 pc 0000000000057564 /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+156) native: #02 pc 00000000000999d4 /system/lib64/libhidlbase.so (android::hardware::IPCThreadState::transact(int, unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int)+564) native: #03 pc 0000000000094e84 /system/lib64/libhidlbase.so (android::hardware::BpHwBinder::transact(unsigned int, android::hardware::Parcel const&, android::hardware::Parcel*, unsigned int, std::__1::function<void (android::hardware::Parcel&)>)+76) native: #04 pc 000000000000e538 /system/lib64/[email protected] (android::system::suspend::V1_0::BpHwSystemSuspend::_hidl_acquireWakeLock(android::hardware::IInterface*, android::hardware::details::HidlInstrumentor*, android::system::suspend::V1_0::WakeLockType, android::hardware::hidl_string const&)+324) native: #05 pc 0000000000003178 /system/lib64/libhardware_legacy.so (acquire_wake_lock+356) native: #06 pc 0000000000086648 /system/lib64/libandroid_servers.so (android::nativeAcquireSuspendBlocker(_JNIEnv*, _jclass*, _jstring*)+64) native: #07 pc 000000000013ced4 /apex/com.android.art/lib64/libart.so (art_quick_generic_jni_trampoline+148) native: #08 pc 00000000001337e8 /apex/com.android.art/lib64/libart.so (art_quick_invoke_static_stub+568) native: #09 pc 00000000001a8a94 /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+228) native: #10 pc 0000000000318240 /apex/com.android.art/lib64/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread*, art::ArtMethod*, art::ShadowFrame*, unsigned short, art::JValue*)+376) native: #11 pc 000000000030e56c /apex/com.android.art/lib64/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+996) native: #12 pc 000000000067e098 /apex/com.android.art/lib64/libart.so (MterpInvokeStatic+548) native: #13 pc 000000000012d994 /apex/com.android.art/lib64/libart.so (mterp_op_invoke_static+20) native: #14 pc 0000000000617e00 /system/framework/services.jar (com.android.server.power.PowerManagerService.access$600) native: #15 pc 000000000067e33c /apex/com.android.art/lib64/libart.so (MterpInvokeStatic+1224) native: #16 pc 000000000012d994 /apex/com.android.art/lib64/libart.so (mterp_op_invoke_static+20) native: #17 pc 0000000000614fec /system/framework/services.jar (com.android.server.power.PowerManagerService$NativeWrapper.nativeAcquireSuspendBlocker) native: #18 pc 000000000067b3e0 /apex/com.android.art/lib64/libart.so (MterpInvokeVirtual+1520) native: #19 pc 000000000012d814 /apex/com.android.art/lib64/libart.so (mterp_op_invoke_virtual+20) native: #20 pc 00000000006152b0 /system/framework/services.jar (com.android.server.power.PowerManagerService$SuspendBlockerImpl.acquire+52) native: #21 pc 0000000000305b68 /apex/com.android.art/lib64/libart.so (art::interpreter::Execute(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame&, art::JValue, bool, bool) (.llvm.10833873914857160001)+268) native: #22 pc 0000000000669e48 /apex/com.android.art/lib64/libart.so (artQuickToInterpreterBridge+780) native: #23 pc 000000000013cff8 /apex/com.android.art/lib64/libart.so (art_quick_to_interpreter_bridge+88) native: #24 pc 00000000021f4bc4 /memfd:jit-cache (deleted) (offset 2000000) (com.android.server.power.PowerManagerService.updateSuspendBlockerLocked+228) native: #25 pc 000000000201cf6c /memfd:jit-cache (deleted) (offset 2000000) (com.android.server.power.PowerManagerService.updatePowerStateLocked+988) native: #26 pc 00000000021a3800 /memfd:jit-cache (deleted) (offset 2000000) (com.android.server.power.PowerManagerService.acquireWakeLockInternal+1712) native: #27 pc 000000000205640c /memfd:jit-cache (deleted) (offset 2000000) (com.android.server.power.PowerManagerService$BinderService.acquireWakeLock+524) native: #28 pc 0000000002040b64 /memfd:jit-cache (deleted) (offset 2000000) (android.os.IPowerManager$Stub.onTransact+8340) native: #29 pc 00000000020c95a4 /memfd:jit-cache (deleted) (offset 2000000) (android.os.Binder.execTransactInternal+996) native: #30 pc 00000000020b9a0c /memfd:jit-cache (deleted) (offset 2000000) (android.os.Binder.execTransact+284) native: #31 pc 0000000000133564 /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+548) native: #32 pc 00000000001a8a78 /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+200) native: #33 pc 0000000000553c70 /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeVirtualOrInterfaceWithVarArgs<art::ArtMethod*>(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, art::ArtMethod*, std::__va_list)+468) native: #34 pc 0000000000553e10 /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeVirtualOrInterfaceWithVarArgs<_jmethodID*>(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, std::__va_list)+92) native: #35 pc 00000000003a0920 /apex/com.android.art/lib64/libart.so (art::JNI<false>::CallBooleanMethodV(_JNIEnv*, _jobject*, _jmethodID*, std::__va_list)+660) native: #36 pc 000000000009c698 /system/lib64/libandroid_runtime.so (_JNIEnv::CallBooleanMethod(_jobject*, _jmethodID*, ...)+124) native: #37 pc 0000000000124064 /system/lib64/libandroid_runtime.so (JavaBBinder::onTransact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+156) native: #38 pc 000000000004882c /system/lib64/libbinder.so (android::BBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+232) native: #39 pc 0000000000051110 /system/lib64/libbinder.so (android::IPCThreadState::executeCommand(int)+1032) native: #40 pc 0000000000050c58 /system/lib64/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+156) native: #41 pc 0000000000051490 /system/lib64/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+60) native: #42 pc 00000000000773e0 /system/lib64/libbinder.so (android::PoolThread::threadLoop()+24) native: #43 pc 000000000001549c /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+260) native: #44 pc 00000000000a2590 /system/lib64/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+144) native: #45 pc 0000000000014d60 /system/lib64/libutils.so (thread_data_t::trampoline(thread_data_t const*)+412) native: #46 pc 00000000000af808 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+64) native: #47 pc 000000000004fc88 /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64) at com.android.server.power.PowerManagerService.nativeAcquireSuspendBlocker(Native method) at com.android.server.power.PowerManagerService.access$600(PowerManagerService.java:125) at com.android.server.power.PowerManagerService$NativeWrapper.nativeAcquireSuspendBlocker(PowerManagerService.java:713) at com.android.server.power.PowerManagerService$SuspendBlockerImpl.acquire(PowerManagerService.java:4643) - locked <0x073deae2> (a com.android.server.power.PowerManagerService$SuspendBlockerImpl) at com.android.server.power.PowerManagerService.updateSuspendBlockerLocked(PowerManagerService.java:3067) at com.android.server.power.PowerManagerService.updatePowerStateLocked(PowerManagerService.java:1956) at com.android.server.power.PowerManagerService.acquireWakeLockInternal(PowerManagerService.java:1320) - locked <0x03f99c8c> (a java.lang.Object) at com.android.server.power.PowerManagerService.access$4600(PowerManagerService.java:125) at com.android.server.power.PowerManagerService$BinderService.acquireWakeLock(PowerManagerService.java:4780) at android.os.IPowerManager$Stub.onTransact(IPowerManager.java:421) at android.os.Binder.execTransactInternal(Binder.java:1154) at android.os.Binder.execTransact(Binder.java:1123) Copy the code
It can be found that the call stack of native tag actually contains the information of Java layer, so the information of Java layer is output twice (information redundancy). If you don’t know the specific principle of stack traceback, I’m afraid a lot of people will wonder: why nativeAcquireSuspendBlocker Java method will be called to # 47 __start_thread? This is not a real call path, but simply a flaw in the current trace call stack collection scheme.
-
The call stack traceback stops when it encounters a JNI springboard function located in the OAT /odex file. This scheme works for most scenarios. However, if the function calls are interleaved as follows, the current scheme loses part of the call stack.
Java method A ↓(call) Native method B ↓(call) Java method C ↓(call) Native method D Copy the code
In the final traceable overall call stack information, Native method B will disappear, because the traceable of Native layer will be finished when it encounters C. Here is a practical example.
"Binder:1540_2" prio=5 tid=9 Blocked | group="main" sCount=1 dsCount=0 flags=1 obj=0x13700580 self=0x7e0c139800 | sysTid=1560 nice=-2 cgrp=default sched=0/0 handle=0x7df07474f0 | state=S schedstat=( 126689305075 80266662086 342299 ) utm=8978 stm=3690 core=0 HZ=100 | stack=0x7df064c000-0x7df064e000 stackSize=1009KB | held mutexes= at com.android.server.LocationManagerService.isProviderEnabledForUser(LocationManagerService.java:2813) - waiting to lock <0x07cdf9c8> (a java.lang.Object) held by thread 11 at android.location.ILocationManager$Stub.onTransact(ILocationManager.java:488) at Android. OS. Binder. ExecTransact (Binder. Java: 726) (- lost among the native call stack -) at android. OS. BinderProxy. TransactNative (native method) at android.os.BinderProxy.transact(BinderProxy.java:473) at android.location.IGeocodeProvider$Stub$Proxy.getFromLocation(IGeocodeProvider.java:143) at com.android.server.location.GeocoderProxy$1.run(GeocoderProxy.java:79) at com.android.server.ServiceWatcher.runOnBinder(ServiceWatcher.java:425) - locked <0x0d7e7a61> (a java.lang.Object) at com.android.server.location.GeocoderProxy.getFromLocation(GeocoderProxy.java:74) at com.android.server.LocationManagerService.getFromLocation(LocationManagerService.java:3341) at android.location.ILocationManager$Stub.onTransact(ILocationManager.java:217) at android.os.Binder.execTransact(Binder.java:726)Copy the code
The thread initiates a binder communication to the peer process, which initiates a new communication to the process as it proceeds. Based on the design of the Binder Transaction Stack, this new communication must be handed over to the original thread. So execTransact indicates that it is processing this communication. Between transactNative and execTransact, the native layer frames are actually missing.
Both of these flaws are actually minor and harmless. After feedback and communication with Google engineers, they said that these problems are most likely to be fixed on T.
conclusion
When solving most APP problems, the call stack is the most important source of analysis. Knowing the details doesn’t really matter if it always perfectly reflects the execution logic of the thread. But that’s not the case. In some scenarios of ANR, threads may be stuck in GoToRunnable; In the case of interleaved calls, the intermediate native method may be lost. And so on. The call stack at these times presents confusing information that can only be truly resolved by understanding the details of the backtracking.