Caikelun. IO/POST /2019-0…
Does flawless code logic always produce flawless programs? The answer is no. On a software level, perhaps only binaries will never fool you.
The phenomenon of
Recently, the business side reported a strange crash problem and decided there was not enough information to fix it.
Signal: 11 (SIGSEGV), Code: 1 (SEGV_MAPERR), fault addr 0x1
r0 993ff520 r1 dc3170c4 r2 00000000 r3 dabe3e08
r4 993ff520 r5 00000005 r6 00000290 r7 000007ac
r8 e83253a0 r9 00006aba r10 bf921e39 r11 e83253a0
ip bfa3a9e0 sp 993ff494 lr bf88a71d pc bf96c31c
#00 pc 001a731c /data/data/com.package.name/files/download/libmcto_media_player.so
#01 pc 0020b7e5 /data/data/com.package.name/files/download/libmcto_media_player.so
#00 993ff494 0000022c993ff498 adcfd000 [anon:libc_malloc] 993ff49c bf88a71d /data/data/com.package.name/files/download/libmcto_media_player.so 993ff4a0 ffffffff 993ff4a4 ffffffff 993ff4a8 bf9d07e7 /data/data/com.package.name/files/download/libmcto_media_player.so#01 993ff4ac 00000000
993ff4b0 00000000
993ff4b4 00000000
993ff4b8 00000000
993ff4bc 00000000
993ff4c0 adcfd234 [anon:libc_malloc]
993ff4c4 00000000
993ff4c8 0000006e
993ff4cc 00000000
993ff4d0 adcfdf6c [anon:libc_malloc]
993ff4d4 00000000
993ff4d8 00000000
993ff4dc 00000000
993ff4e0 00000000
993ff4e4 00000000
993ff4e8 00000000
Copy the code
At first glance, this must be a bug in the dynamic library’s business logic, resulting in a segment error. Backtrace is indeed incomplete, but it is hard to see why it is incomplete. We really need help to analyze it.
Analysis of the
Common reasons why backtrace is incomplete
Backtrace is incomplete from time to time. The common reasons are as follows:
- The stack memory was heavily miswritten during the crash. This is even worse if the logic near the crash point is dealing with very random external inputs, and you tend to see a large number of discrete incomplete backtraces. For example:
#00 pc 00000ffb
#01 pc 0009a885 /data/app/com.package.name-1/lib/arm/libjsc.so
#02 pc 0003ff93 /data/app/com.package.name-1/lib/arm/libjsc.so
#03 pc 0011f60f /data/app/com.package.name-1/lib/arm/libjsc.so
#04 pc fffffffb
Copy the code
#00 pc 000092fe
#01 pc 00099ec3 /data/app/com.package.name-1/lib/arm/libjsc.so
#02 pc 00003ffe
Copy the code
#00 pc 00000ffb
Copy the code
- The unwind table for some ELF files on the invocation path is incomplete. For example, oDEX/OAT in some systems, and WebView Chromium in some systems are all in this category. For example:
#00 pc 00d12bcc /system/lib/libwebviewchromium.so
Copy the code
#00 pc 01a0cf72 /system/app/WebViewGoogle/WebViewGoogle.apk!libwebviewchromium.so (offset 0x46da000)
Copy the code
#00 pc 00006fde /data/app/com.package.name-1/lib/arm/libcros.so
#01 pc 00007007 /data/app/com.package.name-1/lib/arm/libcros.so
#02 pc 00007023 /data/app/com.package.name-1/lib/arm/libcros.so
#03 pc 00007037 /data/app/com.package.name-1/lib/arm/libcros.so
#04 pc 000070d1 /data/app/com.package.name-1/lib/arm/libcros.so
#05 pc 000049bf /data/app/com.package.name-1/lib/arm/libcros.so
#06 pc 000092e3 /data/app/com.package.name-1/oat/arm/base.odex
Copy the code
#00 pc 00013792 /system/lib/libc.so (__futex_wait_ex+49)
#01 pc 00013b21 /system/lib/libc.so (pthread_mutex_lock+310)
#02 pc 00028351 /system/lib/libc.so (dlfree+48)
#03 pc 0000ef33 /system/lib/libc.so (free+10)
#04 pc 0000a367 /system/lib/libjavacrypto.so
#05 pc 0000bc4d /system/lib/libjavacrypto.so
#06 pc 022fd081 /system/framework/arm/boot.oat
Copy the code
- Some ELF files on the invocation path are themselves corrupted or removed. In addition, if the crash point itself is in a corrupted ELF, sometimes the signal received will be SIGBUS. For example:
#00 pc 5d9840f2
#01 pc 4008ab6c
Copy the code
#00 pc 00392fd0 /system/lib/egl/libGLES_mali.so
#01 pc 0002ab7b /system/lib/libgui.so (_ZN7android10GLConsumer22bindTextureImageLockedEv+182)
#02 pc 0002b3a9 /system/lib/libgui.so (_ZN7android10GLConsumer14updateTexImageEv+208)
#03 pc b3317c6c
Copy the code
- The instruction executed is in SharedMemory, and the ELF content read at this time can be unreliable, so active termination of the unwind is generally chosen to avoid misdirection. For example:
#00 pc 0007a010 /dev/ashmem/dalvik-jit-code-cache (deleted)
Copy the code
#00 pc 00019e64 /system/lib/libssl.so (SSL_clear+19)
#01 pc 000103b5 /system/lib/libjavacrypto.so (_ZL25NativeCrypto_SSL_shutdownP7_JNIEnvP7_jclassxP8_jobjectS4_+156)
#02 pc 00027a7d /system/framework/arm/boot-conscrypt.oat (com.android.org.conscrypt.NativeCrypto.SSL_shutdown+156)
#03 pc 00032a03 /system/framework/arm/boot-conscrypt.oat (com.android.org.conscrypt.OpenSSLSocketImpl.shutdownAndFreeSslNative+138)
#04 pc 0003330b /system/framework/arm/boot-conscrypt.oat (com.android.org.conscrypt.OpenSSLSocketImpl.close+434)
#05 pc 003e0931 /system/lib/libart.so (art_quick_invoke_stub_internal+64)
#06 pc 003e4ea3 /system/lib/libart.so (art_quick_invoke_stub+226)
#07 pc 000ac2d9 /system/lib/libart.so (_ZN3art9ArtMethod6InvokeEPNS_6ThreadEPjjPNS_6JValueEPKc+140)
#08 pc 001f27fb /system/lib/libart.so (_ZN3art11interpreter34ArtInterpreterToCompiledCodeBridgeEPNS_6ThreadEPNS_9ArtMethodEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameEPNS_6JValueE+238)
#09 pc 001edd71 /system/lib/libart.so (_ZN3art11interpreter6DoCallILb0ELb0EEEbPNS_9ArtMethodEPNS_6ThreadERNS_11ShadowFrameEPKNS_11InstructionEtPNS_6JValueE+576)
#10 pc 003cce3d /system/lib/libart.so (MterpInvokeVirtualQuick+504)
#11 pc 003d6994 /system/lib/libart.so (ExecuteMterpImpl+29972)
#12 pc 001d5351 /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadEPKNS_7DexFile8CodeItemERNS_11ShadowFrameENS_6JValueEb+340)
#13 pc 001da6a3 /system/lib/libart.so (_ZN3art11interpreter33ArtInterpreterToInterpreterBridgeEPNS_6ThreadEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameEPNS_6JValueE+142)
#14 pc 001edd5b /system/lib/libart.so (_ZN3art11interpreter6DoCallILb0ELb0EEEbPNS_9ArtMethodEPNS_6ThreadERNS_11ShadowFrameEPKNS_11InstructionEtPNS_6JValueE+554)
#15 pc 003cb927 /system/lib/libart.so (MterpInvokeStatic+322)
#16 pc 003d2d94 /system/lib/libart.so (ExecuteMterpImpl+14612)
#17 pc 001d5351 /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadEPKNS_7DexFile8CodeItemERNS_11ShadowFrameENS_6JValueEb+340)
#18 pc 001da6a3 /system/lib/libart.so (_ZN3art11interpreter33ArtInterpreterToInterpreterBridgeEPNS_6ThreadEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameEPNS_6JValueE+142)
#19 pc 001ee931 /system/lib/libart.so (_ZN3art11interpreter6DoCallILb1ELb0EEEbPNS_9ArtMethodEPNS_6ThreadERNS_11ShadowFrameEPKNS_11InstructionEtPNS_6JValueE+420)
#20 pc 003cc9eb /system/lib/libart.so (MterpInvokeDirectRange+294)
#21 pc 003d3014 /system/lib/libart.so (ExecuteMterpImpl+15252)
#22 pc 001d5351 /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadEPKNS_7DexFile8CodeItemERNS_11ShadowFrameENS_6JValueEb+340)
#23 pc 001da6a3 /system/lib/libart.so (_ZN3art11interpreter33ArtInterpreterToInterpreterBridgeEPNS_6ThreadEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameEPNS_6JValueE+142)
#24 pc 001ee931 /system/lib/libart.so (_ZN3art11interpreter6DoCallILb1ELb0EEEbPNS_9ArtMethodEPNS_6ThreadERNS_11ShadowFrameEPKNS_11InstructionEtPNS_6JValueE+420)
#25 pc 003cc9eb /system/lib/libart.so (MterpInvokeDirectRange+294)
#26 pc 003d3014 /system/lib/libart.so (ExecuteMterpImpl+15252)
#27 pc 001d5351 /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadEPKNS_7DexFile8CodeItemERNS_11ShadowFrameENS_6JValueEb+340)
#28 pc 001da6a3 /system/lib/libart.so (_ZN3art11interpreter33ArtInterpreterToInterpreterBridgeEPNS_6ThreadEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameEPNS_6JValueE+142)
#29 pc 001edd5b /system/lib/libart.so (_ZN3art11interpreter6DoCallILb0ELb0EEEbPNS_9ArtMethodEPNS_6ThreadERNS_11ShadowFrameEPKNS_11InstructionEtPNS_6JValueE+554)
#30 pc 003cce3d /system/lib/libart.so (MterpInvokeVirtualQuick+504)
#31 pc 003d6994 /system/lib/libart.so (ExecuteMterpImpl+29972)
#32 pc 001d5351 /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadEPKNS_7DexFile8CodeItemERNS_11ShadowFrameENS_6JValueEb+340)
#33 pc 001da5f1 /system/lib/libart.so (_ZN3art11interpreter30EnterInterpreterFromEntryPointEPNS_6ThreadEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameE+92)
#34 pc 003c0fbd /system/lib/libart.so (artQuickToInterpreterBridge+944)
#35 pc 003e46f1 /system/lib/libart.so (art_quick_to_interpreter_bridge+32)
#36 pc 000a5511 /dev/ashmem/dalvik-jit-code-cache (deleted)
Copy the code
Preliminary analysis of collapse location
Back to the question. Let’s look at crash position 001A731C:
.text:001A7310 STMFD SP! , {R4,R5,LR} .text:001A7314 LDR R5, [R1] .text:001A7318 MOV R4, R0 .text:001A731C LDR R3, [R5,# - 4]; This is where the crash happened
.text:001A7320 SUB SP, SP, #0xC
.text:001A7324 CMP R3, # 0
.text:001A7328 SUB R0, R5, #0xC
.text:001A732C BLT loc_1A7350
.text:001A7330 LDR R3, =(dword_2759D4 - 0x1A733C)
.text:001A7334 ADD R3, PC, R3 ; dword_2759D4
.text:001A7338 CMP R0, R3
.text:001A733C BNE loc_1A7364
.text:001A7340 loc_1A7340
.text:001A7340 STR R5, [R4]
.text:001A7344 MOV R0, R4
.text:001A7348 ADD SP, SP, #0xC.text:001A734C LDMFD SP! , {R4,R5,PC} .text:001A7350 ADD R1, SP,#0x18+var_14
.text:001A7354 MOV R2, # 0
.text:001A7358 BL sub_1A6EA8
.text:001A735C MOV R5, R0
.text:001A7360 B loc_1A7340
.text:001A7364 MOV R1, # 1
.text:001A7368 ADD R0, R0, # 8
.text:001A736C BL sub_1C2CAC
.text:001A7370 B loc_1A7340
Copy the code
This is a relatively short complete call. R4, R5, LR are pressed first and then executed. LDR R3, [R5,#-4] [0x5,#-4] [0x5,#-4] [0x5,#-4] [0x5,#-4] Signal Code SEGV_MAPERR and Fault ADDR 0x1 were also exactly as expected.
Since there are only two lines of backtrace, let’s move on to the next line, at 0020b7e5:
. .rodata:0020B795 DCB"try_count_=%d",0
.rodata:0020B7E3 asc_20B7E3 DCB ": / /",0
.rodata:0020B7E7 aCdn DCB "CDN". Zero...Copy the code
Surprisingly 0020b7e5 is in.rodata, but that explains why the unwind was broken (the backtrace is incomplete).
Suspicious of
Looking back at the instructions near the crash location again, something suspicious did appear:
.text:001A7310 STMFD SP! , {R4,R5,LR} ............ .text:001A731C LDR R3, [R5,# - 4]; This is where the crash happened
.text:001A7320 SUB SP, SP, #0xC. .text:001A7348 ADD SP, SP,#0xC.text:001A734C LDMFD SP! , {R4,R5,PC}Copy the code
In this relatively short call, only 24 bytes of stack memory were used, but the SP was not moved all at once, which was very unusual.
unwind table
Look at the unwind table:
$ arm-linux-androideabi-readelf -u ./libmcto_media_player.so
............
0x1a7268: 0x80b108ab
Compact model index: 0
0xb1 0x08 pop {r3}
0xab pop {r4, r5, r6, r7, r14}
0x1a7310: 0x8002a9b0
Compact model index: 0
0x02 vsp = vsp + 12
0xa9 pop {r4, r5, r14}
0xb0 finish
0x1a7424: 0x8001a8b0
Compact model index: 0
0x01 vsp = vsp + 8
0xa8 pop {r4, r14}
0xb0 finish
............
Copy the code
Crash position 001a731c matches the unwind message code 0x8002a9B0 with offset starting at 1a7310. According to this information, the unwind SP value needs to be added by a total of 24 bytes. But as you can see from the previous assembly instructions, when a crash occurs (executing to 001a731C), the value of SP is reduced by only 12 bytes (STMFD SP! , {R4,R5,LR}), and that’s the problem.
Look at the stack
According to the data in stack:
#00 993ff494 0000022c993ff498 adcfd000 [anon:libc_malloc] 993ff49c bf88a71d /data/data/com.package.name/files/download/libmcto_media_player.so 993ff4a0 ffffffff 993ff4a4 ffffffff 993ff4a8 bf9d07e7 /data/data/com.package.name/files/download/libmcto_media_player.so#01 993ff4ac 00000000
993ff4b0 00000000
993ff4b4 00000000
............
Copy the code
So we see that the unwind process is actually doing exactly what’s in the unwind table, so it’s misleading, SP moved 12 bytes more than it really needed to, the actual LR is stored in memory address 993FF49c, it’s bf88a71D, Based on maps, we calculated the absolute address offset relative to the current ELF. Unfortunately, due to the complex logic of the business side dynamic library and the deep level of call, the unwind process terminated prematurely, using only existing registers, stack and memory information. It is not enough to help the business side locate the problem.
What exactly is 001A731C?
It is unusual for an unwind table message to contradict the corresponding sequence of assembly instructions. What exactly is the function at 001a731C? Why is there such a sequence of instructions?
Got a dynamic library file with debug symbols from the business side:
$arm-linux-androideabi-addr2line -f -e ./libmcto_media_player.so 001a731c
_ZNSsC1ERKSs
libgcc2.c:?
$arm-linux-androideabi-c++filt -n _ZNSsC1ERKSs
std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
Copy the code
The STD ::basic_string constructor.
This problem is most likely due to a bug in the NDK. According to the business side, the VERSION of THE NDK they use is R9D.
Know the NDK you use
Developing and maintaining a cross-platform cross-compilation tool is no easy task. Compared to C, the compiler has to make a lot of extra effort to ensure that the various syntactic features of C++ work as expected at runtime, and the C++ standard library has long been a mix of different versions. Support new versions of Android for underlying changes while maintaining backward compatibility. The NDK was not as stable and reliable as we expected. Check out the NDK’s Github official Issues.
The NDK has clearly listed important Known Issues in Changelog since R11.
In R11 Changelog, we can see that:
Exception handling will often fail when using c++_shared on ARM32. The root cause is incompatibility between the LLVM unwinder used by libc++abi for ARM32 and libgcc. This is not a regression from r10e.
In R12 Changelog’s Known Issues, it says:
Exception unwinding with c++_shared still does not work for ARM on Gingerbread or Ice Cream Sandwich.
We know that C++ ‘s exception handling mechanism also relies on unwinding at runtime. That should be the problem.
conclusion
The business side recompiled the dynamic library using a newer version of the NDK, and we checked the assembly instructions for STD ::basic_string and found that this time SP moved into place at the beginning of the function. There should be no problem. After the business side goes online to recompile the dynamic library and gets the complete backtrace, it can locate and fix the segment error problem.
Therefore, the reason for this incomplete backtrace problem was a bug in the earlier NDK that caused the generated dynamic library to fail to perform the unwind correctly in some cases.
According to the above description of Known Issues, not only backtrace retrieval after a crash is sometimes affected, but also where the business logic itself uses C++ exception mechanism, to be specific, it may be affected: After an exception is thrown, it may not be possible to execute the exception-catching logic layer by layer as the code logic expects. If there are hidden issues like this, hopefully this NDK update will fix them as well.
About the crash capture tool
Finally, it’s time for our commercial break.
All of the above online crash information was captured using xCrash, an Android APP crash capture tool developed by us.