Caikelun. IO/POST /2019-0…

Does flawless code logic always produce flawless programs? The answer is no. On a software level, perhaps only binaries will never fool you.

The phenomenon of

Recently, the business side reported a strange crash problem and decided there was not enough information to fix it.

Signal: 11 (SIGSEGV), Code: 1 (SEGV_MAPERR), fault addr 0x1

r0  993ff520  r1  dc3170c4  r2  00000000  r3  dabe3e08
r4  993ff520  r5  00000005  r6  00000290  r7  000007ac
r8  e83253a0  r9  00006aba  r10 bf921e39  r11 e83253a0
ip  bfa3a9e0  sp  993ff494  lr  bf88a71d  pc  bf96c31c

#00 pc 001a731c /data/data/com.package.name/files/download/libmcto_media_player.so
#01 pc 0020b7e5 /data/data/com.package.name/files/download/libmcto_media_player.so

#00 993ff494 0000022c993ff498 adcfd000 [anon:libc_malloc] 993ff49c bf88a71d /data/data/com.package.name/files/download/libmcto_media_player.so 993ff4a0 ffffffff 993ff4a4 ffffffff 993ff4a8 bf9d07e7  /data/data/com.package.name/files/download/libmcto_media_player.so#01 993ff4ac 00000000
     993ff4b0  00000000
     993ff4b4  00000000
     993ff4b8  00000000
     993ff4bc  00000000
     993ff4c0  adcfd234  [anon:libc_malloc]
     993ff4c4  00000000
     993ff4c8  0000006e
     993ff4cc  00000000
     993ff4d0  adcfdf6c  [anon:libc_malloc]
     993ff4d4  00000000
     993ff4d8  00000000
     993ff4dc  00000000
     993ff4e0  00000000
     993ff4e4  00000000
     993ff4e8  00000000
Copy the code

At first glance, this must be a bug in the dynamic library’s business logic, resulting in a segment error. Backtrace is indeed incomplete, but it is hard to see why it is incomplete. We really need help to analyze it.

Analysis of the

Common reasons why backtrace is incomplete

Backtrace is incomplete from time to time. The common reasons are as follows:

  • The stack memory was heavily miswritten during the crash. This is even worse if the logic near the crash point is dealing with very random external inputs, and you tend to see a large number of discrete incomplete backtraces. For example:
#00 pc 00000ffb 
      
#01 pc 0009a885 /data/app/com.package.name-1/lib/arm/libjsc.so
#02 pc 0003ff93 /data/app/com.package.name-1/lib/arm/libjsc.so
#03 pc 0011f60f /data/app/com.package.name-1/lib/arm/libjsc.so
#04 pc fffffffb 
      
Copy the code
#00 pc 000092fe 
      
#01 pc 00099ec3 /data/app/com.package.name-1/lib/arm/libjsc.so
#02 pc 00003ffe 
      
Copy the code
#00 pc 00000ffb 
      
Copy the code
  • The unwind table for some ELF files on the invocation path is incomplete. For example, oDEX/OAT in some systems, and WebView Chromium in some systems are all in this category. For example:
#00 pc 00d12bcc /system/lib/libwebviewchromium.so
Copy the code
#00 pc 01a0cf72  /system/app/WebViewGoogle/WebViewGoogle.apk!libwebviewchromium.so (offset 0x46da000)
Copy the code
#00 pc 00006fde /data/app/com.package.name-1/lib/arm/libcros.so
#01 pc 00007007 /data/app/com.package.name-1/lib/arm/libcros.so
#02 pc 00007023 /data/app/com.package.name-1/lib/arm/libcros.so
#03 pc 00007037 /data/app/com.package.name-1/lib/arm/libcros.so
#04 pc 000070d1 /data/app/com.package.name-1/lib/arm/libcros.so
#05 pc 000049bf /data/app/com.package.name-1/lib/arm/libcros.so
#06 pc 000092e3 /data/app/com.package.name-1/oat/arm/base.odex
Copy the code
#00 pc 00013792 /system/lib/libc.so (__futex_wait_ex+49)
#01 pc 00013b21 /system/lib/libc.so (pthread_mutex_lock+310)
#02 pc 00028351 /system/lib/libc.so (dlfree+48)
#03 pc 0000ef33 /system/lib/libc.so (free+10)
#04 pc 0000a367 /system/lib/libjavacrypto.so
#05 pc 0000bc4d /system/lib/libjavacrypto.so
#06 pc 022fd081 /system/framework/arm/boot.oat
Copy the code
  • Some ELF files on the invocation path are themselves corrupted or removed. In addition, if the crash point itself is in a corrupted ELF, sometimes the signal received will be SIGBUS. For example:
#00 pc 5d9840f2 
      
#01 pc 4008ab6c 
      
Copy the code
#00 pc 00392fd0 /system/lib/egl/libGLES_mali.so
#01 pc 0002ab7b /system/lib/libgui.so (_ZN7android10GLConsumer22bindTextureImageLockedEv+182)
#02 pc 0002b3a9 /system/lib/libgui.so (_ZN7android10GLConsumer14updateTexImageEv+208)
#03 pc b3317c6c 
      
Copy the code
  • The instruction executed is in SharedMemory, and the ELF content read at this time can be unreliable, so active termination of the unwind is generally chosen to avoid misdirection. For example:
#00 pc 0007a010 /dev/ashmem/dalvik-jit-code-cache (deleted)
Copy the code
#00 pc 00019e64 /system/lib/libssl.so (SSL_clear+19)
#01 pc 000103b5 /system/lib/libjavacrypto.so (_ZL25NativeCrypto_SSL_shutdownP7_JNIEnvP7_jclassxP8_jobjectS4_+156)
#02 pc 00027a7d /system/framework/arm/boot-conscrypt.oat (com.android.org.conscrypt.NativeCrypto.SSL_shutdown+156)
#03 pc 00032a03 /system/framework/arm/boot-conscrypt.oat (com.android.org.conscrypt.OpenSSLSocketImpl.shutdownAndFreeSslNative+138)
#04 pc 0003330b /system/framework/arm/boot-conscrypt.oat (com.android.org.conscrypt.OpenSSLSocketImpl.close+434)
#05 pc 003e0931 /system/lib/libart.so (art_quick_invoke_stub_internal+64)
#06 pc 003e4ea3 /system/lib/libart.so (art_quick_invoke_stub+226)
#07 pc 000ac2d9 /system/lib/libart.so (_ZN3art9ArtMethod6InvokeEPNS_6ThreadEPjjPNS_6JValueEPKc+140)
#08 pc 001f27fb  /system/lib/libart.so (_ZN3art11interpreter34ArtInterpreterToCompiledCodeBridgeEPNS_6ThreadEPNS_9ArtMethodEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameEPNS_6JValueE+238)
#09 pc 001edd71  /system/lib/libart.so (_ZN3art11interpreter6DoCallILb0ELb0EEEbPNS_9ArtMethodEPNS_6ThreadERNS_11ShadowFrameEPKNS_11InstructionEtPNS_6JValueE+576)
#10 pc 003cce3d /system/lib/libart.so (MterpInvokeVirtualQuick+504)
#11 pc 003d6994 /system/lib/libart.so (ExecuteMterpImpl+29972)
#12 pc 001d5351 /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadEPKNS_7DexFile8CodeItemERNS_11ShadowFrameENS_6JValueEb+340)
#13 pc 001da6a3  /system/lib/libart.so (_ZN3art11interpreter33ArtInterpreterToInterpreterBridgeEPNS_6ThreadEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameEPNS_6JValueE+142)
#14 pc 001edd5b  /system/lib/libart.so (_ZN3art11interpreter6DoCallILb0ELb0EEEbPNS_9ArtMethodEPNS_6ThreadERNS_11ShadowFrameEPKNS_11InstructionEtPNS_6JValueE+554)
#15 pc 003cb927 /system/lib/libart.so (MterpInvokeStatic+322)
#16 pc 003d2d94 /system/lib/libart.so (ExecuteMterpImpl+14612)
#17 pc 001d5351 /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadEPKNS_7DexFile8CodeItemERNS_11ShadowFrameENS_6JValueEb+340)
#18 pc 001da6a3  /system/lib/libart.so (_ZN3art11interpreter33ArtInterpreterToInterpreterBridgeEPNS_6ThreadEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameEPNS_6JValueE+142)
#19 pc 001ee931  /system/lib/libart.so (_ZN3art11interpreter6DoCallILb1ELb0EEEbPNS_9ArtMethodEPNS_6ThreadERNS_11ShadowFrameEPKNS_11InstructionEtPNS_6JValueE+420)
#20 pc 003cc9eb /system/lib/libart.so (MterpInvokeDirectRange+294)
#21 pc 003d3014 /system/lib/libart.so (ExecuteMterpImpl+15252)
#22 pc 001d5351 /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadEPKNS_7DexFile8CodeItemERNS_11ShadowFrameENS_6JValueEb+340)
#23 pc 001da6a3  /system/lib/libart.so (_ZN3art11interpreter33ArtInterpreterToInterpreterBridgeEPNS_6ThreadEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameEPNS_6JValueE+142)
#24 pc 001ee931  /system/lib/libart.so (_ZN3art11interpreter6DoCallILb1ELb0EEEbPNS_9ArtMethodEPNS_6ThreadERNS_11ShadowFrameEPKNS_11InstructionEtPNS_6JValueE+420)
#25 pc 003cc9eb /system/lib/libart.so (MterpInvokeDirectRange+294)
#26 pc 003d3014 /system/lib/libart.so (ExecuteMterpImpl+15252)
#27 pc 001d5351 /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadEPKNS_7DexFile8CodeItemERNS_11ShadowFrameENS_6JValueEb+340)
#28 pc 001da6a3  /system/lib/libart.so (_ZN3art11interpreter33ArtInterpreterToInterpreterBridgeEPNS_6ThreadEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameEPNS_6JValueE+142)
#29 pc 001edd5b  /system/lib/libart.so (_ZN3art11interpreter6DoCallILb0ELb0EEEbPNS_9ArtMethodEPNS_6ThreadERNS_11ShadowFrameEPKNS_11InstructionEtPNS_6JValueE+554)
#30 pc 003cce3d /system/lib/libart.so (MterpInvokeVirtualQuick+504)
#31 pc 003d6994 /system/lib/libart.so (ExecuteMterpImpl+29972)
#32 pc 001d5351 /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadEPKNS_7DexFile8CodeItemERNS_11ShadowFrameENS_6JValueEb+340)
#33 pc 001da5f1 /system/lib/libart.so (_ZN3art11interpreter30EnterInterpreterFromEntryPointEPNS_6ThreadEPKNS_7DexFile8CodeItemEPNS_11ShadowFrameE+92)
#34 pc 003c0fbd /system/lib/libart.so (artQuickToInterpreterBridge+944)
#35 pc 003e46f1 /system/lib/libart.so (art_quick_to_interpreter_bridge+32)
#36 pc 000a5511 /dev/ashmem/dalvik-jit-code-cache (deleted)
Copy the code

Preliminary analysis of collapse location

Back to the question. Let’s look at crash position 001A731C:

.text:001A7310 STMFD SP! , {R4,R5,LR} .text:001A7314 LDR R5, [R1] .text:001A7318 MOV R4, R0 .text:001A731C LDR R3, [R5,# - 4]; This is where the crash happened
.text:001A7320  SUB     SP, SP, #0xC
.text:001A7324  CMP     R3, # 0
.text:001A7328  SUB     R0, R5, #0xC
.text:001A732C  BLT     loc_1A7350
.text:001A7330  LDR     R3, =(dword_2759D4 - 0x1A733C)
.text:001A7334  ADD     R3, PC, R3 ; dword_2759D4
.text:001A7338  CMP     R0, R3
.text:001A733C  BNE     loc_1A7364
.text:001A7340 loc_1A7340
.text:001A7340  STR     R5, [R4]
.text:001A7344  MOV     R0, R4
.text:001A7348  ADD     SP, SP, #0xC.text:001A734C LDMFD SP! , {R4,R5,PC} .text:001A7350 ADD R1, SP,#0x18+var_14
.text:001A7354  MOV     R2, # 0
.text:001A7358  BL      sub_1A6EA8
.text:001A735C  MOV     R5, R0
.text:001A7360  B       loc_1A7340
.text:001A7364  MOV     R1, # 1
.text:001A7368  ADD     R0, R0, # 8
.text:001A736C  BL      sub_1C2CAC
.text:001A7370  B       loc_1A7340
Copy the code

This is a relatively short complete call. R4, R5, LR are pressed first and then executed. LDR R3, [R5,#-4] [0x5,#-4] [0x5,#-4] [0x5,#-4] [0x5,#-4] Signal Code SEGV_MAPERR and Fault ADDR 0x1 were also exactly as expected.

Since there are only two lines of backtrace, let’s move on to the next line, at 0020b7e5:

. .rodata:0020B795 DCB"try_count_=%d",0
.rodata:0020B7E3 asc_20B7E3   DCB ": / /",0 
.rodata:0020B7E7 aCdn         DCB "CDN". Zero...Copy the code

Surprisingly 0020b7e5 is in.rodata, but that explains why the unwind was broken (the backtrace is incomplete).

Suspicious of

Looking back at the instructions near the crash location again, something suspicious did appear:

.text:001A7310 STMFD SP! , {R4,R5,LR} ............ .text:001A731C LDR R3, [R5,# - 4]; This is where the crash happened
.text:001A7320  SUB     SP, SP, #0xC. .text:001A7348 ADD SP, SP,#0xC.text:001A734C LDMFD SP! , {R4,R5,PC}Copy the code

In this relatively short call, only 24 bytes of stack memory were used, but the SP was not moved all at once, which was very unusual.

unwind table

Look at the unwind table:

$ arm-linux-androideabi-readelf -u ./libmcto_media_player.so

............

0x1a7268: 0x80b108ab
  Compact model index: 0
  0xb1 0x08 pop {r3}
  0xab      pop {r4, r5, r6, r7, r14}
  
0x1a7310: 0x8002a9b0
  Compact model index: 0
  0x02      vsp = vsp + 12
  0xa9      pop {r4, r5, r14}
  0xb0      finish
  
0x1a7424: 0x8001a8b0
  Compact model index: 0
  0x01      vsp = vsp + 8
  0xa8      pop {r4, r14}
  0xb0      finish
............
Copy the code

Crash position 001a731c matches the unwind message code 0x8002a9B0 with offset starting at 1a7310. According to this information, the unwind SP value needs to be added by a total of 24 bytes. But as you can see from the previous assembly instructions, when a crash occurs (executing to 001a731C), the value of SP is reduced by only 12 bytes (STMFD SP! , {R4,R5,LR}), and that’s the problem.

Look at the stack

According to the data in stack:

#00 993ff494 0000022c993ff498 adcfd000 [anon:libc_malloc] 993ff49c bf88a71d /data/data/com.package.name/files/download/libmcto_media_player.so 993ff4a0 ffffffff 993ff4a4 ffffffff 993ff4a8 bf9d07e7  /data/data/com.package.name/files/download/libmcto_media_player.so#01 993ff4ac 00000000
     993ff4b0  00000000
     993ff4b4  00000000
............
Copy the code

So we see that the unwind process is actually doing exactly what’s in the unwind table, so it’s misleading, SP moved 12 bytes more than it really needed to, the actual LR is stored in memory address 993FF49c, it’s bf88a71D, Based on maps, we calculated the absolute address offset relative to the current ELF. Unfortunately, due to the complex logic of the business side dynamic library and the deep level of call, the unwind process terminated prematurely, using only existing registers, stack and memory information. It is not enough to help the business side locate the problem.

What exactly is 001A731C?

It is unusual for an unwind table message to contradict the corresponding sequence of assembly instructions. What exactly is the function at 001a731C? Why is there such a sequence of instructions?

Got a dynamic library file with debug symbols from the business side:

$arm-linux-androideabi-addr2line -f -e ./libmcto_media_player.so 001a731c
_ZNSsC1ERKSs
libgcc2.c:?

$arm-linux-androideabi-c++filt -n _ZNSsC1ERKSs
std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
Copy the code

The STD ::basic_string constructor.

This problem is most likely due to a bug in the NDK. According to the business side, the VERSION of THE NDK they use is R9D.

Know the NDK you use

Developing and maintaining a cross-platform cross-compilation tool is no easy task. Compared to C, the compiler has to make a lot of extra effort to ensure that the various syntactic features of C++ work as expected at runtime, and the C++ standard library has long been a mix of different versions. Support new versions of Android for underlying changes while maintaining backward compatibility. The NDK was not as stable and reliable as we expected. Check out the NDK’s Github official Issues.

The NDK has clearly listed important Known Issues in Changelog since R11.

In R11 Changelog, we can see that:

Exception handling will often fail when using c++_shared on ARM32. The root cause is incompatibility between the LLVM unwinder used by libc++abi for ARM32 and libgcc. This is not a regression from r10e.

In R12 Changelog’s Known Issues, it says:

Exception unwinding with c++_shared still does not work for ARM on Gingerbread or Ice Cream Sandwich.

We know that C++ ‘s exception handling mechanism also relies on unwinding at runtime. That should be the problem.

conclusion

The business side recompiled the dynamic library using a newer version of the NDK, and we checked the assembly instructions for STD ::basic_string and found that this time SP moved into place at the beginning of the function. There should be no problem. After the business side goes online to recompile the dynamic library and gets the complete backtrace, it can locate and fix the segment error problem.

Therefore, the reason for this incomplete backtrace problem was a bug in the earlier NDK that caused the generated dynamic library to fail to perform the unwind correctly in some cases.

According to the above description of Known Issues, not only backtrace retrieval after a crash is sometimes affected, but also where the business logic itself uses C++ exception mechanism, to be specific, it may be affected: After an exception is thrown, it may not be possible to execute the exception-catching logic layer by layer as the code logic expects. If there are hidden issues like this, hopefully this NDK update will fix them as well.

About the crash capture tool

Finally, it’s time for our commercial break.

All of the above online crash information was captured using xCrash, an Android APP crash capture tool developed by us.