Caikelun. IO/POST /2019-0…
Whether it’s Linux’s Coredump or Android’s Tombstone, we’ll eventually need to perform what’s known as a “mortem.” Sometimes we come across the perfect crime scene.
The phenomenon of
The phenomenon of 1
Signal: 7 (SIGBUS), Code: 2 (BUS_ADRERR), fault addr 0xcd596372
r0 0000006e r1 00000000 r2 00000000 r3 00000000
r4 f35bb740 r5 f35bb748 r6 cd3ff8a8 r7 00000064
r8 f7451e34 r9 f35bb708 r10 cd301000 r11 cd5a3ac5
ip 00000000 sp cd3ff890 lr 00000000 pc cd596372
#00 pc 0009a372 /data/data/com.package.name/files/download/libCube.so
#01 pc 000a4f69 /data/data/com.package.name/files/download/libCube.so
#02 pc 000a513f /data/data/com.package.name/files/download/libCube.so
#03 pc 000a7ab9 /data/data/com.package.name/files/download/libCube.so
#04 pc 000a7abf /data/data/com.package.name/files/download/libCube.so
#05 pc 000a7add /data/data/com.package.name/files/download/libCube.so
#06 pc 0004078b /system/lib/libc.so (_ZL15__pthread_startPv+30)
#07 pc 0001a031 /system/lib/libc.so (__start_thread+6)
#00 cd3ff890 cd3ff8dc
cd3ff894 5d419f43
cd3ff898 cd3ff8a0
cd3ff89c f35bb744 [anon:libc_malloc]
cd3ff8a0 5d419f43
cd3ff8a4 000a5952
cd3ff8a8 5d419f43
cd3ff8ac 2e62c950 /dev/ashmem/dalvik-main space (deleted)
cd3ff8b0 00000064
cd3ff8b4 f35bb700 [anon:libc_malloc]
cd3ff8b8 f35bb750 [anon:libc_malloc]
cd3ff8bc 00000064
cd3ff8c0 cd3ff8dc
cd3ff8c4 cd5a0f6d /data/data/com.package.name/files/download/libCube.so
#01 cd3ff8c8 cd5a3ac5 /data/data/com.package.name/files/download/libCube.so
cd3ff8cc 00000000
cd3ff8d0 00000001
cd3ff8d4 262237a1 /dev/ashmem/dalvik-main space (deleted)
cd3ff8d8 00000000 ...... . memory near pc:cd596350 43591889 60714b11 dd054299 33019b01 .. YC.Kq`.B..... 3cd596360 4b0f9306 607118c9 1c299803 f0251c32 ... K.. q`..) . 2. %.cd596370 286ef8d5 7b23d003 d0f52b00 2601e002 .. n(... # {+... &
cd596380 e0004276 78232600 d1002b00 1c287323 vB... &#x.+.. #s(.
cd596390 fd84f024 b0091c30 46c0bdf0 3b9ac9ff $...0......F...;
cd5963a0 c4653600 1c0cb510 f8c0f025 d1022800 .6e..... %... (..cd5963b0 20016020 f024e004 6800fcd1 20006020 `. .. $... h `.cd5963c0 b570bd10 26006803 3b0cb09a 1c0c681b .. p.. h.&... ; .h..cd5963d0 d10342b3 600a2202 e0161c18 6800600e .B...". `... `.h cd5963e0 f0244669 2800fef3 f024d005 6800fcb7 iF$.... (.. $... h cd5963f0 1c306020 9b04e009 021222f0 2380401a `0......". @.#
cd596400 429a021b 6020d101 b01a2001 b538bd70 ... B.. `... p.8.cd596410 1c0d6800 3b0c1c03 2c00681c 2302d102 .h..... ; .h.,...#
cd596420 e00d600b feeaf024 d0051e04 602b2300 .`.. $...# + `
cd596430 fef4f024 e0042001 fc90f024 60286800 $.... ..$....h(`
cd596440 bd381c20 6803b570 1c0d1c06 681c3b0c .8.p.. h..... ; .h ...... .cd4fc000-cd5e9000 r-x 0 ed000 /data/data/com.package.name/files/download/libCube.so
cd5e9000-cd5ee000 r-- ec000 5000 /data/data/com.package.name/files/download/libCube.so
cd5ee000-cd5ef000 rw- f1000 1000 /data/data/com.package.name/files/download/libCube.so
......
Copy the code
.text:0009A368 LDR R0, [SP,#0x38+var_2C]
.text:0009A36A MOVS R1, R5
.text:0009A36C MOVS R2, R6
.text:0009A36E BL j_pthread_cond_timedwait
.text:0009A372 CMP R0, #0x6E ; SIGBUS occurs at this location
.text:0009A374 BEQ loc_9A37E
Copy the code
The crash is at libCube at offset 0009a372. According to the arm32 fastCall convention, here, after pthread_cond_timedwait() returns, check whether the return value in r0 is 6E (decimal: Pthread_cond_timedwait () returns 110 (ETIMEDOUT) indicating timeout.
This is a normal call, no problem. So why SIGBUS? Is it an occasional NAND Flash hardware failure? Or a driver bug?
As you can see from the crash information captured by xCrash, the binary instruction code of the current PC location (0009a372) is 286E, which corresponds to CMP R0, #0x6E, where the instruction data is obtained by ptrace(). XCrash gets it but the CPU doesn’t get it when it executes this instruction, right?
The phenomenon of 2
Signal: 11 (SIGSEGV), Code: 1 (SEGV_MAPERR), fault addr 0x11d868
r0 bf206648 r1 00000000 r2 00000000 r3 00000000
r4 bf206640 r5 bf206648 r6 ffffffff r7 000001b9
r8 d44bf980 r9 bf206608 r10 b1654000 r11 b21bd89d
ip b240ed4c sp b1752890 lr b21a66b9 pc 0011d868
#00 pc 0011d868
#01 pc 005586b5 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN3psl5Event4WaitEi+160)
#02 pc 0056fe55 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN8BaseHcdn6Thread9ThreadRunEv+122)
#03 pc 0057002b /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN8BaseHcdn10CThreadObj14ThreadWorkFuncEv+6)
#04 pc 0056f891 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN3psl13CThreadObject14ThreadBaseFuncEv+26)
#05 pc 0056f897 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN3psl13CThreadObject3RunEv+2)
#06 pc 0056f8b5 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN3psl10ThreadBaseINS_13CThreadObjectEE10ThreadProcEPv+24)
#07 pc 00040aab /system/lib/libc.so (_ZL15__pthread_startPv+30)
#08 pc 0001a011 /system/lib/libc.so (__start_thread+6)
#00 b1752890 b17528dc. .#01 b1752890 b17528dc
b1752894 5d419f83
b1752898 b17528a0
b175289c bf206644 [anon:libc_malloc]
b17528a0 5d419f83
b17528a4 000cf94e
b17528a8 5d419f84
b17528ac 115c2ef0
b17528b0 000001b9
b17528b4 bf206600 [anon:libc_malloc]
b17528b8 bf206650 [anon:libc_malloc]
b17528bc 000001b9
b17528c0 b17528dc
b17528c4 b21bde59 /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN8BaseHcdn6Thread9ThreadRunEv+126)
#02 b17528c8 b21bd89d /data/data/com.package.name/files/download/libHCDNClientNet.so (_ZN3psl10ThreadBaseINS_13CThreadObjectEE10ThreadProcEPv)b17528cc 00000000 b17528d0 00000001 b17528d4 0730c6a9 b17528d8 00000000 b17528dc bf20663c [anon:libc_malloc] b17528e0 00000000 b17528e4 bf206600 [anon:libc_malloc] b17528e8 b1752970 b17528ec b1752930 ...... . f73d0000-f7444000 r-x 0 74000 /system/lib/libc.so
f7444000-f7448000 r-- 73000 4000 /system/lib/libc.so
f7448000-f744b000 rw- 77000 3000 /system/lib/libc.so ...... b1c4e000-b23ef000 r-x 0 7a1000 /data/data/com.package.name/files/download/libHCDNClientNet.so b23ef000-b23f0000 --- 0 1000 b23f0000-b240f000 r-- 7a1000 1f000 /data/data/com.package.name/files/download/libHCDNClientNet.so b240f000-b241a000 rw- 7c0000 b000 /data/data/com.package.name/files/download/libHCDNClientNet.so ......Copy the code
This is a segment error, PC value is 0011d868, this address is obviously abnormal, we call PC “run away”. In the second line of backtrace, offset 005586b5 for libhcdnClientnet. so corresponds to the following instruction:
.text:005586B2 MOVS R0, R5 .text:005586B4 BL j_pthread_mutex_unlock ; Here. Text :005586B8 MOVS R0, R6Copy the code
This is a call to the pthread_mutex_unlock() function in libc.so, and there should be no problem here. Is it a Linker bug? The absolute address of pthread_mutex_unlock() should be written by Linker to libhCDnClientnet. so’s GOT.
There are more of these bizarre crashes, which are not listed here.
What went wrong?
Analysis of the
The problem characteristics
After summary, it is found that the above problems have the following characteristics:
- The root cause of the problem cannot be seen from Tombstone.
- The breakdown point itself is beyond the control of the business logic.
- Most occur at the next instruction that causes a call to a thread block.
- All take place in a dynamically downloaded so library. (Note that the so library is present
files
Directory) - Device model and OS version distribution have no obvious characteristics.
- More than 99% occur on non-root devices. (There are fewer root phone users now)
Crash the capture mechanism
For android, debuggerd, xCrash, or Breakpad, catching native crashes will go through the following stages:
- The signal handler of the crashed process is aroused to execute.
clone()
The child process.- In the child process
ptrace()
Attach individual threads to the crashed process. - Read the register, memory, ELF and other information of each thread in the crashed process.
- Write information directly to dump file. Or further analyze and extract backtrace information and write it into the dump file.
The crashed process is still executing until the child attaches to all the threads of the crashed process. If you consider the multiple processes of an Android App, other processes of the app may also be executing during the entire crash capture.
The mechanism for crash capture is also implemented by code, which needs to be executed step by step. We can imagine that when a crime (crash) occurs, it takes a certain amount of time for the police (crash capture mechanism) to arrive at the scene of the crime, which is the “blind spot”.
For the crash problem we described earlier, after the process crashed, but before xCrash started executing and recording the crash information, something else must have happened that we didn’t know about. Some details of the crime scene were tampered with!
Suspicious of
Recall the above “phenomenon 1”, why xCrash can get instruction code 286E, but CPU does not get it, so SIGBUS occurs?
Perhaps the reason is simple: when the CPU pipeline loads the instruction from memory, the offset position of the so library file is indeed unreadable, and thus SIGBUS occurs. After that, xCrash began to capture crash information. When xCrash used ptrace() to read data from the same memory location, the offset position in the so library became readable again.
Most likely, the so library was overwritten while linker had already loaded and started executing. Maybe there is a bug in some aspects of dynamic download, decompression and verification of SO library, which leads to the abnormal internal state.
Look for circumstantial evidence
After careful analysis, we did find some circumstantial evidence:
The last modification time of the so file
We have posted a lot of cases where the “last modification time of the crashed so library file” is later than the “crash time”. Something like the following:
2019-08-01 01:54:51.519 +0800 Process Crash time: 2019-08-01 01:54:53.743 +0800 so last modify time: 2019-08-01 01:54:53.938 +0800Copy the code
So file size and MD5 hash
Since the dynamically downloaded SO library can be reliably obtained from the server, we compared the file size and MD5 hash of the crashed SO library recorded by xCrash. It turns out that one part of the case is actually wrong, and that part of the case should be: “Until xCrash started recording crash information, the so library was not completely covered by errors.”
open files
By analyzing the open Files list, we can see that in some cases, the crashed process still has a valid FD that can be used to access the zip package that the dynamically downloaded so library belongs to:
Open files:
......
fd 57: /data/data/com.package.name/files/download/fullso.zip
......
Copy the code
Some FDS also point to CRC files for zip packages:
Open files:
......
fd 58: /data/data/com.package.name/files/download/fullso.zip.crc
......
Copy the code
conclusion
The conclusion is obvious: there are some bugs in the dynamic download, decompression and verification of the so library, causing the so library file to be overwritten once it has been loaded and executed by Linker.
Follow-up: Our so library dynamic delivery process has been improved with a more defensive design. We can see from the statistics of online crash data of APM that the crashes similar to “Phenomenon 1” and “Phenomenon 2” mentioned above almost disappear, and the overall crash rate of APP native is further reduced.