The background,

Crash refers to a situation in which a mobile device (such as an iOS or Android device) suddenly exits or is interrupted when an application is being opened or used. If the online version of App crashes frequently, the user experience will be greatly affected, and even lead to user loss and revenue reduction. Therefore, crashes are a priority for the client stability team.

In the vast sea of people, you see this paragraph of text, waste your three seconds of time, welcome you to an iOS exchange technology collision, learn from each other, improve technology together! 563513413

However, for all crash scenarios, only 25% of crashes can be improved by semaphore capture. Another 75 percent of crashes are hard to identify, potentially having a huge impact on the user experience of the App.

! [](https://static001.geekbang.org/infoq/2f/2fcab36dab2ca7e98eb40fd04251b3a2.png)

Facebook engineers broke down App exits into six categories:

1. Exit () or abort() is actively called inside the App;

2. The user process is killed during the App upgrade.

3. A user process is killed during the system upgrade.

4. The App is killed in the background;

5.App is killed in the foreground and the stack can be retrieved;

6.App is killed in foreground and cannot get stack.

For the exits of categories 1 to 4, they belong to normal App exits, which have no significant impact on user experience and need not be handled accordingly. For class 5 exits, the cause of the crash can be located at the stack code level, for which the industry has developed a relatively mature solution. There are many possible reasons for the exit of type 6, including but not limited to: the system continues to apply for memory when the memory is insufficient, the main thread is stuck for more than 20 seconds, and the CPU usage is too high on Stack Overflow. We call this the “Abort problem” on the iOS client.

Abort problems cannot be caught by the stack and occur much more frequently than catch-able crashes (” stack crashes “). Historically, the number of Abort problems associated with hand washes (representative of e-commerce superapps) is about three times the number of stack crashes. Youku Pad (video super App representative) typically has about five times as many Abort problems as stack crashes. As you can see, Abort has a significant impact on the user experience.

This paper will analyze the root cause of Abort problem on iOS client and propose a systematic solution.

Abort causes

There are four main causes of Abort.

2.1 memory Jetsam

The physical memory resources of mobile devices are limited, but THE App is still applying for memory constantly. Therefore, signal 9 kills the process, causing an abnormal exit.

{ "memoryPages" : { "active" : 24493, "throttled" : 0, "fileBacked" : 24113, "wired" : 13007, "anonymous" : 12915, "purgeable" : 127, "inactive" : 10955, "free" : 2290, "speculative" : 1580 }, "uncompressed" : 125795, "decompressions" : 143684 }, "largestProcess" : "Taobao4iPhone", "processes" : [ { ... { "rpages" : 2050, "states" : [ "frontmost", "resume" ], "name" : "Taobao4iPhone", "pid" : 1518, "reason" : "vm-thrashing", "fds" : 50, "UUID" : "5103A88A-917F-319E-8553-c0189DD1ABAC "," PURGEable ": 127, "cpuTime" : 4.619693, "lifetimeMax" : }}, 3557...Copy the code

2.2 Main thread deadlock

Threads A and B are waiting for each other to complete certain operations at the same time, so they cannot continue to execute, resulting in deadlock and abnormal exit.

Exception Type: 00000020Exception Codes: 0x000000008badf00dHighlighted Thread: 0 Application Specific Information:com.myapp.myapp failed to scene-create in time Elapsed total CPU time (seconds): Elapsed CPU time (seconds): Elapsed CPU time (seconds): Elapsed CPU time (seconds): Dispatch queue: com.apple.main-threadThread 0:0 libsystem_kernel.dylib 0x36360540 semaphore_wait_trap + 81 libdispatch.dylib 0x36297eee _dispatch_semaphore_wait_slow + 1862 libxpc.dylib 0x364077b8 xpc_connection_send_message_with_reply_sync + 1523 Security  0x2b8dd310 securityd_message_with_reply_sync + 644 Security 0x2b8dd48c securityd_send_sync_and_do + 445 Security 0x2b8ea452 __SecItemCopyMatching_block_invoke + 1666 Security 0x2b8e96f6 SecOSStatusWith + 147 Security 0x2b8ea36e SecItemCopyMatching + 174Copy the code

2.3 Startup/Restart Timed out

The startup/restart time of the App exceeded the upper limit. Procedure

scene-create watchdog transgression: App Exhausted real (wall clock) time allowance of Elapsed total CPU time (seconds): 21.050 (User 21.050, System 0.000)Copy the code

2.4 the CPU

The main thread deadlock or startup/restart timeout may indirectly lead to CPU explosion and abnormal exit.

Abort problem root location

Abort problems often have no obvious clues to locate the problem, so they are difficult to resolve. Hand tao has experienced many times of Abort problem number spike, but do not know how to start the accident, even one or two occurred shortly before Double 11, but often ended with “a group of people helpless and painful mass test reoccurrence, reoccurrence can not determine whether it is really reoccurrence”.

Therefore, we urgently need to develop a complete solution based on the existing experience to quickly and accurately locate/solve the problem. This requires us to consider from the following aspects:

1.Abort scenarios in which the problem occurs: for example, which page, what operation.

2.Abort causes: For example, memory Jetsam, main thread deadlock, startup/restart timeout, or CPU burst.

3. For memory Jetsam, further locate whether memory leaks have occurred and the retained Cycle of leaked references.

4. For the main thread deadlock, locate the stuck stack.

5. If the startup or restart times out or the CPU exploses, locate the stack.

Next, we take the main thread deadlock of hand scouring as an example to conduct root cause analysis. First, let’s take a look at the general view of a version of the Abort problem data:

! [](https://static001.geekbang.org/infoq/ec/ecba89e29d3dc2f315b12472c15190e3.png)

Before Abort occurs, memory and CPU usage are normal. Therefore, the primary thread is deadlocked.

! [](https://static001.geekbang.org/infoq/cf/cff913059e028d3ff991b361c3478ef9.png)

Check the related log files to verify that the time and clues match, so you can finally determine the cause of the abnormal exit deadlock of the master thread.

! [](https://static001.geekbang.org/infoq/a4/a47ffdd2464a987416f7e30a1ddab561.png)

4. Abort is a systemic solution to the problem

4.1 Abort System solution Difficulty: Field capture

To implement a systematic solution to Abort, consider the following:

1. Abort problems caused by killing processes with Signal 9 are often difficult to capture on the stack through the semaphore. In this case, how can key information from the crash site be captured as completely as possible? What information does it contain?

2. When App crashes, the system is in an extremely unstable state. How to ensure the stable fall of the data in the crash?

3. During information collection and data capture, a large amount of data needs to be written. How can I ensure high-performance log writing?

4. When a large amount of data is stored or uploaded, the system may be under great pressure. How can I ensure high data compression?

! [](https://static001.geekbang.org/infoq/11/11c05fff45278e8442ce49cafbd5b877.png)

Based on the above considerations, we proposed and designed a trace file protocol based on MMAP with high performance, high compression rate, high consistency and self-interpretation, which serves as the data carrier of the iOS high availability system.

4.1.1 MMAP Data Storage Layer Ensures high performance and consistency of data writes

1. Use Mmap to map a file or other object to the address space of the process. The operation on memory will be written to the corresponding disk file by the kernel. Data writes perform as well as (and slightly better than) memory operations

2. After the user process crashes, the mapping area is still managed by the kernel to ensure data consistency

4.1.2 Binary coding protocol ensures the highest data compression rate

1. Specific coding protocol

2. The compression rate of the measured coding can reach more than 80%, or intuitively speaking, the use of 50K memory can record the user’s detailed usage records within 20 minutes, including page access records, system events, second level memory, CPU data.

4.1.3 Record system multidimensional indicators and abnormal events as much as possible

Include:

1. Performance data, including CPU and memory data, is used to determine whether the application processes the overload state

2. Apply for large memory resources

3.Retain Cycle, used to locate Jetsam events

4. Trap is used to locate the Watch dog kill

5. Number of surviving VC instances

! [](https://static001.geekbang.org/infoq/d1/d1bbb87ada43058c298dfcea5fc1311b.png)

Five, the summary

In the world of apps, functional differences have become increasingly difficult to reflect. In this case, good user experience is often the key to App success. Abort is the biggest challenge to user experience for every App and requires App developers to pay enough attention to it.