Bytedance technical team
Reference: / Github /
Different from the ANR problem in Android system, there is no mature solution to the problem of App card death and crash in iOS system. The main reasons are as follows:
-
Generally, when the App is dead for more than 20 seconds, the protection mechanism of the operating system will be triggered and the crash occurs. At this time, the user can find the log of the card death and crash generated by the operating system on the device. However, due to the closed ecosystem of the iOS system, the App level has no permission to obtain the log of the App death and crash.
-
Generally, users do not have the patience to wait for such a long time when they encounter the jammed problem. They may lose their patience after 5s, and directly close the application manually or directly back the application to the background. Therefore, the system will not generate the jammed crash log in these two scenarios.
Due to the two reasons mentioned above, the industry stuck in iOS production environment monitoring scheme is mainly based on the caton monitoring, namely when the user in the process of using the page response time exceeds a certain threshold value of the card (typically hundreds of ms) later judged to be a card, and then crawl to the scene at that time of the call stack and report to the background analysis. The main defects of this scheme are as follows:
-
Failure to distinguish between more minor and serious problems resulted in a large number of reported problems that were difficult to focus on. In fact, this part of the problem actually hurts the user experience much more than the lag.
-
Because some users with low-end models are more likely to encounter frequent delays in a short period of time, but call stack fetching, log writing and reporting monitoring methods are performance harmful, which is why the lag monitoring solution in production environment generally only small traffic, not full.
-
Imagine that A lag lasts for 100ms, the first 99ms are stuck in the execution of method A, but in the last 1ms, the call stack caught by the lag monitoring is the call stack of method B. In fact, method B is not the main reason for the lag, which leads to misdirection.
Pain points based on the above, byte APM China team since the developed a set of specially used to locate the jammed collapse in production environments of solution, this paper will introduce the detailed thinking and concrete implementation of the scheme, and summed up after the scheme launched by some of the typical problems and best practices, expect to inspire you.
Jammed crash background
What is the watchdog
If our App gets stuck for about 20 seconds at startup and crashes on a given day, the system crash log exported from the device is likely to be in the following format:
Exception Type: EXC_CRASH (SIGKILL)Exception Codes: 0x0000000000000000, 0x0000000000000000Exception Note: EXC_CORPSE_NOTIFYTermination Reason: Namespace ASSERTIOND, Code 0x8badf00dTriggered by Thread: 0
Copy the code
Here are the four most important lines of information:
- Exception Type
EXC_CRASH: Indicates the exception type of the Mach layer, indicating that the process exits abnormally.
SIGKILL: Signal in the BSD layer indicating that the process has been terminated by the system, and this signal cannot be blocked, processed, or ignored. You can view the Termination Reason field to know the Termination Reason.
- Exception Codes
This field is usually not used. When a crash report contains an unnamed exception type, the exception type will be represented by this field in the form of a hexadecimal number.
- Exception Note
EXC_CORPSE_NOTIFY and EXC_CRASH are defined in the same file, which means that a process abnormally enters the CORPSE state.
- Termination Reason
The main focus here is Code 0x8BadF00d, which can be viewed in apple’s official documentation. 0x8BadF00d means App ate bad food, indicating that the process was terminated by the operating system because the watchdog timed out. Based on the above information, we can get the definition of watchdog crash:
On iOS platform, if an App takes too long to start, exit, or respond to system events, the system protection mechanism is triggered, and the process is forced to end, such an exception is defined as the watchdog type crash.
The so-called watchdog crash is also referred to as the card crash in this article.
Why do we need to monitor the crash
As we all know, in the research and development of client, flash exit is the most serious bug because it will block the normal use of users, which will directly affect the retention, revenue and other core business indicators. Most of the focus has been on crashes such as unrecognized Selector and EXC_BAD_ACCESS that can be caught within the App process (called normal crashes below), but for exceptions such as SIGKILL that are forced to exit by an out-of-process directive, The original monitoring principles cannot be covered, resulting in such crashes being ignored in production environments for a long time. In addition, there are the following reasons:
-
Because jammed crashes most commonly occur during App startup, the user can get stuck on the screen for 20 seconds and then the App exits. This experience hurts users more than a normal crash.
-
At the beginning of the launch of jam monitoring, the daily jam crash of Toutiao App was about three times that of ordinary crash. It can be seen that if no governance is done, the magnitude of such problems is very large.
-
OOM crash is also triggered by SIGKILL exception signal, the current OOM crash mainstream monitoring principle or elimination method. But the traditional solution, when you do the elimination method, misses one other type of crash of a very large magnitude and that’s the card crash here. This will also greatly improve the accuracy of OOM crash monitoring. IOS performance optimization practice: How to achieve the OOM crash rate reduced by 50%+
Therefore, based on the above information, we can conclude that the monitoring and governance of jammed crashes is very necessary. After nearly two years of monitoring and governance, the magnitude of jinri Toutiao App card crash every day is roughly the same as ordinary crash.
Mechanism of jammed crash monitoring
Caton monitoring principle
In fact, from the perspective of user experience, the definition of jammed is the part of jammed for a long time and never recover, so let’s review the principle of jammed monitoring first. We know that in iOS, most of the calculation or drawing tasks on the main thread are executed periodically in runloop units. If a single runloop lasts longer than 16ms, the UI experience will stall. So how do you check the time of a single runloop?
As you can see from the figure above, if we register an observer of a RunLoop lifecycle event, then afterWaiting=>beforeTimers, BeforeTimers =>beforeSources and beforeSources=>beforeWaiting All three phases may have time-consuming operations. Therefore, the monitoring principle for the lag problem can be roughly divided into the following steps:
-
Register observers of runLoop life cycle events.
-
Time is detected between runloop life cycle callbacks, and if any phase other than sleep is detected to exceed our pre-set lag threshold, a lag decision is triggered and the call stack at that time is recorded.
-
Report to backend platform for analysis at appropriate time.
The overall process is shown in the figure below:
How to judge a card for a card dead
In fact, through some of the above conclusions, it is not difficult to find that whether a long time of lag finally triggered the system card crash, or users can not bear to actively end the process or back to the background, their common feature is that a long time of lag occurred and did not recover, blocking the normal use of the user process.
Based on this theory, we can use the following procedure to determine whether a jam is stuck or not:
-
After a long time delay is detected, the call stack of all threads at that time is recorded and stored in the database as the suspected object of the delay crash.
-
If you enter the next active state in the current Runloop loop, the log is deleted from the database if the log is not stuck once. In this usage cycle, write another log as a suspect when the next long lag is triggered, and so on.
-
At the next start when testing the last start ever stuck log (once the user use cycle up to only happens once stuck), if any, indicating the user during the last use finally met a long card, and ultimately runloop also failed to enter the next active, report is marked as a jammed collapse.
Through the process analysis, we can not only detect system card dead collapse, can detect the user can’t stand long time caton eventually kill application or killed by the system after the background, etc, these scenarios although no actual trigger system card dead collapse, but the extent is equivalent. In other words, the monitoring capability of the jammed crash mentioned in this article is the superset of the system jammed crash.
How to determine the threshold of time of deadlock
The truncated system card crash log format is as follows:
Exception Type: EXC_CRASH (SIGKILL)Exception Codes: 0x0000000000000000, 0x0000000000000000Exception Note: EXC_CORPSE_NOTIFYTerminationReason: Namespace ASSERTIOND, Code 0x8badf00dTriggered by Thread: 0Termination Description: SPRINGBOARD, scene-create watchdog transgression: Application < com. Ss. The iphone. The article. The News > : 2135 exhausted of real (wall clock) time most egregious nest-feathering of 19.83 secondsCopy the code
It can be seen that the protection mechanism of iOS system will only be triggered when the App dead-time exceeds an exception threshold, so the dead-time threshold is a very critical parameter.
Unfortunately, there is no official documentation or API that allows you to directly determine the threshold for a crash. Exhausted Real (Wall Clock) Time Allowance of 19.83 seconds – 19.83 is not a fixed number. It can vary between different implementations in different phases. A 10s case has also been encountered in the crash logs of some systems.
Based on the above information, in order to cover most of the users can perceive, shielding differences between the system version, we think the system lag triggered the collapse of the threshold for 10 s, there are actually quite a number of users in case of an App long caton tend to end the process manually restart rather than has been waiting for, Therefore, the threshold should not be too long. In order to reserve enough time for stack capture, log writing and other operations after triggering the deadlock decision, the final threshold of the deadlock time in this scheme is determined to be 8s. The probability of an 8s choke is much lower than the probability of a few hundred ms choke, so the choke monitoring scheme has no significant performance cost and can be made available to a full range of users in a production environment.
How do I detect when a user is stuck
After the occurrence of stuck death, in fact, we will also pay attention to how long a stuck death is finally stuck. The longer the stuck death, the greater the damage to the user experience, and it should be better solved.
After triggering the deadlock threshold, we can use a timer with a short interval (the default policy is 1s, which can be adjusted on the line) to check whether the current Runloop has entered the next active state every 1s. If not, the current deadlock threshold is added for 1s. In this way, even if a flash retreat finally occurs, the actual dead time of the card can be approximated with an error of less than 1s. The final dead time of the card will also be reported in the log.
However, after this scheme was launched, it encountered some cases of extremely long deadlock time, which mostly occurred in the background of App cutting. This is because in the background, an App process is suspended and can be deemed to have been suspended for a long time. When we calculate the deadlocked time, we use the time difference in the real world. That is to say, when the current App is suspended for 10s in the background and then recovers, we will think that the App is deadlocked for 10s, which easily exceeds the deadlocked threshold set by us. However, the App is not really deadlocked, but the scheduling behavior of the operating system. Such false positives often do not meet our expectations. The false alarm scenario is shown in the following figure:
How to solve the problem that the main thread call stack may have false positives
In order to solve the above problems, we adopt the method of multi-section waiting to reduce the problem that the program running time does not match the real time caused by thread scheduling and suspension, as shown in the following figure. Before the threshold is set at 8s, wait every 1s. After the wait times out, add the current stuck time for 1 second. If the App is suspended during this process, no matter how long it is suspended, the maximum error of 1s will be caused when the App is restored, which greatly increases the stability and accuracy compared with the previous scheme.
In addition, when the dead-time exceeds the preset dead-time threshold, the stack is captured for all threads. But the thread call stack at this point in time alone does not guarantee that the problem will be accurately located. At this point, the main thread may be executing a non-time-consuming task, and the real time-consuming task has finished. Or a more time-consuming task may occur later, which is the key to the deadlock. Therefore, in order to increase the confidence of the stuck call stack, after the stuck threshold is exceeded, the stack of the current main thread is grabbed while waiting every 1s. In order to avoid the expansion of the thread call stack caused by a long deadlock time, the last 10 main thread call stacks before the App exits abnormally will be retained at most. After waiting at multiple intervals, we can get a set of function call stacks that the main thread changes over time before the App exits unexpectedly. With this set of function call stacks, we can locate the real cause of the main thread stalling, and further locate the cause of stalling in combination with the full-thread call stacks obtained when the stalling time exceeds the threshold.
The final monitoring effect is as follows:
Here because of the limitation of image size, just cut the jammed collapse before the last time the main thread of the call stack, when actual use can view the collapse of a period of time before every seconds of the call stack, if it is found that every time the main thread of the call stack are not change, it can confirm the stuck problem not false positives, for example, here is an unusually cross-process communication cause the card to death.
Common problems and best practices for jammed crashes
Multithreaded deadlock
Problem description
It is common for a neutron thread to access the main thread simultaneously at dispatch_once, resulting in deadlock problems. As shown in the figure above, the steps to replicate this deadlock are:
-
The subroutine enters the block of dispatch_once and locks it.
-
The main thread then goes to dispatch_once and waits for the child thread to unlock.
-
The CTTelephonyNetworkInfo object is initialized by the child thread, which throws a notification but requires the main thread to respond in the same step. This causes the main thread and the child thread to deadlock because they are waiting for each other, which eventually triggers a stuck crash.
This is actually stepping on a potential pit of CTTelephonyNetworkInfo. If you replace this with a dispatch_sync to dispatch_get_main_queue(), the effect will be the same, with the same risk of a crash.
Best practices
-
Do not have methods in dispatch_once that are synchronized to the main thread.
-
CTTelephonyNetworkInfo initializes a shared instance at some other time before the +load or main method to avoid the trap of lazy child loading requiring the main thread to respond in the same step.
Lock contention exists between main thread execution code and child thread time-consuming operations
Problem description
A typical problem is stuck in -[YYDiskCache containsObjectForKey:], YYDiskCache internal for multithreaded read and write operations on the disk, through a semaphore lock to ensure mutual exclusion. By analyzing the stuck stack, it can be found that the child thread occupies the lock resource and performs time-consuming write or clean operations, which causes the main thread to be stuck. When the problem occurs, the following child thread call stack can be found:
Best practices
-
Code that is likely to have lock contention should not be executed in the same step as the main thread.
-
If the main thread and child threads inevitably compete, keep the locking granularity as small and the operation as light as possible.
Disk I/O is too dense
Problem description
This problem can manifest itself in a variety of ways, but the bottom line is that the disk IO is too dense and the main thread disk IO takes too long. Typical case:
-
Main thread compression/decompression.
-
The main thread writes to the database in the same step, or reuse the same serial queue with the child thread’s potentially time-consuming operations (such as SQLite vaccum or checkpoint).
-
The main thread disk IO is relatively light, but the subthread IO is too dense, which occurs on low-end devices.
Best practices
-
Database reads and writes, file compression/decompression, and other disk I/O actions are not performed on the main thread.
-
If there are scenarios where the main thread synchronizes tasks to a serial queue for execution, make sure that these tasks are not multiplexed into the same serial queue as the time-consuming operations that may exist in the child threads.
-
For SDKS that do not need to be loaded synchronously at startup and have dense disk IO behavior, third-party SDKS such as various payment sharing can be delayed and stagger loading.
The underlying implementation of the system API has cross-process communication
Problem description
Because cross-process communication needs to be synchronized with other processes, once other processes are abnormal or suspended, the current App is likely to die. Typical case:
-
UIPasteBoard, especially OpenUDID. In order to access the same UDID across apps, the library OpenUDID implements cross-app communication by creating and reading the clipboard. Each time OpenUDID is called externally to obtain the UDID once, and OpenUDID will loop 100 times internally. The process of getting the UDID from the clipboard and sorting to get the UDID with the highest frequency may eventually cause the access clipboard to get stuck.
-
There is direct or indirect cross-process communication in the underlying implementation of NSUserDefaults, and it is easy to get stuck when using the same step in the main thread.
-
[[UIApplication sharedApplication] openURL] interface, internal implementation also exists synchronous cross-process communication.
Best practices
-
Deprecating the third-party library OpenUDID, some third-party SDKS that rely on UIPaseteBoard are pushing maintainers to remove the dependency on UIPasteBoard and update the version; Or put the initialization of these SDKS in the non-main thread, but the experience is that the child thread initialization may have 5% of the dead into flash back, so it is best to add a switch to gradually watch.
-
For kv class storage requirements, consider MMKV for heavy use. For light use, refer to Firebase’s implementation and rewrite a lighter UserDefaults class yourself.
-
IOS10 and older version of the system using the [[UIApplication sharedApplication] openURL: options: completionHandler:] this interface, this interface can be asynchronous callback, not cause.
Objective-c Runtime Lock deadlock
Problem description
Although the occurrence probability of such problems is small, they also happen in some complex scenarios. The call stack of the main thread usually gets stuck in a seemingly ordinary OC method call, which is very obscure, so to find this problem, the deadlock monitoring module itself cannot be implemented in OC language, but should be changed to C/C++. OC = > < / ld_register_func_for_add_image = > OC = > < / ld_register_func_for_add_image And the OC method synchronously calls the objc_copyClassNamesForImage method (holding the OC Runtime lock and then the dyLD lock). Typical case:
- The dyld Lock, selector Lock, and OC Runtime Lock are waiting for each other. The scenario where the three locks wait for each other is as follows:
- During an iteration, the APM SDK internally determines whether the device has been jailbroken to a dependent implementation
fork
Method can be called successfully, butfork
Method will call_objc_atfork_prepare
This function will fetch the objC associated lock and then be calleddyld_initializer
If one of our threads already holds the DyLD lock and is waiting for the OC Runtime Lock, then a deadlock will be raised.
Best practices
-
Use the _dyLD_register_func_for_add_image and objc_copyClassNamesForImage methods with caution, especially if they are called in sync with the OC method.
-
Jailbreak detection, independent of calls to the fork method.