This paper mainly introduces the principle of Heimdallr’s monitoring of stuck and stuck anomalies, and combines the problems found in long-term business precipitation to carry out continuous iteration and optimization, and gradually achieve a comprehensive, stable and reliable process.

Bytedance Terminal Technology — Kunlun White

preface

As an important performance indicator of iOS App at present, deadlock and lag not only affect user experience, but also affect important product data such as user retention and DAU. This paper mainly introduces the principle of Heimdallr’s monitoring of stuck and stuck anomalies, and combines the problems found in long-term business precipitation to carry out continuous iteration and optimization, and gradually achieve a comprehensive, stable and reliable process.

First, what is stuck/stuck?

Caton, as the name implies, is a period of time during the use of the block, so that the user can not operate during this period of time, the content on the screen does not change. Heimdallr divided the monitoring index into three grades according to the length of blocking time.

1. Fluency and frame loss: animation and sliding list are not smooth, usually at the level of ten to tens of milliseconds

2, stuck: short time operation no response, can continue to use after recovery, from hundreds of milliseconds to a few seconds

3. Stuck: there is no reaction for a long time until it is killed by the system. Collect data online at least 5s

As can be seen, the stutter problem can be divided into fluency and frame loss, stutter and stutter according to the severity from small to large. The severity of a stuck death is comparable to, or worse than, Crash. Not only does the freeze result in a crash, but the user is forced to wait for quite a long time, which further damages the user experience. Due to the differences in monitoring schemes, this paper mainly focuses on the monitoring of the latter two stuck and stuck.

Two, the cause of stuck/stuck

In iOS development, UIKit is non-thread-safe, so all UI-related operations must be executed in the main thread, and the system will redraw and render UI changes to the screen every 16ms (1/60 frames). If the UI refresh interval can be less than 16ms, the user will not feel stuck. But if you do something time-consuming on the main thread that prevents the UI from refreshing, you can get stuck, or even stuck. The main thread processes tasks based on the Runloop mechanism, as shown below. Runloop supports external registration notification callbacks provided

1, RunloopEntry

2, RunloopBeforeTimers

3, RunloopBeforeSources

4, RunloopBeforeWaiting

5, RunloopAfterWaiting

6, RunloopExit

The flow relationship of event callbacks at six times is shown in the figure below. The Runloop goes to sleep when there are no more tasks to work on until it is woken up by a signal to work on new tasks.

In daily coding, UIEvent events, Timer events and dispatch main thread tasks are all driven by the loop mechanism of Runloop. If we perform a time-consuming operation in any part of the main thread, or if the lock is improperly used and causes a deadlock with another thread, the main thread will fail to refresh the interface because it cannot execute the core-animation callback. The user interaction depends on the transmission and response of UIEvent, and the process must also be completed in the main thread. So blocking the main thread leads to both UI and interaction blocking, which is the root cause of gridlock.

Iii. Monitoring scheme

Since the root of the problem is the blocking of the main Runloop, we need to monitor the running status of the main Runloop by technical means. In order to obtain the status of the main thread Runloop in real time, the main thread first registers the above mentioned event callbacks. When the event callback is triggered, it uses the signal mechanism to pass its running status to another child thread that is listening. The listener thread can handle signals in a variety of ways. It can set a timeout for waiting for the signal. If the timeout is exceeded, this indicates that the main thread may be experiencing blocking. By listening to the thread, we can get a complete picture of the main thread Runloop cycle, what phase it is in, how long it is taking, and so on. According to the necessary information, corresponding strategies can be adopted to catch and process exceptions. The following will be explained separately on the stuck and dead exceptions.

At present, most APM tools use Runloop listening to catch catch, which is the best performance and accuracy. Because RunloopBeforeTimers and RunloopBeforeSources are adjacent event callbacks, Heimdallr removes listening on RunloopBeforeTimers to reduce the performance penalty caused by frequent event callbacks to Runloop.

  1. Caton (ANR)

The feature of the holdup monitoring is that the blocking of the main thread is temporary and can be recovered, so we need to obtain the duration of the holdup to evaluate the severity of the holdup problem. We set a threshold value T for the holdup time in advance. When the main thread blocks for more than this threshold, the whole thread will be triggered to grab the stack and obtain the stack information of the holdup scene. After that, the listening thread continues to wait for the main thread until the main thread recovers, calculates the total time of the holdup, integrates the stack information obtained before, and reports the holdup exception. Note that if the main thread cannot be recovered after a stack capture, the exception is not stuck and should be handled by the stuck module.

  1. Stuck (WatchDog)

Unlike a gridlock, a gridlock is longer and irreversible. IOS monitors the main thread of the App in a similar way. Once a block is detected that lasts longer than the threshold allowed in the current system (different iOS versions and models), the current App process is forcibly killed without notice. So what we need to do is get the stack and estimate the duration of the holdup as best we can before the system finds it and kills it.

Set a preset threshold T (8s by default) for gridlock. This threshold can be relatively conservative and does not necessarily mean that a person will be judged to be gridlocked if the threshold is exceeded. When the deadlock threshold T is exceeded, the full thread stack is fetched and saved to a local file. After that, sampling is performed at intervals (sampling interval, default is 1s). The purpose of sampling is not to get a new stack, but to update the duration of the holdup, saving this information to a local file. Therefore, the smaller the sampling interval is, the more accurate it is to approximate the real stuck time. Until a certain time node, the system kills the App. When the App is started next time, the module will restore the information of the stack and duration of the stuck according to the local file information retained in the last startup, and report the abnormal stuck.

To be clear, many people assume that freezes must be caused by scenarios such as deadlocks and endless loops that never complete the program. However, in many scenarios, one or more time-consuming operations that exceed the system threshold will trigger a deadlock. The system will also kill applications that fail to complete initialization within the specified time during startup. Therefore, some jams may be caused by the unreasonable combination of multiple scenes, which also puts forward higher requirements for the location of jams.

Problems and optimization

Ideal is full, reality is skinny. What appears to be an “airtight” surveillance scheme is showing varying degrees of trouble online.

  1. Caton monitoring optimization

In the holdup monitoring, we think that the stack acquired when the holdup threshold is exceeded must be a holdup scenario, but it is not. In some cases, the acquired stack may be someone else’s. Let’s look at the following case. It is the time-consuming operation 4 that causes the main thread to stall, but when we set the threshold timeout, we get a stack 5 without any performance problems. So if you use this way to monitor caton, there are bound to be false positives. In terms of probability, although the reported 5 is a false positive, in terms of online reported volume, the number of 4 must be greater than 5. Therefore, reporting large stacks should be really time-consuming operations that require our attention, while reporting smaller stacks can be false positives.

Is it possible to capture caton scenes more accurately with some technical means while controlling the performance overhead? A good idea is a sampling strategy. As shown below, we added “sampling mode” on the basis of the original “conventional mode”. Additional sampling intervals and sampling thresholds need to be defined. We divide the waiting process of the caton threshold into finer granularity time nodes in the unit of sampling interval. The main thread is sampled at each time node and the stack is extracted from the main thread. Since only the main thread is extracted from the stack, the time is much less than the full thread of the stack capture.

Once the main thread stack is retrieved, the stack is aggregated by extracting the top layer’s first call to itself. If an identical stack lasts longer than a set sampling threshold, such as 4 in the figure, and is repeated three times, then the scenario must be considered a stall scenario. Full thread grab is then performed, and no longer when the laten threshold is triggered.

Combined with the main thread sampling, we can more accurately monitor the stacken scene at the function level, but also need to pay the additional performance cost of sampling. In order to minimize the sampling overhead and avoid secondary performance degradation of low-end equipment online, Caton monitoring supports sampling enabled annealing strategies. When a certain stuck scenario is captured for several times, to avoid unnecessary performance waste caused by capturing it again, the sampling interval is gradually increased until the “sampling mode” is degraded to “normal mode”.

After the sampling function is configured and enabled on the Slardar platform, the sample_flag function can be used to filter out the staid exceptions obtained through sampling timeout. The stack obtained in this way has a high probability of being stuck, which can be analyzed and solved in a more targeted way.

  1. Optimization of stuck monitoring

Compared to gridlock, most false positives of gridlock occur in the background (currently Heimdallr provides post-gridlock filtering, which can be opened by business parties who are not concerned about the post-gridlock). Due to the limitation of background scenes, the current App thread has a lower priority and may be suspended by the system at any time, which brings many problems for us to determine the stuck time.

The above Case describes a false positive scenario of deadlock. Because threads in the background have a lower priority, tasks 1, 2 and 3 take longer to execute than those in the foreground, and it is easier to exceed our deadlock threshold. Then, due to iOS policy problems, the background application is suspended until a certain point in time when the entire application is killed due to memory shortage. Please note, however, that this process belongs to the normal life cycle of the App and is not a WatchDog. And according to our previous strategy, this would have been ruled stuck. Because we cannot listen to the suspend event, false positives cannot be ruled out in this scenario.

In another case where suspend occurs before the 8s threshold, after a long suspension, the application is resumed and the 8s timeout is triggered. However, in fact, our App was only running in a small part of 8s, and was suspended most of the time, so it should not trigger the judgment of stuck. It comes down to the accuracy of the timing.

To solve the above problems, the timing strategy has been improved. Instead of waiting for 8s directly, we subdivide the time into 8 1s. If the App is suspended during this time, it will not directly exceed the 8s threshold when it resumes, but will only cause an error of up to 1s. \

In addition, as mentioned above, gridlock can sometimes result from multiple time-consuming scenarios. In order to be able to track the change of the main thread, stack sampling is performed on the main thread and reported together during the sampling phase after stack capture. Combined with the main thread stack obtained in the sample, we can get a main thread stack changeA time lineTo help locate problems more accurately. (A time lineFeatures supported after Heimdallr 0.7.15)

Finally, we found that some of the deadlocks were caused by THE OC Runtime Lock (most likely deadlocks caused by DYLD and OC Runtime Lock). Once this type of holdup occurs, the OC code of all other threads will block, including the listening thread, and the holdup monitor cannot catch the exception. In order to cover all scenarios, we reconstructed all the logic of the stuck and stalled modules in C/C++ to remove the dependence on OC calls, and the performance was further improved compared with OC implementation.

conclusion

Heimdallr’s ANR and WatchDog modules have reached a comprehensive, stable and reliable state after a period of iteration and optimization. Some optimization ideas during this period refer to some open source APM framework, and combined with the actual needs of users to continue to improve. Thanks to all users for their feedback and help us improve our features and experience. In the future, we will continue to add anti-stuck function for Watchdog scenario, helping access parties to solve the stuck problem in general scenarios without intrusion.


🔥 Volcano Engine APMPlus application Performance Monitoring is a performance monitoring product for volcano Engine application development suite MARS. Through advanced data collection and monitoring technologies, we provide enterprises with full-link application performance monitoring services, helping enterprises improve the efficiency of troubleshooting and solving abnormal problems. At present, we specially launch “APMPlus Application Performance Monitoring Enterprise Support Action” for small and medium-sized enterprises, providing free application performance monitoring resource pack for small and medium-sized enterprises. Apply now for a chance to receive a 60-day free performance monitoring service with up to 60 million events.

👉 Click here to apply now