This is the full text of “How to Systematically Manage iOS Stability Issues” delivered by Feng Yadong at ArchSummit global Architects 2021
First of all, I would like to introduce myself: I am Feng Yadong. I joined ByteDance in April 2016 and have been responsible for the engineering architecture, basic library and experience optimization of Toutiao App. Since December 2017, I have been focusing on APM. From 0 to 1, I have participated in the construction of BYteDance APM platform and served the whole series of bytedance products. Currently, I am mainly responsible for the performance stability monitoring and optimization of iOS terminal.This sharing is mainly divided into four chapters: 1. Classification of stability problems; 2. 2. Stability governance methodology; 3. Attribution of difficult problems; 4. Review. The third chapter, “Attribution of difficult problems”, is the focus of this sharing, accounting for about 60% of the length.
I. Classification of stability problems
Before we get into categories, let’s take a look at the background: We all know that flash backlogging is the worst bug a user can encounter with a mobile app, because after the flash backlogging, the user can no longer use the product, and then the subsequent retention and business value of the product itself are not discussed. Here are some numbers I’d like to share with you: For 20% of mobile users, flashbacks are the most annoying problem, second only to inappropriate ads. Of the users who lose because of experience problems, 1/3 will switch to competing products, which shows that the flashback problem is very bad and serious.Bytedance, as a company with super-sized apps such as Tiktok and Toutiao, takes stability very seriously. Over the past few years, we have invested a lot of people and resources in this area and achieved good governance results. In the past two years, the abnormal crash rate of Douyin, Toutiao, Feishu and other apps has been optimized by more than 30%, and some indicators of some products have even been optimized by more than 80%. As can be seen from the pie chart on the right of the figure, we take the iOS platform as an example and divide known stability problems into five categories according to different causes of stability problems. The proportion of stability problems is ranked from high to low. The first category is OOM, which is the crash caused by excessive memory occupation, which accounts for more than 50%. The second is Watchdog, which is stuck, similar to ANR in Android. Crash. Finally, disk I/O and CPU exceptions. One question may be on your mind: What did ByteDance do to achieve this? Next, I will share with you the methodology we have developed for stability governance.
Ii. Methodology for governance of stability issues
First of all, from the perspective of the monitoring platform, it is most important to have complete capability coverage in stability problem governance. For example, for all types of stability problems mentioned in the previous chapter, the monitoring platform should be able to timely and accurately detect them. In addition, from the perspective of business research and development students, stability problem governance needs to run through the complete life cycle of software research and development, including requirements research and development, testing, integration, gray scale, online, etc. In each of the above stages, research and development students should attach importance to the discovery and management of stability problems.
On the right side of the figure are two important governance principles we summarized: The first is to control the increase and control the stock. Generally speaking, the new stability problems may be some easy to break out, the impact is more serious. The stock problem is relatively difficult, and the repair cycle is long. The second is easier to understand: first hurry, then slow; first easy, then difficult. We should prioritize fixing the problems that break out and the ones that are relatively easy to fix.If we focus the software development cycle on the direction of stability problem management, the following links can be abstracts: First, the first link is problem discovery: when users encounter any type of flash back online, the monitoring platform should be able to find and report it in time. At the same time, problems can be notified to developers immediately through alarms and automatic distribution of problems, to ensure that these problems can be fixed in time. The second stage is attribution: the first thing a developer should do when presented with a stability problem is to identify the cause of the problem. According to some different scenarios, attribution can be divided into single point attribution, common attribution and outbreak problem attribution. After finding out the cause of the problem, the next step is to fix the problem, which is the treatment of the problem. Here we have some means of problem prevention: if it is the online stage, we can first do some problem prevention. For example, baymax, an online Crash automatic repair scheme based on OC Runtime mentioned in an article of netease a few years ago, can do Crash prevention directly on the online stage. In addition, due to the outbreak of stability problems caused by back-end services coming online, we can achieve dynamic stop loss by rolling back the service. In addition to these two methods, more scenes still need to be developed to repair native code offline, and then make a thorough repair through release. The last stage is also a hot topic in recent years, is the problem of deterioration. This refers to the phase between development and launch of requirements through automated unit testing /UI automation on the rack, and development through system tools such as Xcode and Instruments, including third-party tools. Wechat’s open source MLeaksFinder, for example, aims to detect and solve stability problems ahead of time.
If we want to manage the stability problem well, we need all r &D students to pay attention to each of the above links to achieve the final goal. But what is the point of all this? From byteDance’s experience in problem management, we believe that the most important step is the second one — attribution of problems online. According to the internal statistical data, it is found that the reason why there is no conclusion for a long time and no way to fix the problems online is mainly because the RESEARCH and development failed to locate the root cause of these problems. So the next chapter is also the focus of this sharing: difficult problem attribution.
3. Attribution of difficult problems
We rank these issues in order of developer familiarity: Crash, Watchdog, OOM, and CPU&Disk I/O. I’ll share background and solutions for each of these problems, and demonstrate how various attributional tools can solve these problems in combination with real-world examples.
3.1 The first kind of difficult problem — Crash
The pie chart on the left is a breakdown of Crash into four categories based on the different causes: Mach exceptions, Unix Signal exceptions, OC and C++ language exceptions. Mach has the highest percentage of exceptions, followed by Signal exceptions, and OC and C++ exceptions are relatively rare. Why this ratio? You can see in the upper right there are two pieces of data. The first is an article by Microsoft claiming that more than 70% of its security patches are memory-related errors, which on iOS corresponds to an illegal address access in Mach exceptions, known as EXC_BAD_ACCESS. According to the internal statistics, 80% of the crashes on Bytedance are long-term unsolved, and more than 90% of these crashes are Mach or Signal exceptions. There are so many crashes that can’t be solved. What’s the difficulty? We summarize several difficulties in attributing these problems:
- First of all, unlike OC and C++ exceptions, the developer may get a crash call stack is a pure system call stack, which is obviously very difficult to fix;
- In addition, some Crash problems may be accidental rather than inevitable, and it is very difficult for developers to reproduce the problems offline, because they cannot reproduce, it is difficult to troubleshoot and locate these problems through IDE debugging.
- In addition, crashed call stacks may not be the first scene for problems such as illegal address access. Here is A simple example: The memory allocation of service A overflows and the memory of service B is trampled. At this time, we believe that service A should be the main cause of the problem, but it is possible that service B will use this memory at A later time and crash. Obviously, this kind of problem is actually caused by A business, but finally collapsed in the call stack of B business, which will bring great interference to the developer’s troubleshooting and solving this problem.
Seeing this, you may have a question in mind again: since such problems are so difficult to solve, is there no way at all? Not really, but I’m going to share two attributions tools inside bytes that are very useful for solving these problems.
3.1.1 Zombie detection
The first one is Zombie detection, which if you’ve used Xcode’s Zombie monitoring, you’re probably familiar with. If we turn Zombie Objects on before debugging, and we run a crash caused by an OC object wild pointer, the Xcode console prints a log that tells the developer which object crashed when calling what message.
Zombie is an OC object that has been released. What’s the attribution advantage of Zombie monitoring? First, it can directly locate the class where the problem occurred, rather than some random crash call stack. In addition, it can improve the recurrence rate of occasional problems, since most of the occasional problems are likely to be related to a multi-threaded runtime environment. If we can turn an occasional problem into a mandatory problem, then developers can easily troubleshoot the problem with the IDE and debugger. However, this solution has its own scope, because its underlying principle is based on the OC Runtime mechanism, so it only applies to memory problems caused by the OC object field pointer.Here’s a refresher on Zombie monitoring: First we hook the dealloc method of the base NSObject class. When any OC object is released, the dealloc method does not actually free the memory and points the ISA pointer to the object to a special zombie class. Since this special zombie class did not implement any method, the zombie object would Crash after receiving any message. Meanwhile, we would report the class name of the zombie object on the Crash site and the method name that was called at that time to the background analysis.Here’s a real case of Byte: the problem was a Crash of flying books on Top 1 of a version line that lasted for two months without being resolved. First of all, you can see that the crash call stack is a pure system call stack. Its crash type is illegal address access, which occurs in a transition animation of the view navigation controller.The MainTabbarController object crashes when the retain method is called. In this case, the MainTabbarController object crashes when the retain method is called.
The MainTabbarController is generally the root view controller of the home page and should not be released in theory for its entire life cycle. Why does it become a wild pointer object? As you can see, this simple error message is sometimes not enough for developers to locate the root cause of the problem. So here we go one step further and extend the functionality to report the call stack information when a Zombie object is released.If you look at the penultimate line, which is actually a flying book of business code, it’s a proxy method for view navigation Controller gesture recognition, which releases the MainTabbarController when called. Because the call point of the business code is found through the call stack, we only need to analyze why the TabbarController is released against the source code to locate the cause of the problem.On the right is the simplified source code (replaced with a comment because of code privacy concerns). Historically, in order to solve the conflict of gesture sliding back, a TRICK code was written in the gesture recognition proxy method of the flying book view navigation controller. It was this TRICK scheme that led to the accidental release of the home view navigation controller. Here, we find the root cause of the problem, and the solution is very simple: just drop the TRICK scheme and rely on the native implementation of the navigation controller to determine whether the gesture is triggered to solve the problem.
3.1.2 Coredump
As mentioned earlier, the Zombie monitoring scheme has some limitations. It only applies to the problem of wild Pointers to OC objects. Again, you might wonder: C and C++ code can also have wild pointer problems. In Mach and Signal exceptions, there are many other types of exceptions besides memory problems such as EXC_BAD_INSTRUCTION and SIGABRT. So how do we solve the other difficult problems? Here we present another solution — Coredump.Coredump is a special file format defined by LLDB. The Coredump file can restore the full running state of an App at a certain point in time (the running state mainly refers to the memory state). The Coredump file is a breakpoint at the crash site, and gets register information, stack memory, and complete heap memory for all threads at that time.
What is the attribution advantage of the Coredump scheme? First, because it is a file format defined by LLDB, it naturally supports LLDB instruction debugging, which means that developers can implement post-debugging of online difficult problems without having to reproduce the problem. In addition, because it has all the memory information at the time of the crash, it provides a huge amount of problem analysis material for developers.
This scheme can be used to analyze any Mach exception or Signal exception.Here is an analysis of an online real case: at that time, this problem appeared in all products of Byte, and the magnitude was very large in many products, ranking Top 1 or Top 2. This problem had not been solved in the previous two years.
As you can see, the crash call stack is also full of system library methods and eventually crashes a method in the libDispatch library with an exception type that hits a system library assertion.After we reported the crashed Coredump file, we used the LLDB debugging instruction mentioned above to analyze it. Because we have the complete memory state at the time of the crash, we can analyze the information of registers and stack memory of all threads.
Here we finally figure out that the x0 sender of the crash thread’s stack frame 0 (the first line calls the stack) is actually the queue structure information defined in libDispatch. At its starting address offset 0x48 bytes, this is the label attribute of the queue (which can be simply interpreted as the queue name). The name of the queue is critical to us, because to fix the problem, we need to know which queue is causing the problem. We through the memory read instructions directly read the memory of information, it was subsequently found to be a C string, name is com. Apple. CFFileDescriptor, this information is critical. We searched the source code globally for this keyword and found that the queue was created in the network library at the bottom of the byte, which explains why all byte products have this crash.Finally, we checked together with the students of the network library, combined with the source code of libDispatch, and found that the cause of this problem was that the external reference count of the GCD queue was less than 0, and there was the problem of excessive release, which finally hit the assertions of the system library and led to the crash.Once the problem is identified, the solution is simple: we simply increase the external reference count of the queue with dispatch_source_CREATE at the time the queue is created. After communicating with the network library maintenance student, it is confirmed that this queue should not be released during the entire App life cycle. The ultimate benefit of this problem resolution was to directly reduce the Crash rate by 8% for all byte products.
3.2 The second type of difficult problem — Watchdog
Let’s move on to the second category of difficult problems — Watchdog.On the left side of the above picture are two pictures I cut from weibo, which are the complaints of users after they encounter the problem of being stuck. It can be seen that the damage to user experience is still relatively large. So what are the dangers of stuck problem?
First of all, the problem of deadlock usually occurs in the cold startup stage when users open the App. Users may wait for 10 seconds and do nothing, and the App will crash, which will do great harm to user experience. In addition, our online monitoring found that if there is no treatment for the problem of deadlock, its magnitude may be 2-3 times that of ordinary Crash. In addition, currently the industry is generally monitoring OOM crash is the elimination method, if not excluded from the jammed crash, the corresponding will increase the probability of OOM crash misjudgment.
What are the attributional difficulties of stuck problems? Firstly, it is based on the traditional scheme — the stalled monitoring: the main thread is considered to be dead after the non-response time exceeds 3 ~ 5 seconds. This traditional scheme is very prone to false positives. As for the reason of false positives, we will talk about it in the next page. In addition, the cause of the deadlock can be very complex, and it is not necessarily a single problem: the main thread deadlock, lock wait, main thread IO and other causes can cause the deadlock. The third point is that deadlocks are a common cause of stuck problems. The threshold of deadlock analysis in traditional schemes is relatively high, because it strongly depends on the experience of developers. Developers must rely on manual experience to analyze which or which threads the main thread is waiting for each other to cause the deadlock, and why the deadlock occurs.As you can see, this is based on the traditional stuck scheme to monitor stuck, which is prone to false positives. Why is that? The green and red parts of the figure are the different time phases of the main thread. If the main thread of caton now lag time exceeds the threshold, just happens in the figure for the fifth time consuming phase, we have to grab the main thread at a time when the call stack, obviously it is not the main reason of this time consuming, the problem is mainly in the fourth time consuming stage, but this time the fourth time consuming phase is over, so can produce a false positives, This can cause developers to miss the real problem.
In view of the above mentioned pain points, we provide two solutions: first, the main thread call stack can be captured for many times when the monitoring is stuck, and the thread state of the main thread at different times can be recorded. The information about the thread state will be mentioned in the next page. In addition, we can automatically identify the deadlock caused by deadlocks, identify such problems, and help developers automatically restore the lock wait relationship between threads.This graph shows the information of the main thread calling the stack at different times. After each thread name, there are three tags, which respectively refer to the three thread states, including the thread CPU usage at the time, the thread running status and the thread flag.
On the right is an explanation of the thread’s running state and thread flags. When we see the thread status, we have two main analysis ideas: first, if we see the main thread CPU usage is 0, the current waiting state, has been swapped out, then we have reason to suspect that the current deadlock may be caused by deadlock; On the other hand, the CPU usage of the main thread is always high and in the running state, so it should be suspected that the main thread has some cpu-intensive tasks such as endless loops.The second attribution tool is deadlock thread analysis. This feature is relatively new, so let’s take a look at how it works. Based on the thread state mentioned in the previous page, we can obtain the state of all threads when stuck, screen out all threads in the waiting state, and then obtain the current PC address of each thread, that is, the method being executed, and judge whether it is a lock waiting method through symbolization.
The figure above lists some of the lock waiting methods we’ve covered so far, including mutex, read/write, spin, GCD, and so on. Each lock wait method defines an argument that passes in information about the current lock wait. We can read the lock wait information from the register and force it into the corresponding structure. Each structure will define a thread ID attribute, indicating which thread is waiting for the lock release. After completing this sequence of operations for each thread in the wait state, we can obtain the lock wait relationship of all threads and construct the lock wait relationship diagram.Through the above scheme, we can automatically identify the deadlocked threads. If we can determine that thread 0 is waiting for thread 3 to release the lock, and thread 3 is waiting for thread 0 to release the lock, then it is clear that two threads are waiting for each other to eventually cause a deadlock.
You can see that the main thread here we’ve labeled as a deadlock, it has zero CPU usage, it’s in a wait state, and it’s been swapped out, which is consistent with the way we used to analyze thread state.After such analysis, we can build a complete lock wait diagram, and deadlock problems caused by two threads or more threads waiting for each other can be automatically identified and analyzed.This is the source code for the deadlock problem shown above. The problem is that the main thread holds a mutex, the child thread holds a GCD lock, and the two threads wait for each other causing a deadlock. The solution here is: if there may be time-consuming operations in child threads, try not to lock with the main thread; Also, be careful when executing blocks synchronously on serial queues.The most common causes of deadlock are deadlock, lock contention, main thread I/O, and cross-process communication, using a byte-in-line monitoring and attribution tool.
3.3 The third type of difficult problem — OOM
OOM is an app that uses too much Memory and crashes when it is forcibly killed.What are the hazards of OOM crash? First of all, we believe that the longer users use the App, the more likely OOM crash will occur. Therefore, OOM crash will do great harm to the experience of heavy users. Statistics show that if OOM problems are not managed systematically, their magnitude is usually 3-5 times that of ordinary Crash. Finally, different from Crash and jam, the memory problem is relatively hidden and easy to deteriorate in the process of rapid iteration.
So what are the attribution difficulties of OOM problem? The first is that the composition of memory is very complex, and there is no very clear information about the exception call stack. Also, we have offline tools to troubleshoot memory problems, such as Xcode MemoryGraph and Instruments Allocations, but these offline tools do not work for online scenarios. For this same reason, it is very difficult for developers to simulate and reproduce online OOM problems offline.The attribution tool for solving the online OOM puzzle is MemoryGraph. By MemoryGraph, I mean memoryGraphs that are available in an online environment. There are some similarities to Xcode MemoryGraph, but there are significant differences. The biggest difference, of course, is that it can be used in the online environment. Second, it can be used for statistics and aggregation of scattered memory nodes, making it easy for developers to locate the memory footprint of the header.
Here’s a refresher on the basics of MemoryGraph online: First of all, we will regularly detect the physical memory usage of App. When it exceeds the danger threshold, memory dump will be triggered. At this time, THE SDK will record the symbolized information of each memory node and the reference relationship between them. This strong and weak reference relationship will be reported at the same time, and finally the overall information reported to the background, can assist the developer to analyze the large memory usage and memory leaks and other abnormal problems.
Here’s a practical example of how MemoryGraph solves OOM problems.The general approach to MemoryGraph file analysis is to peel away the pieces and get to the root cause.
An example of MemoryGraph file analysis is shown above, where the red boxes indicate the different areas: in the upper left corner is a list of classes, which summarizes the number of objects of the same type and how much memory they occupy; On the right is the address list of all instances of the class, on the lower right is an area where the developer can manually trace the references to the object (which other objects the object is currently referenced by, and which other objects it references), and in the middle is the reference diagram.
If we look at the class list, we can see that there are 47 objects of ImageIO type, but these 47 objects take up over 500 MB of memory, which is obviously not a reasonable memory usage. We click on the list of ImageIO classes, take the first object as an example, and trace its references. We discovered that this object had only one reference, the VM Stack: Rust Client Callback, which is actually the Rust Network library thread at the bottom of the fly book. After checking in here, you must wonder: Do all 47 objects have the same reference relationship? Here we can use the add Tag function in the lower right corner of the path tracing to automatically filter whether all 47 objects have the same reference relationship. As you can see in the upper right corner of the image, after filtering, we confirm that 100% of the 47 objects have the same reference relationship.
Let’s look at the VM Stack: Rust Client Callback object. It is found that two of the objects it references are very sensitive, one is ImageRequest and the other is ImageDecoder. From these two names we can easily infer that it should be the object of the ImageRequest and ImageDecoder.Using these two keywords to search the class list, we can find 48 ImageRequest objects and 47 ImageDecoder objects. If you remember, the largest memory hog on the previous page, ImageIO, was 47. This is obviously not a coincidence. When we examine the reference relationship between these two types of objects, we find that these two types of objects are also 100% referenced by the VM Stack: Rust Client Callback object.
Finally, together with the students of Feishu Photo Library, we located the cause of this problem: it is not a reasonable design to request 47 images simultaneously and decode them. The root cause of the problem is that the fetobook library’s downloader relies on NSOperationQueue for task management and scheduling, but is not configured with a maximum number of concurrent tasks, which can cause excessive memory usage in extreme scenarios. The corresponding solution is to set the maximum concurrency for the image loader and adjust the priority based on whether the image to be loaded is in the viewable area.The most common causes of OOM problems are: memory leaks, which are more common; The second is memory pile-up, which is when AutoreleasePool doesn’t clean up in time. Third, resource anomalies, such as loading a large image or a large PDF file; The last one is memory misuse, such as memory caching without a mechanism designed to flush out the memory.
3.4 Type 4 Troubleshooting – CPU and disk I/O exceptions
The reason why these two types of problems are combined here is that they are highly similar: first, they are abnormal appropriation of resources; In addition, they are different from flash backdowns. The cause of the crash is not a moment, but a period of time of abnormal resource occupation.What are the hazards of abnormal CPU and disk I/O usage? First of all, we believe that these two types of problems are particularly likely to cause performance problems such as stalling or hot devices, even if they do not eventually cause App crashes. Secondly, the magnitude of these two types of problems should not be ignored. In addition, compared with the previous stability problems, developers are unfamiliar with this kind of problem and pay insufficient attention to it, which is very easy to deteriorate.
What are the attributional difficulties of this kind of problem? The first is that it’s a very long duration, so there’s probably no single cause; It is also difficult for developers to reproduce such problems offline due to the complex use environment and operation path of users. In addition, if the App wants to monitor and attribute such problems in the user mode, it may need to sample the call stack information at a high frequency for a period of time. However, this monitoring method obviously has very high performance loss.On the left side of the image above is a crash log that we exported from an iOS device with CPU anomalies, capturing key parts. The crash was triggered when the current App’s CPU usage exceeded 80% in less than 3 minutes, or 144 seconds.
The screenshot on the right of the above picture is taken from a session of WWDC2020 by me. For such problems, Apple officially offers some suggestions for attribution schemes: Firstly, Xcode Organizer, which is the problem monitoring background provided by Apple officially. Then there is the suggestion that developers can also access MetricKit, which has diagnostic information about CPU anomalies.On the left side of the image above is the crash log of disk writes, also exported from aN iOS device, again capturing only the key part: Within 24 hours, disk writes to the App had exceeded 1073 MB, triggering the crash.
Apple’s official documentation, on the right, also gives advice on attribution for these types of issues. There are also two suggestions: one is to rely on Xcode Organizer and the other is to rely on MetricKit. In the selection process, we finally decided to adopt MetricKit scheme, mainly considering that we still want to control the data source in our own hands. Because Xcode Organizer is, after all, a black box background of Apple, we cannot get through with the background of the group, and it is not convenient to build follow-up processes such as alarm, automatic problem allocation, and issue status management.MetricKit is apple’s official framework for performance analysis and stability problem diagnosis, and because it is a system library, it has very little performance degradation. On iOS 14 and above, Metrickit makes it easy to get diagnostic information for CPU and disk I/O exceptions. Its integration is also very convenient. We just need to import the system library header file, set up a listener, in the corresponding callback CPU and disk write exception diagnostics report to the background analysis.In fact, the diagnostic information format of the two types of exceptions is also highly similar, recording all method calls over a period of time and the elapsed time of each method. After reporting it to the background, we can visualize the data as a very intuitive fire map. This intuitive form helps developers easily locate problems. For the flame diagram on the right, we can simply say that the longer the rectangle, the more CPU time it takes. So we just need to find the App call stack with the longest rectangle block to locate the problem. In the highlighted red box, the keyword of one of the methods is animateForNext, as you can probably guess from the name that this is animation scheduling.
Finally, we located the cause of this problem together with the students of Feishu: there was an animation in the small program business of Feishu that did not pause when it was hidden, resulting in a high CPU usage. The solution is as simple as pausing the animation while it’s hidden.
Iv. Summary and review
In the second chapter stability problem management methodology, I mentioned that “if they are to good governance, the problem of stability need to be the matter through every link in the software development cycle, including the discovery of problems, attribution, governance and prevent degradation”, at the same time, we think the problem come from line especially the attribution of online problems, the most important thing is the whole link. For each type of problem, this share provides some useful attribution tools: Crash has Zombie monitoring and Coredump; Watchdog has thread status and deadlock thread analysis; OOM MemoryGraph; The CPU and disk I/O exceptions are MetricKit. ProcedureExcept for MetricKit, all the attribution schemes mentioned in this share are independently developed by Bytedance, and the open source community has not yet come up with a complete solution. These tools and platforms will be followed by bytesMARS APM Plus, volcano Engine application development suiteProvide one-stop enterprise solutions. All the capabilities mentioned in this share have been verified and polished in various internal products of Byte for many years. Its own stability and business effects brought by access are obvious to all. We welcome your continued attention.