The cause of iOS OOM crash in the production environment has been a long-standing problem in the industry, and bytedance’s toutiao, Douyin and other products are also facing the same problem. In the research and development practice of bytedance performance and stability assurance team, we developed an OOM attribution scheme — Online Memory Graph, which is based on Memory snapshot technology and can be applied to production environment. Based on this solution, the crash rate of Toutiao Douyin OOM decreased by 50%+ within 3 months. This paper mainly shares the technical background, technical principle and usage of the solution, aiming to provide a new solution for this difficult problem.

OOM Crash background

OOM

OOM is actually short for Out Of Memory, which means that the current application on an iOS device is forcibly terminated by the operating system due to high Memory usage. In the perception Of users, it is the flash withdrawal Of the App, which is not significantly different from the common Crash. However, when we encounter this kind of crash in the debugging phase, we can not find the normal type of crash log from the device Settings -> Privacy -> Analysis and Improvement. We can only find the log at the beginning of Jetsam. This kind of log is actually generated by the system after OOM crash. So the next question is, what is Jetsam?

Jetsam

Jetsam is a resource management mechanism adopted by the iOS operating system to control memory overuse. Unlike desktop operating systems such as MacOS, Linux, and Windows, iOS does not have a memory swap mechanism for performance reasons. Therefore, in iOS, if the overall memory of the device is insufficient, the system can only terminate some processes that have low priorities or occupy too much memory.

Jetsam log interpretation

The image above is the most critical part of a Jetsam log. Key information interpretation:

  • PageSize: indicates the size of the physical memory page of the current deviceiPhoneXs MaxThe size of the device’s physical memory page before the apple A7 chip was 4KB.
  • States: Indicates the running status of the current applicationHeimdallr-ExampleThis application is running in the foreground, and this kind of crash is calledFOOMForeground Out Of Memory; Corresponding to this, there is also an OOM crash in the background, which we call a crashBOOM(Background Out Of Memory).
  • Rpages:resident pagesHeimdallr-example specifies the number of pages currently occupied by the process. Based on pageSize and RPages, you can calculate the size of memory occupied when the application crashes :16384 * 92800/1024/1024 = 1.4GB.
  • Reason: Indicates the reason why the process was terminated.Heimdallr-ExampleThe application was terminated because the operating system exceeded the maximum physical memory footprint allowed for a single process.

The Jetsam mechanism cleanup strategy can be summarized as follows:

  1. The physical memory usage of an App exceeds the upper limit. Procedure
  2. The physical memory usage of the entire device is cleared according to the following priorities:
    1. Background Application > Foreground Application
    2. Applications with high memory usage > Applications with low memory usage
    3. User Application > System Application

The code for Jetsam can be found in the open source XNU code. For detailed source code analysis, please refer to the second and third references at the end of this article.

Why monitor OOM crash

As we have learned before, OOM is divided into FOOM and BOOM. Obviously, the former is more obvious to the user, so it is more harmful to the user’s experience. The OOM crash mentioned below only refers to FOOM. Is it necessary to establish online monitoring methods for OOM crashes?

The answer is yes and very necessary! Here’s why:

  1. Heavy users meaning users who have been using for a longer period of time are more likely to occurFOOMIf the damage to this part of user experience leads to the loss of users, the loss of business will be greater.
  2. Headline, douyin and other products online data are displayedFOOMMore than the average crash, because in the past the lack of effective monitoring and governance means caused the problem to go unnoticed for too long.
  3. High memory usage even if it does not cause itFOOMIt could also lead to other applicationsBOOMOnce users find that they switch from wechat to our App and then switch back to wechat instead of staying on the previous wechat chat page but restarting it, the experience for users will be very bad.

OOM Online Monitoring

Jetsam force kill code screenshot

Looking through the XNU source code, we can see that the Jetsam mechanism terminates the process by sending a SIGKILL exception.

#define SIGKILL 9 kill (cannot be caught or ignored)

The SIGKILL signal can not be ignored or captured in the current process. The usual Crash capture scheme used to listen for abnormal signals is definitely not applicable. So how do we monitor OOM crashes?

Positive monitoring isn’t working. In 2015, Facebook came up with an alternative approach, which is simply elimination. For the specific process, please refer to the following flow chart:

The elimination method determines the process of OOM crash

We judge the reason for the termination of the last startup process at each App startup, and the known reasons are as follows:

  • The App has been updated
  • The App crashed
  • User Manual Exit
  • The operating system has been updated
  • The process terminates after App switches to the background

A FOOM crash is determined to have occurred during the last startup if the cause of the last startup process is not known.

Fabric, once owned by Facebook, did the same. However, through our testing and verification, the above method will misjudge at least the following scenarios:

  • The WatchDog crash
  • The background to start
  • XCTest/UITest and other automated test framework drivers
  • Use exit to exit the system

Before byteDance’s OOM crash monitor went live, we ruled out all of the known miscalculation scenarios above. It should be noted that because the elimination method is not as accurate as direct monitoring after all, there are more or less bad cases, but we will ensure that it is as accurate as possible.

– OOM crash rate decreased by 50%+

OOM Production environment attribution

Currently, the tools for troubleshooting Memory problems on iOS include The Memory Graph provided by Xcode and the toolset related to Instruments. These tools can provide relatively complete Memory information, but the application scenarios are limited to the development environment and cannot be used in the production environment. Since memory problems tend to occur in some extreme usage scenarios, offline development tests generally cannot cover the corresponding problems, and the tools provided by Xcode cannot analyze and deal with most of the occasional problems.

Companies have come up with their own online solutions and open source excellent solutions such as MLeaksFinder, OOMDetector, and FBRetainCycleDetector.

During the internal use of ByteDance, we found that the existing tools had different focuses and could not fully meet our needs. The main problems focus on the following two points:

  • The solution of finding circular references based on objective-C object reference relationship has a relatively small scope of application and can only deal with part of circular reference problems. However, memory problems are usually complex, such as memory accumulation, Root Leak, and C/C++ layer problems that cannot be solved.
  • The scheme based on allocation stack information clustering needs to run permanently, which consumes a lot of resources such as memory and CPU. It cannot monitor users with memory problems, but can only cast a wide net, which has a great impact on user experience. At the same time, memory allocated through some of the more common stacks cannot be located for actual memory usage scenarios, and common leaks such as circular references cannot be analyzed.

In order to solve the increasingly serious Memory problems of toutiao, Douyin and other products, we independently developed an online solution based on Memory snapshot technology, which we called Online Memory Graph. After it was launched, it was connected to almost all products in the group and helped each product to repair the historical problems of many years. OOM rate decreased by an order of magnitude. OOM rate of the latest version of Douyin decreased by 50% and Toutiao by 60% within 3 months. The positioning efficiency of sudden OOM problems online has been greatly improved, completely saying goodbye to the era of OOM problems attributed to “two eyes and one blackness”.

The core principle of the Online Memory Graph is to scan all Dirty Memory in a process and build a directed Graph of reference relationships between Memory nodes based on the address values of other Memory nodes stored in Memory nodes for the analysis and location of Memory problems. The whole process does not use any private API. The programme has the following capabilities:

  1. Fully restore the user’s memory state at that time.
  2. Quantifying the large memory footprint and memory leakage of online users can accurately answer the question of how large the App memory is.
  3. Use memory node symbols and reference diagrams to answer the question why memory nodes survive.
  4. Strictly control performance loss. Analysis is triggered only when memory usage exceeds an abnormal threshold. There is no run-time overhead, only collection-time overhead, with almost no impact on 99.9% of normal users.
  5. Support for major programming languages including OC, C/C++, Swift, Rust, etc.

Online Memory Graph collection and reporting process diagram

Memory Snapshot Collection

The online Memory Graph collects Memory snapshots mainly to obtain all Memory objects in the current running status and reference relationships between them for subsequent problem analysis. You need to obtain the following information:

  • All memory nodes, and their symbolic information (e.gOC/Swift/C++The instance class name, or the tag of a VM node for a particular purpose, etc.).
  • Reference relationships between nodes, and symbolic information (offsets, or instance variable names),OC/SwiftMember variables also need to record reference types.

Since the collection process occurs during the normal running of the program, the entire collection process should be completed in a relatively static running environment to ensure that the program will not run abnormally due to the collection of memory snapshots. Therefore, the entire snapshot collection process is divided into the following steps:

  1. Suspend all non-collection threads.
  2. Get all memory nodes, memory object reference relationships and corresponding auxiliary information.
  3. Write to the file.
  4. Restores the thread state.

Below are some implementation details and trade-offs in the collection process.

Obtaining memory nodes

Vm_region_recurse64 (VM_REGION_RECURse64); Vm_REGION_RECURse64 (VM_REGION_RECURse64); The vm_REGION_submap_info_64 structure obtains the following information:

  • Addresses and sizes in the virtual address space.
  • Dirty and Swapped Indicates the number of memory pagesVM RegionReal physical memory usage.
  • Whether swapable, Text segment, shared Mmap and other read-only or can be swapped out at any time, do not care.
  • User_tag, the user tag used to provide thisVM RegionMore accurate information about the use of.

Most VM regions act as a single memory node, recording only the start address and Dirty and Swapped memory as the size, and reference relationships with other nodes. However, the VM Region where the heap memory maintained by Libmalloc resides usually contains objective-C objects, C/C++ objects, and buffers in most business logic, which can obtain more detailed reference information. Therefore, internal nodes and reference relationships need to be processed separately.

In the iOS system, in order to avoid performance problems caused by system calls for all memory allocation, the relevant library is responsible for applying for a large chunk of memory once, and then allocating and managing the large chunk of memory on top of it again, providing it for the use of small chunk of memory objects that need dynamic allocation, which is called heap memory. Most of the dynamic memory used in programs is managed by the heap. On iOS, libmalloc is used to manage the memory allocated for major business logic. Some system libraries also use their own separate heap management for performance purposes. CFNetwork also uses its own separate heap, where we focus only on the internal memory management state of Libmalloc and not on the other possible heaps.this particular part of memory exists at the granularity of VM regions and does not analyze node references within them.

Malloc_get_all_zones can be used to retrieve all zones within libmalloc and iterate over the managed memory nodes in each zone to obtain Pointers and sizes of all living memory nodes managed by libmalloc.

symbolic

After obtaining all the memory nodes, we need to find a more detailed type name for each node for subsequent analysis. For the VM Region memory node, we can use user_tag to give it meaningful symbolic information. Heap memory objects include raw Buffers, Objective-C/Swift, C++, etc. For objective-C /Swift and C++ parts, we try to symbolize more detailed information through some runtime information in memory.

Objective/Swift object symbolization is relatively simple, many tripartite libraries have similar implementation, Swift is compatible with Objective-C in memory layout, also has ISA Pointers, objC related methods can be applied to objects in both languages. As long as the ISA pointer is valid and the object instance size meets the conditions, it is considered correct.

C++ objects can be divided into two classes based on whether or not they contain virtual tables. Objects that do not contain virtual tables cannot be processed because of a lack of runtime data.

For objects that contain virtual tables, after investigating the mach-o and C++ ABI documentation, the corresponding type information can be obtained from STD ::type_info and the following sections.

  • type_name string– The constant string corresponding to the class name, stored in__TEXT/__RODATAPart of the__const sectionIn the.
  • type_info– in__DATA/__DATA_CONSTPart of the__const sectionIn the.
  • vtable– in__DATA/__DATA_CONSTPart of the__const sectionIn the.

C++ instance and vtable reference diagram

Within iOS, there is a special class of objects called CoreFoundation. In addition to the well-known CFString and CFDictionary, many system libraries also use CF objects, such as CGImage and CVObject. The Objective-C types retrieved from their ISA Pointers are unified as __NSCFType. Since CoreFoundation types support real-time registration and deregistering of types, to refine this part of the type, we reverse retrieve the location of the Slot array maintained by CoreFoundation and read its data to ensure that we can safely obtain the exact type.

CoreFoundation type fetch

Construction of reference relationships

The core of the entire memory snapshot is to rebuild the reference relationships between memory nodes. In virtual memory, if a memory node refers to another memory node, the corresponding memory address stores the value of a pointer to that memory node. Based on this fact, we designed the following scheme:

  1. A memory node is traversed through all the ranges where Pointers can be stored to get the value A it stores.
  2. Search all the obtained nodes and determine whether A is the address of any byte in A memory node. If so, it is considered as A reference relation.
  3. Repeat for all memory nodes.

For certain memory areas, we did some extra processing on the stack memory and the Objective-C/Swift heap memory in order to get more detailed information for troubleshooting problems.

The stack memory also exists in the form of VM regions, where temporary variables and TLS data are stored. Obtaining the corresponding reference information can help troubleshoot memory problems caused by autoreleasepool. Since the Stack does not use the entire Stack memory, in order to obtain the reference relation of Stack, we obtain the current available range of Stack according to the register and Stack memory, and exclude invalid references caused by unused Stack memory.

Stack usage range

For Objective-C/Swift objects, since the runtime contains additional information, we can get Ivar strong/weak references and Ivar names, which can help us analyze the problem. If the offset of the reference relationship found is the same as that of Ivar, the reference relationship is considered to be the Ivar, and ivAR-related information can be attached to it.

Data reporting Policy

After the App memory reaches the set value, we collect the memory nodes and reference relationships of the App at that time, and then upload them to the remote end for analysis, which can accurately reflect the memory status of the App at that time, so as to locate the problem. The general process is as follows:

The overall online Memory Graph workflow

The complete process of the entire online Memory Graph module is shown in the figure above, which mainly includes:

  1. Background threads periodically check memory usage and trigger memory analysis when the memory usage exceeds the threshold.
  2. After memory analysis, data is persisted and will be reported next time.
  3. The original file is compressed and packaged.
  4. Check the backend reporting permissions. Because the individual files are large, the backend may implement some traffic limiting policies.
  5. If the file is successfully cleared, the file will be cleared again after the file fails. The file will be cleared for a maximum of three times to avoid occupying too much disk space.

Analysis of the background

This is a case of the Memory Graph single point detail page for byte monitoring platform:

An overview of the Online Memory Graph details page

We can see that the memory usage of this user is nearly 900MB.

  1. Try to find the most suspect class in the class list in terms of the number of objects and the memory footprint of objects.
  2. Select an instance at random from the list of objects and trace the reference relationship back to its parent node to find the reference path that you think is the most suspect.
  3. Click in the upper right corner of the reference path moduleAdd TagTo determine how many times the currently selected reference path appears in the same class object.
  4. Determine which business module is causing the problem after identifying the problematic reference path.

Statistics on the frequency of occurrences of the current reference path in objects of the same type

Through the analysis of the path referenced in the figure above, we find that all the pictures are eventually held by the class TTImagePickController. The final check is that the picture selector module loads all the pictures in the user’s album into the memory at one time. This problem may occur in extreme cases.

Overall performance and stability

Optimization policy on the collection side

Because the entire memory space generally contains memory nodes ranging from hundreds of thousands to tens of millions, and the running state of the program changes rapidly, the acquisition process has great pressure on performance and stability.

We also made some performance optimizations:

  • Write the collected data to usemmapMapping, and custom binary format to ensure sequential read and write.
  • Memory nodes are sorted in advance and binary lookup is used when establishing edge reference relationships. Some illegal memory addresses can be prune quickly in advance through bit operation.

For the stability part, we mainly consider the following points:

  • A deadlock

Since the state of the Objective-C runtime lock cannot be guaranteed, we will need to cache the information obtained through the runtime API before suspending the thread. In order to ensure the safe state of the libmalloc lock, after suspending the thread, we determine the state of the libmalloc lock. If it has been locked, we restore the thread to try to suspend again to avoid heap deadlock.

  • Illegal memory access

To reduce the impact of the memory allocated by the collection itself after all other threads have been suspended, we use a separate MALloc_zone to manage the memory usage of the collection module.

Performance loss

Because all threads need to be suspended during data collection, which will lead to the user’s perception of lag, the byte module still has certain performance loss. Our test shows that on iPhone8 Plus, when the App occupies 1G memory, the collection time is 1.5-2 seconds, and the extra memory consumption is 10-20MB. The generated file zip size is 5-20MB.

In order to strictly control performance loss, the online Memory Graph module applies the following policies to avoid too frequent triggering to disturb the normal use of users and to avoid excessive occupation of its own Memory and disk resources:

Performance loss control strategy

The stability of

The scheme has been in bytes in the whole system stable operation of the production line more than six months, with proven success rate and stability, the current single acquisition success rate can reach 99.5%, the rest of the basic failure is due to the memory tension OOM ahead of time, considering that most applications will less than one over one thousand of the user to trigger to collect, this kind of circumstance belongs to very low probability events.

The trial path

The online Memory Graph is now available to external developers on The Bytedance Volcano engine’s APMInsight platform.

The APMInsight technology has been refined by many applications, such as Toutiao, Douyin and Watermelon video, to produce a complete solution that can locate multiple problems on mobile terminals, browsers and small programs. In addition to supporting the analysis of basic problems such as crashes, errors, stutters and network problems, It also provides many functions associated with application launch, page browsing, and memory optimization. At present, the Demo has most open ability, welcome to register account trial: www.volcengine.cn/product/apm…

Finally, xiaobian also strive for a unique benefit for everyone — draw 5 lucky users, free use for one year. Interested friends welcome to scan the code to register, please be sure to fill in the integrity and authenticity of the declaration information, or it will affect the later release of free places.

Join us

This technical solution is created by the in-depth cooperation of Bytedance APM And Tiktok basic technical teams. Students who are interested in our two teams are welcome to join us:

APM China

Bytes to beat the APM China currently committed to improve the performance of the whole system products within the whole group and stability performance, covering technology stack iOS/Android/Flutter/Web/Hybrid/PC/game/small programs, work content including but not limited to, online monitoring, online operations, the depth of optimization, offline prevent deterioration and so on. Long-term expectations for the industry output more constructive problem finding and in-depth optimization means.

Welcome you to join us, in order to “faster, more stable, more economical, more quality” the ultimate goal hand in hand forward. We have recruitment requirements in Both Beijing and Shenzhen. Please send your resume to [email protected]. Email title: Name – Years of work – APM Medium – Technology stack direction (e.g. IOS /Android/Web/ backend).

Basic technology of Douyin

We are the team responsible for tiktok client basic capability research and development and new technology exploration. We focus on engineering/business architecture, r&d tools, compilation systems and other aspects to support rapid business iteration while ensuring the r&d efficiency and engineering quality of super-large teams. Continuously explore performance/stability and strive to provide the ultimate basic experience for hundreds of millions of users around the world.

If you are passionate about technology, welcome to join the basic technology team of Douyin, and let us build a hundred-million-level global App. At present, we have recruitment needs in Shanghai, Beijing, Hangzhou and Shenzhen. For internal promotion, please contact [email protected]; Email title: Name – Working years – Douyin – Basic Technology – iOS/Android.

reference

[1] zhuanlan.zhihu.com/p/49829766

[2] satanwoo. Making. IO / 2017/10/18 /…

[3] jinxuebin. Cn / 2019/07 / OOM…

[4] engineering.fb.com/ios/reducin…

More share

Bytedance full link Manometry (Rhino) practice

Fastbot: Smart Monkey on the move

Toutiao quality optimization – Graphic details page second practice

Android Camera memory problems analysis

Welcome to Bytedance Technical Team