Abstract

Raphael is a native memory leak detection tool developed by watermelon Video Basic technology team. Raphael is widely used in the native memory leak treatment of bytedance’s major apps with significant benefits. The tool is now open source, and this article will analyze the details of Raphael through principles, solutions and practices.

background

Memory issues on the Android platform have always been the focus and pain point of performance optimization and stability governance. Java heap memory is easy to locate and manage due to mature tools and methodologies, supplemented by Hprof snapshots. However, native memory has always lacked stable and efficient tools. The only Malloc Debug not only failed to meet the needs of performance and stability, but also had compatibility problems with Android versions.

The status quo

In fact, there is no shortage of excellent tools for native memory leak governance. The main tools known for investigating Native memory leak problems are: LeakTracer, MTrace, MemWatch, Valgrind- MemCheck, TCMalloc, LeakSanitizer and more. However, due to the unique characteristics of the Android platform, these tools are either incompatible or expensive to access, making it difficult to implement on the Android platform. Is based on the principle of these tools: agent first memory allocation/release related functions (such as: malloc/calloc/realloc/memalign/free), and then through the unwind back in the call stack, filter out at last with the help of a cache management did not release the memory allocation of records. Therefore, the main differences between these tools are in proxy implementation, stack backtracking, and cache management. Based on the implementation differences, these tools can be roughly divided into hook and LD_PRELOAD, such as MALloc DEBUG and LeakTracer.

tool

The principle of

malloc debug
it works by adding a shim layer that replaces the normal allocation calls
LeakTracer
The LD_PRELOAD mechanism implements the proxy, or the source code is integrated into the target repository, and override implements the proxy

malloc debug

Malloc Debug is a built-in memory debugging tool for The Android operating system (Native memory debugging is introduced). Although there is no additional access code, the way to enable malloc Debug and its core functions are subject to the Android version.

API level

The official description

API <= 18
Note: malloc debug was full of bugs and was not fully functional until API level 19.
API 19-23
This isn’t very useful since it tends to display a lot of false positive because many programs do not free everything before terminating.
API 24-26
You must be able to set system properties using the setprop command from the shell. This requires the ability to run as root on the device.
API 27-28
App developers should check the NDK documentation about wrap.sh for the best way to use malloc debug in Android O or later on non-rooted devices.

When we tried to use Malloc Debug to monitor watermelon video App offline (with wrap.sh configuration), we found that the cold startup time of models with normal startup time less than 1s (Pixel 2 & Android 10) was extended to 11s+. Moreover, in normal use, the sense of lag is very obvious when sliding, the time consuming of page switching is unacceptable, and the application experience in the monitoring process is very poor. In addition, the watermelon video will encounter the necessary stack backtrace crash during malloc debug monitoring (stack is as follows, libunwind LLVM Chronicles has related analysis)

LeakTracer

LeakTracer is another well-known memory leak monitoring tool that works by: Through the LD_PRELOAD mechanism first load a defines the malloc/calloc/realloc/memalign/free agents library with same function, such as the global agency is the application layer of the memory allocation and release, Trace the call stack by unwinding and filter out suspected memory leaks. LD_PRELOAD on The Android platform is severely limited because it does not have a standalone unwind implementation, relies on the system’s unwind capability, and also suffers from stack frame compatibility issues encountered by Malloc Debug; If we integrate LeakTracer into the target SO and implement the proxy through Override, we will only be able to intercept explicit memory allocation/release from this SO, not from other SO and cross-SO calls. The same is true with native peg, which can only monitor local simple memory leakage and cannot monitor memory usage globally.

Based on the above analysis and access experience, it is not difficult to find that these memory leak monitoring tools have the following three typical problems when they are actually accessed on the Android platform:

  • Cumbersome process: You need to configure wrap.sh/root Permission /setprop, which is limited by the Android version
  • Compatibility issues: Unwind inventory has serious compatibility issues, and libunwind_llVM cannot correctly trace back GNU compiled stack frames
  • Performance problem: the official Malloc Debug performance data is more than 10 times the loss, and the measured watermelon is not available on mid – and high-end computers after it is opened

Our needs

Watermelon Video App is a medium and large application with a large number of native codes, such as video playback, special effects shooting, video clip collection, P2P acceleration, etc. Behind each module related to native code, there is a professional team iterating at high speed, plus the influence of daily average usage time of more than 100 minutes. It is very difficult to solve the native memory problem of watermelon Video App. As a matter of fact, there are relatively few simple memory leakage problems, and most of them are due to memory usage problems caused by unreasonable business logic, which requires tools to infiltrate into the running process of App for monitoring, which virtually improves the requirements for tool performance and stability.

Online native memory problems are basically exposed in the form of virtual memory hitting the top. In watermelon Video App, in addition to the above modules, there are other major consumption of virtual memory, such as thread, Webview, Flutter, hardware acceleration, video memory, etc. In fact, malloc/calloc realloc/memalign relative to the mmap/mmap64 directly allocate memory is usually accounted for less than the whole virtual memory space. Since memory problems are usually expressed in the form of virtual memory depletion, the cause of virtual memory depletion can only be accurately identified by collecting as many kinds of memory consumption as possible to approach the virtual memory upper limit indefinitely. So, like malloc debug monitor only so malloc/calloc/realloc/memalign/free could not meet the needs of the memory management, covering the mmap/mmap64 / munmap as much as possible, such as memory allocation form is monitoring tools need to be done.

Based on the above analysis, it can be concluded that watermelon Video App and even other apps under Bytedance have the following demands for a universal native memory leak monitoring tool:

  • Access layer: Does not rely on the Android version, does not need root, and has the lowest service penetration
  • Stability: there is no stability problem affecting business, and it can meet the demand of online use
  • Performance level: no obvious performance problems, up to the standard for online use
  • Monitoring scope: not confined to the malloc/calloc/realloc/memalign/free, at least you can cover the mmap/mmap64 / munmap

Raphael core design

As can be seen from the previous analysis, a complete native memory leak monitoring tool mainly consists of three parts: agent implementation, stack tracing and cache management. Proxy implementation is the key to solve the access problem on Android platform, stack backtracking is the core of performance and stability, cache logic also directly affects performance and stability to a certain extent. Next we will introduce the core design of Raphael from four aspects.

The proxy implementation

Since wrap.sh and LD_PRELOAD are not general-purpose on Android, they are excluded first. And because malloc hook only agent malloc/calloc realloc/free, can’t cover the mmap/mmap64 / munmap, also be abandoned. However, inspired by the implementation of Malloc hook, we can achieve the same effect by using inline hook/PLT hook tools, which are mainly representative of android-inline-hook and xHook

model

The principle of

advantages

disadvantages

malloc hook
Native support, switch control
No performance/stability issues
Interfaces are not exposed and mMAP/MMAP64 is not supported
LD_PRELOAD
The function of the same name preempts the position
No performance/stability issues
Android platform is difficult to open and does not have universality
PLT hook
Modify data in the GLOBAL offset table
Single point hook, mature and reliable
Low efficiency, need to solve the problem of incremental so hook
inline hook
Insert jump instruction into target function
Global hook, the highest efficiency
There are compatibility problems and command repair problems will occur

XHook is a relatively excellent representative of PLT hook tools, and its stability can reach the online standard. Because its implementation depends on the re, and hook has more SO or functions, hook time will be more obvious. In addition, its native implementation can only hook the currently loaded SO, and there is no special treatment for the unloaded so. If it is used for a long time process-level monitoring, the problem of incremental SO hook needs to be solved. However, this hook method is very suitable for so directional monitoring.

In contrast to the PLT hook principle, an inline hook directly inserts a jump instruction into the header of the target function. The hook is the final implementation of the function. An inline hook, however, suspends the thread after a function that might be executing. This is a pain point for inline hooks. Many existing hook implementations do not fix the function, or have more or less problems with the fix.

Raphael used xHook for proxy access in early validation versions. Subsequently, in order to achieve long-term process-level monitoring to cover more business scenarios, Raphael solved the incremental SO Hook problem through Android-inline-hook and realized directional monitoring through xHook. To further improve the performance and stability of the tool, the latest version of Raphael has been switched to ByteHook (byteDance’s own PLT hook tool that automatically handles incremental SO hook issues).

Stack traceback

An object or a segment of memory can be located either by reference/dependency or by the stack at create/allocate time. Because Java heap memory is clearly organized and has clear dependencies, memory leaks can be statically analyzed through dependencies. However, the dependency/reference of Native heap memory is obscure and does not have the clear organization format of Java heap memory. It cannot be statically analyzed through dependency/reference relationship and can only be assisted by the stack at the time of allocation. Stack unwinding is a common way for native layer to obtain the call stack. It is an indispensable core of Native memory leak monitoring tools, and also a bottleneck for tool performance and stability. Next, this paper will introduce Raphael’s work on stack traceback from three aspects: selecting stack traceback tools, limiting stack traceback frequency and reducing useless stack traceback.

Stack backtracking tool selection

The common 32-bit stack traceback libraries on Android platform mainly include: Libunwind_llvm, libunwind (nongnu), libgCC_s, libudf, libbacktrace, libunwindstack, etc. The following is a simple comparison analysis based on the three major stack traceback libraries (platform: Pixel 2 & Android 10, performance: Demo statistics of the total time spent on 16-layer stack frame traceback; Compatibility: Bytedance’s long-term optimization and governance practices for multiple apps)

unwind

performance

compatibility

libunwind (nongnu)

The total backtracking time was more than 9.0ms, and the performance was the worst

No stability problems found, best compatibility, highest backtracking success rate

libunwind_llvm

The total backtracking time is less than 0.8ms, and the performance is good

GNU compiled stack frames have compatibility issues and have the lowest backtracking success rate

libudf

The performance is best when the total backtracking time is less than 0.6ms

No stability problems were found, good compatibility and high backtracking success rate

Stack backtracking involves a lot of things, it is very difficult to achieve a 32-bit stack backtracking tool with excellent performance in stability, backtracking performance, success rate and other aspects in a short time. In order to quickly verify and solve real machine problems, Raphael used libunwind_llVM in early versions, then switched to libunwind_llVM & Libunwind (Nongnu) to ensure backtracking performance with libunWind_llVM. Switch to Libunwind (Nongnu) when the traceback depth is below 2 layers to ensure the traceback success rate. The latest version uses LiBudF, which has both performance and backtracking success. Comparatively speaking, 64-bit fP-based stack backtracking can basically meet the performance and stability requirements, which will not be discussed here. Rapahel has also been designed with enough extensibility in mind to easily switch to other, better stack backtracking implementations.

Limit stack backtracking frequency

Even with liBudF implementation, the average time of backtracking 16 stack frames in demo is 0.6ms. The impact of monitoring tool on App performance is obvious when it is actually running. The way to improve monitoring performance is not only to directly optimize stack traceback performance, but also to reduce traceback frequency. In the optimization and governance practice of Watermelon Video App, we found that the frequency of memory allocation less than 1024 bytes in most scenarios accounted for more than 70%, but the problem of native memory topping encountered online was rarely caused by small memory leakage. Monitoring small memory leaks has no real effect on solving online Native memory topping problems. Even if it is caused by a small amount of memory, this requires a high frequency and must-see scenario to achieve, and this kind of problem is usually covered in online single test (directional monitoring) scenarios. Based on this, Raphael controls stack backtracking frequency by setting memory thresholds, which can greatly reduce the performance cost of stack backtracking.

Reduce useless stack backtracking

Limited by proxy process and stack traceback implementation mechanisms, starting from the agent function entry to the back of path exists on several layers of function call has nothing to do with allocation stack, the success will eventually reflect in the final back layer calls on the stack (below the red part), every memory allocation back the layers of useless invocation chain is a loss of performance. The intuitive solution to this problem is to reduce or even eliminate the irrelevant stack tracebacks completely, at the code level by reducing the call level between the agent entry and the traceback opening function. Inline is a straightforward way to implement this, or you can pre-build the context data from the traceback directly at the broker entry.

Cache management

As an important part of native memory monitoring, cache management plays a crucial role in the performance of the whole monitoring tool. Malloc Debug and LeakTracer, for example, use the allocated memory address as the key to calculate the post-hash store, and use a global lock to synchronize the timing of cache updates. The difference is that Malloc Debug calls a chain of exactly the same memory allocation records through stack aggregation, and its cached storage units are dynamically allocated through malloc; LeakTracer, on the other hand, does not aggregate by stack, allocates a portion of its storage unit in advance, and applies dynamically if the cache is low. Based on the above analysis and test results, malloc Debug performs much worse than LeakTracer due to stack aggregation and cache dynamic allocation.

Comparing the source code for Malloc Debug and LeakTracer also shows that stack aggregation at runtime is completely unnecessary; If the threshold of memory monitoring is limited, the upper limit of cache space and cache unit can be controlled within a certain range, without dynamic application, which can reduce the performance loss of dynamic allocation. In addition, due to the high frequency of native memory allocation and release, the global lock will affect the overall performance of certain programs. Therefore, the global lock is not required when the hash is calculated by the key and then stored.

Raphael is pre-allocated cache space with a fixed size. In addition to the crash caused by memory topping, the cache unit is exhausted in advance, which is also considered to have a memory leak problem. This is mainly because: for 32 bit process, its virtual memory limit is usually 4 g, normal operation is relatively easy to touch the ceiling, but the 64 – bit process virtual address space is very big, actually it’s hard to meet the virtual memory peaking case, but have a much higher probability is insufficient physical memory, this is the opposite of a 32-bit process basic. By controlling vmPeak threshold and cache unit allowance, memory leak data can be captured effectively, and finally achieve a stable and reliable automatic memory leak monitoring and consumption process

Monitoring scope

Through the analysis can know, in front of the monitor only malloc/calloc/realloc/memalign/free is unable to meet the demand of governance, This is mainly because malloc/calloc/realloc/memalign/free allocate memory is usually in the virtual memory space is small, large memory consumption of common Thread, the webview, Flutter, hardware acceleration, memory, etc., It’s not distributed by these functions. In order to accurately attribute the problem of native memory hitting the ceiling on the Android platform, monitoring needs to approach the upper limit of virtual memory infinitely, which requires monitoring as many forms of memory allocation as possible.

Android memory operations mainly malloc/calloc/realloc/memalign/free and mmap/mmap64 / munmap, with monitoring the malloc/calloc/realloc/memalign compared/free, There are two differences in monitoring MMAP/MMAP64 / MUNMAP: one is the release of the thread stack, although the thread is created by mMAP/MMAP64 allocation of stack memory, but the release of stack memory is not necessarily by explicitly calling Munmap; Another is to monitor reentrant problem when malloc/calloc/realloc/memalign big memory allocation, such as the underlying is usually achieved by mmap/mmap64, two kinds of interface reentrant problems when monitoring at the same time.

Stack memory release

Void pthread_exit(void *return_value); void pthread_exit(void *return_value);

  • void pthread_exit(void In the body of the return_value function, when the thread state is THREAD_DETACHED it goes directly through void _exit_with_stack_teardown(void)Stack, size_t sz) released

  • Int pthread_join(pthread_t t, void** return_value) **** via pthread_internal_remove_and_free Finally, the pthread_internal_free is released via Munmap

As a result, memory freed by munmap can be monitored, while memory freed by _exit_WITH_STACK_teardown cannot be intercepted. The agent intercepts void pthread_exit(void *) in Raphael and checks whether the thread state is THREAD_DETACHED. If it is, remove the record in the monitor.

Reentrant problem

The following figure shows a typical reentrant scene where the upper malloc function ends up calling the MMAP function, a problem that occurs when monitoring two types of memory interfaces simultaneously. One challenge posed by the reentrant problem is how to manage caches. Only one record can be maintained in the same cache, and the logic and performance of maintaining two records is too complex. In addition, the stack from MALloc to MMAP is fixed, and these stacks are completely useless for analyzing memory leaks because the focus is on the stack above malloc.

The solution to the reentrent problem is straightforward: ignore the allocation if malloc/ Calloc/Realloc stack frames are detected on MMAP/MMAP64. This not only solves the reentrant problem, but also avoids unnecessary stack backtracking. Since the Android platform does not support Thread local storage (TLS), only pthread_setSpecific and pthread_getSpecific can be implemented.

Comprehensive evaluation

function

Relative to the malloc debug and LeakTracer, Raphael not only support the malloc/calloc/realloc/memalign/free, also supports monitoring mmap/mmap64 / munmap etc., The monitoring scope is extended to thread, Webview, Flutter, video memory, etc., basically covering the usage scenarios of native memory on the Android platform

performance

Native memory leak detection on The Android platform is usually carried out during the running process of the program. Stack backtracking and cache management will consume part of CPU and memory, resulting in a certain performance loss. Raphael’s configurable monitoring capability is very scalable, and the performance impact can be limited within an acceptable range. The following data is based on the evaluation of watermelon Video App in 32-bit mode (the performance of mid-to-high-end models and 64-bit models is higher) :

  • CPU: 32-bit mode & The CPU consumption on the low-end PC is less than 3% under the monitoring threshold greater than or equal to 1024
  • Memory: Approximately 16MB of virtual memory is consumed by default in 32-bit mode
  • Frame rate: 32 bit mode & ≥1024 monitoring threshold, low end frame rate does not change significantly

The stability of

The open source version is based on the open source Inline Hook implementation, and some Android 6 models have been found to freeze, other than other stability issues. In addition, bytedance’s early governance practices were focused on offline, and the offline prevention and control system was built and improved based on Raphael. A more stable version can meet online monitoring needs, and we will open source in subsequent iterations.

Governance practices

Raphael is widely used inside Bytedance and is a native memory leak detection tool designated by Bytedance Native Society. In governance practice, Raphael covers almost all native memory usage scenarios, helping to solve a large number of native memory leaks and improper memory usage. Next, through four typical cases, Raphael monitoring capability and Raphael based data analysis methods (application itself, Java layer, WebView, system layer) are briefly introduced.

Case 1

The following figure shows two typical native memory problems in watermelon video. There are both memory leakage in the strict sense (memory is not released after being used up) and more extensive problems of improper use of memory (transient leakage, local scene problems, upper-level business logic problems, etc.). Memory leaks can be easily and quickly located after the life cycle of relevant memory is defined. For the problem of improper memory usage, it is necessary to collect as much unreleased memory as possible to evaluate the impact comprehensively.

In the early analysis of data, we also verified Raphael’s data with MAPS. Usually by analyzing maps can roughly know the reason why memory peaking, below is a typical runtime by malloc/calloc realloc/memalign and mmap/mmap64 allocated memory too much lead to OOM.

Case 2

The following figure shows the scene of a native memory problem encountered by a service within Bytedance. Although the problem of native memory growth can be easily reproduced before Raphael is added, the cause of memory growth cannot be located. After Raphael is added, although the intercepted memory is not much, the problem is very obvious. The first stack ranked was the Java layer calling the Native layer stack when creating bitmap objects (after Android 8, bitmap data is stored in the Native layer), and the investigation of this problem was eventually transferred to the Java layer.

Based on the above analysis, we can conclude that there must be a large number of Bitmaps in the Heap memory of the Java layer. Because the problem is reoccurring offline, we can easily verify and locate the cause of the problem through the Java heap memory snapshot (as shown in the following figure). If it is online, we need to take a snapshot of the abnormal scene to finally locate, which is also the general abnormal data collection and construction mentioned in The construction of Guardia Video Stability Management System I: Tailor Principle and Practice.

Case 3

The memory consumed by WebView on Android devices has always been neglected. With the increase of front-end service scenarios, memory problems caused by WebView become more and more obvious and frequent. The picture below is the scene of memory problems caused by an active front end page monitored by Raphael in the Watermelon Video App. Due to the webview itself, the tool cannot trace out the complete call stack and intuitively locate the cause of the problem. Finally, through directional analysis of memory data, we find that these memory are basically image resources cached in the front page. After optimizing the image caching strategy of the page, the related abnormal memory hitting the top is greatly reduced.

Case 4

The picture below shows the scene of a long-standing Camera memory leak on Android. According to source code analysis, CameraMetadata instances are continuously constructed by Camera in the Native layer during shooting. Each CameraMetadata object points to a large chunk of Native memory. The release of this native memory depends on the Java layer’s CameraMetadataNative object to execute the Finalize function. This logic ultimately leads to the collection of this part of native memory being indirectly dependent on the Java layer GC. If there is no GC in the Java layer for a period of time, this part of Native memory will accumulate because it is not released in time, which will cause various exceptions caused by insufficient native memory after hitting the top. Analyzes the Android Camera memory problems, there are detailed analysis process, the perspective of ART | how to let the GC synchronous recycling native memory, also gives the solution for such problem, through communication Android team said it would fix the problem thoroughly in the future versions.

Subsequent planning

The principle of Native memory leak monitoring is relatively simple, but it is difficult to achieve perfect universality. The most important tests are attribute performance and stability, such as 32-bit stack backtracking performance and stability, cache management performance, etc. In the early stage of our investigation and development of Raphael, we reused a lot of third-party codes and simplified a lot of logic for the purpose of quick landing and solving urgent problems. After long-term governance practice, the tool itself has also exposed some problems and directions that can be optimized in the future.

Android-inline-hook and And64InlineHook are both good InlineHook tools in terms of proxy logic, but they still have compatibility problems and freezes in practice. Although xHook can meet the online standards in compatibility and performance, it is not universal, so it is difficult to extend native memory leak monitoring to other resources with an upper limit (such as JNI Reference Table). We are also investigating optimization of inline hooks to explore more stable and efficient hook solutions.

Stack backtracking and cache management are bottlenecks to native memory leak monitoring performance and stability. Comparatively speaking, the 64-bit stack backtracking scheme based on FP has reached the extreme, but there is still no perfect and ideal scheme under 32 bits. In 32-bit mode, Raphael avoids the performance impact caused by frequent stack backtracking by limiting the depth of stack backtracking and controlling the monitoring range. Although the performance can be greatly improved, it also has the problem of under-reporting. Therefore, 32-bit stack backtracking performance is also our future optimization direction. In addition, the open source version of Raphael still synchronizes its cache management through global locking, which leads to a performance loss. We will also synchronize the latest optimizations in subsequent open source iterations.

As we all know, physical memory, virtual memory, Thread, FD and JNI Reference Table are typical resources with upper limit. Improper use will cause stability problems that are difficult to be investigated by conventional means. Obviously, the same memory leak monitoring logic applies to these other capped resources. Even those where there is no clear upper limit (Binder, flow, time, etc.) can be constructed for monitoring and traceability. Expanding other monitoring capabilities based on Raphael is something we will refine in the future.

conclusion

Android Native memory leaks have been a topic for a long time. Until then, there were no reliable tools available in the industry. Thanks to AOSP and other excellent open source projects (Android-Inline-hook, And64InlineHook, xHook, xDL), It gives us the opportunity to try something relevant. Raphael is the initial exploration and attempt of watermelon video basic technology team. In the long-term management practice of many bytedance apps (such as Watermelon, Douyin and Toutiao), Raphael has not only solved a large number of difficult problems, but also further improved the tools and methodology.

Although raphael-based native memory leak monitoring scheme is mature and stable, its monitoring process penetrates into the running process of App, causing performance loss and stability risk to a certain extent. The scheme we advocate is based on this to build and improve the offline memory leak prevention system and bring it online carefully. Since there are many Raphael versions in the internal iteration and other projects that have not been opened source are involved, we can only choose one version that is stable and available for this open source, and other optimizations will be opened gradually in the future.

Raphael has just taken a small step, and there is plenty of room for improvement. Open source is not the end of the road. We hope to put our heads together to explore and improve Android stability governance faster and further.

Join us

Welcome to join bytedance watermelon video client team. We focus on the development and basic technology construction of watermelon video App, and invest in client architecture, performance, stability, compilation and construction, and research and development tools. If you also want to work together to overcome technical problems and meet greater technical challenges, welcome to join us!

Watermelon video client team is recruiting Android, iOS architects and r & D engineers, the most nice working atmosphere and growth opportunities, all kinds of benefits and opportunities, in Beijing, Hangzhou, Shanghai have positions, welcome to send resume! Contact email: [email protected]; Email title: Name – years of work – watermelon -Android/iOS/ basic technology.

The relevant data

  1. Raphael open source address: github.com/bytedance/m…
  2. XHook Link: xHook
  3. XDL link: xDL
  4. Android-inline-hook link: Android-inline-hook
  5. And64InlineHook link: And64InlineHook
  6. Malloc DEBUG link: malloc_DEBUG
  7. LeakTracer Link: LeakTracer
  8. Based on Raphael classic actual combat case: Android Camera memory analysis
  9. Watermelon video stability management system construction I: Tailor principle and practice
  10. Chronicles of Libunwind LLVM
  11. ART view | how to let the GC synchronous recycling native memory

Welcome to Bytedance Technical Team