This paper starts with the native memory OOM problem in the camera shooting process of a class of Android models, and analyzes such problems in depth with the help of memory snapshot clipping and retrieval and the enabling of Native memory monitoring tools.

background

Raphael is a native memory monitoring tool developed by watermelon Video Android team, which is widely used in ByteDance’s internal products (such as Watermelon, Douyin, Toutiao, etc.) to monitor native memory leaks. A large number of memory log sites (such as pthread_CREATE, GL error and EGL_BAD_ALLOC) crash caused by virtual memory hit were collected from Tik Tok 7.8.0-8.3.0, and more than 60% of them were camera related memory leaks. Accounting for more than 15% of the overall crash (Java & Native). Meanwhile, we also received feedback from OPPO and other manufacturers that the native crash of Tiktok app on its new model was more than 3 times higher than that of other models. After analyzing the logs provided by the manufacturer, it was found that carSH was basically caused by virtual memory reaching the top, and more than 80% of them had the logs of camera-related memory allocation failure.

The problem

Through stack aggregation and SO-level memory usage statistics on the logs collected from Native memory monitoring, it can be found that the total amount of native memory intercepted by the tool has reached about 1.3G by OOM (the upper limit of native memory that can be directly used by 32-bit applications is about 2G). The largest portion of this is memory referenced indirectly by the CameraMetaData object, and native memory leaks are severe.

Due to the high frequency of Native memory allocation and time-consuming acquisition of Java stack, it is not suitable to directly and frequently capture Java stack when intercepting Native memory allocation. Native memory is different from Java memory, and it is difficult to draw an intuitive conclusion only from the intercepted data. It is often difficult to attribute the problems caused by poor use of resources such as memory. CameraMetaData cited the most memory in terms of intercepted data, so we decided to analyze the problem based on this decision

The preliminary analysis

Analyze native memory allocation and release

As you can see from the intercepted stack, the CameraMetaData creation stack is built on top of Java calls that are ultimately allocated in the native layer (boot-framework.oat & libAndroid_runtime.so). The CameraMetaData object has two parts of memory: the object itself & the memory referenced by the CamerA_metadata_t object pointed to by mBuffer; The camerA_metadata_t that the mBuffer of each CameraMetadata object points to is independent and does not overlap with each other.

Since the tool was able to intercept so many unfreed memory allocations, there must be a problem with the free logic of these allocations. We need to investigate the free logic of CameraMetadata. CameraMetadata::release() does not free the memory pointed to by mBuffer. Instead, it assigns the memory pointed to by mBuffer to another CameraMetadata object. CameraMetadata::clear() is true, and clear is called in two scenarios: when CamerA_metadatA_T is multiplexed and when the CameraMetadata object is destructed.

Camera_metadata_t is independent of CameraMetadata. MBuffer. Based on the stack and allocation numbers intercepted by the tool, there must be a large number of CameraMetadata instances in Native OOM memory. The destructor of C++ objects is usually implemented by calling delete. It is difficult to find where a CameraMetaData object was deleted in AOSP because it is difficult to know the variable name at the time of delete. According to a basic C++ programming specification, memory should be freed where it is normally created, CameraMetaData can be easily found by searching the new CameraMetaData string globally Object to create and release all is in [/ frameworks/base/core/jni/android_hardware_camera2_CameraMetadata CPP]

You can see the Java layer classes associated with these functions from the registry in android_hardware_camera2_CameraMetadata. CPP Is an android/hardware/camera2 / impl/CameraMetadataNative, CameraMetadata_close function in the Java is corresponding nativeClose function. It can be further found that in CameraMetaDataNative nativeClose function is called in close function, and close function is called in Finalize function.

From the above analysis, we can see that only when the CameraMetaDataNative object executes finalize method, the corresponding native memory will be reclaimed, and finalize method is executed in FinalizerDaemon thread. We guess that if the stack native OOM happens, there must be a large number of CameraMetaDataNative objects in Java layer that have not implemented finalize method.

Check the Java heap site

Fortunately, we have access to a large number of Java heap snapshot files corresponding to the Native OOM using Tailor. These memory snapshot files perfectly confirm our previous suspicions that there are plenty of CameraMetadataNative objects in the Java layer when such a native OOM occurs. In the following figure, all CameraMetadataNative objects are in FinalizerDaemon thread queue, waiting to execute finalize method except 6 objects referenced by other code. At the same time, there are 6658 objects in the snapshot, and only the mMetadataPtr of about 600+ objects is equal to 0, indicating that the Native memory corresponding to these objects needs to be released when Finalize, which completely matches the data intercepted by the tool. It also indirectly verifies the correctness and reliability of Native memory monitoring

In-depth analysis

Check Finalize Execution

Although the above analysis verified the problem and confirmed the previous conjecture, it still failed to find the underlying cause of such problems, and it is still helpless to finally solve such problems. Why there are so many CameraMetadataNative objects waiting to implement Finalize method may be the next direction of investigation. If you have done Java stability management, you know there is a kind of famous TimeoutException, which is basically caused by Finalize execution timeout. Will this case be caused by Finalize execution timeout of some object?

FinalizerDaemon’s source code shows that every time an object’s Finalize method is executed, it will passfinalizingObjectProperty to record the current object. If finalizingObject is a finalize timeout, there must be a case where finalizingObject is not empty. After iterating through the FinalizerDaemon thread state in all the relevant memory snapshots, we found that the finalizingObject property on these sites was empty. This result is surprising and does not seem to be caused by a finalize method timeout on an object.

Through the analysis offinalizingReference = (FinalizerReference<? >)queue.remove()It turns out that the logic behind this line of code isn’t rightfinalizingReferenceEmpty, that this place will not return to empty. And since it’s not empty,queue.remove()Can only block wait, this ReferenceQueue. Java source also confirmed the conjecture.

The source code shows that goToSleep is a synchronous method and may block. But the traverse all relevant snapshot found all needToWork attribute is false, that has come (only FinalizerWatchdogDaemon. INSTANCE. GoToSleep () will be set to false, and this function is private, Only called in the FinalizerDaemon thread), so there is almost no chance that the block will be there.

The reason blocks are there is usually because objects that need to execute Finalize are added to FinalizerDaemon’s queue only during GC. If there is no GC for a period of time and the queue is empty, the remove above will block until there is no object added to the queue. By coincidence, Tailor actively dumps a memory snapshot of the Java heap when such a Native OOM occurs, which triggers GC & suspend, This eventually causes a large number of CameraMetadataNative objects to be added to finalizerdaemon. queue at the same time.

Analyzing GC strategy

According to the above analysis, these objects will not be added to finalizerdaemon. queue if it is not GC, which indicates that there has been no GC for a period of time before the occurrence of this native OOM. Therefore, a large number of CameraMetadataNative objects do not execute Finalize in time, and then native OOM occurs. The above analysis was also verified in the offline static observation experiment after entering the shooting page, in which the Java heap would actively trigger GC every 30s-40s or even longer, during which the native memory would keep increasing, and it would drop sharply after GC. Java & Native memory will return to normal. Although the problem is not a block in finalize, ultimately the cause of this problem is locked to GC logic!

Understand the GC students may know there are many kinds of ART of the virtual machine GC cause, kGcCauseForAlloc/kGcCauseBackground triggered the most frequent is virtual machine. The program logic is relatively simple when nothing is done while remaining on the camera page, during which only the camera service cycle (>=30 times /s) is triggered by the binder on the application side to create a CameraMetadataNative object and display an image captured by the camera on the camera page. In this process, only CameraMetadataNative objects are created in the Java heap, and CameraMetadataNative itself occupies relatively small memory. If the Java heap is rich after a GC, the VIRTUAL machine will not actively trigger GC for a long time. If the native memory increases too much during this period, the Native OOM will occur when it hits the top before the next GC

To sum up, the fundamental reasons for this kind of Native OOM are as follows: When the native memory of the application itself is ata high level, after the camera is started, the camera service will continuously create CameraMetadataNative objects on the application side through communication with the Binder. Creating a CameraMetadataNative object will also create/reuse a relatively large memory of CamerA_metadata_t in the native layer through the JNI interface on the application side. As the CameraMetadataNative object of The Java layer is relatively small, it is difficult to trigger GC of the Java layer in a certain period of time by continuously creating small objects, resulting in the continuous increase of the native memory indirectly referenced by it, and finally triggering the virtual memory ceiling and crash.

solution

The cause of the problem is relatively simple, but how to solve it can be difficult to decide. Since the GC is not timely, a simple solution is to trigger GC periodically on the capture page. However, if the GC interval is relatively small, GC is time-consuming after all, and too frequent GC will seriously affect the shooting experience. If the GC interval is long, there is a high probability of repeating the mistakes of this native OOM.

A scenario that actively triggers GC is difficult to balance against the performance impact. In fact, the focus of the problem is not the Java layer, but the native memory referenced by Java objects. If this part of memory is released in time, this kind of problem can be fundamentally solved. As can be seen from the previous analysis, this part of memory was originally collected by finalize in GC, but if CameraMetadataNative is no longer used in advance, it can be triggered to release this part of memory once and for all. By analyzing the source code, it can be found that the CameraMetadataNative is not used after being transferred to the application layer. After the CameraMetadataNative object is used by the application layer, the native memory referenced by it can be released by calling the close function through reflection.

Offline experiments can also find that the growth rate of Native memory decreases significantly after the active reclamation strategy is enabled. During this period, there are still small objects continuously increasing in the Java heap-Native layer, but the growth rate of Native is far less than that of Java layer. In this scenario, Java memory will trigger GC before Native memory hits the top, which greatly reduces the possibility of Native OOM

Finally, after the scheme was launched, the effect was very obvious, and such crash (total ratio of Java and Native >15%) was basically eliminated. In the subsequent memory monitoring logs collected, the amount of memory associated with CameraMetadata is generally less than 2M. The effect is immediate!

conclusion

Such problems have existed for a long time. At least since Android 4.4, Native memory is released by CameraMetadataNative finalize function. In the past, the requirements of shooting were relatively simple, and most of the time, ROM camera applications were used to take photos. Because such apps are relatively simple, the native memory level itself is very low, and it is difficult to trigger the upper limit of virtual memory, so such problems are not exposed. With the rise of small video and other apps, the shooting requirements are becoming more and more heavy (special effects & beauty, etc.), and the APP is becoming more and more complex. Besides, the native memory level of the app keeps rising. Besides, the native memory leaks and other reasons, such problems are easily triggered when the app stays on the shooting page for a long time.

In addition, CameraMetadata does not crash when it fails to allocate memory. It only crashes when there is another memory allocation request (such as thread creation, GL memory allocation, etc.). This is the root cause of many camera black screen problems during shooting. This solution also inadvertently solves the long-standing problem of black screen during shooting.

This kind of problem is caused by both application itself and the design of memory reclamation strategy. While minimizing leaks, applications should also strive to lower their native memory water levels. AOSP uses Java Finalize method to release its indirectly referenced native memory, which is a lazy design. Similar cases can be found everywhere in AOSP. In our actual development, limited resources such as memory should be recycled in time, and even the life cycle of objects can be actively limited. Once the mission is completed, the memory occupied by the Objects can be reclaimed actively, and the Native memory can be released without using Finalize logic.

The two improved tools (Native memory monitoring tool Raphael and Android heap memory snapshot cutting and compression tool) are two sets of efficient and practical basic tools developed by watermelon Video Android team in the long-term memory optimization governance, which are widely used in various internal apps of our company. It is the absolute first choice for memory optimization & stability governance. We will also introduce the technical details of these two sets of tools in the follow-up monitoring tool construction & optimization governance practices and other technical articles, please pay attention to them.

More share

Architecture evolution of Bytedance’s own online drainage playback system

Transactions stored in a bytedance table

IOS: KVO

Toutiao Android ‘second’ level compilation speed optimization

Bytedance – Watermelon Video Android team

Android team is the client team responsible for the development of Bytedance’s Watermelon video App. While meeting the high speed of business iteration, the team continues to optimize performance and experience, improve r&d efficiency, and explore Flutter and other cross-platform solutions. We are looking for business r&d, architects, Flutter engineers, key engineers and interns. We have positions in Beijing, Hangzhou and Shanghai. The business volume is large, the team grows fast, the technical challenge is big, welcome all talents to join! Contact email: [email protected]; Email subject: Name – years of work – watermelon -Android.