preface

Memory problems are a classic problem in the software world, often hidden deep, with few signs of problems before they occur. Once the outbreak of problems, problems of diverse sources, difficult to reproduce, little on-site information, difficult to locate and other difficulties, will make people headache.

Wechat has experienced a variety of memory problems in the past N + iterations, including but not limited to Activity leakage, Cursor unclosed, thread overuse, uncontrolled creation of caches, and a so library that quietly leaks memory, and so on. Some problems even forced us to change the architecture of wechat (2. Leaks of the WebView kernel in the X era led to the change of the multi-process architecture of wechat). To this day, wechat is still occasionally challenged by memory problems, as new problems are always introduced and lurking in successive iterations.

In the process of solving various problems, we have accumulated some relatively effective and multi-faceted optimization methods and tools, from monitoring and reporting to testing and checking in the development stage, to help prevent and solve problems, and we are constantly improving. This article intends to introduce these engineering optimization practice experience, hope to have some reference value to you.

Activity leak detection

Activity leakage is a problem in which an Activity is held by a strong reference, directly or indirectly, by an object with a much longer life than the Activity, during which time it cannot be reclaimed by the GC. Compared with other object leaks, on the one hand, activities on Android provide the Context for interaction with the system, on the other hand, they are also the bearer of interaction between users and the App, so they are very easy to be accidentally held by the system or other business logic as a common tool object for a long time, and once leaks occur, The number of objects implicated in the same leak is also very large. In addition, since there are no obvious symptoms such as crash before a large number of such problems break out except for the increase of App memory occupancy, it is very necessary to actively detect and troubleshoot Activity leakage in the testing stage to avoid OOM or other problems online.

In the early stage of automated testing, Hprof Dump was automatically triggered when the memory usage reached the threshold, and the archived Hprof was analyzed manually by MAT. When the submission speed of new code is not too fast, this can indeed make do with solving the problem. However, with the increasing number of new business codes in wechat, manual check and feedback will be given to the person in charge of each Activity, and the person in charge will manually confirm whether the repair has been done after repair. This process needs to be repeated more and more. Manual solutions are inadequate.

Then we tried LeakCanary. This tool can not only give very readable detection results, but also show open source community maintenance solutions for troubleshooting problems, which can completely replace human labor in Activity leak detection and analysis. The only fly in the wall is that LeakCanary has put both testing and analysis reports together, making the process more consistent with the same person developing and testing, and less friendly to batch automated testing and post-mortem analysis.

Based on LeakCanary, we have developed a leak detection and analysis solution for Activity — ResourceCanary, which is part of our internal quality monitoring platform Matrix and is involved in daily automated testing processes. ResourceCanary has the following improvements compared to LeakCanary:

  • Separate detection and analysis logic.

  • In fact, these two parts can operate independently, with the detection part responsible for detecting and producing Hprof and necessary additional information, and the analysis part processing these products to get the strong reference chain that causes the leak. In this way, the detection part is no longer coupled with analysis and fault solving. The automated test is carried out by the test platform, and the analysis is completed offline by the server of the monitoring platform, and then the relevant development students are notified to solve the problem. The three do not interrupt each other’s process to ensure the consistency of the automated process.

  • Tailor the Hprof file to reduce the overhead of archiving Hprof in the background.

    For Activity leak analysis, we only need the class and object descriptions in Hprof and the string information needed for those descriptions; the rest of the data can be tailored in place on the client side. Due to the low weight of these data in Hprof, this process can reduce the size of Hprof to about 1/10 of its original size, greatly reducing the transfer and storage overhead.

In the actual operation, through ResourceCanary, we checked some typical leak scenarios, some of which are listed as follows:

  • Leaks caused by anonymous inner classes implicitly holding references to external classes

public class ChattingUI extends MMActivity { @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_chatting_ui); EventCenter. AddEventListener (new IListener < IEvent > () {/ / there is a hidden in the IListener inner class members hold the external ChattingUI @ Override this $ public void onEvent() { // ... }}); Private static List<IListener> sListeners = new ArrayList(); Public void addEventListener(IListener cb) {ArrayList holds cb, cb.this$holds ChattingUI, 2. The process has caused the ChattingUI to leak }}Copy the code
  • An Activity leaks when unregistered functions are not called as expected for a variety of reasons

  • Activity leaks caused by system components, such as the leaks caused by SensorManager and InputMethodManager mentioned in LeakCanary.

There are also Runnable that are particularly time-consuming to hold an Activity, or the Runnable itself is not time-consuming, but a time-consuming Runnable in front of it blocks the thread of execution so that the Runnable is never removed from the queue. Also causes Activity leaks and so on. The examples are endless from a source, but any scenario that builds a strong reference to an Activity that holds it for a long time can leak the Activity, thereby leaking a large number of views and other objects that the Activity holds.

In fact, resource Library’s improvements in separating detection from analysis and slashing the size of Hprof files are significant, making it easier to automate Activity checks. We implanted the SDK of ResourceCanary into wechat, and reported the problems found to Matrix platform of wechat through daily routine automated tests, which automatically carried out statistics, stack extraction, accountability and order building. Then, the system would automatically inform the relevant developers to fix the problems, and could continue to follow up the repair situation. It is of great significance to solve problems effectively.

In addition to confirming and repairing the leaks according to the Matrix platform’s report every day, wechat client will also take some active measures to avoid the leaks that cannot be solved immediately, roughly including:

  • Actively cut off the Activity’s reference to the View and reclaim the Drawable in the View to reduce the impact of Activity leakage

  • Try to use the Application Context to obtain some system service instances to avoid memory leaks

  • For known memory leaks from the system that cannot be solved by the above two steps, follow the advice provided by LeakCanary to avoid them

Bitmap allocation and collection tracking

Bitmaps have always been a big memory drain for Android apps. Many Java and even Native memory problems are caused by the improper holding of a large number of bitmaps.

In the meantime, Bitmaps have several features that make it easy to monitor them:

  • The creation scenario is relatively simple. Bitmaps are typically created directly by calling bitmap.create in the Java layer, or decoded from files or network streams through the BitmapFactory. As it happens, we have a layer of encapsulation for Bitmap creation interface invocation, which basically covers all scenarios of creating Bitmap in wechat (including calling external libraries to generate Bitmap). This unified interface helps us to conduct unified monitoring when creating bitmaps, without the need for hack methods such as staking or hook.

  • The creation frequency is low. Bitmap creation is not as frequent as common memory allocation such as MALloc, and is often accompanied by time-consuming decoding or processing, so the performance requirements of adding monitoring logic to Bitmap creation are not particularly high. Even fetching the entire Java stack or even doing some filtering is trivial compared to decoding or other image processing, and we can perform slightly more complex logic.

  • The life cycle of Java objects. The life cycle of Bitmap objects is subject to GC of JVM like ordinary Java objects, so we can track the destruction of Bitmap through WeakReference and other means, instead of tracking the destruction like creation.

In view of the above characteristics, we added a cost-effective monitoring for Bitmap: Add all created bitmaps to a WeakHashMap in the interface layer, and record the time of creating Bitmap, stack and other information. Then look at the WeakHashMap at the appropriate time to see which bitmaps are still alive to determine if Bitmap abuse or leakage has occurred.

This monitoring has a very low performance cost and can be done in a release. Determining leaks takes a bit of performance and currently requires manual handling. Opportunities to collect leaks include:

  • In a Test environment, such as Monkey Test, use “aggressive mode”, where the Java heap is checked a few seconds after each Bitmap creation. When the Java memory usage exceeds a certain threshold, the collection logic is triggered and all surviving Bitmap information is output to a file. Hprof is also output to help find other memory leaks.

  • The release uses “conservative mode”. Bitmap information that occupies more than 1 MB of memory will be output to Xlog only after OOM.

The threshold in aggressive mode is currently set at 200 MB. This is because of the large heap limit of 256 MB for the most OOM prone batch of Android devices that we support. Once the heap peak reaches 200 MB and is not recovered in time, In some scenarios that require similar decoding of large images, there is no way to temporarily allocate tens of MB of memory for image display, resulting in OOM, so Monkey Test considers the Java Heap usage above 200 MB as an exception.

When Bitmap tracing was put into the Monkey Test, the most prominent problem was cache abuse, most typically using static LRUCache to cache large bitmaps.

private static LruCache<String, Bitmap> sBitmapCache = new LruCache<>(20); public static Bitmap getBitmap(String path) { Bitmap bitmap = sBitmapCache.get(path); if (bitmap ! = null) { return bitmap; } bitmap = decodeBitmapFromFile(path); sBitmapCache.put(path, bitmap); return bitmap; }Copy the code

The code above, for example, caches some repeatedly used bitmaps to avoid the performance loss of repeated decoding, but since sBitmapCache is static and has no cleanup logic, images cached in it will never be released until the quota of 20 is exhausted or images are replaced. LruCache limits the number of cached objects, but does not limit the total size of the objects (Java’s object model does not support directly obtaining the memory size of objects). Therefore, if several large or long images are stored in the cache, a large amount of memory will be occupied for a long time. In addition, it is unlikely that different businesses will think ahead to each other that caching might cause, further exacerbating the problem. That’s why we’ve also started to push internally to use a unified cache management component that, as a whole, controls the strategy and amount of cache use.

Native memory leak detection

Native layer memory leakage usually refers to the problem that the allocated memory is not effectively released due to various reasons, resulting in less and less available memory until crash. Since the Native layer has no GC mechanism, the memory management behavior is very controlled, and it is indeed much easier to detect — just intercept the functions related to memory allocation and freeing to see if they match.

We first tried some mature schemes on a single SO:

  • valgrind

    The App was obviously getting sluggish, the test results weren’t very helpful, and ValGrind was a hassle to deploy on Android, with hundreds of test machines being a big problem.

  • asan

    As described in the document, the cost of the test phase is indeed less than that of ValGrind, but the App is still stuck, and it is easy to ANR during automated testing. Retracting the stack phase is prone to crash. In addition, some of our historical SO may have bugs after being compiled by CLang according to asan’s requirements, which also becomes an obstacle to adopting this scheme.

Our guess is that in addition to the overhead of these tools, large features such as dual release, address validity detection, and out-of-bounds access detection add to the runtime overhead. According to this idea, we use malloc_DEBUG of the system for detection, but malloc_DEBUG will produce an inevitable crash in the stack backtracking stage. According to online materials and feedback from manufacturers, This is due to its dependence on the __Unwind family of functions in the STL library, which requires different data structures defined in different STL libraries. However, for some reason, some of the tested SOS are no longer qualified to be reprogrammed in the STL library. This situation forced us to develop a solution to the problem.

Based on previous attempts, we actually need to develop a combination of the two solutions. For libraries that are inconvenient to be reprogrammed, we use a scheme that does not require reprogramming to discard some information in exchange for the ability to locate leaks. For libraries that are easy to reprogram, we use a solution that does not require the Clang environment to ensure that asan can get the stack information of the leaked memory allocation location without introducing bugs. Of course, both solutions need to be light enough to not cause ANR interruptions to the automated testing process.

Due to limited space, the principle of the scheme is not introduced here, but the ideas of the two schemes are explained roughly:

  • Unable to reprogram: PLT hook intercepts the memory allocation function of the tested library, redirects to our own implementation to record the allocated memory address, size, source so library path and other information, regularly scans whether the allocation and release are paired, and outputs the recorded information for the unpaired allocation.

  • Reprogramming: all functions are staked by -finstrut-functions parameter of GCC, and the stack on and out operation is simulated in the pile; The ld’s –wrap parameter intercepts the allocation and release function, redirects it to our own implementation and records the allocated memory address, size, source SO and the current contents of the call stack of the staked record, periodically scans whether the allocation and release are paired, and outputs the information we recorded for the unpaired allocation.

In the actual measurement, the extra cost brought by these two schemes for each memory allocation is less than 10ns, and the change of the overall cost is almost negligible. Through the combination of the two schemes, we not only found a thorny new problem, but also detected the basic network protocol SO library that has been used for many years, and successfully found more than ten small memory leaks hidden for many years to reduce the fragmentation of memory address.

Thread monitoring

Common OOM cases are mostly caused by memory leaks or a large amount of memory. The following thread related cases are rare, but we can sometimes find some such problems on crash system.

java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Out of memoryCopy the code

Cause analysis,

OutOfMemoryError The root cause of an OutOfMemoryError is that insufficient memory is allocated during the initial stack size creation of the thread. This exception is thrown during thread initialization in /art/ Runtime /thread.cc.

void Thread::CreateNativeThread(    JNIEnv* env, jobject java_peer, size_t stack_size, bool is_daemon) {  ...  int pthread_create_result = pthread_create(                        &new_pthread, &attr, Thread::CreateCallback, child_thread);  if (pthread_create_result != 0) {    env->SetLongField(java_peer, WellKnownClasses::java_lang_Thread_nativePeer, 0);    {      std::string msg(StringPrintf("pthread_create (%s stack) failed: %s",                                   PrettySize(stack_size).c_str(), strerror(pthread_create_result)));      ScopedObjectAccess soa(env);      soa.Self()->ThrowOutOfMemoryError(msg.c_str());    }  }}Copy the code

Call the pthread_create method to clone a thread. If pthread_create_result is not 0, the initialization fails. What circumstances will failed to initialize, pthread_create specific logic is in/bionic/libc/bionic/pthread_create. CPP in complete:

int pthread_create(pthread_t* thread_out, pthread_attr_t const* attr, void* (*start_routine)(void*), void* arg) { ... pthread_internal_t* thread = NULL; void* child_stack = NULL; int result = __allocate_thread(&thread_attr, &thread, &child_stack); if (result ! = 0) { return result; }... }static int __allocate_thread(pthread_attr_t* attr, pthread_internal_t** threadp, void** child_stack) { size_t mmap_size; uint8_t* stack_top; . attr->stack_base = __create_thread_mapped_space(mmap_size, attr->guard_size); if (attr->stack_base == NULL) { return EAGAIN; // EAGAIN ! = 0}... return 0; }Copy the code

It can be seen that each thread needs a certain stack size of MMAP. Generally, the default initialization of a thread requires about 1M memory space of Mmap. In 32-bit applications, there is a VMsize of 4G, but the actual usage is 3g+. The maximum number of threads a process can create is 3000+. This is an ideal situation. Linux also has a limit on the number of threads a process can create (/proc/pid/limits). This type of OOM is also raised when the system thread limit is exceeded.

It can be seen that the number of threads is limited, which can avoid OOM occurrence to some extent. Therefore, we also began to monitor the number of threads in wechat.

Monitoring report

In the grayscale version, we dump all threads of the application through a timer for 10 minutes. When the number of threads exceeds a certain threshold, the current thread will be reported and warned. Through the capture of this abnormal situation, we find that in some special scenarios, wechat does have thread leakage and thread explosion in a short time. Cause the number of threads is too large (500+), in this case to create a thread is prone to OOM.

After locating and solving these problems, this type of OOM did decrease a lot in crash system and manufacturer feedback. Therefore, monitoring the number of threads, convergence threads also become one of the effective means to reduce OOM.

Memory monitoring

On Android, you need to pay attention to the usage of two types of memory: physical memory and virtual memory. Memory profilers are often used to view APP Memory usage.

        

In the default view, we can view the total memory footprint of the process, JavaHeap, NativeHeap, and memory allocation for the Graphics, Stack, and Code subtypes. OnLowMemory is triggered when the system is out of memory. At API Level 14 and above, there is a more detailed onTrimMemory. In actual testing, we found that onTrimMemory’s ComponentCallbacks2.TRIM_MEMORY_COMPLETE is not equivalent to onLowMemory, so it is recommended to still listen for onLowMemory callbacks.

In addition to the old large-sized coarse-grained memory reporting, we are building relatively fine-grained memory usage monitoring and integration into the Matrix platform. To implement a monitoring scheme, we need the ability of the runtime to obtain various memory usage data. With ActivityManager’s getProcessMemoryInfo, we get the debug. MemoryInfo data for the wechat process (note: This interface may take a long time in low-end models, so it cannot be called in the main thread, and it takes time to monitor the call. In models that consume too much time, the memory monitoring module is shielded. With the getMemoryStat method of Hook Debug.memoryInfo (version 23 and above), we can get multiple pieces of data equivalent to the default view of the Memory Profiler to get a breakdown of Memory usage. In addition, DalvikHeap is available through the Runtime; Through the Debug. GetNativeHeapAllocatedSize NativeHeap can be obtained. At this point, we can obtain various data of wechat virtual memory and physical memory when low memory occurs, so as to realize monitoring.

Memory monitoring is divided into two scenarios: regular monitoring and low memory monitoring.

  • Conventional memory monitoring – During the use of wechat, the memory monitoring module will obtain the memory usage at intervals (up to 30 minutes) according to the characteristics of Fibonacci sequence, so as to obtain the memory curve of wechat changing with the use time.

  • Low memory monitoring – through onLowMemory callback, or through onTrimMemory callback and different markup, combining ActivityManager. MemoryInfo lowMemory mark, we can get the timing of low memory. It is important to note that onLowMemory callbacks are only caused when physical memory is low. If the virtual memory size exceeds the threshold, an OOM exception is triggered. Therefore, we also monitor the virtual memory usage. When the virtual memory usage exceeds 90% of the upper limit, a low memory alarm is triggered. Low memory monitoring monitors the occurrence frequency of low memory, memory usage monitoring, and the current scenario of wechat when low memory occurs.

Out to protect

In addition to the above problems and solutions, we also put forward a last-ditch protection strategy to try in the face of various unknown and difficult problems.

From the statistics of the whole market, we find that there are tens of millions of users who can survive for more than one day in the main process of wechat, accounting for 1.5%+. If there is a slight memory leak in the application itself or at the bottom of the system, OOM will not be caused in a short time, but the application will occupy more and more memory in a long time of use. This will eventually result in OOM status. In this case, we are also thinking that if we can know the memory occupancy and the current usage scenario of users in advance, we can protect the abnormal situation to avoid the OOM phenomenon that is uncontrollable and easy to be perceived by users.

How to out

OOM causes the process to be killed, which is actually the signal thrown by the system to handle exceptions. If the application itself plays this role, we have more flexibility than the system to handle such exceptions in advance, depending on the specific scenario. The biggest benefit is that you can kill the process and restart it in the right scenario without the user being aware of it and before it gets close to triggering a system exception, bringing the application’s memory footprint back to normal, which is not a good way to take advantage of the situation.

Here we mainly consider several conditions:

  • Whether wechat returns to the background on the main screen and stays in the background for more than 30 minutes

  • The current time ranges from 2 a.m. to 5 a.m

  • There is no foreground service (notification bar, music player bar, etc.)

  • The Java heap must be greater than 85% of the current process’s largest distribution | | native memory is greater than 800 m | | vmsize over 4 g (WeChat 32 bit) of 85%

  • Non-heavy traffic consumption (less than 1 MB/min) && The process does not have heavy CPU scheduling

When the above conditions are met, the current wechat main process is killed and re-pulled and initialized by the push process to carry out bottom protection. From the user’s point of view, when users cut wechat back to the foreground, they will not see the initialization interface, which is still located in the main interface, so they will not feel awkward. According to the results of local test and gray scale, the application of this last-ditch strategy can effectively reduce the occurrence of OOM for users. Among 5W users with gray scale, 3 or 4 of them hit this last-ditch strategy. However, whether the specific last-ditch strategy is reasonable or not needs more rigorous testing before it can be confirmed to go online.

conclusion

Through the above article, we have introduced as many aspects of memory optimization methods and engineering practices as possible. Some of the less obvious issues and many of the details cannot be explored in detail because space is limited. In general, the idea of our optimization practice is to continuously implement more tools and components in the development stage, systematically and gradually improve the degree of automation to improve the efficiency of problem finding.

Have to mention is, of course, even if doing so much effort, memory problems are still not completely destroyed, there is still a problem because of the lack of information is difficult to locate the reason, or because the test path cannot be covered and cannot be found in advance, and the problem of compatibility, the introduction of external components for leaks, etc., and we also need more systematic and automation, This is something we are still refining and improving.

I hope the existing experience can be helpful to you, there is no end to optimization, wechat is still working hard.