The analysis and location of C++ memory leak has always been a difficult problem for developers on Android platform. Because map rendering, navigation and other core functions have high performance requirements, amap APP has a lot of C++ code. To solve this problem is particularly important and key to product quality, amAP technical team has developed a set of solutions in practice.

The core of analyzing and locating memory leak problem lies in allocation function statistics and stack backtracking. Knowing only the memory allocation point without knowing the call stack makes the problem more complex and costs more to solve, so you need both.

In Android, Bionic malloc_DEBUG module is relatively perfect in monitoring and statistics of memory allocation function, but stack backtracking is lack of efficient way in Android system. With the development of Android, Google also provides some analysis methods for stack backtracking, but these solutions have the following problems:

1. Libunwind is used in all the links of stack tracing. This acquisition method consumes a lot of energy. In the case of a large number of Native codes, frequent calls will lead to slow application.

2. ROM requirements are limited, which brings inconvenience to daily development and testing.

3. Use the command line or DDMS to perform operations. Prepare an environment for each check and perform manual operations.

Therefore, how to carry out efficient stack backtracking and build a systematic Android Native memory analysis system is particularly important.

Based on these two points, Gaudamap has made some improvements and extensions. Through these improvements, these problems can be found and solved in time through automated testing, greatly improving the development efficiency and reducing the cost of troubleshooting problems.

Stack backtracking acceleration

On the Android platform, libunwind is mainly used for stack backtracking, which can meet the vast majority of cases. However, the global lock and unwind table resolution in libunwind implementation will have performance loss, which will cause the application to freeze and become unusable in the case of frequent multi-threaded calls.

Accelerate the principle

The compiler’s -finstrut-functions compilation option allows custom functions to be inserted at the beginning and end of a function at compile time, and a call to __cyg_profile_func_enter to be inserted at the beginning of each function, Insert a call to __cyg_profile_func_exit at the end. Call point addresses can be retrieved from both functions, and the call stack can be retrieved at any time by recording these addresses.

Example of effect after piling:

How do I log this call information? We want this information to be read from thread to thread without being affected. One option is to use thread synchronization, such as adding critical sections or mutex to read and write the variable, but this would be inefficient.

Can you leave it unlocked? Thread-local storage, or TLS, comes to mind. TLS is a dedicated storage area that can only be accessed by its own threads without thread safety issues, which fits this scenario.

Therefore, the stack backtracking is accelerated by using the compiler to peg the call stack and store it in thread-local storage. The concrete implementation is as follows:

1. Use the compiler’s -finstrut-functions compilation option to insert relevant code at compile time.

2. The record of the call address in TLS adopts the form of array + cursor to achieve the fastest insertion, deletion and acquisition.

Define array + cursor data structure:

typedef struct {
    void* stack[MAX_TRACE_DEEP];
    int current;
} thread_stack_t;
Copy the code

Initialize the storage key for thread_stack_t in TLS:

static pthread_once_t sBackTraceOnce = PTHREAD_ONCE_INIT;

static void __attribute__((no_instrument_function))
destructor(void* ptr) {
    if (ptr) {
        free(ptr);
    }
}

static void __attribute__((no_instrument_function))
init_once(void) {
    pthread_key_create(&sBackTraceKey, destructor);
}
Copy the code

Initialize thread_stack_t into TLS:

get_backtrace_info() {
    thread_stack_t* ptr = (thread_stack_t*) pthread_getspecific(sBackTraceKey);
    if (ptr)
        return ptr;

    ptr = (thread_stack_t*)malloc(sizeof(thread_stack_t));
    ptr->current = MAX_TRACE_DEEP - 1;
    pthread_setspecific(sBackTraceKey, ptr);
    return ptr;
}
Copy the code

3. Implement __cyg_profile_func_Enter and __cyg_profile_func_exit, and record the call address to TLS. The second argument to __cyg_profile_func_enter, call_site, is the code segment address of the call point. When the function enters, it is recorded in the array already allocated in TLS. The cursor PTR ->current moves to the left, and PTR ->current moves to the right when the function exits.

void __attribute__((no_instrument_function))
__cyg_profile_func_enter(void* this_func, void* call_site) {
    pthread_once(&sBackTraceOnce, init_once);
    thread_stack_t* ptr = get_backtrace_info();
    if (ptr->current > 0)
        ptr->stack[ptr->current--] = (void*)((long)call_site - 4);
}

void __attribute__((no_instrument_function))
__cyg_profile_func_exit(void* this_func, void* call_site) {
    pthread_once(&sBackTraceOnce, init_once);
    thread_stack_t* ptr = get_backtrace_info();
    if(++ptr->current >= MAX_TRACE_DEEP) ptr->current = MAX_TRACE_DEEP - 1; }}Copy the code

Logical diagram:

The inconsistency between the record direction and the array growth direction makes the external interface for obtaining stack information more simple and efficient. You can directly copy the memory to obtain the call stack whose address of the nearest call point is first and the address of the farthest call point is later.

4. Provide an interface to obtain stack information.

get_tls_backtrace(void** backtrace, int max) {
    pthread_once(&sBackTraceOnce, init_once);
    int count = max;
    thread_stack_t* ptr = get_backtrace_info();
    if (MAX_TRACE_DEEP - 1 - ptr->current < count) {
        count = MAX_TRACE_DEEP - 1 - ptr->current;
    }
    if (count > 0) {
        memcpy(backtrace, &ptr->stack[ptr->current + 1], sizeof(void *) * count);
    }
    return count;
}
Copy the code

5. Compile the above logic into a dynamic library, and other business modules rely on the dynamic library compilation. At the same time, add -finstrument-functions to compile flag for interpolation, and then all function calls are recorded in TLS. The user can call get_tls_backtrace(void** backtrace, int Max) from anywhere to get the call stack.

Effect comparison (Google Benchmark is used for performance test, mobile phone model: Huawei Vision 5S, 5.1 system) : • LibunWind single thread

• TLS single-thread access

• LibunWind 10 threads

• 10 threads in TLS mode

The advantages and disadvantages

• Advantages: Much faster, for more frequent stack backtracking.

• Cons: Compiler pegs, bulky, can’t be used directly as an online product, only for in-memory test packs. This problem can be solved by continuous integration, which outputs the C++ project into a common library and the corresponding in-memory test library each time the project is out of the library.

2. Systematization

The above steps can solve the pain point of obtaining memory allocation stack slowly, combined with the tools provided by Google, For example, the DDMS and ADB shell am dumpheap -n pid /data/local/ TMP /heap. TXT commands can be used to detect Native memory leaks. However, the troubleshooting efficiency is low and certain mobile phone environment preparation is required.

Therefore, we decided to build a set of systematic system to solve such problems more conveniently. The overall idea is introduced below:

• Memory monitoring uses the MALloc_DEBUG module of LIBC. If you do not use the official way to enable this function, it is very difficult and not good for automated testing. You can compile a copy and put it in your own project, hook all memory functions, and jump to the monitoring function malloc_DEBUG to execute, so malloc_DEBUG monitors all memory requests/releases. And the corresponding statistics were made.

• Use get_tls_backtrace to implement the __LIBC_HIDDEN__ int32_t get_backtrace_external(uintptr_t* frames, Size_t max_depth), in conjunction with the stack backtrace acceleration mentioned above.

• Establish Socket communication, support external programs to exchange data through Socket, so that more convenient access to memory data.

• Set up the Web terminal, and the obtained memory data can be parsed and displayed after uploading. Here, the address needs to be reversed by ADDR2line.

• Write test cases and integrate them with automated tests. At the beginning of the test, the memory information was collected and stored through Socket, and at the end of the test, the information was uploaded to the platform for parsing, and an evaluation email was sent. When there is a problem alarm, the r & D students can check the problem directly on the Web side through the memory curve and call stack information.

Example of system effect: