Bytedance Terminal Technology — Pang Xiangyu

Content in this paper,

AppHealth (Client Infrastructure – AppHealth) A team developed for the location of wild Pointers (UseAfterFree), memory overflow (HeapBufferOverflow), duplicate release (DoubleFree) class problem detection tool. It is widely used for online problem detection of bytedance’s major APPS. This article will introduce the tool through solution principles and practical cases.

background

As the technology stack of Android App development continues to expand to the Native layer, the online stability problem is becoming more and more serious. More than half of the bugs in Android come from Memory Corruption. The difficulty of analyzing such problems on the positioning line lies in that, firstly, it is difficult to reappear offline; secondly, when the problem occurs, it is no longer the first crime scene; moreover, the call stack of such problems presents diversified types. This leads to the situation that such problems are difficult to analyze, locate and solve in the short term.

What is the Memory Corruption problem

UseAfterFree

UseAfterFree (UAF)

void HeapUseAfterFree(a) {
  int *ptr1 = (int*)malloc(4);
  if(ptr1 ! =NULL){
    *ptr1 = 0xcccc;
    free(ptr1);           //free ptr1           
    *ptr1 = 0xabcd;       // Free after write ptr1 mem here will not crash}}Copy the code

Here the UAF problem is used to illustrate the scenario where Native crash is not the first scene. Let’s say the code above runs on thread A, line 2 requests A block of memory with A size of 4 bytes, line 5 frees this block of memory, and thread A switches to thread B before line 6. Thread B then requests A block of memory with A size of 4 bytes. The memory manager will allocate the previously released memory block pointed by PTR1 to thread B in A probabilistic manner. Thread B assigns the value 0xFF to the memory pointed by Ptr2. After the time slice execution of thread B is finished, the CPU is released, and thread A is switched to execute, and Ptr1 is assigned the value 0xABCD. Ptr2 Memory value not 0xFF triggers exception logic. Not the value expected by thread B. This scenario often happens during the running of large App programs.

DoubleFree

DoubleFree below referred to as DF, heap memory secondary release class problems;

void DoubleFree(a) {
  int *ptr = (int*)malloc(4);
  free(ptr);
  free(ptr);
}
Copy the code

The same piece of heap memory address release issue for many times, in the actual development can have such A scenario, A thread is A c + + class X for A heap memory, the memory address is passed to the Y class methods to use, after use by destructors release, thread B to apply for the same size of memory, apply to this address, has released At this point, the X class of thread A performs the destructor to free the corresponding memory.

HeapBufferOverflow

HeapBufferOverflow (HBO);

void HeapBufferOverflow(a) {
  char *ptr = (char*)malloc(sizeof(char) *100);
  *(ptr+101) = "aa";
  *(ptr+102) = "bb";
  *(ptr+103) = "cc";
  *(ptr+104) = "dd";
  free(ptr);
}
Copy the code

The out-of-bounds heap problem is easier to understand and won’t be repeated here.

Tool status quo

There are many excellent tools for Memory Corruption analysis, such as Asan (Address Sanitizer), HWASAN, Valgrind or Coredump. However, due to compatibility, performance power consumption, high access cost, and system limitations, these tools cannot be used in large scale on the Android App client line. Therefore, it is difficult to locate complex problems in large-scale user scenarios.

Tool comparison:

Byte scheme

Can you develop an online tool to detect Memory Corruption problems? The answer is yes.

Identify the issues that need to be addressed before development.

  • The problems to be solved are as follows:

    • Strong compatibility, low performance overhead, small memory consumption, high stability;
    • Stack backtracking is efficient and accurate, requiring recording thread information, memory allocation size and memory address information.
  • Configurable function management, convenient online and offline use, low access cost;

  • The user does not have the perception detection, does not trigger the crash when the exception occurs;

The main idea is to manage the memory applied and released by App in a unified manner, so as to achieve the supervision of memory allocation and release. If you monitor all the memory and record the desired information, performance will be affected. Therefore, the tool allocates a chunk of memory through MMAP and maintains and manages the memory itself. Memory allocation policy Based on random sampling. After matching the sampling rule, the tool allocates and releases the memory in the memory pool and controls the access permission to the memory. Add an isolation area before and after the allocated memory block, set the freed memory to unread and write permission, and mark the memory status. Through a data structure to record the thread information, thread stack frame, record the current memory block state to achieve the purpose of detection. At the same time, the system implements configurable management by dynamically delivering configurations online.

Selection of Hook tools

To locate Memory Corruption problems, Hook functions related to Memory application and release to monitor Memory. The selection of Hook scheme is involved here. The first thing to be considered online is high efficiency, stability and good compatibility.

Common types of online Hook tools are as follows:

From the comparison of tools, dispatch table hook is the first choice after a lot of experiments, because malloc/free related functions are used very frequently. Hook Dispatch table is efficient and stable, has little impact on performance, and can be opened on a large scale online. The hook principle is mainly to find the address of the Dispatch table and replace the address of malloc related functions in the table to meet the needs of Hook Malloc related functions. Since it is hook Callee, there is no need to worry about hook delta library. Google/LLVM also uses this approach for malloc proxying. Memory Corruption problems are usually caused by the release of small Memory applications (within 4K), so there is no need to hook Mmap related functions for the time being. In terms of compatibility, we have adapted Android5.x ~ 11.

Stack backtracking scheme selection

There are many kinds of stack backtracking standards on Android. Through research, compare the mainstream stack backtracking schemes.

Through comparison and experiment, we choose FP method to carry out stack backtracking on Arm64 device. The frame pointer of Arm64 device is on by default, and the 64 bit SO in App on the experimental observation line does not close the frame pointer, and THE STACK backtracking of FP method takes almost no time. Through the actual test, the average backtracking of 15-layer stack frame is 1 ~ 2μs. Other stack tracebacks are basically at the MS level.

For Arm32 devices, the frame pointer is off by default. Therefore, the Arm32 device cannot record the App memory allocation and release process through fp stack tracing. We optimized libunwind_stack, so in Arm32 we chose libunwind_stack for stack traceback, for record allocation and stack trace release.

Dual sampling memory configuration policy

For the current monitoring of Memory Corruption, all Memory applications and releases are usually monitored through injection and staking. App applications will apply for releasing thousands to tens of thousands of times once a user uses a sliding event. Stacking stack backtracking capability will cause serious performance impact on monitored programs, resulting in poor user experience, lag and other problems.

In order to solve this problem, the dual sampling mechanism is adopted to control the allocation of the number of users and the memory of the client. Dual sampling is a method in which the server sends configuration file samples and the client randomly samples the allocation of memory to monitor the allocation and release of memory **. ** The server configuration file is sampled by setting the user sampling ratio on the server and sampling in proportion to different problem types, versions, and models. Client random memory allocation sampling is implemented by the on-end random allocation sampling algorithm. In this way, random memory allocation and configuration management can be carried out for the number of users and the number of monitored memory on the end.

Sensorless detection

When a Memory Corruption problem occurs, it normally triggers a SIGSEGV type crash. To make the user unaware, you can not let the program crash exit. What we do here is to register SIGSEGV signal processing function. When the controlled memory block is released, it will be set as unreadable. When an exception occurs, the code accessing the unreadable memory will trigger SIGSEGV and enter the signal processing function. First, determine whether the current abnormal address is in our managed memory pool. If the exception is triggered by our managed memory segment, restore the read and write permission of the corresponding memory segment. To ensure that the program exit process is not triggered during the signal processing process. So that users have no perception detection Memory Corruption problems. If the memory address that triggered SIGSEGV is not in our managed memory segment, the signal is forwarded to the original signal handler for processing.

Solution process

Advantages and disadvantages of the scheme

  • advantages

    • Available online and offline, low access cost, rely on AAR component initialization, no additional operations;
    • Configurable management, through the cloud to deliver configuration, dynamic switch function;
    • Memory allocation sampling management, memory pool line control from 100KB to 8MB, total memory overhead from 700KB to 8.6MB.

  • disadvantages

    • Monitor memory block size, maximum 4KB;
    • Failure to detect crashes caused by non-heap memory
    • Ios /x86 is not supported for now, but will be supported later.
    • Android4.4 and below versions are not supported, but later versions can be supported;

Online effect and case study

After the launch of several MemCorruption apps, more than 200 basic database Memory Corruption problems have been found. 30+ faults have been located and resolved using tools.

Example 1. UseAfterFree

Log information, abnormal stack, Free stack, Alloc stack. Abort MSG records the size of the memory allocation, and the Free and Alloc stacks record the allocated and freed threads. Using this information, you can know the allocation and release of a block of memory. Combine the source code to locate the problem.

The following is an online detection of a UAF problem in the service SDK. Abort MSG message indicates that the UAF problem is caused. The memory size is 256 bytes and the UAF detection is triggered when the memory 0x7A25A28B00 is offset by 240 bytes. This is where the UAF problem occurs when a member variable of a structure variable is accessed. Note The corresponding structure variable is used again after being released.

  • Exception stack and Abort MSG

Trigger detection code logic

  • Free stack information

According to the Free memory block stack information, it can be determined that the 11170 thread freed the corresponding memory. Combined with the code, it can locate the freed memory block variable m_pDefaultFilter, and m_filterType is the member variable of m_pDefaultFilter.

Free object memory code

  • Alloc Memory information

M_pDefaultFilter has been freed, and then memory has been freed when accessing its member variable m_FILterType, triggering UAF detection. Compared to traditional Tombstone with only one stack of exceptions, the MemCorruption tool can analyze the problem more clearly and capture the first spot of the problem. Shorten the troubleshooting time of r & D students to improve the efficiency of problem solving.

Case 2, DoubleFree problem

  • Exception stack and Abort MSG

  • Free stack information

  • Alloc stack information

From the above information, we know that there is a double Free exception in the libbinder library. Free has two links that can release mData or mObjects of the Parcel class.

Link 1: Call recycle- > Freebuffer -… –>freeData–>freeDataNoInit–>free;

WriteString –> nativeWritexxx–> WritexXX16 –> Writexxx–> continueWrite–> Realloc –> Free;

ContinueWrite and freeDataNoInit codes are in Parcel. CPP and are not protected, and there are multiple releases due to concurrency for the life cycle of mData and mObjects objects. Combined with abnormal stack information, business code to protect and repair.

conclusion

Memory Corruption is an unavoidable problem for C/C++ developers. The principle of MemCorruption tool is not complicated. In large-scale user scenarios, it can monitor Memory allocation and release through sampling method and detect problems without triggering a program crash. It can effectively detect Memory Corruption problems caused by online low-probability and edge scenarios. Reduce App vulnerabilities and improve App stability.

The MemCorruption tool is just one part of byte’s efforts to address online Memory Corruption issues. There’s still a lot of work to be done. Please continue to pay attention to bytedance terminal technology team, the future will be more exciting.

The follow-up plan

In order not to affect the performance of online APPS, the MemCorruption tool has limited the scope of memory monitoring, and we will expand the capability of this part in the future. At the same time, there are some problems with Memory Corruption in iOS. Please stay tuned for the iOS version.

This tool will be available in APMPlus, a performance monitoring product under MARS, byteDance application development suite. APMPlus uses advanced data acquisition and monitoring technologies to provide enterprises with full-link application performance monitoring services and meet their requirements for end-to-end monitoring. With non-invasive monitoring, rich abnormal site restoration capabilities, help enterprises improve the efficiency of troubleshooting and solving abnormal problems, optimize application quality, reduce costs and increase revenue.

APMPlus is currently offering a 30-day trial free for a limited time to new users. Including App monitoring, Web monitoring, Server monitoring, small program monitoring, App monitoring and Web monitoring each 5 million events, Server and small program monitoring limited time, welcome free access trial.

Click the link to go to the official website for more product information.