preface

  • Since the Douyin team shared this article about The development practice of Douyin: A Solution based on binary file rearrangement to improve APP startup speed by more than 15%, the startup time of binary rearrangement optimization in the pre-main phase has been widely spread.

  • This article first describes the principle of binary rearrangement (because the principle part of douyin team in the above article is mostly point-to-stop, most of our friends did not learn anything after reading it). Then I will introduce and practice how to solve this problem left by the Douyin team by combining clang piling method:

    Hook Objc_msgSend cannot solve pure swift, block, c++ methods.

    To achieve the perfect binary rearrangement scheme.

(This article will be explained from the perspective of principle, some familiar students may feel that the rhythm is too wordy, in order to take care of most students, we can skip according to the catalog.)

Before we get to binary rearrangement, we need to know a little bit about the lead-up and what kind of problem binary rearrangement is intended to solve.

Virtual memory vs. physical memory

It’s divided into three nouns for convenience

Virtual memory, physical memory, disk

1. No virtual memory

In early App, the whole data was loaded from disk to physical memory. For example, the physical memory was 120M, and program A needed 10M, program B needed 100M, and program C needed 30M.

When we run program A first, we allocate 0m-10m of memory to program A

When we run program B again, 10M-110M of memory will be allocated to program B

At this point, the memory footprint is 110 meters and then we have 10 meters left

When we run program C, if the area occupied by A is released, it is not enough. Moreover, at this time, the 10M space occupied by A is not enough, so we can only replace the area occupied by B to the hard disk, and then allocate 10m-40M memory to program C

Problems:
  1. Memory usage is low. When C replaces B, the entire B needs to be replaced with disk data, and C needs to be replaced with memory. There’s a lot of data exchange
  2. Security issues: No programs are isolated, memory is continuous, and program B can be accessed from program A via an offset
  3. Redirection problem: Each load location is different and redirection is required

In order to solve the security problem, virtual memory is proposed

To solve the problem of efficiency and redirection, paging segmentation is proposed

2. Virtual memory

Virtualize a virtual address space, which maps the actual physical space. Each thread has its own separate virtual space. Control thread can only access their own virtual space, the effective separation of each program.

3. Segmented paging

  1. Segmentation

    In order to solve the efficiency problem, the first thought is segmented operation, the basic idea is to map a segment of virtual space with the size of memory space required by the program to a physical address space. As shown in the figure below, each byte in the virtual space corresponds to each byte in the physical space. At this time, the entire mapping is set by software, and the actual address translation is completed by hardware.

    For example, when the program App1 accesses 0x00000000, the actual address of the physical memory accessed is 0x00C00000.

    Virtual memory + fragmentation solves security and redirection issues. But it hasn’t solved the problem of memory inefficiency, where a lot of data is exchanged as programs switch back and forth. Even though we only use a fraction of the data in the program. The segmented approach also requires the entire program to read and write from memory to disk. Therefore, in order to solve this problem, a smaller granularity of memory segmentation and mapping method is thought to make full use of the principle of program locality, so as to improve the memory utilization. That method is paging

  2. Paging

    “The basic approach to paging is to artificially divide the address space equally into fixed-size pages, the size of which is determined by the hardware, or the hardware supports multiple page sizes, the size of which is determined by the operating system’s choice. For example, Intel Pentium processors support 4KB or 4MB page sizes. The operating system can choose either 4KB or 4MB page size, but only one page size can be selected at a time. Therefore, the page size is fixed for the entire system. Almost all current OPERATING systems on PCS use 4KB pages. The PC we’re using has a 32-bit virtual address space, which is 4GB, so at 4KB per page, that’s 1 048 576 pages. The same goes for physical space.”

    “Let’s look at a simple example, as shown in the figure below. Each virtual space has 8 pages, and each page is 1KB, so the virtual address space is 8KB. Assuming that the computer has 13 address lines, or 2^13 physical addressing capability, the physical space could theoretically be up to 8KB. But for one reason or another, you don’t have enough money to buy memory and you can only afford 6KB of memory, so the first 6KB of physical space is really what matters.”

    “So, when we split the virtual address space of the process by page, we load the frequently used data and code pages into memory, and store the less frequently used code and data on disk and pull it out when we need it. As an example in Figure 1-6, we assume that there are two processes, Process1 and Process2, in which part of the virtual pages are mapped to the physical pages, such as VP0, VP1, and VP7 to PP0, PP2, and PP3. For example, VP2 and VP3 are in DP0 and DP1 of disk. In addition, some pages, such as VP4, VP5 and VP6, may not be used or accessed. They are temporarily unused. Here, we call pages in Virtual space Virtual pages (VP, Virtual Page), pages in Physical memory Physical pages (PP, Physical Page), and pages in Disk Disk pages (DP, Disk Page). “The lines in the graph represent the mapping, and we can see that some pages of the virtual space are mapped to the same physical page, so that memory can be shared.”

    “The VP2 and VP3 of Process1 in Figure 1-6 are not in memory, but when the process needs to use these pages, the hardware catches the message, which is called a Page Fault, and the operating system takes over and reads VP2 and VP3 from disk and loads them into memory. Then map the two pages in memory to VP2 and VP3. It’s very convenient to access and exchange this data on a page basis, and the hardware itself supports this page-by-page approach.”

    “The implementation of virtual storage depends on hardware support, which is different for different cpus. But almost all hardware uses a widget called the MMU (Memory Management Unit) for page mapping.”

Binary rearrangement optimization principle

As you can see from the introduction above, every time a Page is not loaded, a Page Fault occurs. Every time a Page is generated, the system will block. So if we call methods in a Method1 > Method4 > Method2 order. So when we call Method1 we’re going to fire a PageFault because we didn’t load Page1 in memory, so we’re going to fire a PageFault on Method4, because Method4 is in Page2, so we’re going to need to load Page2, and we’re going to fire a PageFault again. Two pageFaults are generated.

So we can put Method1 and Method4 in a single Page so that the system calls Method1 and then calls Method4, because Page1 is already loaded into memory when Method1 is loaded, so when we call Method4 we don’t get a PageFault. One time to achieve our goal of optimization.

In a real project, we can put together as many functions as we need to call at startup. In this way, the generation of PageFault is reduced, so as to achieve the purpose of optimization. This is better known as binary rearrangement.

How to view PageFault

  1. Go to Instruments and select System Trace
  2. Click the upper left corner to start recording until the home page appears, click stop, select the project to check, as shown in the picture below

How does binary rearrangement work

First, Xcode uses a linker called ld, and LD has a parameter called Order File. We can use this parameter to configure the path of an Order File.

In the order file, write the symbols you need in order.

When the project is built, Xcode reads this file, and the binary package will generate the corresponding Mach-O according to the symbolic order in the file.

1️ : Order will there be a problem if the symbol in the file is wrong or the symbol does not exist?

  • A:ldI’m going to ignore these symbols, and in fact if you providelinkoptions-order_file_statistics, willwarningTo print the missing symbols in the log. .

2️ discount: Some students may consider whether this approach will affect retail sales?

  • A: First of all,objcThe source code itself uses this approach.
  • The binary rearrangement simply rearranges what is generatedmachoThe order of function tables and symbol tables in.

Clang plugging pile

The official documentation

Configuration items

  • Build test project, in Build Setting -> Other C Flags, add -fsanitize-coverage=trace- PC-guard configuration

  • Implement the following two methods in your code

    void __sanitizer_cov_trace_pc_guard_init(uint32_t *start, uint32_t *stop)
    void __sanitizer_cov_trace_pc_guard(uint32_t *guard) 
    Copy the code
Official example:

The function description

// trace-pc-guard-cb.cc
#include <stdint.h>
#include <stdio.h>
#include <sanitizer/coverage_interface.h>
// The compiler inserts this callback as a module construct into each DSO, and start and stop correspond to the beginning and end of each entire binary.
Before DSO, it can also be called multiple times with the same arguments.

// This callback is inserted by the compiler as a module constructor
// into every DSO. 'start' and 'stop' correspond to the
// beginning and end of the section with the guards for the entire
// binary (executable or DSO). The callback will be called at least
// once per DSO and may be called multiple times with the same parameters.
// uint32_t *start Unsigned 32 bits of the start position of the four-byte symbol
// Uint32_t *stop
extern "C" void __sanitizer_cov_trace_pc_guard_init(uint32_t *start,
                                                    uint32_t *stop) {
  static uint64_t N;  // Counter for the guards.
  if (start == stop || *start) return;  // Initialize only once.
  printf("INIT: %p %p\n", start, stop);
  for (uint32_t *x = start; x < stop; x++)
    *x = ++N;  // Guards should start from 1.
}

// This callback is inserted by the compiler on every edge in the
// control flow (some optimizations apply).
// Typically, the compiler will emit the code like this:
// if(*guard)
// __sanitizer_cov_trace_pc_guard(guard);
// But for large functions it will emit a simple call:
// __sanitizer_cov_trace_pc_guard(guard);
// The callback function for each insertion
extern "C" void __sanitizer_cov_trace_pc_guard(uint32_t *guard) {
  if(! *guard)return;  // Duplicate the guard check.
  // If you set *guard to 0 this code will not be called again for this edge.
  // Now you can get the PC and do whatever you want:
  // store it somewhere or symbolize it and print right away.
  // The values of `*guard` are as you set them in
  // __sanitizer_cov_trace_pc_guard_init and so you can make them consecutive
  // and use them to dereference an array or a bit vector.
  void *PC = __builtin_return_address(0);
  char PcDescr[1024];
  // This function is a part of the sanitizer run-time.
  // To use it, link with AddressSanitizer or other sanitizer.
  __sanitizer_symbolize_pc(PC, "%p %F %L", PcDescr, sizeof(PcDescr));
  printf("guard: %p %x PC %s\n", guard, *guard, PcDescr);
}
Copy the code
Possible problems

The problem arises because the above two methods are not implemented

Describe function methods carefully

Void __sanitizer_cov_trace_pc_guard_init(uint32_t *start, uint32_t *stop) You can see from the printed information. This method generates guard, and the function gives start and end Pointers. It then iterates over the pointer and assigns a value to each pointer address (which corresponds to the blue part of the void __sanitizer_cov_trace_pc_guard(uint32_t *guard).

At this point we wonder what this number means.

Let’s add a -(void) test method to try it out and see that the printed one adds one. So we can see that. This contiguous space represents the number of symbols.

In assembly, the void __sanitizer_cov_trace_pc_guard(Uint32_t *guard) method is inserted in front of the calling function. After execution, the method actually called is called. We can call __builtin_return_address(0) in this method; Method to get the address to start executing the method, as shown in the figure below: The PC obtained in the function returns the address at which the method is to be executed.

We can obtain the symbolic name of the function by using the address of the function call

#include <dlfcn.h>
/* * Structure filled in by dladdr(). */
typedef struct dl_info {
        const char      *dli_fname;     /* Pathname of shared object */
        void            *dli_fbase;     /* Base address of shared object */
        const char      *dli_sname;     /* Name of nearest symbol */
        void            *dli_saddr;     /* Address of nearest symbol */
} Dl_info;
extern int dladdr(const void *, Dl_info *);


Dl_info info;
dladdr(PC, &info);
Copy the code

The library is first imported, then the Dl_info structure variable is defined, and the specific content is retrieved through dlADDR. Dli_sname is the most recent symbol name.

This allows us to obtain the corresponding symbol name based on PC, since __sanitizer_cov_trace_pc_guard is called before each method is called, we can obtain the corresponding symbol name by recording PC. To get all the symbol names from the first page of the App.

Pit:

→ : Multithreading problem

Since methods must be called on different threads, collecting PCS can be done using locks instead of a simple array

// Atomic queue
static OSQueueHead symboList = OS_ATOMIC_QUEUE_INIT;
// Define the symbolic structure
typedef struct{
    void * pc;
    void * next;
}SymbolNode;

void __sanitizer_cov_trace_pc_guard(uint32_t *guard) {
    if(! *guard)return;  // Duplicate the guard check.
    void *PC = __builtin_return_address(0);
    SymbolNode * node = malloc(sizeof(SymbolNode));
    *node = (SymbolNode){PC,NULL};
    / / team
    // offsetof is used here to add the next node to the queue and find the position of the next pointer to the previous node
    OSAtomicEnqueue(&symboList, node, offsetof(SymbolNode, next));
}

- (void)touchesBegan:(NSSet<UITouch *> *)touches withEvent:(UIEvent *)event{
    // walk through the queue
    while (true) {
        //offsetof is to find the offsetof a property relative to a structure
        SymbolNode * node = OSAtomicDequeue(&symboList, offsetof(SymbolNode, next));
        if (node == NULL) break;
        Dl_info info;
        dladdr(node->pc, &info);
        
        printf("%s \n",info.dli_sname); }}Copy the code
→ : Infinite loop problem

→ : The above clang pegging method will insert hook code in the loop as well.

When we confirm that we are queuing and queuing are ok, and your own writing method is ok to save and read, we found this pit, this will be an infinite loop, why?

Here I will not take you to analyze the assembly, directly said the conclusion:

Assembly will see a method with a while loop that is statically added multiple times to __sanitizer_cov_trace_pc_guard calls, resulting in an infinite loop.

→ : Solution

Change Other C Flags to the following:

-fsanitize-coverage=func,trace-pc-guard
Copy the code

→ : Another kind of dead-loop problem

If the OC method is called in the while, __sanitizer_cov_trace_pc_guard is called again before the method is called, OSAtomicEnqueue(&symboList, node, offsetof(SymbolNode, Next)); , resulting in an endless loop.

- (void)touchesBegan:(NSSet<UITouch *> *)touches withEvent:(UIEvent *)event{
    // walk through the queue
    while (true) {
       SymbolNode * node = OSAtomicDequeue(&symboList, offsetof(SymbolNode, next));
	     if (node == NULL) break; . [selftest]; . }}Copy the code
To: the Load method

In the load method, __sanitizer_cov_trace_pc_guard takes guard to 0.

The above print did not find load.

Solution: Mask the __sanitizer_cov_trace_pc_guard function

if (! *guard) return;Copy the code

Residual refinement work

  • If you’re using the same kind of multithreading that I’m using, and since we’re using fifo, we’re going to reverse order,

  • We also need to do a de-redo operation.

  • The order file format requires c functions, and block calls are preceded by an underscore and an underscore.

  • Write to the file.

  • A library introduced in Cocoapods, and the company project modifies use_Modular_HEADERS as a static library. So you can directly turn on this method, if you need to re-implement the two methods provided above for dynamic libraries.

    config.build_settings['LD_GENERATE_MAP_FILE'] = ='YES'
    config.build_settings['OTHER_CFLAGS'] = '$(inherited) -fsanitize-coverage=func,trace-pc-guard'
    config.build_settings['OTHER_SWIFT_FLAGS'] = '$(inherited) -sanitize-coverage=func -sanitize=undefined'
    config.build_settings['GCC_PREPROCESSOR_DEFINITIONS'] = '$(inherited) OrderFileDebug=1'
    Copy the code
  • Swift engineering

    -sanitize-coverage=func -sanitize=undefinedCopy the code

The complete method is as follows:

#import <libkern/OSAtomic.h>
#include <dlfcn.h>
#import <os/signpost.h>

void __sanitizer_cov_trace_pc_guard_init(uint32_t *start, uint32_t *stop) {
    static uint64_t N;
    if (start == stop || *start) return;
    printf("INIT: %p %p\n", start, stop);
    for (uint32_t *x = start; x < stop; x++)
        *x = ++N;
}


// Define the atomic queue
OSQueueHead symbolList = OS_ATOMIC_QUEUE_INIT;

// Define the structure
typedef struct {
    void *pc;
    void *next;
} SYNode;

static BOOL isFinding = false;

void __sanitizer_cov_trace_pc_guard(uint32_t *guard)  {
    if (isFinding) {
        return;
    }
    void *PC = __builtin_return_address(0);
    // Create a structure
    SYNode *node = malloc(sizeof(SYNode));
    *node = (SYNode){PC, NULL};
    // The structure is pushed
    //offsetof: parameter 1 passes in the type to return the address of the next node to the parameter
    OSAtomicEnqueue(&symbolList, node, offsetof(SYNode, next));
}

@implementation LGClangTools
/// go through all of them
+ (void)symbolNames {
#if OrderFileDebug == 1
    // Define an array
    isFinding = true;
    NSMutableArray<NSString *> *symbolNames = [NSMutableArray array];
    while (YES) {
        SYNode *node = OSAtomicDequeue(&symbolList, offsetof(SYNode, next));
        if(node == NULL) {break;
        }
        Dl_info info;
        dladdr(node->pc, &info);
        printf("%s",info.dli_sname);
        // Turn to string
        NSString *name = @(info.dli_sname);
        // Add _ to the OC function name
        BOOL isObjc = [name hasPrefix:@ "+ ["] || [name hasPrefix:@"-["];
        NSString * symbolName = isObjc ? name : [@ "_" stringByAppendingString:name];
        [symbolNames addObject:symbolName];
    }
    isFinding = false;
    // Reverse traverse the number group
    NSEnumerator * em = [symbolNames reverseObjectEnumerator];
    NSMutableArray * funcs = [NSMutableArray arrayWithCapacity:symbolNames.count];
    NSString * name;
    while (name = [em nextObject]) {
        If the symbol name is not in the array, add it to the array
        if(! [funcs containsObject:name]) {// deduplicate: Array has no name[funcs addObject:name]; }}// Remove the current function name
    [funcs removeObject:[NSString stringWithFormat:@"%s",__func__]];
    
    // Write to the file
    //1. Programming string
    NSString * funcStr = [funcs componentsJoinedByString:@"\n"];
    NSString * filePath = [NSTemporaryDirectory() stringByAppendingPathComponent:@"hank.order"];
    NSData * file = [funcStr dataUsingEncoding:NSUTF8StringEncoding];
    [[NSFileManager defaultManager] createFileAtPath:filePath contents:file attributes:nil];
    
    NSLog(@ "% @",funcStr);
#else

#endif
}


Copy the code

Before optimization

The optimized

Before optimization

The optimized

Start measuring

Results measurement: After startup sequence optimization, before optimization, after optimization is accurate measurement before and after optimization. When the optimized app is started for the first time, about 20 apps in the mobile phone should be opened before startup, and the camera should take 100 photos in succession to ensure the accuracy of data

Cold start

The optimized

Before optimization

Cold start after secondary measurement optimization

Summary of measurement Results

01.644.771s (0.1.615.618s+ 29.09ms) after optimization

The total time before optimization was 03.339.297s (0.3.287.467s + 52.23ms)

01.891.344s (0.1.864.704 + 26.64ms) after optimization

Optimized Page Fault 646MS

Page Fault 1.20s before optimization

Page Fault 592MS after optimization

Warm start

The optimized

Before optimization

conclusion

There is not much difference before and after hot start optimization, the overall difference is about 0.15s. Considering the error, the difference is about 0.1s before and after cold start optimization, which is about 40% higher

Refer to the article

Juejin. Cn/post / 684490…

Reference books

Self-cultivation of the Programmer