The original address

Compared to the normal crash problem, the loss caused by starting crash is much greater. Normally, if you have a robust build release system, most of the time problems can be found and fixed before the release goes live, but there is still a small chance of an online accident. Crash startup generally has two characteristics: severe damage and difficulty in capturing

The boot process

A lot of things happen from the time the app icon is clicked by the user until the app can start responding. Generally speaking, although we hope to start the crash-monitoring tool as early as possible, the access party will always wait until after the launch event to start the tool, and the crash occurred before this time is the start of the crash. The following lists the possible start of the crash stage before launch:

Initialize may be in earlier order, but always between load and launch. As shown in the figure, if we want to monitor crash, the start time of monitoring must be placed in the load stage to ensure the best monitoring effect

How to monitor

The simplest way is to enable crash monitoring directly in the load method, regardless of whether the access party wants to enable crash monitoring. But doing so exposes the app to four risks:

  • On-line switching schemes like A/B lose control of the monitoring tool

  • The crash monitoring startup has a crash problem, which will cause the application to crash completely

  • The crash caused by the recursive loading of the startup tool cannot be monitored when the class is not loaded in the load phase

Based on these risks, the scheme to enable crash monitoring should meet the following conditions:

  • The startup process does not depend on classes to avoid crash caused by recursive loading

  • Once a process crashes, the security of logging is guaranteed

Finally, the flow chart of monitoring is obtained:

Don’t rely on class

Not relying on classes means that the monitoring tool needs to use the C interface to implement its functions. This is cumbersome, but since the Runtime mechanism determines that all method calls end with the objc_msgSend function as the entry point, if we could hook this function and implement a stack structure that pushes all calls into the stack record, Tracing method calls is easy. Fishhook provides the ability to hook functions:

__unused static id (*orig_objc_msgSend)(id, SEL, ...) ; __attribute__((__naked__)) static void hook_Objc_msgSend() { /// save stack data /// push msgSend /// resume stack data /// call origin msgSend /// save stack data /// pop msgSend /// resume stack data } void observe_Objc_msgSend() { struct  rebinding msgSend_rebinding = { "objc_msgSend", hook_Objc_msgSend, (void *)&orig_objc_msgSend }; rebind_symbols((struct rebinding[1]){msgSend_rebinding}, 1); }Copy the code

Implement msgSend

The __naked__ function tells the compiler not to use the stack to store parameter information when the function is called, and the function return address is stored in the LR register. Since msgSend itself uses this modifier, register data must be saved and restored when logging function calls on and off the stack. MsgSend uses x0-X9 registers to store parameter information. You can manually use sp registers to store and restore these parameter information:

#define save() \ __asm volatile (\ "STP x8, x9, [sp, #-16]! \n" \ "stp x6, x7, [sp, #-16]! \n" \ "stp x4, x5, [sp, #-16]! \n" \ "stp x2, x3, [sp, #-16]! \n" \ "stp x0, x1, [sp, #-16]! \n"); #define Resume () \ __asm volatile (\ "LDP x0, x1, [sp], #16\n" \" LDP x2, x3, [sp], #16\n" \ "LDP x4, x5, [sp], #16\n" \ "ldp x6, x7, [sp], #16\n" \ "ldp x8, x9, [sp], #16\n" ); #define call(b, value) \ __asm volatile (" STP x8, x9, [sp, #-16]! \n"); \ __asm volatile ("mov x12, %0\n" :: "r"(value)); \ __asm volatile ("ldp x8, x9, [sp], #16\n"); \ __asm volatile (#b " x12\n"); // msgSend must implement __attribute__((__naked__)) static void hook_Objc_msgSend() {save() __asm volatile ("mov x2, lr\n");  __asm volatile ("mov x3, x4\n"); call(blr, &push_msgSend) resume() call(blr, orig_objc_msgSend) save() call(blr, &pop_msgSend) __asm volatile ("mov lr, x0\n"); resume() __asm volatile ("ret\n"); }Copy the code

logging

Normal I/O processing cannot guarantee the security of data during crash, so MMAP is the most suitable solution for this scenario. Mmap ensures that files can be written to IO even if the application crashes irresistibly. In addition, we only need to record the class and selector call stack information, which requires very little memory usage in the absence of recursive algorithms:

time_t ts = time(NULL);
const char *filePath = [NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES).lastObject stringByAppendingString: [NSString stringWithFormat: @"%d", ts]].UTF8String;

unsigned char *buffer = NULL;
int fileDescriptor = open(filePath, O_RDWR, 0);
buffer = (unsigned char *)mmap(NULL, MB * 4, PROT_READ|PROT_WRITE, MAP_FILE|MAP_SHARED, fileDescriptor, 0);
Copy the code

A buffer is the buffer to which we write data. In order to ensure that the information on the call stack is accurate, the data in the buffer needs to be updated every time the information on the call function comes in and out of the stack. One possible approach is to prefix each call record with the @ symbol, always store the last call record’s index, and clear all data after that index when unstack

static inline void push_msgSend(id _self, Class _cls, SEL _cmd, uintptr_t lr) { _lastIdx = _length; buffer[_lastIdx] = '@'; . } static inline void pop_msgSend(id _self, SEL _cmd, uintptr_t lr) { ...... buffer[_lastIdx] = '\0'; _length = _lastIdx; size_t idx = _lastIdx - 1; while (idx >= 0) { if (buffer[idx] == '@') { _lastIdx = idx; break; } idx--; }}Copy the code

Clear the log

Because msgSend is called so frequently, this monitoring scheme is not suitable for long periods of time, so you need to turn off the monitoring at some point. Since normal crash monitoring startup may also have crash, it is the most appropriate choice to listen to the becomeActive notification to turn off the function, because it has already passed the launch stage of the crash monitoring tool, which can ensure that the tool itself is in normal use:

[[NSNotificationCenter defaultCenter] addObserver: self selector: @selector(closeMsgSendObserve) name: UIApplicationDidBecomeActiveNotification object: nil];

- (void)closeMsgSendObserve {
    close(fileDescriptor);
    munmap(buffer, MB * 4);
    [[NSFileManager defaultManager] removeItemAtPath: _logPath error: nil];
}
Copy the code

The rollback

When a rollback is required, it indicates that a crash has occurred. In this case, the handling method varies according to the log content:

  • The log file is empty

    This is the most dangerous case, if the log file is empty, the file has been created, but no method calls have been made. It is very likely that there will be a crash during fishhook processing, in which case you should simply shut down the monitoring scheme, even if it is not the cause, and quickly issue a new version

  • Log files are not empty

    If the log file is not empty, it indicates that the crash is successfully monitored. In this case, the log file should be uploaded synchronously to the service side and the loss should be stopped in time. First, stop-loss measures should be synchronized to ensure that the application can continue to run. Depending on the situation, stop-loss rollback methods include the following:

    1. If A crash occurs in A functional component that does not interfere with normal service execution, you can disable the corresponding function through the A/B switch. The premise is that the functional component is controlled by the switch

    2. The code at the crash has interfered with normal service execution, but the error code is short. You can try to repair the error code dynamically by sending patch package to the server, but the patch package should be wary of introducing other problems

    3. In the case that neither A/B Test nor patch package can solve the problem, if the project adopts reasonable componentization design, h5 can be used to complete the normal operation of the application through routing and forwarding

    4. Lack of dynamic repair means and crash does not interfere with normal business execution, consider stopping all plug-ins and auxiliary components

    5. Lack of means of dynamic repair, including plan 1, 2, 3. Consider using third-party jailbreak market to provide reverse package, prompting users to download and install

    6. Lack of means of dynamic repair, including plan 1, 2, 3. Use Test Flight in batches to quickly restore users to the new version