XCrash details and source code analysis

One, foreword

He that will do his work well must sharpen his tools. It would be much easier to have a handy log capture tool when an application goes wrong or crashes. Today’s lesson is the xCrash log capture tool from IQiYi. This tool is superior in both quality and functionality.

2. XCrash narration

The exception logs that xCrash can capture include Java Crash, Native Crash and ANR logs, and the exceptions that occur on Android can only be summed up in these three types. The main advantages of this library, according to the official explanation, are as follows:

Support Android 4.0-10 (API Level 14-29). Supports Armeabi, Armeabi-V7A, ARM64-V8A, x86 and X86_64. Catch Java crashes, Native crashes, and ANR. Get detailed process, thread, memory, FD, network statistics. Regular expressions are used to set which threads need to be retrieved. No root permissions or any system permissions are required.

And from a development standpoint, the architecture is pretty clear. Below is the official architecture diagram.

Initialization analysis

1. Initialization

The initialization code is as follows, which seems so easy.

public class MyCustomApplication extends Application { @Override protected void attachBaseContext(Context base) { super.attachBaseContext(base); xcrash.XCrash.init(this); }}Copy the code

I will not post any further code here, but to explain the text, initialization is mainly to get basic information such as AppId, AppVersion, etc. Beyond that, of course, the most important are of course the initialization of the JavaCrash Handler, NativeCrash Handler, and AnrCrash Handler.

2. JavaCrash Handler initialization

JavaCrashHandler implements the interface UncaughtExceptionHandler, which is also easy to initialize.

Thread.setDefaultUncaughtExceptionHandler(this);
Copy the code

The interface provided by the virtual machine is used to monitor the Java Crash. The other major one is the implementation method uncaughtException, which will be discussed later.

3. Initialize AnrHandler

The initialization of AnrHandler takes a few parameters and then listens for changes to the /data/anr directory.

fileObserver = new FileObserver("/data/anr/", CLOSE_WRITE) {
            public void onEvent(int event, String path) {
                try {
                    if(path ! = null) { String filepath ="/data/anr/" + path;
                        if (filepath.contains("trace")) {
                            handleAnr(filepath);
                        }
                    }
                } catch (Exception e) {
                    XCrash.getLogger().e(Util.TAG, "AnrHandler fileObserver onEvent failed", e); }}}; try { fileObserver.startWatching(); } catch (Exception e) { fileObserver = null; XCrash.getLogger().e(Util.TAG,"AnrHandler fileObserver startWatching failed", e);
        }
Copy the code

Of course, we all know that on older Versions of Android, apps can no longer access /data/anr. Does xCrash provide an alternative implementation? It actually captures SIGQUIT, which is a signal sent by ActivityMangerService to the Android App when ANR occurs. More on that later.

4. Initialization of NativeHandler

The initialization of the NativeHandler is more complicated, and it is divided into a Java layer and a Native layer.

4.1 Java layer

The Java layer is relatively simple, mainly loading libxcrash.so and further calling nativeInit() to initialize the native layer.

System.loadLibrary("xcrash");
Copy the code

4.2 Native layer

The JNI implementation mapped by nativeInit() is xc_jni_init(). There are three more small steps in xc_jni_init for initialization.

xc_common_init

Some public parameters are initialized, such as os-kernel-version, app_version, appID, log directory, etc. The most important of these is that two file FDS are initialized in case the file FDS are exhausted.

    //create prepared FD for FD exhausted case
    xc_common_open_prepared_fd(1);
    xc_common_open_prepared_fd(0);
Copy the code

These two FDS are given to xc_common_crash_prepared_fd and xc_common_trace_prepared_fd, respectively. Note, however, that both of them currently open “/dev/null”.

Xc_crash_init xcc_unwind_init Initializes the Unwinder.

Api_level >= 16&& API_level <= 20 then load libcorkscrew. So API_level >= 21&& API_level <= 23 then load libunwind.soCopy the code

Xc_crash_init_callback Initializes the JNI Call back. In this case, a native thread is initialized, and then a notification is sent to upper-layer Java through EventFD blocking waiting for a crash to occur in Native.

Next comes the more important signal registration through xCC_signal_crash_register.

int xcc_signal_crash_register(void (*handler)(int, siginfo_t *, void *)) { stack_t ss; .if(0 != sigaltstack(&ss, NULL)) returnXCC_ERRNO_SYS; .for(i = 0; i < sizeof(xcc_signal_crash_info) / sizeof(xcc_signal_crash_info[0]); i++)
        if(0 != sigaction(xcc_signal_crash_info[i].signum, &act, &(xcc_signal_crash_info[i].oldact)))
            return XCC_ERRNO_SYS;

    return 0;
}
Copy the code

Here are the key lines, where sigalStack is used to replace the signal processing stack or, in some cases, to set the emergency stack. The reason for this is that normally, when a signal handler is called, the kernel creates a stack frame for the process on its stack. However, there is a problem here. If the stack grows to the resource limit of the stack (RLIMIT_STACK, which can be seen using the ulimit command, usually 8M), or if the stack grows too long (RLIMIT_STACK is not limited), So that the mapped memory boundary is reached, then the signal handler can’t get the stack frame allocation. Then it installs signals through sigAction (), focusing only on which signals it installs.

Signum = SIGABRT},abort signal {. Signum = SIGBUS}, illegal memory access {. Signum = SIGFPE}, floating point exception {. Signum = SIGILL}, illegal instruction {. Signum = SIGSEGV}, invalid memory access {.signum = SIGTRAP}, breakpoint or trap instruction {.signum = SIGSYS}, system call exception {.signum = SIGSTKFLT} stack overflowCopy the code

The signal processing function is in xc_crash_signal_handler. I’ll come back to that later. In addition, there is also a file fd, xc_crash_prepared_fd, and it is not clear the difference and relationship with the previous two.

Xc_trace_init Trace is only for Android 5.0 and above because it is mainly used to get ANR trace. Xc_trace_init_callback () just gets methodId in Java, and the further main operation is in xcc_signal_trace_register().

int xcc_signal_trace_register(void (*handler)(int, siginfo_t *, void *))
{
    ......
    //un-block the SIGQUIT mask for current thread, hope this is the main thread
    sigemptyset(&set);
    sigaddset(&set, SIGQUIT);
    if(0 != (r = pthread_sigmask(SIG_UNBLOCK, &set, &xcc_signal_trace_oldset))) return r;
    //register new signal handler for SIGQUIT
    ......
    if(0 != sigaction(SIGQUIT, &act, &xcc_signal_trace_oldact))
    {
        pthread_sigmask(SIG_SETMASK, &xcc_signal_trace_oldset, NULL);
        returnXCC_ERRNO_SYS; }... }Copy the code

The response function used to handle SIG_QUIT is xc_trace_handler(), which will also be examined later. The function also starts a thread at the end and waits for ANR to occur in the thread-response function xc_trace_dumper. The wait mechanism here also uses eventFD.

5. Initialization summary

Initialization JavaCrashHandler, its implementation mechanism is through the Thread setDefaultUncaughtExceptionHandler () to register a own UncaughtExceptionHandler.
Initialize AnrHandler, which listens for changes to the “/data/anr” folder. For versions 5.0 and above, this is done by listening on SIGQUIT.
Initialize the NativeHandler, reserve the FD, install a set of signals, initialize libcorkscrew. So and libunwind.so for unwinding, and get the related functions.

Iv. Abnormal processing and analysis

1.Java exception handling

Java’s exception handling mechanism is relatively simple. Simply wait for the callback of the exception in the uncaughtException() method and collect the corresponding information. These are relatively simple, not detailed analysis here, interested can go to see. In addition, it implements an Util class for reading system files, which has a lot of value learning things, such as get meminfo, get the FDS occupied by the file, etc.

2.ANR exception handling

2.1 Processing of Java layer

The AnrHandler#handleAnr() method is also a simple Java layer to parse the data/anr/trace.txt file to see if it has any information about its own process. If you’re interested, you can analyze it for yourself.

2.2 Processing of Native layer

As for the ANR processing of Native layer, the official has given a specific implementation architecture diagram. So, let’s see how it works in detail.

During Native initialization, we know that it listens for SIGQUIT signals to handle the occurrence of ANR, which is processed in the xc_trace_handler() method.

XCC_UTIL_TEMP_FAILURE_RETRY(write(xc_trace_notifier, &data, sizeof(data)));
Copy the code

The main implementation is simply to send a notification via eventfd. The response function to the notification is xc_trace_dumper().

Open the log file xc_common_open_trace_log() and write header information xc_trace_write_header(). Our focus is on how it dumps art trace.

Xc_trace_load_symbols loads the symbol table xc_dl_create() and xc_dl_sym() are two of the more important functions. Xc_dl_create is the virtual address to find so loaded by Mmap, xc_dl_sym is the virtual address to compute the corresponding symbol (function) in SO. It’s basically looking for the symbol _ZNSt3__14cerrE from libc++. So, yes, that’s cerr; Look for the symbol _ZN3art7Runtime9instance_E in libart.so as well _ZN3art7Runtime14DumpForSigQuitERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE in process virtual address in the space. You also need _ZN3art3Dbg9SuspendVMEv and _ZN3art3Dbg8ResumeVMEv for L.

The concrete implementation of xc_dl_create() gets the base address of SO in xc_dl_find_map_start(), xc_dl_file_open() loads SO through mmap, and xc_dl_parse_elf() resolves SO. The elf file format should be familiar with. I won’t go into the analysis here.

Xc_trace_libart_runtime_dump began to dump

The relevant codes are as follows:

        if(xc_trace_is_lollipop)
            xc_trace_libart_dbg_suspend();
        xc_trace_libart_runtime_dump(*xc_trace_libart_runtime_instance, xc_trace_libcpp_cerr);
        if(xc_trace_is_lollipop)
            xc_trace_libart_dbg_resume();
Copy the code

Xc_trace_libart_runtime_dump is _ZN3art7Runtime14DumpForSigQuitERNSt3__113basic_ostreamIcNS1_11char_traitsIcEEEE. That is, dump is called to output the processing of SIGQUIT to CERR. One detail is that in the dump section, it redirects the standard error output to its own FD through the dup2() function. It’s at the top of this code, as follows.

        if(dup2(fd, STDERR_FILENO) < 0)
        {
            if(0 != xcc_util_write_str(fd, "Failed to duplicate FD.\n")) goto end;
            goto skip;
        }
Copy the code

Next is the processing of other logs, if you are interested, you can also take a look, such as logcat log fetch, file FD, network logs, etc. At this point, we’re done grabbing the trace.

3.Native exception processing

As for Native exception processing, the official architecture diagram is as follows, and the process is very clear.

During initialization, we analyzed that when native crash occurs, signal processing function xc_crash_signal_handler() will process it. So let’s start with this function.

static void xc_crash_signal_handler(int sig, siginfo_t *si, void *uc) { ...... pid_t dumper_pid = xc_crash_fork(xc_crash_exec_dumper); . int wait_r = XCC_UTIL_TEMP_FAILURE_RETRY(waitpid(dumper_pid, &status, __WALL)); }Copy the code

In addition to basic operations such as opening a file fd, the main thing this function does is create a child process through xc_crash_fork() and wait for it to return.

The response function for the created child process is xc_crash_exec_dumper(). Pipe this function first writes a series of parameters, such as process PID, crash thread TID, etc. to the standard input via PIPE, in order for the child process to read the parameters from the standard input. Then use execl() to enter the real Dumper program.

static int xc_crash_exec_dumper(void *arg)
{
  ......
  execl(xc_crash_dumper_pathname, XCC_UTIL_XCRASH_DUMPER_FILENAME, NULL);
}
Copy the code

This is essentially running libxcrash_dumper.so through execl(), and of course, it doesn’t create a new process. The entry to libxcrash_dumper.so is main() in xcd_core.c. Most of you are probably seeing the familiar C main() function on Android for the first time.

The main() function is posted below. The implementation is concise and reflects the core logic of the dump diagram above.

int main(int argc, char** argv)
{
    (void)argc;
    (void)argv;
    
    //don't leave a zombie process alarm(30); //read args from stdin if(0 ! = xcd_core_read_args()) exit(1); //open log file if(0 > (xcd_core_log_fd = XCC_UTIL_TEMP_FAILURE_RETRY(open(xcd_core_log_pathname, O_WRONLY | O_CLOEXEC)))) exit(2); //register signal handler for catching self-crashing xcc_unwind_init(xcd_core_spot.api_level); xcc_signal_crash_register(xcd_core_signal_handler); //create process object if(0 ! = xcd_process_create(&xcd_core_proc, xcd_core_spot.crash_pid, xcd_core_spot.crash_tid, &(xcd_core_spot.siginfo), &(xcd_core_spot.ucontext))) exit(3); //suspend all threads in the process xcd_process_suspend_threads(xcd_core_proc); //load process info if(0 ! = xcd_process_load_info(xcd_core_proc)) exit(4); //record system info if(0 ! = xcd_sys_record(xcd_core_log_fd, xcd_core_spot.time_zone, xcd_core_spot.start_time, xcd_core_spot.crash_time, xcd_core_app_id, xcd_core_app_version, xcd_core_spot.api_level, xcd_core_os_version, xcd_core_kernel_version, xcd_core_abi_list, xcd_core_manufacturer, xcd_core_brand, xcd_core_model, xcd_core_build_fingerprint)) exit(5); //record process info if(0 ! = xcd_process_record(xcd_core_proc, xcd_core_log_fd, xcd_core_spot.logcat_system_lines, xcd_core_spot.logcat_events_lines, xcd_core_spot.logcat_main_lines, xcd_core_spot.dump_elf_hash, xcd_core_spot.dump_map, xcd_core_spot.dump_fds, xcd_core_spot.dump_network_info, xcd_core_spot.dump_all_threads, xcd_core_spot.dump_all_threads_count_max, xcd_core_dump_all_threads_whitelist, xcd_core_spot.api_level)) exit(6); //resume all threads in the process xcd_process_resume_threads(xcd_core_proc); #if XCD_CORE_DEBUG XCD_LOG_DEBUG("CORE: done"); #endif return 0; }Copy the code

The analysis of each process is no longer carried out, but the most important point here is that the regS, backtrace and other information of the core access thread are obtained by pTrace technology. Ptrace and ELF are relatively complex, so I don’t want to embarrass you here.

Five, the summary

XCrash code looks very simple, the hierarchy is also very clear, lamenting the author’s strong skills. However, due to the limited personal level, the analysis in some places may not be particularly in-depth. Please point out any mistakes and correct them. Thank you.