Author: Umeng + mobile development expert Zhang Wen

“Crash”, “stall” and “abnormal exit” are three common conditions affecting App stability. When iOS crashes over 0.8% and Android crashes over 0.4%, there is a significant decline in active users. Not only does it cause critical business interruptions, reduced retention, and poor brand reputation, but it also leads directly to offloading and churn. Also at the same time to bring developers can not be underestimated capital loss.

Do apps with low crash rates have higher quality? Can the stability of App be judged directly by the crash rate?

First of all, when measuring the quality of an App, we need to define a unified caliber, that is, which indexes can be used as the evaluation caliber of stability? Taking the concept of stability defined by U-APM of Umeng + as an example, to evaluate the stability and quality of an App, the following three comprehensive considerations are generally taken:

Crashes occur, such as Java crashes and Native crashes, which are used to evaluate computations by the metric of crash rate.

ANR occurs, that is, the ANR rate is used to evaluate the calculation;

Abnormal exit, for example, low memory killer, delete from the task list, system exception, power failure, or shutdown/restart triggered by the user, is calculated using the exception rate.

A crash is when an exception occurs in the program, causing the program to exit. Include:

A Java crash is when an uncaught exception occurs in Java code, causing the program to exit abnormally. Such as: null pointer exception, array out-of-bounds exception, etc.

A Native exception is a signal signal generated when an error occurs in Native code, causing the program to exit abnormally. For example, access to illegal addresses, address to its problems.

Java crash capture is relatively simple, Native crash capture may require us to have a certain knowledge of the underlying system. We know that Android is based on Linux, and most crashes are caused by coding errors or hardware errors. When the system encounters unrecoverable errors, it triggers the exception processing process by means of exception interrupts, and the processing of these interrupts is unified as semaphore. When an application receives a semaphore, it follows the kernel’s default actions, such as Term, LGN, Core, Stop, Cont. We can also use sigAction to register receiving signals to specify processing actions, such as capturing crash information. Of course, there will also be some difficulties in the process of capturing, especially in extreme environments, such as stack overflow, because the stack space has been used up, our signal processing function can not be called, so that we can not capture the crash information, then we need to consider using SignalStack, This allows our signal processing functions to allocate a chunk of memory in the heap as an “alternative signal stack” to process crash information.

Of course, in addition to stable and safe capture capability, it is also necessary to enrich the context information of crash site, such as Logcat information, call stack information, device information, environment information, etc., to provide comprehensive reference for our subsequent locating and solving problems.

In the case of crashes, we use the crash rate as a data metric. Include:

UV crash rate, i.e., the number of deactivated users with crash errors/total deactivated users;

PV crash rate, i.e. the number of crash errors/startup times;

Start the breakdown rate, that is, application crash happened in the process of start, it is easy to be ignored but very important indicators, the collapse of the startup is APP life cycle in a very important stage, a lot of advertising, the splash screen, activity content gives fully in the process, such as startup and needs to be loaded at the same time various initialization, start and if there is an error, In most cases, hot repair and disaster recovery degradation cannot remedy the problem.

ANR, or Application Not Responding, when the Application is Not Responding in a timely manner for a period of time, the ANR dialog box pops up, giving the user the option to continue to wait or to force closure. ANR can sometimes be worse than a crash from a user experience perspective, so developers should take ANR seriously as well as crash seriously.

The accuracy of ANR capture has always been a process of upgrading and improving. In the early days, we used FileObserver to monitor /data/anr/ localted. TXT files to capture and report changes, but unfortunately with the version upgrade, the system and manufacturers began to restrict system files and permissions, and the coverage of this solution became less and less, resulting in the accuracy of ANR capture has been reduced.

Then we improved to capture ANR by monitoring the running time of the message queue, that is, to put an empty message into the main thread Looper and monitor whether the empty message will be executed after 5 seconds. However, this scheme could not truly capture ANR (there are missed and false positives) and could not obtain the complete ANR content. Subsequently, we refer to the implementation principle of Android ANR to achieve a real-time and accurate ANR capture scheme, which can be compatible with all system versions. It is known that the system_server process will send SIGQUIT (Signal 3) to the process with ANR when it detects the ANR in the APP. By default, libart.so receives this signal and calls the Java VIRTUAL machine’s dump method to generate traces.

You can intercept SIGQUT, receive ANR signals first, and generate traces and ANR logs. After processing the signals, you can pass the signals to the system for the system to generate traces. At the same time, you can increase the speed at which you generate old traces, so that the system_server does not use SIGKILL (signal 9) to generate old traces. At the same time, we have enriched the captured content, including: ANR triggers the CAUSE, CPU usage of TOP processes in the mobile phone, CPU usage of TOP threads in the ANR process, CPU core processing time distribution, disk I/O operation waiting duration, and other important information, which provides strong support for analyzing, locating, and solving ANR problems.

Similarly, for the occurrence of ANR, we also divide it into UV ANR rate and PV ANR rate. The algorithm can be referred to the calculation of collapse rate above.

Of course, with the exception of crashes and ANR, we tend to ignore the exception exit scenario, but it is often through the exception exit that we can find problems such as low memory killers, system restarts, and so on that are not properly caught. For example, the application flashes back due to compatibility problems, device restart, and the number of application flashes back increases when the third-party library calls exit function actively. Therefore, the abnormal exit rate can be used to comprehensively understand and measure the stability of the application.

To sum up, I think everyone should have the answer to the question at the beginning of the article. Of course, we should not avoid some problems by manual try catch in order to cover up the code quality problems, which may interrupt the normal use of users and cause perceptual blocking feedback. We should start from the real perception of users when using the APP, and timely capture and deal with problems.

The stability of App is a long-term iterative process, in which U-APM is a good tool to improve efficiency and reduce costs. It provides the ability of collection, analysis, aggregation and analysis. In the next phase, we will explain how to solve and deal with problems such as crash and ANR through U-APM.