Author: Idle fish technology – not song

background

In the process of rapid business iteration, Xianyu is faced with the test of stability, especially the problem of ANR (application non-response) is particularly prominent. In the public opinion platform, we can occasionally see users’ feedback that Xianyu App is stuck or stalled. When ANR occurs, the system will prompt users to close the application with a pop-up or directly kill the application process, which will greatly affect the use experience and even cause user loss.

The difficulty of ANR problem is that it is extremely difficult to reproduce offline. There is almost no feedback of ANR problem in the test process at ordinary times. However, in the face of Android fragmentation models, system running state and user operation habits, ANR problems may occur online. Therefore, we must rely on monitoring and troubleshooting to solve the problem. This article mainly from ANR monitoring, investigation system, optimization case several aspects of xianyu on ANR problem management ideas.

The reason for the introduction of ANR

To solve the ANR problem, we first need to understand why ANR was introduced. The Android system monitors the response capabilities of the components (Activity, Service, Receiver, Provider, and input) of the application process for timeout. If the application process does not complete the task after the scheduled time, the ANR warning will be triggered. So the reasons for the introduction of ANR can be divided into two categories

1. The main thread is too busy to process critical messages: time-consuming messages exist, message queues are congested, critical messages cannot be scheduled, or deadlock occurs 2. The system is busy and the main thread cannot be scheduled: Other threads or resources in the system or application are overloaded (high I/O and frequent memory jitter), and the main thread is preempted

Monitoring plan

1. Listen for changes in the ANR directory

Use the FileProvider to listen for changes in the /data/anr/traces. TXT file and capture the scene for reporting. However, after Android 6.0 and later system file permissions are tightened, there is no permission to read this file. Our previous use of this monitoring scheme has resulted in a large number of ANR issues being underreported on higher-release devices.

2. Main thread timeout monitoring

Start a child thread to periodically post a message to the main thread, every once in a while (such as 5 seconds) to monitor whether the message is consumed, if it is not processed, the main thread is stuck, ANR may have occurred, and then through the system service to obtain the error information of the current process, to determine whether there is ANR. However, there is a lot of underreporting and the polling scheme does not perform well.

3. Listen for the SIGQUIT signal

When ANR is triggered, the system service sends a SIGQUIT signal to the application process to trigger dump traces. On the application side we can listen for the SIGQUIT signal to determine if ANR has occurred. To exclude false positives caused by ANR of other processes, you need to obtain the error information of the current process through the system service and filter it further. The third scheme has high accuracy and low performance loss, which is also the current monitoring scheme adopted by mainstream apps in the industry.

Screening system

After selecting the appropriate monitoring scheme, a perfect investigation system is needed to analyze the attribution of ANR problems.

1. ANR traces

After listening for SIGQUIT signal, Crash SDK will call the art VIRTUAL machine internal dump stack interface to obtain ANR traces information, including the stack of all threads in the ANR process, and can analyze whether there are problems such as main thread time, deadlock, main thread waiting lock, main thread sleep, etc.

The following figure shows ANR stuck in the photo album scenario. The cause can be found through the trace file that the main thread is waiting for the child thread.

The following figure shows ANR in the WebView scenario. The cause can be located through the trace file, and the main thread actively cycles sleep, waiting for the completion of resource initialization.

2. Main thread message queue monitoring

After relying on ANR traces information to repair the problem of having a clear stack, there are more problems of nativePollOnce, as shown in the stack belowThe stack is full of system message queue source code, no business code, it seems bad to locate analysis. Several possible scenarios to enter the nativePollOnce scenario:

1. There is no message to be processed at present, the thread enters the sleep state and waits for the queuing message to wake up at the other end of the pipeline; 2. The message queue actually has messages to be processed, but synchronization barrier is set up. If no asynchronous message is found in the queue message list after traversal, it will enter nativePollOnce to wait for wake up; Dump traces take too long to dump, resulting in migration. Time-consuming messages occur before the dump.

For the second point, hook message queue can be used to detect whether there is synchronization barrier leak. We did not find such problems by sampling buried points in a small range online. For the third case, the historical messages in the message queue of the main thread can be monitored before ANR occurs, and the time consuming messages can be proactively reported. When ANR occurs, the historical messages, current messages and messages in the waiting queue can be reported to the cloud through the Crash SDK.

Implementation scheme

By setting the Printer of the main thread Looper, monitor the schedule of each message, record the target, callback, what and current time stamp of the message. At the same time, a child thread is opened, and the main thread’s stack is periodically collected if message processing occurs, and the stack is associated with the messages by time stamps, so that the main thread’s stack is known as each message is executed.

public final class Looper { public static void loop() { ...... for (;;) {... final Printer logging = me.mLogging; if (logging ! = null) { logging.println(">>>>> Dispatching to " + msg.target + " " + msg.callback + ": " + msg.what); }... try { msg.target.dispatchMessage(msg); } finally { ... }... if (logging ! = null) { logging.println("<<<<< Finished to " + msg.target + " " + msg.callback); }}... }}Copy the code

Due to frequent string concatenation, the performance will be damaged to a certain extent, and only a small range of online sampling will be enabled.

Effect of solution

Through the monitoring of the message queue, it can be seen that the execution time of a message is 155ms, and the wall clock is 411ms. It can be seen from the stack that the reason is that heavy initialization operation is called on the main thread, and there are cross-process calls. Once the execution of messages such as Receiver and Service is blocked, ANR warning of system Service will be triggered.

To optimize the case

After having perfect and accurate monitoring and investigation ability, share some optimization cases below.

1. SharedPreference optimization

From the traces data of online ANR, there are three main types of ANR problems caused by SP:

1. At a particular message, the main thread waits for sp Apply queue persistence to complete 2. Sp commit3 blocks until sp completes loading data

The performance data of MMKV and SP were tested offline, and it was found that MMKV could solve these three problems perfectly. Test the read and write performance of MMKV and SP respectively during the first installation (the sum of each key and value is different after 1000 cycles) :The second start reads only one value of the KV component

We take over all getSharedPreferences interface calls in the compiler as a section, returning the MMKV implementation or the SharedPreferencesImpl implementation of the original system based on the whitelist configuration, with no awareness of the business layer use.

2. Optimize the network broadcast monitoring time

From the traces data of ANR on the line, there are many IPC calls about getActiveNetworkInfo. Through the buried point, it is found that on the one hand, inter-process communication of IPC itself is time-consuming, on the other hand, there are too many instances of broadcast listeners listening to the network state, each of which will make repeated calls to query the network state. Each of these adds up to an increase in time, causing ANR to occur if the scheduled execution of a critical message is blocked

The optimization scheme is to use the dynamic proxy IConnectivityManager interface, intercept the proxy getActiveNetworkInfo method, and give priority to the cache. The network broadcast listener obtains the network information from the asynchronous thread IPC through the unified global network broadcast listener, update the cache, and then the cache can be directly used. Avoid multiple IPC calls.

3. Enable delayed registration of components

In the Application#onCreate phase, a serial task will block the execution of the main thread. In this case, the key message sent by the system cannot be dispatched to the main thread. ANR will occur. The core idea of the fix is to try not to register components such as receivers and services during startup, or to delay registration until onCreate is complete.

Public class MyApplication extends Application {@override public void onCreate() {public class MyApplication extends Application {@override public void onCreate()... isInitDone=true; } @Override public Intent registerReceiver(final BroadcastReceiver receiver, final IntentFilter filter) { if (isInitDone) { return super.registerReceiver(receiver, filter); } mainHandler.post(new Runnable() { @Override public void run() { MyApplication.super.registerReceiver(receiver, filter); }}); return null; }}Copy the code

Summary and Outlook

After upgrading the monitoring of ANR problems and improving the troubleshooting ability, through solving a series of optimization solutions, the ANR rate was reduced by more than half, bringing better user experience. Hopefully, this article will inspire developers to govern ANR and maximize the performance of our application code. We will consider the following two aspects:

• Continue to improve the governance of ANR issues, such as switching key messages to asynchronous threads to avoid congested queues on the main thread that cannot be scheduled; • Strengthen defense mechanisms to prevent data degradation, such as automated offline stability testing to detect new issues in advance.