preface

To be a good Android developer, you need a completeThe knowledge systemHere, let’s grow up to be what we want to be.

As we all know, mobile development is in the second half of the game, and in order to stand out from the crowd of developers, we need to have a deep understanding of one area. For Android developers, there are a few good segments that are worth building up your technical barriers to. Here are a few:

  • 1. Performance optimization expert: capable of deep performance optimization and systematic APM construction.
  • 2. Architect: Rich experience and experience in application architecture design, familiar with the implementation principle and architecture design of Android Framework layer and popular third-party library.
  • 3. Audio/video/image processing experts: There is no doubt that mastering the NDK and getting deeper into audio/video and image processing will enable us to shine in the years to come.
  • 4. Big front end experts: A deep understanding of Flutter and its design principles and ideas will enable us to quickly learn front end knowledge.

Performance optimization is one of the most difficult and most technical barriers in the above subsegments. To be a top performance optimization expert, you need to have a deep and broad knowledge of many fields, including many skills involved in architect, NDK, Flutter. Starting with this article, I’ll walk you through the steps of Android performance optimization.

To fully understand the Android performance optimization body of knowledge, let’s take a look at the following graph I summarized, which looks like this:

To optimize the Performance of applications, we need to establish a set of systematic Performance optimization scheme, which is called APM (Application Performance Manange) in the industry. In order to let you quickly understand the relevant knowledge involved in APM, the author has summarized it into a graph, as shown below:

As we build APM and optimize for App performance, we must first address the issue of App stability. Now, let’s take a flight to explore the boundaries of Android stability optimization.

Mind mapping outline

directory

  • A,The correct understanding
    • 1, stability latitude
    • 2, stability optimization matters needing attention
    • 3. Indicators related to Crash
    • 4. Evaluation of Crash rate
    • 5, Crash key problem
    • 6, APM Crash part of the overall architecture
    • 7. Attribution of responsibility
  • Second,Crash optimization
    • 1. A single Crash solution
    • 2. Crash rate governance scheme
    • 3, Java Crash
    • 4. Java Crash processing process
    • 5, Native Crash
    • 6. Difficult Crash solution
    • 7. Process preservation
    • 8, summary
  • Three,ANR optimization
    • 1, ANR monitoring implementation mode
    • 2. ANR optimization
    • 3. Some common questions about ANR
    • 4. Understand ANR trigger process
  • Four,Building a high availability solution for mobile services
    • 1. Importance of business high availability
    • 2. Business high availability solution construction
    • 3. Mobile Dr Scheme
  • Five,Stable long-term management
    • 1. Development phase
    • 2. Test phase
    • 3, code phase
    • 4. Release phase
    • 5. Operation and maintenance phase
  • Vi.Stability optimization problem
    • 1. What stability improvements have you made?
    • 2, performance stability is how to do?
    • 3. How to guarantee business stability?
    • 4. How to stop losses quickly if abnormal conditions occur?
  • Seven,

First, correct understanding

First of all, we must have a correct understanding of App stability, which is the most basic and critical part of the App quality construction system. If our App is unstable and doesn’t work properly on a regular basis, users will most likely uninstall it. Therefore, stability is important, and Crash is a P0 priority, which needs to be resolved first. In addition, stability can be optimized in a wide range, including not only Crash, but also the optimization categories such as lag and power consumption.

1, stability latitude

Application stability can be divided into three latitudes, as shown below:

  • 1. Crash latitude: The most important indicator is the Crash rate of the application.
  • 2. Performance latitude: the optimization direction, including startup speed, memory, drawing and so on, is secondary to Crash. Before the systematic construction of application performance, we must ensure the stable availability of application functions.
  • 3. High availability of business: It is a very critical step. We need to adopt various means to ensure the stability of the main process and core path of our App.

2, stability optimization matters needing attention

When we optimize the stability of applications, we need to pay attention to the following three points:

1, focus on prevention, monitoring is essential

In terms of stability, if an exception is found only after the App is online, it has actually caused losses. Therefore, the optimization of stability should focus on prevention. From the coding process of developing students, to the testing process of testing students, to the release process before the launch, and the operation and maintenance process after the launch, these links need to prevent the occurrence of abnormal conditions. If an exception does occur, you need to minimize the damage and try to expose as many problems as possible with the minimum cost.

In addition, monitoring is also an essential step, no matter how well prevention is done, there will always be a variety of anomalies online. So, in any case, we need comprehensive monitoring to be more sensitive to problems.

2. Think deeper and attach importance to implicit information: for example, think about whether the Crash problem will cause the same kind of problems when solving the Crash problem

When we see a Crash, we can’t simply deal with this Crash, but need to think deeper and consider whether the same Crash type will occur in other places. If there are such cases, we must deal with them uniformly and prevent them.

In addition, we also need to pay attention to the implicit information about Crash. For example, in the interview process, the interviewer asks you what the Crash rate of your application is. This question indicates that the Crash rate of your application is, but in fact it is asking you some implicit information. The leader’s architecture is not good enough, and there is a lot of room for optimization at all stages of the project, so the interviewer will not think well of you.

3, long-term maintenance needs a scientific process

Application on the stability of construction process is a slowly, so it’s easy to appear this version optimization is good, but in the next version if we regardless of it, it will occur a deteriorating situation, therefore, we must from the every process of project research and development, to establish a scientific and perfect the relevant specification, to ensure long-term optimization effect.

3. Indicators related to Crash

To optimize for application stability, we need to understand some of the metrics associated with Crash.

1. UV and PV

  • Page View (PV) : Page views
  • UV (Unique Visitor) : the same terminal is counted only once within 0-24 hours

2. Crash rate of UV, PV, startup, increment and stock

  • UV Crash rate (Crash UV/DAU) : Based on user usage statistics, it calculates the percentage of crashes of all users within a period of time, and evaluates the impact range of Crash rates, combined with PV. It’s important to make sure you’re using the same measurement all the time.
  • PV Crash rate: Evaluates the severity of the impact of the related Crash.
  • Startup Crash rate: During the startup phase, the Crash occurs before the user fully opens the App. It is the most serious Crash and causes the most damage to the user. It cannot be saved by hotfix. (We’ll talk about that later)
  • Incremental and stock Crash rate: Incremental Crash indicates new crashes, while stock Crash indicates some historical bugs. Incremental Crash is the focus of the new version, while inventory Crash is a hard bone that needs to be continuously chewed. We need to prioritize the problem of incremental and continuous follow-up of inventory.

4. Evaluation of Crash rate

So, how much lower is the Crash rate of our App to be considered a normal or excellent level?

  • The total crash rate between Java and Native must be less than two in a thousand.
  • Crash rate of ten thousand is excellent: it should be noted that 90% of crashes are relatively easy to solve, but solving the last 10% requires great efforts.

5, Crash key problem

Here we also need to focus on the key issues related to Crash. If an application crashes, we should restore the Crash site as much as possible. Therefore, we need to collect the relevant information of application Crash comprehensively, as shown below:

  • Stack, device, OS version, process, thread name, Logcat
  • Background, usage time, App version, minor version, channel
  • CPU architecture, memory information, thread count, resource pack information, and user behavior logs

Then, after the above information is collected and reported to the background, we will aggregate display in the APM background. The specific display information is as follows:

  • Crash Site Information
  • Crash Top Model, OS version, distribution version, and region
  • Crash Start version, report trend, New or not, duration, and magnitude

Finally, we can decide whether the Crash needs to be solved immediately and in which version according to the above information. For APM aggregation display, please refer to the APM background aggregation display of Bugly platform.

Then, let’s look at the overall architecture related to Crash.

6, APM Crash part of the overall architecture

The overall architecture of THE APM Crash part is divided into the collection layer, processing layer, display layer and alarm layer from top to bottom. Now, let’s look at what each layer does in detail.

1) Acquisition layer

First, we need to obtain enough crash-related information in the collection layer to ensure that the problem can be accurately located. You need to collect the following information:

  • The error stack
  • Equipment information
  • Action log
  • Other information

2) Processing layer

Then, in the processing layer, we process the data collected by the App.

  • Data cleaning: some do not conform to the conditions of data filter, for example, for some special cases, some App data collected is not complete, or the upload data lead to the failure of the data is incomplete, the data on the APM platform is certainly can’t fully display, so, first of all we need to put the information filtering.
  • Data aggregation: In this layer, crash-related data is aggregated.
  • Latitude classification: For example, the Crash of Top models, the Top 10% of the user Crash rate, and other dimensions.
  • Trend contrast

3) Display layer

After the processing layer, you will come to the presentation layer, which displays the following types of information:

  • Data reduction
  • Latitude information
  • The initial version
  • Other information

4) Alarm layer

Finally, it will come to the alarm layer. When a serious anomaly occurs, it will notify the relevant students for emergency treatment. We can customize the alarm rules, such as the overall Crash rate, which is more than 5% compared with the previous period (compared with the previous period) or year on year (such as the 10th of this month and the 10th of last month), or a single Crash suddenly surges. The alarm can be made by email, IM, phone, SMS, etc.

7. Attribution of responsibility

Finally, let’s look at the non-technical issues related to Crash. It should be noted that what we need to solve is how to keep the Crash rate low for a long time. We need to ensure that we can quickly find the relevant bug responsible person and let the development students can timely deal with online bugs. The specific solutions are as follows:

  • Set up task force rotation: set up a virtual task force to track the Crash rate of each version online. Members of the team can take turns tracking the Crash rate of each version online. In this way, we can ensure that all crashes are followed from the source.
  • Automatic matching of responsible person: the APM platform will be connected with the bug order system, so that once the APM background finds an emergency bug, it can be sent to the bug order system in the first time to send a reminder to the relevant responsible person.
  • Processing process full record: We need to record every step of the Crash processing process to ensure that the processing of an emergency Crash is not delayed.

2. Crash optimization

1. A single Crash solution

For a single Crash solution, we can follow the following three steps to solve the problem.

1) Find the answer according to the stack and field information

  • Solve 90% of the problem
  • After the solution is completed, the underlying causes of the Crash should be considered

2) Find commonalities: model, OS, experimental switch, resource pack, consider the scope of influence

3) Offline reappearance and remote debugging

2. Crash rate governance scheme

To rectify the Crash rate of an application, you need to handle the following three types of Crash, as shown in the following:

  • 1) Solve regular online Crash
  • 2) System-level Crash attempt Hook bypass
  • 3) Key breakthrough or replacement scheme of difficult Crash

3, Java Crash

An uncaught exception occurs, resulting in an abnormal exit

Thread. SetDefaultUncaughtExceptionHandler ();Copy the code

By setting a custom UncaughtExceptionHandler, we can get live information when a crash occurs. Note that this hook is for a single process. In a multi-process APP, you need to set ExceptionHandler in each process.

Get stack information for the main thread:

Looper.getMainLooper().getThread().getStackTrace();
Copy the code

Get stack information for the current thread:

Thread.currentThread().getStackTrace();
Copy the code

Get stack information for all threads:

Thread.getAllStackTraces();
Copy the code

Third-party Crash monitoring tools, such as Fabric and Tencent Bugly, use string concatenation to convert the array StackTraceElement[] into strings for saving, reporting, or displaying.

So, how do we deobfuscate the uploaded stack information?

For this, we generally have two alternative solutions, as follows:

  • 1. Save the Mapping file and upload it to the monitoring background every time you package and generate confused APK.
  • 2. Android’s native anti-obfuscate toolkit is retrace.jar, which is used in the monitoring background to parse each reported crash in real time. Retrace.jar parses the Mapping file and instantiates the object, which is a time-consuming process. Therefore, the Mapping object instance can be cached in memory. However, to prevent memory leaks and excessive memory usage, you need to add the logic of automatic reclamation on a regular basis.

How do I get logcat?

The logcat log process is as follows: the application layer –> liblog.so –> logd, the bottom layer uses the ring buffer to store data. There are three ways to obtain the data:

1. Run the logcat command to obtain the value.

  • Pros: Very simple, good compatibility.
  • Disadvantages: The whole link is relatively long, poor controllability, high failure rate, especially heap damage or heap memory is insufficient, the basic failure.

2, hook liblog.so implementation

Redirect the contents to its own buffer via the __android_log_buf_write method in hook liblog.so.

  • Advantages: Simple, compatibility is relatively good.
  • Cons: Always open.

3. Custom fetch code. Obtain logCAT implementation by porting the bottom layer, and interact with logD directly through socket.

  • Advantages: more flexible, pre-allocated resources, the success rate is also relatively high.
  • Cons: Very complex implementation

How do I get the Java stack?

When a native crash occurs, we can only get the native stack by unwinding. But we want to be able to get the Java stack for each thread at the time. There are two approaches to this problem, as follows:

1, Thread. GetAllStackTraces ().

advantages

Simple, good compatibility.

disadvantages
  • The success rate is not high, relying on the system interface will fail in extreme cases.
  • After 7.0, this interface has no main thread stack.
  • Using interfaces at the Java layer requires suspending threads.

2, hook libart.so.

Hook ThreadList and Thread functions to get the same stack as ANR. For stability, it needs to be executed in a child of fork.

  • Advantages: information is very complete, basically like ANR log, native thread state, lock information and so on.
  • Cons: Compatibility issues with black tech, we can use thread.getallstacktraces () to cover the bottom when it fails.

4. Java Crash processing process

After explaining Java Crash related knowledge, we can learn about the processing process of Java Crash. Here, we use Gityuan’s flowchart to explain, as shown in the figure below:

DefaultUncaughtHandler is prepared to process Uncaught Exception and output the basic information of the current crash.

2, calls the current process of AMP. HandleApplicationCrash; Through binder IPC mechanism, it is passed to the system_server process.

3, the next, going into system_server process, invoking the binder server performing AMS. HandleApplicationCrash;

4, from mProcessNames to the target process ProcessRecord object; The system outputs the process crash information to the /data/system/dropbox directory.

5. Run makeAppCrashingLocked:

  • Create an error receiver for the crash application of the current user and ignore the broadcasts of the current application.
  • Stop the freeze screen message of WMS in all activities in the current process, and perform some screen related operations;

HandleAppCrashLocked = handleAppCrashLocked

  • When the same process does not crash twice in a minute, the process that finishes the activity running at the top of the stack is executed;
  • If a non-persistent process crashes twice within one minute, all activities of the application are terminated and all processes in the same process group are killed. Then resume the first non-finishing activity at the top of the stack.
  • If the same process crashes twice within 1 minute and the process is persistent, only the first activity on the top of the stack that is not in the finishing state is restored.

7. Send the SHOW_ERROR_MSG message through mUiHandler, and the crash dialog box is displayed.

8. At this point, the system_server process is finished. Return to the crash process and start killing the current process.

9. When a crash process is killed, system_server is notified with binder death notification to execute appDiedLocked().

10. Finally, perform the cleanup of the application related information of the four components.

Replenish gas stations: Binder death notification principle

We also need to understand the principle of binder’s death notification. The flowchart is as follows:

Because the Crash process has a Binder server ApplicationThread, and the application process attaches to the system_server process by calling attachApplicationLocked() during the creation process, There is an ApplicationThreadProxy within the system_Server process, which is the corresponding Binder client. When the process in which the Binder server ApplicationThread runs (i.e., the Crash process) dies, the Binder client receives the corresponding death notification and enters the binderDied process.

5, Native Crash

Features:

  • Accessing an invalid address
  • Address alignment error
  • Active program abort occurs

Signal signals are generated, causing the program to exit abnormally.

1. Qualified exception catching components

A qualified exception catching component needs to include the following features:

  • Supports more extension operations when crash occurs
  • Prints logcat and logs
  • Number of crash reports
  • You need to take different recovery measures for different crashes
  • Can adapt to the continuous improvement of the business

2. Existing solutions

1, Google Breakpad

  • Pros: Authoritative, cross-platform
  • Cons: Large code volume

2, Logcat

  • Advantages: the use of Android system implementation
  • Disadvantages: New processes need to be started upon crash to filter logcat logs. Unreliable

3, coffeecatch

  • Advantages: Simple implementation, easy to change
  • Cons: Compatibility issues

3. Native crash capture process

The process of Native crash capture involves three ends, and here we will learn about their corresponding processing.

1. Compile side

Compiling C/C++ requires keeping files with signed information.

2. Client

When a crash is captured, write as much useful information as you can gather to a log file, and then upload it to the server at an appropriate time.

3. Server

Read log files reported by the client, find suitable symbol files, and generate a readable C/C++ call stack.

4. Difficulty of Native crash capture

Core: How to ensure that the client can still generate crash logs in all extreme situations.

1. Failed to create log files because file handles were leaked?

Apply for file handle FD reservation in advance.

2. Failed to generate logs due to stack overflow?

  • Use the extra stack space signalStack to avoid stack overflow that causes the process to have no space to create the call stack execution handler. (SignalStack: this is where the system points the stack pointer to in a dangerous situation so that the signal handler can be run on a new stack)
  • Special requests need to replace the current stack directly, so some space should be reserved in the heap.

3. Log production fails due to heap memory depletion?

See Breakpad’s repackaging of Linux Syscall Support to avoid calling LIBC directly to allocate heap memory.

4. Failed to generate logs due to heap corruption or secondary crash?

Breakpad uses fork child or even grandson processes to collect the crash site, even if there is a second crash, only this part of the information is lost.

Here’s what’s bad about Breakpad:

  • The generated minidump file is binary and contains too much unimportant information, resulting in a large number of files. But minidump can debug using GDB and see the parameters passed in.

It’s important to know that Chromium will use the Crashpad instead of Breakpad in the future.

Want to follow Android’s text format and add more important information?

Breakpad, add Logcat information, Java call stack information, other useful information.

5. Native crash capture registration

A Native Crash log is as follows:

The memory address following the PC in the stack information is the stack address of the current function, and we can use the following command line to find the number of lines of code that failed

Arm-linux-androideabi-addr2line -e Memory addressCopy the code

Here are all the semaphores and what they mean:

#define SIGINT 2 // Program termination (e.g. ctrl-c) #define SIGQUIT 3 // Program exit (Ctrl-\) #define #define SIGTRAP 5 #define SIGTRAP 6 #define SIGABRT 6 #define SIGABRT 6 #define SIGBUS 7 #define SIGBUS 7 #define SIGBUS 7 #define SIGBUS 7 #define SIGBUS 7 #define SIGFPE #define SIGFPE #define SIGFPE #define SIGFPE #define SIGSEGV 11 #define SIGSEGV 11 #define SIGUSR1 10 #define SIGSEGV 11 #define SIGSEGV 11 #define SIGUSR2 12 // Not used, leave #define SIGPIPE 13 Usually interprocess communication produces #define SIGALRM 14 // timing signal, #define SIGTERM 15 // end program, similar to mild SIGKILL, can be blocked and handled. SIGKILL #define SIGSTKFLT 16 // Coprocessor stack error #define SIGCHLD 17 // The parent process will receive this signal when the child process has ended. #define SIGCONT 18 #define SIGSTOP 19 #define SIGSTOP 19 #define SIGTSTP 20 #define SIGTSTP 20 #define SIGCONT 18 #define SIGCONT 18 #define SIGTTIN 21 // When a background job wants to read data from the user terminal, all processes in that job will receive a SIGTTIN signal #define SIGURG 23 #define SIGXCPU 24 #define SIGXFSZ 25 #define SIGURG 23 #define SIGURG 23 #define SIGURG 24 #define SIGXFSZ 25 #define SIGURG 24 #define SIGVTALRM 26 // Virtual clock signal. #define SIGPROF 27 // Define SIGALRM/SIGVTALRM #define SIGWINCH 28 // Issue #define SIGIO 29 when window size changes // File descriptor ready, Can start input/output operations #define SIGPOLL SIGIO // Same as above, alias #define SIGPWR 30 // Power exception #define SIGSYS 31 // Invalid system callCopy the code

General concerns are SIGILL (an illegal instruction was executed, or an attempt was made to execute a data segment, stack overflow), SIGABRT (a signal generated by calling abort, indicating a program exception), SIGBUS (illegal addresses, including memory address alignment errors, For example, access a 4-byte integer, but its address is not a multiple of 4), SIGFPE, SIGSEGV, SIGSTKFLT, SIGSYS.

The easiest way to subscribe to a signal where an exception occurs is to simply loop through all the signals to subscribe to, calling sigAction () for each signal.

Pay attention to

  • JNI_OnLoad is the best place to install the signal initialization function.
  • You are advised to invoke methods in the Java layer to report the information. Native crash capture registration.

6. Crash analysis process

First, collect some information about the crash site, as follows:

1. Crash message

  • Process name and thread name
  • Crash stack and type
  • Sometimes you need to know the call stack of the main thread

2. System information

  • System run logs

    /system/etc/event-log-tags

  • Model, system, vendor, CPU, ABI, Linux version, etc

Note that we can look for common problems as follows:

  • Equipment state
  • Whether the root
  • Simulator or not

3. Memory information

Remaining system memory
/proc/meminfo
Copy the code

When the system available memory is less than 10% of the MemTotal, OOM, a large number of GC, frequent system suicide pull up problems are very easy to occur.

Application Memory usage

Including Java memory, RSS, PSS

PSS and RSS are calculated by /proc/self/smap, and more detailed classification statistics such as APK, dex and SO can be obtained.

Virtual memory

Get size:

/proc/self/status
Copy the code

To obtain its specific distribution:

/proc/self/maps
Copy the code

It is important to note that for 32-bit processes, 32-bit cpus, virtual memory up to 3GB May cause memory failures. For a 64-bit CPU, the virtual memory is usually 3 to 4GB. If 64-bit processes are supported, virtual memory is not an issue.

4. Resource information

If there is plenty of application heap and device memory, but there are memory allocation failures, it may be related to resource leaks.

File handle fd

Get the limited number of fd’s:

/proc/self/limits
Copy the code

Generally, a process can open a maximum of 1024 handles. If the number exceeds 800, log out all fd and file names.

The number of threads

Thread count size:

/proc/self/status
Copy the code

Generally, a thread occupies 2MB of virtual memory. If the number of threads exceeds 400, it is dangerous. Therefore, you need to output all tid and thread names to logs for investigation.

JNI

Easy to cite failure, cite out of the table and other crashes.

Through DumpReferenceTables statistics JNI reference table, further analysis whether JNI leakage and other problems.

Add gas stations: The origin of dumpReferenceTables

In dalvik.system.VMDebug, this is either a native method or a static method. You can do this in JNI

jclass vm_class = env->FindClass("dalvik/system/VMDebug");
jmethodID dump_mid = env->GetStaticMethodID( vm_class, "dumpReferenceTables", "()V" );
env->CallStaticVoidMethod( vm_class, dump_mid );
Copy the code

5. Application information

  • Crash scene
  • Critical operation path
  • Other custom information related to the application: runtime, whether patches are loaded, whether new installations or upgrades are available.

6. Crash analysis process

Next, crash analysis:

1. Focus
  • Identify severity
  • Prioritize Top crashes or crashes that have a significant impact on the business, such as startup, payment process crashes
  • Java crash: If the alarm is generated on the OOM, check the memory information and resource information in the log
  • Native crashes: View signal, code, fault addr, and the Java stack at the time of the crash

Common crash types are:

  • SIGSEGV: null pointer, invalid pointer, etc
  • SIGABRT: ANR, call abort launch, etc

If it is ANR, first look at the main thread stack, whether it is caused by lock wait, then look at the ANR log ioWAIT, CPU, GC, SystemServer, etc., determine whether it is AN I/O problem or CPU contention problem or a lot of GC caused ANR.

Note If the cause of the problem cannot be found in one crash log, you need to view more crash logs of the same crash point, or you can view the memory information and resource information to troubleshoot the fault.

2. Look for commonalities

The machine, system, ROM, vendor, and ABI information can be used as a common reference to provide a clear guide to the next recurring problem.

3. Try to replicate

After the reappearance, add logs or debug using Debugger, GDB. If replications are not possible, advanced techniques such as Xlog logging, remote diagnostics, dynamic analysis, and so on are available.

Filling station: Solution to system crash

  • 1. Use common information to find possible causes
  • 2, try to use other ways to avoid
  • 3, Hook up

7. Practice: Use Breakpad to capture native crashes

First of all, here is the crash capture example project written by Mr. Zhang Shaowen from Android Development Master Class. Breakpad has been integrated into the project to obtain system information and thread stack information when native Crash occurs. Here’s a detailed procedure for analyzing native crashes using Breakpad:

1. The example project is built using cmake, so you need to download NDK and Cmake from SDK Tools in SDK Manager in Android Studio.

2. After installing the instance project, click the CRASH button to generate a native CRASH. If Sdcard permission is granted to the generated crash information, it will be stored in/Sdcard /crashDump first for further analysis. Whereas in the directory/data/data/com dodola. Breakpad/files/crashDump.

3, use adb pull command to capture crash log files in the local directory of the computer:

adb pull /sdcard/crashDump/***.dmp > ~/Documents/crash_log.dmp
Copy the code

Download and compile Breakpad source code, find minidump_StackWalk in SRC/Processor directory, and use this tool to convert DMP files to TXT files:

/ / in the project directory clone Breakpad git clone https://github.com/google/breakpad.git/warehouse/switch to the configuration, compilation CD Breakpad Breakpad root directory /configure && make // Use the minidump_StackWalk tool in the SRC /processor directory to convert the DMP file to a TXT file./ SRC /processor/ Minidump_StackWalk ~/Documents/crashDump/crash_log.dmp >crash_log.txtCopy the code

5, open crash_log. TXT, you can get the following content:

Operating system: Android
                  0.0.0 Linux 4.4.78-perf-g539ee70 #1 SMP PREEMPT Mon Jan 14 17:08:14 CST 2019 aarch64
CPU: arm64
     8 CPUs

GPU: UNKNOWN

Crash reason:  SIGSEGV /SEGV_MAPERR
Crash address: 0x0
Process uptime: not available

Thread 0 (crashed)
 0  libcrash-lib.so + 0x650
Copy the code

The key information we need is that the CPU is ARM64 and the crash address is 0x650. Next we need to translate this address to the corresponding line in the code.

6. Use the ADDR2line provided in the NDK to perform a symbol inverse solution process according to the address.

If it’s arm64 so use $NDKHOME/toolchains/aarch64 – Linux – android – 4.9 / prebuilt/Darwin – x86_64 / bin/aarch64 – Linux – android – addr2line.

$NDKHOME/toolchains/ arm-linux-Androideabi-4.9 /prebuilt/darwin-x86_64/bin/ arm-linux-Androideabi-addr2line

As you can see from crash_log.txt, the CPU architecture of our machine is ARM64, so we need to use the aARCH64-linux-Android-addr2line command line tool. The syntax of this command is as follows: // Note: in MAC,./ indicates the address of the execution file./aarch64-linux-android-addr2line -e corresponding to

The corresponding.so file in the above section, after the project is compiled, In Chapter01 – master/sample/build/intermediates/merged_native_libs/debug/out/lib/arm64 – v8a/libcrash – lib. So the position, Since my mobile CPU architecture is ARM64, I chose libcrash-lib.so in ARM64-V8A. Next we use aarch64-linux-android-addr2line:

./aarch64-linux-android-addr2line -f -C -e ~/Documents/open-project/Chapter01-master/sample/build/intermediates/merged_native_libs/debug/out/lib/arm64-v8a/libcrash -lib.so 0x650 The meanings of the parameters are as follows: -e --exe=<executable> : specifies the executable file name that needs to translate the address. -f --functions: Displays the function name along with the file name and line number. -c --demangle[=style] : decodes the low-level symbol name into the user level name.Copy the code

The output is:

Crash()
/Users/quchao/Documents/open-project/Chapter01-master/sample/src/main/cpp/crash.cpp:10
Copy the code

From this, we can get the code behavior of crash at line 10 in the crash. CPP file, and then make corresponding modifications according to the specific situation of the project.

Tips: This is a common technique used by students in NDK development (audio and video, image processing, OpenCv, hotfix framework development) to debug native layer errors. It is highly recommended to master it.

6. Difficult Crash solution

Finally, the author here to explain some difficult Crash solutions.

Problem 1: How to solve the Toast BadTokenException in Android 7.0?

As with Android 8.0 try catch, the mTN (handler) in the proxy Toast can be used to catch exceptions.

Question 2: How to resolve the ANR problem caused by SharedPreference Apply

Why does APPLY cause ANR?

SP calls apply, which creates a wait lock and places it in QueuedWork, and wraps the real data persistence into a task and executes it in an asynchronous queue. When the task finishes, the lock is released. When the Activity onStop and Service process onStop, onStartCommand, execute queuedWork.WaittoFinish () to wait for all wait locks to be released.

How to solve it?

All such ANRs are triggered by queuedWork.WaittoFinish (), as long as the saved queue is manually emptied before this function is called.

Hook ActivityThrad Handler variable. After receiving this variable, set a Callback to it. The Handler’s dispatchMessage will handle the Callback first. Finally, call queue cleanup in Callback, noting that queue cleanup requires a reflection call to QueuedWork.

Pay attention to

The apply mechanism itself has a high failure rate (around 1.8%), and cleaning up wait queues has little impact on persistence.

Problem 3: How do I resolve the TimeoutExceptin exception?

It is thrown by the system’s FinalizerWatchdogDaemon.

The WatchDog is used to monitor the running status of important services. When an important service is stopped and Timeout crashes, the WatchDog is responsible for restarting the application. When the WatchDog is closed (executing stop () method), when the important service is stopped, the Timeout exception will not occur, is a method to prevent the exception by abnormal means.

Avoid scheme

The stop method, before Android 6.0, will be wired to the same step problem. Since the interrupt method that called threadToStop before 6.0 was unlocked, it could cause problems with the same routine.

Note: There is a chance that a Stop will cause a TimeoutException even if there is no timeout.

disadvantages

This is just a hack to avoid reporting exceptions, but does not really solve the problem of finialize timeout.

Question 4: How do I solve the memory leak of the input method?

Use reflection to empty the input method’s two views.

7. Process preservation

We can use SyncAdapter to improve the priority of the process. It is an account synchronization mechanism provided by The Android system. It belongs to the core process level, and the priority of the process that uses SyncAdapter will also be increased. The priority of the process is changed to 1, which is lower than that of the foreground running process. Therefore, the probability of the application being killed by the system is reduced.

8, summary

For App Crash optimization, in general, we need to consider the following four points:

  • 1. Focus on prevention: pay attention to the entire application process, including developer training, compilation check, static scan, standard testing, gray scale, release process, etc
  • 2, do not use try catch to hide the problem: you should start from the source, to understand the nature of the crash, to ensure the rest of the process.
  • 3, the process of solving the crash should be from the point to the surface, consider how to solve a kind of crash.
  • 4. Crashes are closely related to memory, lag, and I/O memory

3. ANR optimization

1, ANR monitoring implementation mode

Use FileObserver to listen for changes in /data/anr/traces

disadvantages

The advanced ROM requires the root permission.

The solution

Overseas Google Play service, domestic Hardcoder.

2. Monitor the running time of message queues

Caton monitoring principle:

With the message queuing mechanism of the main thread, if the application gets stuck, it must be a time-consuming operation performed in the dispatchMessage. We set a Printer for the Looper of the main thread to count the execution time of the dispatchMessage method. If the value exceeds the threshold, it means that there is a lag, and then all kinds of information will be dumped to provide developers with analysis of performance bottlenecks.

Add ANR thread monitoring to the caton monitoring code. When sending a message, a state is saved in the ANR thread, and the flag is Reset after the main thread message is executed. If no reset occurs after receiving a sent message in the ANR thread, the task may have ANR.

disadvantages

  • It cannot accurately determine whether ANR really occurs, but it can only indicate that UI blocking occurs in APP, which needs to be checked twice. The validation is done by waiting for the phone system to show that the Error occurred, and the Error type is NOT_RESPONDING (value 2).

Before each ANR popup, the Native layer sends a signal event whose signal is SIGNAL_QUIT (3). You can also listen for this signal.

  • The full ANR log cannot be obtained
  • It’s part of the Caton optimization

3. Consider the application exit scenario

  • Take the initiative to suicide
  • Process.killprocess (), exit(), and so on.
  • collapse
  • System restart
  • System exception, power failure, and user restart: Check whether the startup time of an application is smaller than the recorded value.
  • Killed by the system
  • Was killed by LMK, from the system task manager crossed out, etc.

Pay attention to

Because the traces. TXT upload is time-consuming, it is generally used offline. Online, it is recommended to combine the ProcessErrorStateInfo and stack information when ANR is present to achieve real-time ANR upload.

2. ANR optimization

The reason for ANR: not doing what needs to be done within the specified time.

ANR classification

Takes place

  • The Activity onCreate method or Input event is not completed after 5s.
  • BroadcastReceiver Front-end 10s, background 60s;
  • The ContentProvider publishes 10 seconds.
  • Service Front-end 20s, back-end 200s.

causes

  • There are time-consuming operations on the main thread
  • Complex layout
  • IO operations
  • Quilt thread synchronization lock block
  • Peer block of Binder
  • Binder is full. As a result, the main thread cannot communicate with SystemServer
  • Access to system resources (CPU/RAM/IO)

From the perspective of process, the reasons are as follows:

  • Current process: The main thread itself is time consuming or the message queue of the main thread has time consuming operations, the main thread is blocked by other child threads of the process
  • Remote processes: Binder call and socket communication

The core principles of Andorid system monitoring ANR are message scheduling and timeout processing.

ANR process

1. Obtain the Log

1. Grab bugreport

adb shell bugreport > bugreport.txt
Copy the code

/data/anr/traces. TXT

adb pull /data/anr/traces.txt trace.txt
Copy the code

2. Read the key points in the log search “ANR in”

  • Occurrence time (may delay 10-20s)

  • Pid: When PID =0, it indicates that the process was killed by LMK or crashed before ANR. Therefore, the process cannot receive the broadcast or key message from the system. Therefore, ANR occurs

  • CPU Load: 7.58/6.21/4.83

    If the system load exceeds 1.0 continuously, the value must be lowered. If the system load reaches 5.0, the system has a serious problem

  • CPU utilization

    CPU usage from 18101ms to 0ms ago 28% 2085/system_server: 18% user + 10% kernel / faults: 8689 minor 24 major 752 / [email protected]: 11% and 4% user + 6.9% kernel/faults: 2 Minor 9.8% 780/ SurfaceFlinger: 6.2% User + 3.5% Kernel/Faults: 143 Minor 4 Major

The preceding information indicates the CPU usage of Top processes.

Pay attention to

If CPU usage is low, the main thread may be blocked.

3, in the bugreport. TXT according to the PID and the occurrence of time to search the blocked log

----- pid 10494 at 2019-11-18 15:28:29 -----
Copy the code

4. Scroll down to the “main” thread to see the blocking log

"main" prio=5 tid=1 Sleeping | group="main" sCount=1 dsCount=0 flags=1 obj=0x746bf7f0 self=0xe7c8f000 | sysTid=10494 nice=-4 cgrp=default sched=0/0 handle=0xeb6784a4 | state=S schedstat=( 5119636327 325064933 4204 ) utm=460 stm=51 core=4  HZ=100 | stack=0xff575000-0xff577000 stackSize=8MB | held mutexes=Copy the code

The meanings of the key fields are as follows:

  • Tid: indicates the thread number
  • SysTid: The thread id of the main process is the same as the process ID
  • Waiting/Sleeping: various thread states
  • Nice: A smaller value indicates a higher priority, ranging from -17 to 16
  • Schedstat: Running, Runable time (NS) and number of switches
  • Utm: The execution time of the thread in user mode (jiffies)
  • STM: Execution time of this thread in kernel mode (jiffies)
  • SCount: The number of times the thread has been suspended
  • DsCount: The number of times the thread has been suspended by the debugger
  • Self: address of the thread itself

Filling stations: various thread states

It should be noted that the various thread states here refer to the thread states of the Native layer. The corresponding relationship between Java thread states and Native thread states is as follows:

enum ThreadState {
  //                                   Thread.State   JDWP state
  kTerminated = 66,                 // TERMINATED     TS_ZOMBIE    Thread.run has returned, but Thread* still around
  kRunnable,                        // RUNNABLE       TS_RUNNING   runnable
  kTimedWaiting,                    // TIMED_WAITING  TS_WAIT      in Object.wait() with a timeout
  kSleeping,                        // TIMED_WAITING  TS_SLEEPING  in Thread.sleep()
  kBlocked,                         // BLOCKED        TS_MONITOR   blocked on a monitor
  kWaiting,                         // WAITING        TS_WAIT      in Object.wait()
  kWaitingForLockInflation,         // WAITING        TS_WAIT      blocked inflating a thin-lock
  kWaitingForTaskProcessor,         // WAITING        TS_WAIT      blocked waiting for taskProcessor
  kWaitingForGcToComplete,          // WAITING        TS_WAIT      blocked waiting for GC
  kWaitingForCheckPointsToRun,      // WAITING        TS_WAIT      GC waiting for checkpoints to run
  kWaitingPerformingGc,             // WAITING        TS_WAIT      performing GC
  kWaitingForDebuggerSend,          // WAITING        TS_WAIT      blocked waiting for events to be sent
  kWaitingForDebuggerToAttach,      // WAITING        TS_WAIT      blocked waiting for debugger to attach
  kWaitingInMainDebuggerLoop,       // WAITING        TS_WAIT      blocking/reading/processing debugger events
  kWaitingForDebuggerSuspension,    // WAITING        TS_WAIT      waiting for debugger suspend all
  kWaitingForJniOnLoad,             // WAITING        TS_WAIT      waiting for execution of dlopen and JNI on load code
  kWaitingForSignalCatcherOutput,   // WAITING        TS_WAIT      waiting for signal catcher IO to complete
  kWaitingInMainSignalCatcherLoop,  // WAITING        TS_WAIT      blocking/reading/processing signals
  kWaitingForDeoptimization,        // WAITING        TS_WAIT      waiting for deoptimization suspend all
  kWaitingForMethodTracingStart,    // WAITING        TS_WAIT      waiting for method tracing to start
  kWaitingForVisitObjects,          // WAITING        TS_WAIT      waiting for visiting objects
  kWaitingForGetObjectsAllocated,   // WAITING        TS_WAIT      waiting for getting the number of allocated objects
  kWaitingWeakGcRootRead,           // WAITING        TS_WAIT      waiting on the GC to read a weak root
  kWaitingForGcThreadFlip,          // WAITING        TS_WAIT      waiting on the GC thread flip (CC collector) to finish
  kStarting,                        // NEW            TS_WAIT      native thread started, not yet ready to run managed code
  kNative,                          // RUNNABLE       TS_RUNNING   running in a JNI native method
  kSuspended,                       // RUNNABLE       TS_RUNNING   suspended by GC or debugger
};
Copy the code

Other analysis methods: Java threads call analysis methods

  • Run the JPS command to list all Java VIRTUAL machine processes running in the current system and obtain the PID of the application process.
  • You can then use the jstack command to view the status of all threads in the process and the call relationship, as well as some simple analysis results.

3. Some common questions about ANR

1, SP call apply cause ANR problem?

While Apply does not block the main thread, it does shift the wait time to the main thread.

2. Check whether abnormal exit occurs during operation?

Set a flag when the application starts, update the flag after the active suicide or crash, the next start to detect this flag can be judged.

4. Understand ANR trigger process

Broadcast is roughly the same as service timeouts, but there is a very subtle skill point: broadcast timeouts via static registration are affected by SharedPreferences(SP).

When SP has work that is not synchronized to the disk, it waits for it to complete before informing the system that the broadcast is complete. And only the XML static registration of the broadcast timeout detection process will consider whether there are SP unfinished, dynamic broadcast is not affected by it.

  • Ams.appnotresponding is finally called after ANR occurs for Service, Broadcast, and Input.
  • For a provider, ANR may occur during the publish process when the process is started. In this case, processes are killed and corresponding information is cleared without an ANR dialog box being displayed.

1, ams. appNotResponding process

  • Output ANR Reason information to EventLog. In other words, the closest to ANR trigger point is the am_ANR information output in EventLog.
  • Collecting and exporting the traces information for each thread in the list of important processes is time-consuming.
  • Output the CPU usage and CPU load of each process.
  • Save the traces files and CPU usage information to Dropbox in the Data/System/Dropbox directory (the most important information for ANR information).
  • Depending on the type of process, it is decided to kill directly in the background or inform the user with a pop-up.

2, AMS. DumpStackTraces flow

Stacks = “firstPids” and “firstPids”;

  • The first is that ANR processes occur;
  • The second is system_server;
  • The rest are all persistent processes in mLruProcesses.

The stacks of Native processes are collected. (dumpNativeBacktraceToFile)

  • In turn is a mediaserver, sdcard, surfaceflinger process.

Stacks collect lastPids process stacks:

  • The top 5 processes in CPU usage are displayed.
Pay attention to

When each process trace is exported, the processes sleep for 200ms.

Iv. Construction of high availability scheme for mobile terminal services

1. Importance of business high availability

Here are five points about the importance of business high availability:

  • High availability
  • performance
  • business
  • Focus on the full availability of user functions
  • Real impact income

2. Business high availability solution construction

The points that need to be paid attention to in the construction of a business high availability solution are complicated, but they can be summarized as follows:

  • The data collection
  • Sort out the main process, core path and key nodes of the project
  • Aop automatic collection, unified reporting
  • Alarm strategy: threshold alarm, trend alarm, specific indicator alarm, direct report (or bottom threshold)
  • Abnormal monitoring
  • Single point tracing: specific problems that need specific analysis, full log retrieval, special analysis
  • Out strategy
  • Configuration center, function switch
  • Jump Distribution Center (componentized routing)

3. Mobile Dr Scheme

Disaster includes:

  • Abnormal performance
  • Business exceptions

Traditional process:

User feedback, repackaging, channel update, unacceptable.

Dr Scheme construction

The construction of a DISASTER recovery solution can be divided into the following seven points. In the following, we will learn about each of them.

1. Function switch

In the configuration center, the server delivers configuration control

Aimed at the scene
  • Function of new
  • Code changes

2, jump center

  • The interface switches through the route. The route determines whether to redirect
  • Native bugs that cannot be hotfixed jump to the temporary H5 page

3. Dynamic restoration

Hot repair capability, can monitor, gray, rollback, clear.

4. Combination of push and pull, multiple scenario calls to ensure the arrival rate

5. Weex and RN are updated incrementally

6. Safe mode

Wechat Reading, Mushrooms street, Taobao, Tmall and other “re-operated” apps all use a safe mode to ensure the startup process of the client, and give users a chance to save themselves after the startup fails. Here are some of its core features:

  • The application is automatically recovered based on the Crash information. After multiple startup failures, the application is reset to the initial installation status
  • Serious bugs can block hot fixes
Safe mode design

Configuration background: A unified configuration background with grayscale publishing mechanism

1. Client capability:

  • In the case of continuous APP Crash, it has the ability of grading and non-inductive self-repair
  • Synchronous hot repair capability
  • The ability to specify that a specific function is triggered
  • Specific function registration ability, convenient later extension of security mode

2. Data statistics and alarm

  • Unified data platform
  • The alarm monitoring function detects problems in a timely manner
  • View data such as the hotfix success rate

3. Quick test

  • Optimize testing in pre-release environments
  • Optimization regression verification of safety mode difficulties
Principle of tmall security mode

1. How to judge the abnormal exit?

Record a flag value when the APP starts. Clear the flag value when the following conditions are met

  • The APP starts normally for 10 seconds
  • The user logs out of the application. Procedure
  • The user actively switches from the foreground to the background

If an exception occurs during startup, the flag value is not cleared. You can determine whether the client exits abnormally by using the flag value. Each time the client exits abnormally, the flag value is +1.

2. Hierarchical policy execution in safe mode

There are two levels of safety mode. Two consecutive crashes are the level 1 safety mode, and two or more consecutive crashes are the level 2 safety mode.

Business line can register behavior in level 1 safe mode, such as clearing cache data. When entering this mode again, it will use the registration behavior to try to repair the client. If the APP cannot be repaired in level 1 safe mode, it will enter level 2 safe mode to restore the APP to the initial installation state. And the Document, Library, Cache three root directory empty.

3. Hot-fix execution policy

Whenever a hotfix is found in the configuration, the APP synchronously blocks the hotfix to ensure the timely repair

4. Gray scale scheme

In the case of gray level, the configuration will include gray level and formal configuration and its gray level probability. APP calculates whether it meets the gray level condition according to a specific algorithm, then gray level configuration is used

Usability considerations

1. Access cost

Perfect documentation, concise interface

2, unified configuration of the background

The configuration can be based on APP and version

3. Customization

Supports customization, allowing the access party to decide the specific behavior

4. Grayscale mechanism

5. Data analysis

The unified data platform is adopted to provide the basis for the improvement of security mode

6. Quick test

Create more specific test cases, such as simulating consecutive crashes

7. Abnormal circuit breaker

When multiple requests fail, the network library can actively reject the request.

Dr Scheme collection path

Function Switch -> System hop center -> Dynamic repair -> Safe mode

5. Stable and long-term governance

To achieve long-term governance of App stability, we need to do targeted treatment from the development phase => test phase => code phase => release phase => operation and maintenance phase.

1. Development phase

  • Unify code specification, enhance code foundation, technical review, CodeReview mechanism
  • Structure optimization
  • Ability of convergence
  • Unified fault tolerance: for example, the network library Utils performs unified pre-check on the returned information. If it is illegal, the following process is not taken.

2. Test phase

  • Functional testing, automated testing, regression testing, overlay installation
  • Boundary tests for special scenarios and models: for example, the server returns abnormal data or the server breaks down
  • Cloud testing platform: provide more comprehensive models for testing

3, code phase

  • Compile detection, static scan
  • Automatic regression of precompile process and main process

4. Release phase

  • Several rounds of gray level
  • Sub-scene, latitude comprehensive coverage

5. Operation and maintenance phase

  • Intelligent monitoring
  • Roll back or degrade the policy
  • Hot repair and local Dr Schemes

Vi. Stability optimization

1. What stability improvements have you made?

As the gradually mature of the project, user base, DAU continue to rise, we met a lot of stability problems, students met a lot of challenges for our technology, users often use our App caton or function is not available, so we went for the stability of open special optimization, we optimize the main three items:

  • Crash special optimization
  • Performance stability optimization
  • Business stability optimization

Through the optimization of these three aspects, we built a mobile terminal high availability platform. At the same time, a lot has been done to make the App truly highly available.

2, performance stability is how to do?

  • Overall performance optimization: startup speed, memory optimization, rendering optimization
  • Offline problem discovery, optimization
  • Mainly online monitoring
  • Crash special optimization

We have made multi-dimensional optimization for the startup speed, memory, layout loading, card, thin, flow, power and other aspects.

Our optimization is mainly divided into two levels, namely online and offline. For offline, we focus on finding problems, directly solving them, and solving problems as far as possible before going online. And really on the line, our main purpose is to monitor, for each performance latitude monitoring, can let us as early as possible to obtain the abnormal situation alarm.

At the same time, for the most serious online performance problem: Crash, we have made special optimization, not only to optimize the specific indicators of Crash, but also to obtain the detailed information when the Crash occurred as much as possible, combined with the back-end aggregation, alarm and other functions, so that we can quickly locate the problem.

3. How to guarantee business stability?

  • Data collection + alarm
  • It is necessary to monitor the main process and core path of the project.
  • We also need to know how many exceptions occurred at each step, so we know the conversion rates for all business processes and the corresponding interfaces
  • Combined with the market, if the conversion rate is below a certain value, alarm
  • Anomaly monitoring + single point tracking
  • Pocket strategy, such as tmall security model

Mobile client service high availability it focus on the user complete functions available, mainly in order to solve some online users some abnormal situation cause he didn’t collapse, no performance problems, however, is just a simple function is not available, we need to project the main process, the core of path for burial site monitoring, every step to calculate how much is it real conversion rate, at the same time? You also need to know exactly how many exceptions occur at each step. So that we know all the business process of conversion and the corresponding interface conversion rate, with the data of the market, we know that, if the conversion or some monitor success rate is lower than a certain value, it is very likely is the emergence of online abnormal, combined with the corresponding alarm function, we don’t need to wait for users to feedback, This is the basis of business stability assurance.

At the same time, for some special cases, for example, the development process and/or in the code of some catch block, capture the exception, let the program does not collapse, it is not reasonable, program didn’t collapse, the function of the program has become unavailable at that time, so, these are the catch exceptions that we also need to report, In this way, we can know exactly what the user has the problem that caused the exception. In addition, there are some online single point problems, such as the user click to log in has been unable to enter, this is a single point problem, in fact, we can not find out the similarities with other problems, so we have to find the corresponding detailed information.

Finally, we have a series of measures in place to make quick stops if something unusual happens. (= > 4)

4. How to stop losses quickly if abnormal conditions occur?

  • Function of the switch
  • All jump center
  • Dynamic fixes: hot fixes, resource pack updates
  • Autonomous repair: safe mode

First of all, the App needs to be equipped with some advanced capabilities. For any new function to be launched, we need to add a function switch, and decide whether to display the entry of the new function through the switch delivered by the configuration center. If there is an exception, you can close the entrance to the new function in an emergency, then you can make the App in a controllable state.

Then, we need to set up route hops for the App, and all interface hops need to be distributed through the route. If we match a new function that needs to jump to a bug, we will not jump, or jump to a unified interface that is in exception processing. If neither of these two methods can be used, then we can consider dynamic repair through hotfix. The current hotfix scheme is actually quite mature, and we can completely add hotfix capability to our project at a low cost. Of course, it would be better if some functions are implemented by RN or WeeX. Then you can update the resource pack to achieve dynamic update. Which if not possible, then you can consider your application to add a self-healing capacity, if the App start many times, consider to empty all the cache data, will be reset to the App install state, to the most serious level, can block the main thread, at this time must be App, such as hot fix after the success had only then allow the user to enter.

Seven,

Android stability optimization is a process requiring long-term investment, continuous operation and maintenance. Above, we not only deeply discussed the solution processes and solutions of Java Crash, Native Crash and ANR, but also analyzed their internal implementation principles and monitoring process. To do stability optimization well, we must have some understanding of virtual machine running, Linux signal processing, and memory allocation. Only by understanding these underlying knowledge can we design better stability optimization schemes than others.

Reference links:

1, “Android Performance Optimization Best Practices” Chapter 5 Stability optimization

2, MOOCs domestic Top team with you play Android performance analysis and optimization chapter 11 App stability optimization

3. Crash optimization for Geek Time

4. Crash capture mechanism and implementation of Android platform Native code

5. Safe mode: Tmall App startup protection practice

6. Meituan Takeout Android Crash Governance (advanced)

7. Experience summary of Crash monitoring SDK (Android) development of Poseidon Platform

Android Native Crash collection

9. Understand the Android Crash process

ANR analysis of Android applications

Understand how Android ANR triggers

12. Input system-ANR principle analysis

13. ANR monitoring mechanism

14. Understand how Android ANR triggers

Understand the information collection process of Android ANR

Applications and System stability part 1 – General procedures for ANR problem analysis

17. Smart positioning of ANR problems

18. Analyze the ANR problem caused by SharedPreference Apply

Linux error signal

Thank you for reading this article. I hope you can share it with your friends or tech groups. It means a lot to me.

I hope we can be friends beforeGithub,The Denver nuggetsLast time we shared our knowledge.