I have been working on the interview topic for a long time. I don’t know if it is helpful for you who need to interview and have the idea of job-changing. Today, I have collected an article about crash optimization and interview, and I will continue to update it later
preface
What should developers do when APP crashes (flash back)? A lot of people would say that using Log to find code that flashes, catch exceptions, and “digest” all Java crashes. Whether the program behaves differently or not is god’s business. Yes, it’s a solution to an emergency, but what’s the truth about flashbacks? Should we address the root cause of the problem?
A collapse,
Crash rate is a basic indicator of the quality of an application, so how to objectively measure this indicator, and how to look at the stability associated with crash.
Two types of Android crashes:
- Java collapse
- Native to collapse
Simply put, a Java crash is an uncaught exception in Java code that causes the program to exit abnormally. Native crashes are usually caused by invalid address access in Native code, address alignment problems, or active Abort, which generate the corresponding Signal and cause the program to exit abnormally.
1.1 Crash collection
“Crash” is the program has exceptions, and the crash rate of a product, with how we capture and deal with these exceptions has a greater relationship. For many small and medium-sized companies, you can choose some third-party services. At present, various platforms are blooming, including Youmeng of Alibaba, Bugly of Tencent, Yunfu of netease, Firebase of Google and so on. To know how to leverage!
1.2 ANR
Is the crash rate completely equivalent to the stability of the application? The answer is absolutely not. After handling crashes, we often encounter the ANR (Application Not Responding) problem.
When ANR appears, the system will pop up a dialog box to interrupt the user’s operation, which is very intolerable to the user.
Use FileObserver to listen for changes in /data/ ANR/localted.txt. Unfortunately, many older roMs no longer have access to this file. At this time, you may have to think of other ways, overseas can use Google Play service, while domestic wechat using Hardcoder framework (HC framework is a set of independent android communication framework, it allows App and manufacturer ROM real-time “dialogue”, The goal is to fully allocate system resources to improve the running speed and picture quality of App, and effectively improve people’s mobile phone use experience. You can also ROOT out the phone and get the testamp.txt file.
1.3 Application Exit
In addition to the common crashes, there are some situations that can cause an application to exit unexpectedly, such as:
- Voluntary suicide. Process. KillProcess (), the exit (), etc
- To collapse. A Java or Native crash occurred
- The system restarts. When the system is abnormal, power is off, and users actively restart, we can compare whether the startup running time of the application is smaller than the previously recorded value
- Is killed by the system. Killed by a low memory killer, deleted from the system’s task manager, etc
- ANR
We can set up a flag at startup and update the flag after an active suicide or crash so that the next time the app starts, we can check the flag to see if there was an abnormal exit during the run. For the above five exit scenarios, we exclude the two scenarios of active suicide and crash (crash will be counted separately), hoping to monitor the abnormal exit of the remaining three scenarios. Theoretically, this exception capture mechanism can achieve 100% coverage.
Through this abnormal exit detection, it can reflect other problems such as ANR, low memory killer, system kill, crash, power outage, and other problems that cannot be properly caught. Of course, there are some false positives in the outlier rate, such as users crossing off applications from the system’s task manager. For big data online, it can also help us find hidden problems in the code.
According to the status of foreground and background, abnormal exit can be divided into foreground and background abnormal exit. “Being killed by the system” is the main reason for the abnormal exit of the background. Of course, we will pay more attention to the abnormal exit of the foreground, which is more related to ANR, OOM and other abnormal cases.
Two, crash processing
We also encounter a variety of difficult problems at work every day, “crash” is one of the more common problems. Solving problems, like solving crimes, requires experience, and the more skilled we are at analyzing them, the faster and more accurately we can locate them. Of course, there are a lot of tricks, like what information should we pay attention to about the “crime scene”? How to find more “witnesses” and “leads”? What is the general process of “investigating a case”? What investigation methods should be used for different types of “cases”?
To believe that “there is always one truth”, collapse is not to be feared.
2.1 Crash Site
The crash site was our “first crime scene” and it held many valuable clues. The more information that can now be mined, the clearer the direction of analysis will be, rather than relying on blind guesswork.
Crash information From the basic crash information, we can have a preliminary judgment about the crash. Process name, thread name. Whether the crash occurred in the foreground or background process, and whether the crash occurred in the UI thread.
Crash stack and type. Whether a crash is a Java crash, a Native crash, or an ANR crash, there are different points of concern for different types of crashes. In particular, you need to look at the top of the crash stack to see whether the crash is in the system code or in the APP code.
Key words: FATAL
FATAL EXCEPTION: main
Process: com.cchip.csmart, PID: 27456
java.lang.NullPointerException: Attempt to invoke virtual method 'void android.widget.TextView.setText(int)' on a null object reference
at com.cchip.alicsmart.activity.SplashActivity$1.handleMessage(SplashActivity.java:67)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:179)
at android.app.ActivityThread.main(ActivityThread.java:5672)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:784)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:674)
Copy the code
System information
The system information sometimes carries some key clues, which can be very helpful in solving problems.
Logcat. Run logs of applications and systems are included. Due to system permissions, the Logcat obtained may only contain those related to the current APP. The event Logcat of the system records some basic information about the running of the APP in the file /system/etc/event-log-tags.
//system logcat:
10- 25 17:13:47.788 21430 21430 D dalvikvm: Trying to load lib ...
//event logcat:
10- 25 17:13:47.788 21430 21430I AM_ON_REsume_called: life cycle10- 25 17:13:47.788 21430 21430I am_LOW_memory: the system memory is insufficient10- 25 17:13:47.788 21430 21430I am_destroY_activity: Destroys the Activty10- 25 17:13:47.888 21430 21430I am_anr: ANR and the cause10- 25 17:13:47.888 21430 21430I am_kill: APP was killed and whyCopy the code
Model, system, vendor, CPU, ABI, Linux version, etc. By gathering as many as dozens of dimensions, this can be very helpful in finding common problems.
Memory information
OOM, ANR, virtual memory running out, etc. Many crashes are directly related to memory. If you split your phone’s memory into “less than 2GB” and “more than 2GB,” the crash rate for “less than 2GB” users is several times higher than that for “more than 2GB” users.
Remaining system memory. For system memory status, the /proc/meminfo file can be read directly. OOM, lots of GCS, and frequent suicidal pull-ups of the system are all very common when the system has very little available memory (less than 10% of MemTotal).
The application uses memory. Including Java memory, Resident Set Size (RSS), Proportional Set Size (PSS), we can figure out the memory usage and distribution of the application itself. PSS and RSS can further obtain more detailed classification statistics such as APK, DEX and SO through /proc/self/smap calculation.
Virtual memory. Virtual memory can be obtained through /proc/self/status, and distribution can be obtained through /proc/self/maps. Sometimes we don’t take virtual memory seriously, but many problems like OOM and TGkill are caused by the lack of virtual memory.
Name: com.xmamiga.name / / process name
FDSize: 800 // Number of file handles applied by the current process
VmPeak: 3004628 kB // The peak virtual memory size of the current process
VmSize: 2997032 kB // The virtual memory size of the current process
Threads: 600 // The number of threads in the current process
Copy the code
In general, for 32-bit processes, virtual memory up to 3GB on a 32-bit CPU may cause memory application failures. If the CPU is 64-bit, the virtual memory is generally between 3 and 4GB. Of course, if we support 64-bit processes, virtual memory won’t be an issue. Google Play requires 64-bit support by August 2019. Although more than 90% of devices support 64-bit in China, stores do not support the release of different CPU architecture types, so it will take longer to popularize.
Resource information
Sometimes it is found that both the application heap memory and the device memory are very sufficient, and there will still be memory allocation failure, which may be more related to resource leakage.
File handle fd. The maximum number of file handles that can be opened by a single process is 1024. However, if the number of file handles exceeds 800, it is dangerous. All FDS and corresponding file names need to be output to the log to further check whether there are file or thread leaks.
opened files count 812:
0 -> /dev/null
1 -> /dev/log/main4
2 -> /dev/binder
3 -> /data/data/com.xmamiga.sample/files/test.config
...
Copy the code
The number of threads. The current thread size can be obtained from the status file above. A single thread can take up 2MB of virtual memory. Too many threads can put pressure on virtual memory and file handles. In my experience, more than 400 threads are dangerous. You need to output all the thread ids and corresponding thread names to the log to further check whether thread-related problems occur.
threads countA 412-1820com.xmamiga.crashsdk
1844 ReferenceQueueD
1869 FinalizerDaemon.Copy the code
JNI. When using JNI, it is easy to have some crashes, such as reference failures and reference table bursts, if you are not careful.
The application of information
In addition to the system, our application actually knows more about itself and can leave a lot of relevant information. Crash scenario. Which Activity or Fragment the crash occurred in, and which business it occurred in; The critical action path, unlike the detailed log during the development process, we can record the critical user action path, which will be a great help in reproducing the crash. Other customized information. Different applications may have different priorities.
2.2 Crash Analysis
With so much information on the scene, you can begin the real “solving the crime” journey. The vast majority of “cases” will eventually come to light if they put in the effort. Don’t be afraid of problems. Through patient and careful analysis, you can always be keen to find some anomalies or key points, and dare to doubt and verify.
Step 1: Identify your priorities
Identify and analyze the key points. The key is to find important information in the logs and make a general judgment about the problem. In general, I suggest focusing on the following points in the prioritization step.
- Identify severity. Crashes also depend on cost performance. We prioritize Top crashes or crashes that have a significant impact on the business, such as crashes of major functions. Don’t spend a few days fixing a corner crash, which may be removed in the next version.
- Crash basic information. Determine the type of crash and the exception description to get a rough idea of the crash. In general, most simple crashes can be resolved by this step.
Java to collapse. Java crash types are obvious. For example, NullPointerException is a null pointer and OutOfMemoryError is insufficient resources. In this case, you need to further check the “Memory Information” and “Resource Information” in logs.
Native to collapse. You need to look at things like Signal, code, Fault ADDR, and the Java stack at the time of the crash. For an introduction to the meanings of each signal, you can check out the crash signal introduction. SIGSEGV and SIGABRT are common. The former is usually caused by null and invalid Pointers, and the latter is mainly caused by ANR and abort() exit.
ANR. Let’s see if the main thread stack is due to lock wait. Then take a look at the ANR logs for ioWAIT, CPU, GC, system Server, etc., to further determine whether it is an I/O problem, a CPU race problem, or a large number of GCS causing a deadlock.
Step 2: Look for commonalities
If the above methods are not effective in locating the problem, we can try to find out if there are any commonalities in this type of crash. Found the common, also can further find the difference, from solving the problem further.
The collected system information of model, system, ROM, manufacturer and ABI can be used as dimension aggregation. Common problems such as whether only x86 phones appear, whether only Samsung models, whether only Android 8.0 system. App information can also be aggregated as dimensions, such as links being opened, videos being played, countries, regions, and so on.
Finding commonalities will give you a clearer guide to your next recurring problem.
Step 3: Try to reproduce
If we already have a rough idea of the cause of the crash, we need to try to reproduce the crash in order to confirm more information. If we are completely clueless about the crash, we want to try to reproduce it through the user action path and then analyze the cause of the crash.
“As long as I can reproduce locally, I can solve” is what many developers and testers say. The main reason for this is that in a stable replay path, we can add logs or use Debugger, GDB and other tools to do further analysis.
We might have all kinds of weird questions. For example, a manufacturer has changed the underlying implementation, the new Android system implementation has changed, all need to go to Google, look at the source code, sometimes also need to dig the ROM of the manufacturer or manually brush the ROM. Many difficult problems need us to endure loneliness, repeated speculation, repeated hair gray, repeated verification. But this kind of problem still depends on the serious procedure of the problem, cannot pick up the sesame and lose the watermelon.
2.3 System Crash
System crashes often leave us feeling helpless, either because of a Bug in an Android version or because of a vendor modification of the ROM. The crash stack in this case might not have our own code at all, making it difficult to locate the problem directly. Can do:
-
Look for possible causes. By sorting through the commonalities above, let’s first see if it’s a system version problem or a vendor-specific ROM problem. While the crash log may not have our own code, a few points of suspicion can be found by manipulating the path and log.
-
Try to avoid it. Look for suspicious code calls, inappropriate APIS, and alternatives to avoid them.
-
The hooks. There are Java Hook and Native Hook. It may only appear on Android 7.0, so follow Android 8.0’s lead and catch this exception directly. If you do that, most of these crashes should be resolved or avoided, as should most system crashes. Of course, there are always difficult issues that depend on the user’s real environment, and these require capabilities like dynamic tracing and debugging.
Third, summary
Crash attack and defense is a long-term process, and we try to prevent crashes as early as possible and nip them in the bud. As technical personnel, we should not blindly pursue the crash rate, we should put user experience first, if forced to cover up some problems often more counterproductive. Try catching should not be used to hide the real problem, but to understand the root cause of the crash and ensure that the flow is followed. In the process of solving the breakdown, we should also start from point to point. We should not only solve the breakdown, but also consider how to solve and prevent this kind of breakdown.