Interviews often ask about Android performance optimization and Crash handling. Today we are going to learn about ahmeituan App Crash processing. For more information, see Android Performance Optimization: Taking you Through memory Optimization

The former address: https://blog.csdn.net/MeituanTech/article/details/80701773

Crash rate is one of the important indicators to measure the quality of an App. If you ignore the existence of Crash rate, it will get worse and worse, and finally cause the loss of a large number of users, thus bringing immeasurable losses to the company. This paper describes a lot of practical work done by meituan Takeout Android client team in the process of improving the Crash rate of App from 3/1000 to 2/10000, hoping to provide some experience and inspiration for other teams.

Challenges and achievements

Faced with such problems as high user frequency, fast growth of takeout business and serious fragmentation of Android, it is a great challenge for Meituan Takeout Android App to continuously reduce Crash rate. Through the all-out efforts of the team, the average Crash rate of Meituan Takeaways Android App dropped from 3/1000 to 2/10000, and the optimal value was about one thousand (statistical method of Crash rate: Crash times /DAU).

Meituan’s business has grown at an exponential rate since it was founded in 2013. The business of Meituan Takeout has developed from a single catering business to more than ten major categories such as catering, supermarket, fresh food, fruits and vegetables, medicines, flowers, cakes and errands. At present, the daily order volume of Meituan delivery has exceeded 20 million, making it one of the most important businesses of Meituan Dianping. Meituan takeout client has more and more business modules, higher product complexity and more and more team developers, all of which bring great challenges for App to reduce Crash rate.

Governance practices of Crash

For Crash governance, we try to abide by the following three principles:

  • From point to surface. When a Crash happens, we should not only solve the Crash, but also consider how to solve and prevent this kind of Crash. Only in this way can such crashes be truly solved.
  • Exceptions are not to be eaten. Random use of try-catch will only increase the branches of the business and hide the real problem. It is necessary to understand the essence of Crash and solve the problem according to the essence of the cause. The branch of Catch should take the bottom line according to the business scenario to ensure the normal follow-up process.
  • Prevention is better than cure. When the Crash happened, the loss had already been caused, and we could only reduce the loss no matter how we treated it. As far as possible to prevent the occurrence of Crash in advance, Crash can be eliminated in the bud.

General Crash governance

Regular crashes occur mainly because developers write code carelessly. Such crashes need to be solved in a centralized manner according to the causes of crashes and the business itself. Common Crash types include: empty node, corner marker out of bounds, type conversion exception, entity object not serialized, digital conversion exception, Activity or Service cannot be found, etc. This kind of Crash is the most common Crash in App, and also the most likely to occur repeatedly. After obtaining the Crash stack information, it is generally easier to resolve such crashes, and more attention should be paid to how to avoid them. Here are two crashes with a large amount of governance.

NullPointerException

NullPointerException is the one we encounter most frequently. Generally, there are two situations causing such Crash:

  • The object itself is not initialized.
  • The object is already initialized, but is either reclaimed or manually set to NULL, and then manipulated.

For the first case, there are many reasons, such as developer error, API return data parsing exception, static variable is not initialized after the process is killed, we can do:

  • Executes a short call on a potentially empty object.
  • Get in the habit of using @nonNULL and @nullable annotations.
  • Try not to use static variables and, as a last resort, use SharedPreferences for storage.
  • Consider using Kotlin.

For the second case, which is mostly caused by code executed in Message, Runnable, network, etc callbacks after the Activity/Fragment is destroyed or removed, we can do the following:

  • Check whether the Activity/Fragment is destroyed or removed when the Message or Runnable callback is performed. Add try-catch protection; Removes all sent Runnable during Activity/Fragment destruction.
  • Encapsulation LifecycleMessage/Runnable based components, and check custom Lint, suggest using encapsulation good base components.
  • Cancel all requests from the current Activity in the BaseActivity and BaseFragment onDestory().
IndexOutOfBoundsException

This type of Crash is common in ListView operations and multithreaded operations on containers.

For ListView IndexOutOfBoundsException, often because of external also hold in the Adapter data reference (such as direct assignment in the constructor of Adapter), if the external references to data changes at this time, However, if notifyDataSetChanged() is not called in time, it may cause Crash. In this case, we encapsulate a BaseAdapter, and the data is maintained by the Adapter itself. At The same time, The content of The adapter has changed but ListView did not receive a notification has been greatly avoided. These two types of Crash have been uniformly solved at present.

In addition, a lot of container is thread safe, so if in a multithreaded IndexOutOfBoundsException its operation is easy to cause. The most common implementations are ArrayList in JDK, SparseArray in Android, and ArrayMap. It is also important to note that some of the internal implementations of classes also use thread-unsafe containers, such as ArrayMap in Bundle.

System-level Crash governance

As we all know, Android has many models and is highly fragmented, and each hardware manufacturer may customize their ROM and change the system approach, resulting in the crash of a specific model. The detection of such crashes mainly relies on cloud testing platform, automated testing and online monitoring. In this case, it is difficult to directly locate the Crash stack information. Here are some common solutions:

  1. Try to find the suspicious code causing the Crash and see if there is a specific API or improper way of calling. Try to modify the code logic to avoid it.
  2. Hook is divided into Java Hook and Native Hook. Java Hook mainly relies on reflection or dynamic proxy to change the behavior of the corresponding API. It is necessary to try to find the point that can Hook. Generally, Hook points are mostly static variables. Native Hook principle is to replace the old method in memory address with the modified method, which needs to take into account the difference between Dalvik and ART. Comparatively speaking, Native Hook is less compatible, so it needs to cooperate with degradation strategy when using Native Hook.
  3. If the first two methods fail to solve the problem, we can only try to decompile ROM and find a solution.

Let’s take an example of a Crash caused by customized system ROM. According to the Crash platform statistics, it was found that the Crash only occurred on vivo V3Max, and the Crash stack was as follows:

java.lang.RuntimeException: An error occured while executing doInBackground()
  at android.os.AsyncTask$3.done(AsyncTask.java:304)
  at java.util.concurrent.FutureTask.finishCompletion(FutureTask.java:355)
  at java.util.concurrent.FutureTask.setException(FutureTask.java:222)
  at java.util.concurrent.FutureTask.run(FutureTask.java:242)
  at android.os.AsyncTask$SerialExecutorThe $1.run(AsyncTask.java:231)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1112)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:587)
  at java.lang.Thread.run(Thread.java:818)
Caused by: java.lang.NullPointerException: Attempt to invoke interface method 'int java.util.List.size()' on a null object reference
  at android.widget.AbsListView$UpdateBottomFlagTask.isSuperFloatViewServiceRunning(AbsListView.java:7689)
  at android.widget.AbsListView$UpdateBottomFlagTask.doInBackground(AbsListView.java:7665)
  at android.os.AsyncTask$2.call(AsyncTask.java:292)
  at java.util.concurrent.FutureTask.run(FutureTask.java:237)
  ... 4 more
Copy the code

We found that the UpdateBottomFlagTask class is not present in the AbsListView of the corresponding system version on the native system, so we can infer that the ROM customized for this version of Vivo has changed the implementation of the system. After we failed to locate the Crash suspect, we decided to solve the problem by Hook. Through the source code, we found that AsyncTask$SerialExecutor is a static variable, which is a good Hook point. We solved the problem by adding try-catch reflection. Since the final object is modified, we need to modify accessFlags by reflection first. We need to pay attention to the difference between ART and Dalvik classes. The code is as follows:

  public static void setFinalStatic(Field field, Object newValue) throws Exception {
        field.setAccessible(true);
        Field artField = Field.class.getDeclaredField("artField");
        artField.setAccessible(true);
        Object artFieldValue = artField.get(field);
        Field accessFlagsFiled = artFieldValue.getClass().getDeclaredField("accessFlags");
        accessFlagsFiled.setAccessible(true);
        accessFlagsFiled.setInt(artFieldValue, field.getModifiers() & ~Modifier.FINAL);
        field.set(null, newValue);
    }

Copy the code
private void initVivoV3MaxCrashHander() {
    if(! isVivoV3()) {return;
    }
    try {
        setFinalStatic(AsyncTask.class.getDeclaredField("SERIAL_EXECUTOR"), new SafeSerialExecutor());
        Field defaultfield = AsyncTask.class.getDeclaredField("sDefaultExecutor");
        defaultfield.setAccessible(true); defaultfield.set(null, AsyncTask.SERIAL_EXECUTOR); } catch (Exception e) { L.e(e); }}Copy the code

Meituan takeout App solved the corresponding Crash by using the above method, but the takeout channel in Meituan App could not pass this method due to the limitations of the platform, so we tried to decomcompile ROM. Img is an image used to store system files in the Android system. The file format is usually Yaffs2 or ext. Dat, system.patch.dat, system.transfer.list, and so on. So we first need to get system.img from the three files above. However, after extracting vivo ROM, we found that the manufacturer had fragmented System.new.dat, as shown in the picture below:

After the information in system.transfer.list and system.new. Dat 1 2 3… A comparative study of file sizes found some similarities. The number of each block *4KB in system.transfer.list was roughly the same as the size of the corresponding fragment file, so I made a bold guess. Vivo ROM fragmentation of System.patch. dat is simply fragmented according to block sequence. So we just need to combine these fragments into a system.patch.dat file before converting img. Finally, according to the file system format of system.img unpack, get the framework directory, including framework.jar and boot.oat files, because after Android4.4 introduced ART virtual machine, Some JAR packages in system/ Framework will be converted to OAT format in advance, so we also need to unpack the corresponding OAT files through ota2dex to obtain dex files, and then view the source code through dex2jar and JD-GUI.

OOM

OOM is short for OutOfMemoryError. In the common Crash difficulty list, OOM can definitely top and stay popular. Because the Crash stack information when it happens is often not the root cause of the problem, but the straw that breaks the camel’s back. Most of the reasons for OOM are as follows:

  • Memory leaks. A large number of useless objects are not reclaimed in time. As a result, subsequent memory allocation fails.
  • There are too many large memory objects. The most common large object is Bitmap. Loading several large images at the same time will trigger OOM easily.

Memory leak MEMORY leak refers to the failure of the system to release memory objects that are no longer used. It is usually caused by faulty program code logic. On The Android platform, the most common and serious memory leak is Activity object leak. Activity bears the whole interface function of App. The leakage of Activity also means that a large number of resource objects held by it cannot be recycled, which is extremely easy to cause OOM. Common reasons for Activity leaks are:

  • Anonymous inner classes implement handlers to process messages, which can cause implicitly held Activity objects to fail to be reclaimed.
  • The Activity and Context objects are confused and abused, and the Activity object is used in many places where only the Application Context is needed but not the Activity object, such as registering various receivers, calculating screen density, and so on.
  • View objects are mishandled, and the View Context created using an Activity’s LayoutInflater is actually an Activity. This is often overlooked and can lead to Activity leaks in scenarios such as View reuse.

For Activity leaks, we have a very useful detection tool: LeakCanary, which can automatically detect all Activity leaks and provide a very friendly prompt when a leak occurs. In order to prevent developers from leaking, we will also report it to the server for a general check and solution. In addition, StrictMode can be used to check Activity leaks and Closeable objects are not closed.

On The Android platform, we can analyze the memory information of any application and almost always reach the same conclusion: the most memory-hogging objects are mostly Bitmaps. As phones get bigger and more high-resolution, with 1080p and 2k screens taking the majority of the time, we tend to use a lot of hd images for better visual effects, which is also the bane of the OOM. For image memory optimization, we have several common ideas:

  • Try to use a mature photo library, such as Glide, which provides a lot of general protection and reduces unnecessary human error.
  • Loading images according to actual needs, namely View size, can use as little memory as possible on lower resolution models. In addition to the usual BitmapFactory.Options#inSampleSize and BitmapRequestBuilder#override provided by Glide, our picture CDN server also supports real-time zooming-in of images, which can be processed on the server side. This reduces the memory pressure on the client. Analyzing the details of the App memory is the first step to solve the problem. We need to have a general understanding of how much memory the App occupies and how many objects of what type there are when it runs, and make predictions based on the actual situation, so that the analysis can be targeted. Android Studio also provides excellent Memory profilers, heap dumps, and allocation trackers to help quickly locate problems.

AOP enhancement aid

AOP stands for faceted programming, and with the addition of the Transform API in Android’s Gradle plugin 1.5.0, it’s easy to modify bytecode to implement AOP at compile time because of official support. In some specific cases, an AOP approach can be used to handle uncaught exceptions automatically:

  • The method of throwing an exception is very clear, and the call method is relatively fixed.
  • The exception handling method is unified.
  • It has nothing to do with service logic. That is, automatic exception handling does not affect normal service logic. Typical examples are reading Intent Extras parameters, reading SharedPreferences, parsing color string values, and displaying hidden Windows.

The principle for resolving such problems is similar, as illustrated in the Intent Extras example. The problem with reading intEnts Extras is that the very common Intent#getStringExtra method may throw a ClassNotFoundException in the event of a code logic error or malicious attack. Since it is not possible to write code with a try-catch statement for all calls, a more secure Intent utility class has been created that can theoretically prevent crashes of this type as long as everyone uses it to access the Intent Extras parameter. But AOP comes into its own when there are large old code repositories and many lines of business, it is very expensive to modify existing code, and more external SDK dependencies make it almost impossible to use our own utility classes. We made a Gradle plugin that replaces a call to a particular method with another method by simply configuring the parameters:

WaimaiBytecodeManipulator {
     replacements(
         "android/content/Intent.getIntExtra(Ljava/lang/String; I)I=com/waimai/IntentUtil.getInt(Landroid/content/Intent; Ljava/lang/String; I)I"."android/content/Intent.getStringExtra(Ljava/lang/String;) Ljava/lang/String; =com/waimai/IntentUtil.getString(Landroid/content/Intent; Ljava/lang/String;) Ljava/lang/String;"."android/content/Intent.getBooleanExtra(Ljava/lang/String; Z)Z=com/waimai/IntentUtil.getBoolean(Landroid/content/Intent; Ljava/lang/String; Z)Z",...). }}Copy the code

The above configuration replaces all IntentUtil class security versions of IntentUtil calls to Intent.getxxxExtra in App code (including third-party libraries). Of course, not all exceptions just need to be caught. If there are logical errors, they must be exposed in time during development and testing. Therefore, IntentUtil will judge the running environment of App, and Debug will directly throw exceptions. The Release environment returns the default value when an exception is caught and reports the exception to the server.

Dependency library issues

Android apps often rely on multiple AArs, each of which may have multiple versions. Gradle determines the final version number to use when packaging according to rules (either the highest version is selected by default or a mandatory version is specified), and other versions of AArs are discarded. If there are incompatible versions of interdependent AArs, problems cannot be discovered during packaging and can only be found when related codes are executed, resulting in NoClassDefFoundError, NoSuchFieldError, and NoSuchMethodError exceptions. As shown in the figure, both order and Store business libraries rely on platform.aar, one version 1.0 and the other version 2.0. By default, only Platform 2.0 is finally called into APK. If a class or method used by the ORDER library is deleted in version 2.0, an exception may occur at runtime. Although the SDK will try to achieve backward compatibility during the upgrade, most of the time, especially the third party SDK, it is not guaranteed. In the V6.0 version of Meituan Wayai Android App, the hotfix function was lost due to this reason. Therefore, in order to find the problem in advance, we connected the dependency checking plug-in Defensor.

Defensor retrieves all input files (i.e., compiled class files) using DexTask at compile time and then checks for the presence of classes, fields, methods, etc. referenced in each file.

In addition, we wrote a Gradle plugin SVD(Strict Version Dependencies) to unify the versions of important SDKS. The plugin checks at compile time to see if the final SDK version used by Gradle is the same as configured. If it is not, the plugin terminates the compilation with an error and prints out all dependencies of the conflicting SDK.

Prevention practices of Crash

It is unrealistic to reduce the occurrence of Crash simply by agreement or specification. Due to the limitation of organizational structure and individual implementer, conventions and specifications are easily ignored. Only engineering architecture and tools can ensure the long-term implementation of Crash prevention.

Impact of engineering architecture on Crash rate

In the practice of Crash governance, we often ignore the impact of engineering architecture on Crash rate. Most of the reasons for the occurrence of Crash are the unreasonable codes of programmers, and the most direct contact of programmers is the engineering architecture. For an architecture with fuzzy boundaries and chaotic hierarchies, programmers are more likely to write code that causes crashes. In such an architecture, it is very difficult for programmers to improve code that is not reasonable, even if they are aware that there is a problem with the way it is written. On the contrary, an architecture with clear hierarchy and boundaries can greatly reduce the probability of Crash, and it is relatively easier to govern and prevent Crash. Here we can give a few examples of our practice.

The division of business modules used to be that our Crash was basically solved by individual students. Each student in the team would submit the code that might cause Crash. If the student in charge of Crash did not pay attention to the Crash rate of App temporarily due to something, The student who caused the Crash wouldn’t know that his code caused the Crash.

For this problem, our approach is App business modularization. With business modularization, each business has a unique package name and corresponding owner. When a module crashes, problems can be submitted to the responsible person of the module according to the package name for immediate handling. Business modularity is itself one of the engineering architecture priorities.

Page jump routing unified handling page jump foreign sell App is concerned, the most in use process is the jump between a page, the page jumps between often cause ActivityNotFoundException, for instance, we match a scheme, but the other side of the path scheme has changed; For example, if we call the photo album function on the phone, the photo album application has been disabled or removed by the user himself. To solve this kind of Crash, actually very simple also, you just need to increase ActivityNotFoundException in startActivity exception handling. But an App, the Activity place, is almost everywhere, which place can cause ActivityNotFoundException unpredictable. What we did was distribute all the page jumps through our encapsulated Scheme route. The advantage of scheme routing is that all services are decoupled in engineering architecture, and pages can be jumped and basic type parameters can be transmitted without mutual dependence between modules. At the same time, because all the page jump will go scheme routing, we only need in scheme can solve routing in a plus ActivityNotFoundException exception handling this type of Crash. The schematic diagram of route design is as follows:

The network layer uniformly processes a large part of the crashes of API dirty data clients because of the dirty data returned by API. For example, when API returns null value, empty array or data that is not of the convention type, App will likely Crash after receiving these data, such as null pointer, array overbound and type conversion error. And with this kind of dirty data, it’s particularly likely to cause a massive crash online. The earliest usage of the network layer in our project is: the page listens for the callback of the success and failure of the network, passes the JSON data to the page, parses the Model and initializes the View, as shown in the figure. The problem is that although the network request is successful, there may be problems in the process of JSON parsing Model, such as no data returned or the return of the wrong type of data, and this dirty data will cause problems in the UI layer, directly reflected to the user.

According to the figure above, we can see that since the network layer only undertakes the responsibility of requesting the network, but not the responsibility of data parsing, the responsibility of data parsing is handed over to the page. In this way, once we find the Crash caused by dirty data, we can only add various judgments in the callback of network requests to be compatible with dirty data. We have hundreds of pages, and it’s impossible to fill in the gaps. Through several versions of refactoring, we redefined the responsibilities of the network layer, as shown in the figure below:

As can be seen from the figure, the reconstructed network layer is responsible for network request and data parsing. If there is dirty data, problems will be found at the network layer and the UI layer will not be affected. The data returned to the UI layer are all verified successfully. After such reconstruction, we find that the Crash rate of this kind has been greatly improved.

A larger image monitoring

As mentioned above, large objects are one of the main causes of OOM. Bitmap is the most common type of large objects in App, so it is necessary to monitor Bitmap objects that occupy too much memory. We use AOP to Hook three kinds of common picture library loading callback methods, and monitor the two dimensions of picture library loading: 1. The URL used to load the image. Except for static resources, all images in takeout App are required to be published to the dedicated image CDN server. Regular expression is used to match URL when loading images. In addition to limiting THE CDN domain name, corresponding dynamic scaling parameters are required to be added when loading all images. 2. The result of the loaded image (i.e., the Bitmap object). We know that Bitmap objects take up memory in proportion to their resolution, and that it generally doesn’t make sense to have images larger than their own size on ImageView. Therefore, we require that the Bitmap resolution displayed in the ImageView should not exceed the size of the View itself (an alarm threshold can also be set to reduce false positives).

In the process of development, when an uncompliant image is detected in the App, the location of the wrong ImageView will be highlighted immediately and a dialog box will pop up to remind the Activity of ImageView, XPath and THE URL used to load the image, as shown in the picture below, to help the developer locate and solve the problem. In the Release environment, alarm information can be reported to the server to observe data in real time and deal with problems in time.

Lint check

We found that many crashes online could actually be prevented by Lint checking during development. Lint is a Google Android static code inspection tool that scans and finds potential problems in code, notifying developers of early fixes, and improving code quality.

However, the Android native version of Lint rules (such as whether or not a higher version OF the API is being used) is far from adequate, and lacks some of the checks we think are necessary, as well as the ability to check code specifications. So we started developing custom Lint, and so far we have implemented Crash prevention, Bug prevention, performance/security and code specification checking through custom Lint rules. If a class implements the Serializable interface, its member variables (including those inherited from the parent class) declare types that implement the Serializable interface, which can effectively avoid NotSerializableException. Forcing the use of encapsulated utility classes such as ColorUtil and WindowUtil can effectively avoid illegalArgumentExceptions caused by incorrect arguments and badtokenExceptions caused by Activity finishes.

Lint checks can be performed in multiple stages, including local manual checks, live code checks, compile-time checks, commit checks, Pull Request checks, package checks, and so on in a CI system, as shown in the figure below. For more details, please refer to Meituan Takeaway Android Lint Code Checking Practices.

Resource repeat check

In the previous article “Meituan Takeaway Android Platform Architecture Evolution Practice”, we described our platform evolution process. In this process, a large part of our work is sinking, but incomplete sinking will lead to the repetition of some classes and resources. Classes will not have problems because of the limitation of package name. But some resources such as layout, drawable, etc. if the same name, the lower layer will be overwritten by the upper layer, then the layout view ID changes may cause the problem of null pointer. To avoid this problem, we wrote a Gradle plugin to hook MergeResource Task, which retrieves all library and main library resource files. If the Task is found to be duplicate, the compilation process will be interrupted and the duplicate resource name and corresponding library name will be output. At the same time to avoid some resources do need to be overwritten for style reasons, so we set up a whitelist. At the same time, we also got all the picture resources in this process, so we could easily do the local monitoring of the picture size, as shown in the following picture:

Monitoring Crash & practice of stop loss

monitoring

After the various checks and tests mentioned earlier, the application is released. We have established the monitoring process as shown below to ensure timely feedback and processing when exceptions occur. First, gray level monitoring. Gray level is the stage where incremental crashes are most easily exposed. If this stage is not well grasped, incremental changes will be made, leading to an increase in Crash rate. If conditions permit, some grayscale strategies can be formulated during the grayscale period to improve the exposure of Crash at this stage. For example, sub-channel gray scale, sub-city gray scale, sub-business scene gray scale, newly installed user gray scale and so on, try to cover all branches. After the end of gray scale, the full scale will begin. In the full scale process, we also need some daily Crash monitoring and abnormal alarm of Crash rate to prevent the occurrence of emergencies, such as online Crash caused by background online or incorrect operation configuration. In addition to this, some other monitoring is needed, for example, the large image monitoring mentioned earlier, to avoid the OOM caused by the large image. The output forms include email notification, IM notification, and report.

Stop loss

Although we have done so much before, crashes are still unavoidable. For example, some crashes are not exposed in the grayscale stage due to the insufficient magnitude. Or some functions of the client are online earlier than the background, and these functions are not covered in the gray stage; In these cases, you need to consider how to stop your losses if something goes wrong.

If the problem is not serious and the repair cost is high, it can be considered in the next version. On the contrary, if the problem is serious, it must be fixed if it has an impact on user experience or order. When repairing, the business degradation should be considered first, mainly to see whether the abnormal business has A bottom or A/B strategy, which is the safest and most effective way. If the business can not be degraded, thermal repair needs to be considered. At present, the thermal repair framework of Meituan Takeout Android App access is the self-developed Robust, which can repair more than 90% of the scenes, and the success rate of thermal repair has reached more than 99%. If the problem occurs in a scenario that cannot be covered by a hot fix, the user can only be forced to upgrade. Forced upgrades are only used as a last resort because they take a long time to cover and affect the user experience.

Looking forward to

Crash’s self-repair

When selecting new technologies, we need to consider whether they can meet business needs, whether they are better than existing technologies and the cost of learning for the team. Compatibility and stability are also very important. But in the face of domestic environment, rich and colorful Android system in mass millions of magnitude of the App it is almost impossible to achieve flawless technical scheme and components, so normally if a technical implementation scheme can achieve 0.01 ‰, following the collapse of the rate and other solutions also does not have better performance, we think it is acceptable. However, even if the Crash rate is only 1/100,000, it still means that users are affected. We believe that Crash is the worst experience for users, especially for scenes involving transactions. Therefore, we must follow the principle that every order is very important and try our best to ensure users to execute the process smoothly.

In actual situations, some technical solutions compromise on compatibility and stability, often because of performance or scalability advantages. In this case, we could have done more to improve the usability of our App. Just as many operating systems have “compatible mode” or “safe mode”, and many automatic machines have manual operation mode, the App can also realize the backup degradation plan, and then set the triggering strategy of specific conditions, so as to achieve the purpose of automatic repair Crash.

For example, The hardware acceleration mechanism introduced in Android 3.0 can improve the drawing frame rate and reduce the CPU usage, but some models still have the situation of drawing disorder or even Crash. Then we can record the Crash problems related to hardware acceleration in the App or use inspection code active hardware acceleration is working correctly, and then take the initiative to choose whether to open the hardware acceleration, both to make the majority of users to enjoy the benefits of such hardware acceleration, which can guarantee hardware acceleration is not a perfect model is not affected. There are also some similar scenarios for automatic downgrading, such as:

  • Some modules implemented using JNI can be relegated to Java version implementations if SO fails to load or an exception occurs at runtime.
  • RenderScript’s image blur can also be degraded to the normal Java version of Gaussian blur when it fails.
  • OkHttp3 or HttpURLConnection network channels have a high failure rate when using Retrofit network libraries and can be actively switched to another channel.

Such problems need to be analyzed according to the specific situation. If accurate judgment conditions and stable repair schemes can be found, the stability of App can be further improved.

Logs of a specific Crash type are automatically retrieved

The takeout business develops rapidly. Although we use various tools and measures to avoid Crash during development, Crash is still inevitable. After some weird online crashes occur, we can not only analyze the Crash stack information, but also use offline log retrieval and dynamic log distribution tools to restore the scene when the Crash occurs, so as to help the developers locate the problems. However, these two methods have their own problems. Offline logging, as its name implies, is pre-recorded, and sometimes you may miss some key information because logging in code is usually only for business critical points, and it is not possible to log in a large number of common methods. The problem of dynamic log (Holmes) is that it can only be delivered to one device of one user with known UUID each time. This operation is not suitable for a large number of online crashes, because we cannot know which user who has a Crash will repeat this operation again, and the delivery configuration is full of uncertainty.

Holmes, we can be engineered to keep their supports batch and even the whole amount issued by dynamic logging, record log until he report the case of certain types of Crash, so that can reduce the pressure of log server, but also can greatly improve the efficiency of the location problem, because we can determine the equipment to be submitted to log the last real type of the Crash happened, Analyzing the logs can do more with less effort.

conclusion

With the rapid development of business, it is often impossible to give the team enough time to manage Crash, which is one of the most important indicators of App. The team needs to explore the most essential reason for each Crash by taking individual Crash cases one by one, find the most reasonable solution to such Crash, and establish a long-term mechanism to solve such Crash, instead of drinking poison to quench thirst. Only in this way, as the version continues to iterate, can we get closer and closer to the goal of Crash governance.

The resources

  1. How did the team manage to reduce the Crash rate from 2.2% to 0.2%?
  2. Android runtime ART load OAT file process analysis
  3. Android dynamic log system Holmes
  4. Android Hook technology prevents gossip
  5. Meituan Takeaway Android Lint code checking practices

Interview essential UI refresh big decryption

Flutter foundation – Environment setup and Demo running

My Android Refactoring Journey: The Framework

How to choose between MVC, MVP and MVVM patterns?

Wechat official account: Terminal R&D Department