Baymax: Practice of automatic Crash protection for netease iOS App running

Copyright statement

This article is transferred from the official account of netease Hangzhou Front-end Technology Department, and published with the authorization of the author.

Baymax, the health robot in Disney’s Big Hero 6, is a chubby inflatable robot who is loved by everyone for his cute appearance and kind nature, and is called the “Cute God”.

Baymax project is to reduce the developers in the development of some non-standard code writing caused by memory leaks, interface stuck, power consumption and other problems and come to a monitoring system.

Now Baymax has ushered in its new function: automatic protection of APP Crash during APP running, escorting the smooth operation of APP process!

The following will introduce in detail the purpose of development, design principle and application method of the automatic repair system for Crash during APP running.

Do you ever get a fatal phone Call just as you’re laying down for a good night’s sleep and it’s your boss?! “Xiao Wang, the new version of X.X.X has a problem, how to operate the crash ah, resulting in the new function can not use, quickly locate the cause, hurry up hotpatch repair ah!” . In the heart of ten thousand grass mud horse roaring over, the moment has been full of sweat of you but also pretend to be calm to answer: “well, the boss I immediately go to see, must try to solve the problem!” If you hurry to turn on your computer, you know you’re going to have no sleep tonight.

Is there a situation where your boss gets everyone together and has a KPI goal setting meeting for the beginning of the year and says, “As a senior technical team, app performance is the first target of our technical team, one of the most important is the crash rate of app. Last year, our APP statistics showed that the crash rate was 5 out of 1,000, while our competitors’ crash rate was only 5 out of 10,000, a difference of 10 times! This year we need to match them or at least match them.” You quite agree, but in your mind, you are a little skeptical. The development resources of the other side are several times as much as ours and all of them are senior drivers, while most of our team are fresh graduates. Can this KPI be fulfilled?

Baymax Health system — Crash automatic repair system is designed to deal with these situations and solve problems.

The original intention of the system is to reduce the Crash rate of APP. Using the dynamic characteristics of Objective-C language, using AOP(Aspect Oriented Programming) design idea of section-oriented Programming, to achieve traceless implantation. It can automatically capture the broken loop factors leading to APP crash in real time when the APP is running, and then resolve these destructive factors through specific technical means, so that the APP is free from crash, can continue to operate normally, and escort the continuous operation of the APP.

The main function of automatic Crash repair system when APP is running can be summarized in a simple sentence: Zero invasion of business code to capture and eliminate the Crash that would have caused APP Crash, to ensure the normal operation of APP, and then extract the specific information of Crash and return it to users in real time.

Through the following a small example can be very intuitive to reflect the role of the system:

Call the following code

TestObj is a UIButton object, and UIButton does not implement someMethod:, so send someMethod to testObj: When this method is used, the method cannot be found in the list of relevant methods, resulting in the crash of the APP.

However, through our crash protection system, the app will not crash when this code is called, and the Console of XCode is as follows:

It can be seen that the corresponding crash information (crash type, cause, call stack information) can be completely printed in XCode Console.

It indicates that our Baymax system has captured the crash, eliminated the crash and spit out the complete information of the crash.

Of course, the function of the current system is not powerful enough to process all the crashes, but the system will deal with some common and frequent crashes one by one. Currently, the types of crash that can be handled are as follows:

unrecognized selector crash
KVO crash
NSNotification crash
NSTimer crash
Container crash (array out of bounds, insert nil, etc)
NSString crash (string operation crash)
Bad Access Crash (wild pointer)
UI not on Main Thread Crash

For each type of crash, the security system adopts different ways to process the crash.

As mentioned above, the current security protection system can cover 8 types of Crash. The specific principles of the protection of these 8 types of Crash will be introduced in detail in the following paragraphs:

3.1.1 Causes of crash

Crash of unrecognized selector type occupies a large part in many crash types of app, which is usually caused by an object calling a method that does not belong to its method.

For example, calling the following code produces a crash

The specific performance of crash is shown in the following figure:

To solve this type of crash, we need to first understand the specific reasons and process of its occurrence.

3.1.2 Method call process

Let’s take a look at the method call at run time.

The specific method call flow in Runtime is as follows:

First, look for the invoked method in the list of cached methods in the object of the corresponding operation, and if found, turn to the corresponding implementation and execute.
If not, look for the invoked method in the method list of the corresponding operation object. If found, turn to the corresponding implementation for execution
If not, execute 1,2 in the object to which the parent pointer points.
And so on, if you can’t find it all the way to the root class, switch to intercepting the call and going through the message forwarding mechanism.
An error is reported if the intercepting method was not overridden.

3.1.3 Intercepting calls

As mentioned in the method call, if the method is not found, the call will be intercepted.

So what is an interception call?

Intercepting calls means that you have a chance to handle this by overwriting NSObject’s four methods before the program crashes when it can’t find a method to call:

The entire flow of intercepting calls is Objective-C’s message forwarding mechanism. The specific process is shown as follows:

As you can see from the figure above, when a function is missing, runtime provides three ways to fix it:

Calling resolveInstanceMethod gives the class an opportunity to add this function
Call forwardingTargetForSelector let other objects to perform this function
The forwardInvocation enables the target function to execute in another manner.

If none, call doesNotRecognizeSelector to throw an exception.

3.1.4 Unrecognized selector Crash prevention scheme

Since we can remedy this, we can also take advantage of the message forwarding mechanism. Then the question comes, among these three steps, which one is more appropriate for transformation?

Here we chose the second step forwardingTargetForSelector to exploit. Here’s why:

ResolveInstanceMethod needs to dynamically add methods that don’t exist on the class itself, which are redundant for the class itself
The forwardInvocation can be used to forward messages to multiple objects in the form of NSInvocation, but it is expensive and requires the creation of new NSInvocation objects. In addition, the forwardInvocation function is often invoked by the user for multi-layer message forwarding. Not suitable for multiple rewrite
ForwardingTargetForSelector messages can be forwarded to an object, spending less, and be rewritten probability is low, suitable for rewriting

Can choose after forwardingTargetForSelector NSObject rewrite the method, do the following steps of:

Create a pile class dynamically
Dynamically add the corresponding Selector for the peg class, using a general function to return 0 to implement the SEL IMP
The message is forwarded directly to the pile object.

The flow chart is as follows:

Note if the object’s class skills if rewrite the forwardInvocation method, should not have to rewrite the forwardingTargetForSelector, otherwise it will affect the type of object’s original message forwarding process.

Through rewriting NSObject forwardingTargetForSelector method, we can will not be able to identify methods to intercept and to forward the message to the security of pile in the class object, which can make the app to continue normal operation.

3.2.1 Causes of KVO crash

KVO, or key-value Observing, provides a mechanism for Observing. When the properties of the specified object are modified, the object will receive the notification. In simple terms, KVO automatically notifies the corresponding observer every time the properties of the specified observed object are changed.

The KVO mechanism is used in many iOS development scenarios. However, if not carefully used improperly, can lead to a lot of crash problems. So it would be of some value to find a way to automatically capture these KVO Crash problems caused by careless developers.

First of all, let’s take a look at two scenarios that can lead to KVO Crash:

The observer of KVO still registers the crash caused by KVO when dealloc, as shown in the following figure

The crash caused by adding KVO and repeatedly adding observers or repeatedly removing observers (KVO registered observers and removed observers do not match) is shown in the following figure

3.2.2 KVO Crash Protection Scheme

A typical KVO diagram for an object looks like this:

There are several observers on an Observed Object, and each Observer observes several KeyPath.

If the number of observer and keypath is too large, it is easy to confuse the whole KVO relationship of the observed object. As a result, some relationships are still left when the observed object dealloc. It also causes a mismatch between the KVO registered observer and the removed observer.

I have also encountered situations in which KVO repeatedly adds or removes observers in the case of multiple threads. These types of problems are often hidden and not easy to troubleshoot at the code level.

It can be seen from the above that most crashes caused by KVO are caused by the disorder of the KVO diagram of the observed object. So how to manage the chaotic KVO relationship. You can have the observed object hold a KVO delegate. All operations related to KVO are managed through the delegate. The delegate maintains the entire KVO relationship by creating a map. The diagram below:

There are two advantages to this:

If KVO repeatedly adds or removes observers (the KVO registered observer does not match the removed observer), the Delegate can directly prevent these abnormal actions.
Before the observed object dealloc, the delegate can automatically cancel all the KVO relations related to the observed object, avoiding the crash caused by the fact that the observed object dealloc is still registered with KVO.

Swizzle’s methods are:

about

Methods The transformation process is shown as follows:

The above process transfers all the KVO-related Observer information for the ObServerD object to the KVOdelegate and avoids the possibility of adding the same kvoInfo multiple times.

about

Methods The transformation process is shown as follows:

When removing a keypath Observer, if the delegate kvoInfoMap cannot find the key as keypath, the delegate does not hold the Observer corresponding to keypath, that is, removing a mismatched Observer. If you continue the operation at this time, the APP will crash. Therefore, you should interrupt the flow in time and collect the abnormal information.

When keypath’s KVOInfo list (infoArray) is empty, the delegate no longer holds any observers associated with keypath. The delegate observer should be removed by calling the original removeObserver method.

Observer == nil. If observer is nil, then if keypath is changed, It can also crash because it can’t find an observer, so this step is needed to prevent that from happening.

about

Methods The transformation process is shown as follows:

The most important change the delegate made to the observeValueForKeyPath method was to transfer the corresponding response method to the real KVO Observer, Find the observer stored in keypath’s KVOInfo via keyInfoMap and invoke the original observer response method.

At the same time, when InfoArray is traversed, if info.observerw == nil is found, it needs to be cleared in time to avoid crash caused by value change after KVO observer observer is released.

Finally, if the observed DEALloc of KVO is still registered with the crash caused by KVO, NSObject dealloc Swizzle, An object dealloc automatically clears the corresponding KvoDelegate of all data related to kVO, and the KVoDelegate is also empty. Avoid crash when the dealloc of KVO is still registered with KVO.

3.3.1 Causes of crash

When a notification is added to an object, if the dealloc still holds the Notification, an NSNotification crash will occur.

A crash of the NSNotification type occurs when a programmer inadvertently writes code, forgetting to remove the object dealloc after adding an object as an observer to the NSNotificationCenter.

Fortunately, apple has specifically addressed this situation since iOS9, so Notification crashes don’t happen after iOS9 even if developers don’t remove the observer.

However, for users before iOS9, we still need to do the NSNotification Crash protection.

3.3.2 NSNotification Crash prevention Scheme

The protection principle of NSNotification Crash is very simple. Dealloc function of Method Swizzling Hook NSObject is used. Call [[NSNotificationCenter defaultCenter] removeObserver:self] before the object actually dealloc.

Note that not all objects need to do this. If an object has never been added as an Observer by NSNotificationCenter, calling removeObserver before its dealloc is unnecessary. So we hook the NSNotificationCenter

addObserver:(id)observer selector:(SEL)aSelector name:(NSString *)aName object:(id)anObject

Function that dynamically adds a flag to an observer when it adds an observer. This allows the Observer dealloc to be flagged to determine whether it is necessary to call the removeObserver function.

3.4.1 Causes of NSTimer Crash

In the process of program development, people will often use regular tasks, but use NSTimer scheduledTimerWithTimeInterval: target: the selector: the userInfo: repeats: interface timing for repetitive tasks when there is a problem: NSTimer strongly references the target instance. Therefore, you need to invalidate the timer at an appropriate time. Otherwise, the target cannot be released due to the timer strongly references the target, resulting in memory leaks and even crash when the scheduled task is triggered. The representation of a crash depends on the selector performed by the specific target.

At the same time, if NSTimer performs a task infinitely repeatedly, it may cause target’s selector to be repeatedly called and in invalid state, which is an unnecessary waste of CPU, memory and other performance aspects of app.

Therefore, it is necessary to design a scheme that can effectively prevent the abuse of NSTimer.

3.4.2 NSTimer Crash Prevention Scheme

As can be seen from the above analysis, the main cause of NSTimer’s problems is that it does not have a proper time to invalidate, as well as memory leakage caused by NSTimer’s strong reference to target.

The key to solving the NSTimer problem lies in the following two points:

NSTimer whether its target can not be strongly referenced
Find an appropriate time to automatically invalidate NSTimer when it is certain that NSTimer has expired

On the first issue, target’s strong reference problem. It can be solved by the following scheme:

Add a layer of stubTarget between NSTimer and Target. StubTarget acts as a bridge layer for communication between NSTimer and Target.

At the same time, stubTarget strongly references stubTarget, while stubTarget weakly references Target. Thus, the relationship between Target and NSTimer is weak reference, which means target can be released freely, thus solving the problem of circular reference.

As mentioned above, stubTarget is responsible for the communication between NSTimer and Target. The specific implementation process is further divided into two steps:

In Step 1. Swizzle NSTimer scheduledTimerWithTimeInterval: target: the selector: the userInfo: repeats: In the new method, stubTarget object is dynamically created, stubTarget object weak reference holds the original target, Selector, timer, targetClass and other properties. Then distribute the original target on stubTarget, and the selector callback function is stubTarget’s fireProxyTimer. The flow is shown as follows:

Step 2. Use stubTarget’s fireProxyTimer: to process and distribute the callback function selector. The process is shown as follows:

With stubTarget’s introduction, the original target can now be released free of NSTimer strong references.

If the NSTimer callback function fireProxyTimer: is executed, it will automatically determine whether the original target has been released. If it has been released, it means that NSTimer is invalid. If you continue to call the selector of the original target, it will probably cause a crash, and it won’t be necessary. Therefore, it is necessary to invalidate the NSTimer and report the error data. This allows NSTimer to automatically invalidate when appropriate.

3.5.1 Cause of Container Crash

A crash of the Container type refers to a crash of a Container class. Common examples are NSArray/NSMutableArray/NSDictionary/NSMutableDictionary/NSCache. Some common error operations such as crossing boundaries and inserting nil will lead to such crashes. Because the causes are relatively simple, I will not expand to describe.

Although this kind of crash is easy to detect, the probability of app crash is still quite high, so it is necessary to protect it.

3.5.2 Container Crash Prevention Scheme

The protection scheme for Container Crash is also simple. Method swizzling for NSArray/NSMutableArray/NSDictionary/NSMutableDictionary/NSCache Swizzle’s new approach then adds some constraints and judgments to make these apis secure, which I won’t elaborate on here.

The causes and protection schemes of NSString/NSMutableString crash are similar to those of Container crash.

3.7.1 Cause of the wild pointer crash

In all Crash of App, Crash caused by access to wild pointer accounts for a large part. The manifestation of wild pointer Type Crash is: Exception Type:SIGSEGV, Exception Codes: SEGV_ACCERR or the following figure:

Solving the crash caused by wild Pointers is often a tricky thing. On the one hand, the scene that caused the crash is not easy to reproduce, and on the other hand, the information of console after the crash can provide limited help. XCode itself, in order to facilitate the discovery of wild Pointers during open debugging, provides Zombie mechanism, which can prompt the occurrence of wild Pointers in the class, thus solving the problem of wild Pointers in the development stage. However, there is still no good way to locate the wild pointer problem on the line.

Therefore, because the occurrence probability of wild Pointers is high and it is difficult to locate the problem, it is necessary to make a special layer of protection measures for wild Pointers.

3.7.2 Protection scheme for Wild Pointer Crash

XCode provides a Zombie mechanism to check for wild Pointers, so we can implement a Zombie mechanism, with all the method blocking mechanism and message forwarding mechanism for Zombie instances. Then it can be done in the wild pointer access does not Crash, but only Crash related information.

It’s also important to note that the zombie mechanism requires objects to keep their Pointers and memory footprint when they are released. As the app progresses, more and more objects are created and released, resulting in a larger memory footprint, which obviously has an impact on the performance of a functioning app. A zombie object release mechanism is needed to ensure that the impact of the zombie mechanism on memory is limited.

The implementation of the Zombie mechanism in Improve is divided into the following four steps:

Step 1. Method Swizzling replaces NSObject’s allocWithZone method. In the new method, determine whether this type of object needs to be protected with wild Pointers. Objc_setAssociatedObject is used to set a flag for the object. The tagged object will then enter the zombie process

The flow chart is as follows:

Flag because many system classes, such as NSString and UIView, are created and released very frequently, and the probability of wild Pointers in these instances is very low. Basically, it is the classes we write that have the problem of wild Pointers. Therefore, we can improve the efficiency of the scheme by setting a flag at the time of creation to filter instances that do not need to be protected by wild Pointers.

In addition, we added the blacklist mechanism, because certain classes are not suitable to be added to the zombie mechanism and crash (for example: NSBundle), and all classes related to the zombie mechanism cannot be tagged, as they will loop references and calls during the release process, causing memory leaks and even stack overflows.

Step 2. method swizzling replaces NSObject’s dealloc method by calling objc_destructInstance on the flag object instance, releasing the related attributes referenced by the instance. Then change the instance’s ISA to HTZombieObject. Save the original class name in the instance with objc_setAssociatedObject.

The flow chart is as follows:

Objc_destructInstance is called for:

Object C Runtime NSZombies. Dealloc calls ObjectDispose, which does three things.

Call objc_destructInstance to release related instances referenced by this instance
Change the ISA of this instance to stubClass to accept arbitrary method calls
Free the memory

Objc_destructInstance is explained in the official documentation:

Destroys an instance of a class without freeing memory and removes any associated references this instance might have had.

Objc_destructInstance frees references associated with the instance, but does not free memory such as the instance.

Step 3. In HTZombieObject via message forwarding mechanism forwardingTargetForSelector handle all intercept method, according to the selector can dynamically add processing method of responder HTStubObject instance, Then use objc_getAssociatedObject to get the original class name of the instance that was saved before, and count the error data.

The flow chart is as follows:

The processing of HTZombieObject is the same as that of unrecognized selector crash. The main purpose is to intercept all functions passed to HTZombieObject and replace them with a function that returns nothing, so that the program does not crash.

Step 4. When you retire to the background or reach the upper limit of unfreed instances, call the original dealloc method in ht_freeSomeMemory method to release all zombie instances

To sum up, the prevention process of bad Access crash can be summarized as follows:

3.7.3 Associated Risks

Wild pointer protection is made, and a null implementation is dynamically inserted to prevent Crash. However, it is difficult to determine the performance of the business level, and the business may enter the abnormal state. A plan for how to present the problem to users needs to be developed.
Due to delayed release of several instances, the total memory of the system will be affected to some extent. Currently, the buffer of the memory is opened to about 2M, so there should be no great impact, but there may be some potential risks.
The delayed release of the instance is based on the assumption that the relevant function code will be called in a certain period of time, so the zombie protection mechanism of the wild pointer can only be effective when the actual instance object is still cached in the zombie cache mechanism. If the instance is actually released, the call of the wild pointer will still cause Crash.

Brushing the UI on a non-main thread will cause the app to crash, which needs to be fixed.

The current preliminary processing scheme is the following three methods of swizzle UIView class:

- (void)setNeedsLayout;
- (void)setNeedsDisplay;
- (void)setNeedsDisplayInRect:(CGRect)rect;Copy the code

Dispatch_async (dispatch_get_main_queue(), ^{// call the original method});

To transfer the corresponding UI brushing operation to the main thread, while counting error messages.

However, after the implementation, it is found that these three methods cannot completely cover all the UI brush operations related to UIView. However, if all the UI brush methods to UIView are counted and swizzle, it feels a little awkward and inefficient.

So the author is still looking to see if there is a better solution to this problem.

Currently, the SDK implements the following functions and configurations:

1. Configure the crash type to be protected

You can select certain crash prevention configurations as required and configure them through the following interfaces:

- (void)configSafetyGuardService:(HTSafetyGuardType)SafetyGuardType;Copy the code

SafetyGuardType can be configured as follows:

HTSafetyGuardType_None
HTSafetyGuardType_All
HTSafetyGuardType_UnrecognizedSelector
HTSafetyGuardType_KVO
HTSafetyGuardType_BadAccess
HTSafetyGuardType_Notification
HTSafetyGuardType_Timer
HTSafetyGuardType_Container
HTSafetyGuardType_String
HTSafetyGuardType_UI

You can choose the type of protection according to your own project needs.

2. Enable or disable the security protection function in real time

After the configuration is complete, call – (void)start; The switch of protection is real-time (no need to restart the APP), you can choose to turn on/off the protection function at any time.

– (BOOL)isWorking indicates the status of the current defense function.

Use the – (void)start interface to enable the defense function in real time

Use the – (void)stop interface to disable the defense function in real time

3. Configure whitelists and blacklists, and specify classes and objects for which you want to add or remove security protection functions

Due to the particularity of different classes, it may be considered that some classes do not need to enable protection. So provides the blacklist function. Blacklisted classes themselves and their subclasses are not protected.

The reason for the appearance of the whitelist is that the author found that some classes in the system are not necessary to enter the protection scope during the development, so the overall protection scope is adjusted to all user-defined classes. However, after that, it was found that most crash had a strong connection with some commonly used system classes (such as NSString, NSDictionary, UIView, etc.), so it was necessary to enable protection for these commonly used system classes. Therefore, the whitelist function is provided for those system classes that need to be protected.

Note: The protection of the wild pointer type is not applicable to the whitelist and blacklist due to its particularity. It maintains a new blacklist and whitelist. For details, see 3.7 Protection against Wild Pointer Crash

4. Set exception handling handler to specify the operations that users want to customize after a crash is captured

After crash occurs and is captured and processed by our system, users may need further processing, such as uploading buried points. To implement this, set up a handler. The HTExceptionHandler returns the crash information in the form of HTCrashInfo.

HTCrashInfo contains:

The crash type is crashType
Call stack for crash thread: callStackSymbols
Description of a crash: crashDescription
Extended information: userinfo

Detailed information about the above interfaces can be found in (htSafetyGuardService.h). (Note that HTSafetyGuardService is a singleton)

Because the SDK has not gone through the complete function test and performance test, the corresponding SDK is not open for the time being. After the author feels that the project quality has reached a certain standard, the project SDK will be released. If you are interested in this project, you can contact [email protected].

Baymax: Practice of automatic Crash protection for netease iOS App running

Related Posts

Memory management

Simple integration of iOS& Unity

Swift + keyword (V 3.0.1)