0x1 Why Is Crash Protection Required
In the process of product development, Crash rate is a very important indicator, and it is also an indicator that almost all departments in a team should pay attention to or participate in the improvement. It not only represents the quality of the whole product, but also reflects the overall technical ability of a team. A lower Crash rate can not only make the product gain a better reputation among users, but also enable team members to gain more growth in the whole process, deepen their understanding of the bottom layer of iOS system, and bring greater help for future development.
0x2 Why write this article
The reason was that my own project stepped on FB’s SDK: On July 10, 2020, data error sent by FB background caused a large number of APPS using FB SDK to start Crash, which affected a large number of users and a wide range of users. In addition, most apps including us lacked relevant protection or fault-tolerant processing at that time, and the Crash rate soared immediately, and the re-release had to go through the release process. I had to rely on the restoration of FB background. At that time, I was helpless and passive, so I decided to build a relatively complete Crash protection system to avoid such a scene from happening again. The second purpose is that AFTER the problem occurred, I also consulted some information on the Internet and the practices of other teams, and found that everyone has different ways, different methods, different effects, so I also decided to find good ideas and methods in the market combined with some of my own ideas and experience to record. Finally, because knowledge is to precipitate, accumulate and share, but also consolidate and deepen their understanding of it.
0 x3 how to do
In fact, the Crash scene was very simple at that time. Originally, a Dictionary parameter FB background sent a String type of data, which would inevitably Crash during parsing. To solve the problem, only one layer of parameter security verification was needed.
However, most apps failed to deal with such a simple problem, which proves that there must be something unnoticed in the process, which is only the tip of the iceberg. There must be some problem or something that can be optimized in our mechanism.
If you want to avoid this situation, you should do the following:
I: Crash processing process
In iOS, you can basically summarize these four steps,
-
Crash prevention – By Hook and other means, some similar container classes are checked for input parameters to avoid the occurrence of Crash
-
Crash interception – If the first step of protection fails, then we need to intercept the Crash at this stage, so that we can find exceptions
-
Crash reporting – Defends and captures Crash, generates effective logs and reports, and restores the stack as much as possible.
-
Follow-up process of Crash – How to protect user experience to the maximum extent after Crash, and how to Crash gracefully
II: Crash prevention
There are two main methods of Crash protection: AOP is usually adopted for non-memory problems, and zombie object is adopted for memory problems.
AOP:
IOS AOP related knowledge of the web thread code is also a lot of, I will not repeat here, but in the case of AOP such a frequent call on the need to pay attention to the place and pit more.
-
AOP’s scope of influence problem: at that time, we used the ordinary way to Hook array related methods, and found a large number of similar Crash. [UIKeyboardLayoutStar release]: message sent to deallocated instance UIKeyboardLayoutStar
In some other scenarios, it can be determined that HookNSMutableArr methods are responsible for affecting system class calls.
Through Xcode debugging found, because the nature of the Hook is inserted in front of the original system call a user-defined function method of exchange, so in some extreme cases, such as multithreaded), was released to the function variables, so go to the original system call can appear when normal release repeated release. The general process is as follows
This scenario was difficult to reproduce during testing, but once it went online, the problem became apparent when the user base was large enough. The solution is very simple, Hook as far as possible under MRC, using autoRelease pool to wrap. Ensure that internal variables are released at the end of the current runloop.
-
AOP performance problems: the above said that the principle of AOP is more than a layer of method invocation, then combined with the iOS method forwarding process can be imagined, AOP will cause performance loss, and in the Crash protection scenario frequently called, performance problems must not be ignored.
As can be seen from the figure above, the method calling process will eventually return the corresponding IMP pointer for external call. As a dynamic language, OC cannot determine when the developer will insert or exchange which function, so it must carry out similar verification logic through this process.
Those of you who have used AOP know that you do a layer of validation before AOP
+(void)hookClass:(Class)classObject isClassMetohd:(BOOL)classMethod fromSelector:(SEL)fromSelector toSelector:(SEL)toSelector { Class class = classObject; Method fromMethod = class_getInstanceMethod(class, fromSelector); Method toMethod = class_getInstanceMethod(class, toSelector); If (classMethod) {class = object_getClass(classObject); fromMethod = class_getClassMethod(class, fromSelector); toMethod = class_getClassMethod(class, toSelector); } if(class_addMethod(class, fromSelector, method_getImplementation(toMethod), method_getTypeEncoding(toMethod))) { class_replaceMethod(class, toSelector, method_getImplementation(fromMethod), method_getTypeEncoding(fromMethod)); } else { method_exchangeImplementations(fromMethod, toMethod); }}Copy the code
So in methods we just in toSelector in the code above when we need to call back to the original method we just call the corresponding function pointer
Finally, I did a test on the method of directly calling IMP, which was a scene in the Demo and App respectively. The test data is as follows, and the comparison results are still obvious. This is why Swift or some other static language is faster than OC.
Zombie:
Using zombie objects to solve memory problems has always been apple’s main approach. Xcode also has related Settings, which can be turned on under Debug, but once this function is put online for protection or monitoring, there are many problems to consider.
-
Zombie entry problem: in other words, where to generate a zombie object, see some related SDK use Dealloc as the entry function, not bad, just not optimal. There are two reasons:
-
1: apple no longer recommends actively calling dealloc in ARC. Currently, only performSelector or other dynamic calls can be used.
-
2. It is easy to omit Objc_destructInstance. All member variables and attributes will be released in this function.
In conclusion, I chose to generate zombie objects in the Free function
-
-
Zombie memory threshold problem: Zombie objects take up memory space, however, you must be careful and have a complete logic to operate memory in online environments. If a certain memory threshold is exceeded, you need to empty zombie objects in time. The determination of the memory threshold becomes key, and there are two problems:
-
1: Memory problems will be strongly related to models. How to adjust different thresholds according to different models?
-
2. How to make flexible and dynamic adjustment according to the online situation?
The bottom line is that you can’t trigger memoryWarning once you’re in zombie, so I tested memoryWarning thresholds on most models:
It can be seen from the figure above that the memory warning will be triggered when the memory occupied by App reaches 57%~69% of the total memory. Besides, some of the memory in iPhone is reserved by the system and will not be given to developers, so our available memory is only about 50%. I summarize the following formula:
Formula 1: Memory warning cannot be triggered Y = 0.5 * deviceMem — currentAppMem
Y = min ((0.5 * deviceMem — currentAppMem), currentAppMem)
The above two formulas seem perfect, but there is still some optimization, because not all variables in the APP may become zombie objects, but only some of them need to be monitored, so the final calculation formula of memory threshold is obtained:
Y = min ((0.5 * deviceMem — currentAppMem), currentAppMem/N)
Since the memory usage of the app changes at any time, a timer can be added to update the value at regular intervals.
The above formula N dynamic issued another advantage is that we can the background, according to the online memory caused Crash, if the large amount of Crash, it may need more memory threshold to keep zombie object, can turn N is small, large anyway, so you can ignore the difference of models according to the condition of the Crash of remote configuration.
As you can see from the online data in the figure, the zombie memory threshold increases as N decreases, but does not exceed the memory warning threshold, ensuring memory health.
The figure below shows the number of capturing wild pointer problems corresponding to different N values, which can be adjusted by respective apps according to their own business conditions.
-
-
Zombie update strategy issues: The current practice is to check if a new zombie object has exceeded the threshold, and then delete the previous zombie object before adding a new one. This cleanup logic is dependent on the addition of new objects. If no new objects are added, the cache space will not change. Once the zombie space is generated, it cannot be deleted, and self-cleaning of the cache cannot be achieved, which means that the App increases the memory usage without reason.
Also referring to LRU’s most recent unused logic, it will check the cache situation every 30 seconds, and the zombie object that has not been used for more than 30 seconds will be deleted. 30s is an empirical value. Through a lot of tests, it is found that memory problems generally occur within 30 seconds after the object is destroyed, and the probability of recurrence after 30 seconds is very small. This allows for cache self-cleaning logic.
Instrument tests show that the zombie logic does not have a significant impact on the memory of the App itself.