Why do we have to do an online detection

The number of regular iOS crashes is low, and the remaining crashes are often unstable or lack of valid information. My online statistics show that 60%+ of the remaining unlocated and resolved crashes are due to wild Pointers. A typical stack is as follows:

There are other types of stacks, but they all have one characteristic: they crash on release or retain.

If your APP is monitoring wildpointer crashes for the first time, you are advised to simulate and reproduce them offline first. Xcode has a number of excellent tools: Address Sanitizer, SCribble, Zombie, etc.

However, in practice, it is difficult to reproduce the problem by relying on the above tools, which is largely related to the nature of APP. If the APP is a converged platform that integrates a lot of businesses, and these businesses are developed by separate teams, with separate entry and trigger logic, and even separate greyscale strategies, we might not even find a test entry. Faced with this situation, we can only rely on online means to collect and troubleshoot problems.

Did some technical research

I have listed the articles for reference in the bibliography. If you are interested, you can read them. I have sorted out the general ideas, which can be divided into two categories:

  • hook free

    In free, the memory is not released, but reserved, and the object is determined to be an objC object. If it is an objC object, the object setClass is defined as a custom class, and the stack and class information are obtained by message forwarding.

  • hook dealloc

    When dealloc, determine whether to enable wild pointer detection. If not, directly release the object. Otherwise, the object will be retained after isa modification and added to the memory pool, and message forwarding will be triggered to intercept stack and object class name information when the object is called again.

I have looked at many schemes in the industry, and the schemes for OC are basically the same. The idea is to save the object and then trigger message forwarding to intercept the stack and type information when called again. But how to control the growth of memory, exactly which classes to monitor and other practical issues are not much described. The schemes listed above can be used in the debug or grayscale phase, but may cause performance problems if they are used online. Therefore, if wild pointer detection goes online, the core problem is how to control performance loss under the premise of ensuring effective coverage.

Exploration and Practice

In the beginning, I tried to implement directly into the project using the technical solutions I learned from the article and monitored all OC classes. When I was debugging the APP, I found that it took a long time just to start the APP. This is because there are too many types of monitoring, and many types that are released very frequently are also monitored and processed, resulting in a very serious performance degradation. So I made several optimizations here, mainly as follows:

  • Mask some non-essential types
  • In Dealloc, try not to do performance intensive operations
  • Use thread pools to distribute a large number of tasks to child threads

How do I block some irrelevant types?

Let’s first answer the first question, how to block some irrelevant types of monitoring. During the development phase I found that a lot of the underlying XPC classes and a lot of other types that we hadn’t seen were being monitored, which was obviously a bit of crap. So in order to optimize the type range of monitoring, I made two improvements. Let’s start with the first improvement:

Optimization scheme based on dynamic library

As we all know, the main execution file of our APP relies on a number of system libraries, which are linked to the application using explicit declarations. But these dynamic libraries also depend on other system libraries, which I refer to here as secondary dynamic libraries. Obviously these system library classes are not of concern to us.

Therefore, these classes from secondary system libraries need to be excluded at run time.

So how do you determine which library a class comes from?

My first thought was dlADDR, which tells us the details of the address at run time, including the image file. Then we can tell directly if this class needs to be monitored. However, if you do this, you will find a very obvious problem: this function runs too ~! Slow! !! , simply cannot support a large number of frequent calls. Therefore, my optimization scheme is to obtain the class name address first, and determine which library it is in according to the interval of the class name address.

The specific implementation is: all images and the TEXT segment address range of each image are read at startup, and then stored in unorderMap. Of course, this is only the general idea, the specific implementation needs to deal with the adaptation of the segment migration scheme. After the above optimization, the actual efficiency is greatly improved.

, of course, this is not a perfect solution, because I found I in development phase for wild pointer of the buffer pool was soon exhausted, start-up phase, 30 MB of cache pool should have run out of 6 ~ 11 seconds and the consumption rate is a little too fast, according to my personal understanding, under normal use, an object can exist in the buffer pool is 30 seconds to pass the exam. To this end, I wrote the type of each object and the size of each object into the file. After checking, I found that although we did the filtering of dynamic library, there were still many types that we had not seen before were also included in the monitoring scope. This is easy to understand, UIKitCore, LibobJC and other common dynamic libraries still have a lot of types that we haven’t used yet. So we need to shift our thinking to more subtle monitoring: monitoring only the system classes we use.

Optimization scheme based on Bind information

The difficulty in monitoring the system classes we use is: how do I get all the system classes used in the project? This step can be referenced in another of my articles on the exploration of iOS 15 bind from wild pointer detection (which is a long and rewarding read) and won’t be repeated here.

What not to do in Dealloc

The whole idea is to hook Dealloc, so we inevitably inject our code at the Dealloc stage. Here are a few small points to note:

  • Do not call any OC code
  • Don’t useobjc_setAssociatedObject
  • Don’t jump right in. Judge firstisTaggedPointer

The first point is easy to understand, because any OC code you call can cause additional object releases in dealloc that you can monitor.

The second point is that objc_setAssociatedObject is not used because objc_setAssociatedObject has a significant memory consumption (around 96B) that can only be found when it is used with a lot of violence.

The third point, which may not be mentioned in many scenarios, is TaggedPointer. There is no need to do further monitoring to waste the cache pool. For an introduction to TaggedPointer, see byte APM’s article TaggedPointer: Why does an object security cushion fail

Multithreading and memory optimization

Multithreading

In the development stage, due to unsatisfactory performance indicators, the batch release of objects and objects into the pool packaging as Task to join the Task queue, multithreading Task processing. I have forgotten the performance problem 😭, the impression seems to be not so processing frame drop obvious, recently curious object into the pool synchronization processing did not find obvious frame drop phenomenon, quite embarrassed 😓.

Memory optimization

The memory optimizations are mostly on the capture stack. The practice of wild pointer detection reveals that it is not enough to know the class name and the stack at which the wild pointer occurred. As a developer, I also want to know exactly when this object was released. So for wild pointer detection I added the ability to record the release stack, which of course will not be fully open, but only for the specified type of configuration. Here are two interesting questions:

  • Can the stack size be optimized?
  • Can be used to grab the stackmemcpyalternativevm_read_overwrite?

A stack, before being symbolized, is a stack of 640 bytes of data, expressed in sectors, assuming that our stack contains 10 lines of information. In iOS, however, a pointer is expressed in sectors that do not use all of the 64-bit information. The stack distance (UInt64) &_MH_EXECUte_header offset can be used to replace the stack, so that 32-bit information can be used to represent 64-bit information. The memory consumed by the record stack is reduced by nearly half.

And there’s another interesting question. A lot of the code you see for grabbing stacks is vm_read_overwrite. Why not use memcpy or use Pointers like this? Vm_read_overwrite is very slow, and even a small amount of grayscale in Dealloc will definitely freeze you up. So can we use memcpy in dealloc at all?

From this question, I found that I did not understand the memory mechanism, and felt that the memory part was very interesting.

Simply put, VM_read_overwrite can read addresses safely and has a detection mechanism. Even if it is an illegal address, it can ensure the normal operation of the program. Memcpy, on the other hand, is straightforward but insecure. The reason why vm_read_overwrite is used for most of the fetching stack is because vm_read_overwrite is safe because it can be read from invalid addresses by backtracking across threads and not suspending all threads. Therefore, we can use memcpy instead of vm_read_overwrite as long as we can ensure that the current thread stack will not be destroyed. Obviously, we will synchronize backtracking on the current dealloc thread, so we can use memcpy to optimize the call.

Effect of online

After all that, do you dare to use this thing online? Can problems be found after use? At present, the code has been online for a period of time, and there are 30W users online. In general, some problems can be collected. For example, the following problem is found to be caused by the improper use of multiple threads after investigation according to the captured information.

The abstract can be summed up as:

self.dic = [NSDictionary new];
for (int i = 0 ; i < 3000 ; i++) { 
    dispatch_async(dispatch_get_global_queue(0, 0), ^{
         self.dic = @{@"name":@(i)};
     });
 }
Copy the code

Deficiencies and Improvements

A technical solution should not only say advantages, but also show you the corresponding disadvantages and shortcomings. Generally speaking, I am not satisfied with three points at this stage:

  • Memory, memory, memory! Currently, the actual memory consumption is larger than the memory consumption recorded. This is due to the fact that some of the functions we call consume some memory internally and have not yet been discovered, resulting in extreme cases where the actual memory keeps growing and the memory cache is not released. The classic example isobjc_setAssociatedObject, this function maintains an internalunordermap.
  • Lack of corresponding control platform. At present, grayscale is controlled by the server side, so it is very inconvenient to control it flexibly. My idea is that gray scale can be controlled according to the percentage of model, system, version, etc., and distributed detection can be flexibly configured for devices. For example, there are 8000 classes in total, which are distributed to 8 kinds of devices according to the proportion of devices. In this way, each device only bears 1000 monitoring tasks, greatly reducing the pressure. This step is planning, to have a holiday, after the year to say ~
  • Lack of universal symbolic platform. At present, the stack we reported has not been symbolized, which needs to be symbolized locally, which has certain requirements for developers and is inconvenient to use. Planning, holidays, say ~ after the year

reference

1. Baymax Health System — Crash automatically fixes the system when iOS APP runs

2, JJException

3. IOS wild pointer positioning: wild pointer sniffer

4. Summary of iOS wild pointer positioning

5. Brief discussion on Crash capture and protection in iOS

6, xiejunyi ‘Blog

7. Why does Tagged Pointer fail