preface

There are many measures of application performance, and from a user’s perspective, lag is the most obvious one, but that doesn’t mean that apps that don’t seem to lag don’t have performance problems. From a development perspective, the criteria for measuring a piece of code or an algorithm include spatial complexity and time complexity, corresponding to memory and time complexity respectivelyCPUTwo important types of computer hardware. Only when both the external and the internal work is done, can the application be said to be performing well. Therefore, an application performance monitoring system can be of great help to developers. It can help you find performance bottlenecks in your application.

CPU

Threads are the smallest unit of program execution, in other words: our application is actually composed of multiple threads running on the CPU. To figure out how much CPU your application is using, you need to measure the CPU usage of all the threads in your application. The thread_basic_info structure encapsulates basic information about a single thread:

struct thread_basic_info {
    time_value_t    user_time;      /* user run time */
    time_value_t    system_time;    /* system run time */
    integer_t       cpu_usage;      /* scaled cpu usage percentage */
    policy_t        policy;         /* scheduling policy in effect */
    integer_t       run_state;      /* run state (see below) */
    integer_t       flags;          /* various flags (see below) */
    integer_t       suspend_count;  /* suspend count for thread */
    integer_t       sleep_time;     /* number of seconds that thread
                                   has been sleeping */
};
Copy the code

The problem is how to get this information. The iOS operating system is based on the Darwin kernel, which provides the Task_threads interface to get a list of all threads and the thread_info interface to get information about individual threads:

kern_return_t task_threads
(
    task_inspect_t target_task,
    thread_act_array_t *act_list,
    mach_msg_type_number_t *act_listCnt
);

kern_return_t thread_info
(
    thread_inspect_t target_act,
    thread_flavor_t flavor,
    thread_info_t thread_info_out,
    mach_msg_type_number_t *thread_info_outCnt
);
Copy the code

The first function passes in the target_task flag, which uses mach_task_self() to get the current process. The second function passes in two Pointers to return the thread list and the number of threads, respectively. The second function passes in different macro definitions to get different thread information. THREAD_BASIC_INFO is used here. In addition, there are several types of parameters, most of which are actually aliases of the mach_port_t type:

So you get the following code to get the CPU usage information for your application. TH_USAGE_SCALE returns the total frequency of CPU processing:

- (double)currentUsage { double usageRatio = 0; thread_info_data_t thinfo; thread_act_array_t threads; thread_basic_info_t basic_info_t; mach_msg_type_number_t count = 0; mach_msg_type_number_t thread_info_count = THREAD_INFO_MAX; if (task_threads(mach_task_self(), &threads, &count) == KERN_SUCCESS) { for (int idx = 0; idx < count; idx++) { if (thread_info(threads[idx], THREAD_BASIC_INFO, (thread_info_t)thinfo, &thread_info_count) == KERN_SUCCESS) { basic_info_t = (thread_basic_info_t)thinfo; if (! (basic_info_t->flags & TH_FLAGS_IDLE)) { usageRatio += basic_info_t->cpu_usage / (double)TH_USAGE_SCALE; } } } assert(vm_deallocate(mach_task_self(), (vm_address_t)threads, count * sizeof(thread_t)) == KERN_SUCCESS); } return usageRatio * 100.; }Copy the code

practice

Because of the explosive performance growth of hardware, it is more feasible to use older devices for performance monitoring to optimize code. Also, don’t be afraid of high CPU utilization. The higher the usage, the more efficient the USE of THE CPU. Don’t be afraid of exceeding 100% usage, but look at the FPS to see if there is any impact on usage.

Take the author’s recent optimization for example, iPhone5c, the performance bottleneck appeared after the application was started and the city selection list was entered for the first time. By monitoring the CPU, it was found that around 187% utilization was achieved during the process, and the frame count dropped to around 27 frames. The original pseudocode looks like this:

static BOOL kmc_should_update_cities = YES; - (void)fetchCities { if (kmc_should_update_cities) { [self fetchCitiesFromRemoteServer: ^(NSDictionary * allCities) { [self updateCities: allCities]; [allCities writeToFile: [self localPath] atomically: YES]; }]; } else { [self fetchCitiesFromLocal: ^(NSDictionary * allCities) { [self updateCities: allCities]; }]; }}Copy the code

The above code for data processing is basically in the child thread processing, the main thread actually processing not much work, but also caused the lag. My guess is that the CPU treats the threads executing equally and does not allocate extra processing time to the main thread. Therefore, the main thread will still be affected when the child thread’s task processing exceeds a certain threshold, so it is necessary to put time-consuming tasks into the child thread to avoid the main thread running out of time. You can easily find the most CPU-intensive part of your code: file writing

The concurrency flaw I mentioned in the multithreading trap is the same as the code above. Data is written locally and consumes CPU resources until the write operation is complete. If other cores are also busy at this time, the application can hardly escape the lag. The author’s solution is to use streams to fragment data to write locally. Due to the design of NSStream itself, the smooth writing operation can be guaranteed:

case NSStreamEventHasSpaceAvailable: {
    LXDDispatchQueueAsyncBlockInUtility(^{
        uint8_t * writeBytes = (uint8_t *)_writeData.bytes;
        writeBytes += _currentOffset;
        NSUInteger dataLength = _writeData.length;
            
        NSUInteger writeLength = (dataLength - _currentOffset > kMaxBufferLength) ? kMaxBufferLength : (dataLength - _currentOffset);
        uint8_t buffer[writeLength];
        (void)memcpy(buffer, writeBytes, writeLength);
        writeLength = [self.outputStream write: buffer maxLength: writeLength];
        _currentOffset += writeLength;
    });
} break;
Copy the code

When replacing file writes, the number of reentry frames has gone up to about 40, but there is still a short lag after the operation. A further guess is that when a stream is written asynchronously, data processing is also done asynchronously. Multiple threads are competing with the main thread for CPU resources, so data processing is further delayed until the write operation is complete:

case NSStreamEventOpenCompleted: {
    [self fetchCitiesFromLocal: ^(NSDictionary * allCities) {
        [self updateCities: allCities];
    }];
} break;
Copy the code

After this step, the number of frames remains above 52, except that it drops to around 40 when the page jumps. Other optimizations include lazy loading of data during UI presentations, but the above two processes are more consistent with the concept of CPU-related optimizations, so I won’t go into more details.

memory

The process memory usage information is also stored in another structure, mach_task_basic_info, which stores a variety of memory usage information:

#define MACH_TASK_BASIC_INFO     20         /* always 64-bit basic info */
struct mach_task_basic_info {
    mach_vm_size_t  virtual_size;       /* virtual memory size (bytes) */
    mach_vm_size_t  resident_size;      /* resident memory size (bytes) */
    mach_vm_size_t  resident_size_max;  /* maximum resident memory size (bytes) */
    time_value_t    user_time;          /* total user run time for
                                           terminated threads */
    time_value_t    system_time;        /* total system run time for
                                           terminated threads */
    policy_t        policy;             /* default policy for new threads */
    integer_t       suspend_count;      /* suspend count for task */
};
Copy the code

The corresponding fetch function is called task_info, passing in the process name, the type of information retrieved, the information storage structure, and the quantity variable:

kern_return_t task_info
(
    task_name_t target_task,
    task_flavor_t flavor,
    task_info_t task_info_out,
    mach_msg_type_number_t *task_info_outCnt
);
Copy the code

Since memory in mach_task_basic_info uses bytes, we need one more layer of conversion before we can display it. In addition, in order to facilitate conversion in actual use, the author uses structure to store memory-related information:

#ifndef NBYTE_PER_MB #define NBYTE_PER_MB (1024 * 1024) #endif typedef struct LXDApplicationMemoryUsage { double usage; ///< memory (MB) double total; ///< total memory (MB) double ratio; / / / < occupancy ratio} LXDApplicationMemoryUsage;Copy the code

The code to get memory usage is as follows:

- (LXDApplicationMemoryUsage)currentUsage {
    struct mach_task_basic_info info;
    mach_msg_type_number_t count = sizeof(info) / sizeof(integer_t);
    if (task_info(mach_task_self(), MACH_TASK_BASIC_INFO, (task_info_t)&info, &count) == KERN_SUCCESS) {
        return (LXDApplicationMemoryUsage){
            .usage = info.resident_size / NBYTE_PER_MB,
            .total = [NSProcessInfo processInfo].physicalMemory / NBYTE_PER_MB,
            .ratio = info.virtual_size / [NSProcessInfo processInfo].physicalMemory,
        };
    }
    return (LXDApplicationMemoryUsage){ 0 };
}
Copy the code

show

Memory and CPU monitoring does not do as much interesting things as other device information. In fact, the retrieval of both is a boring, fixed piece of code, so there’s not much to say. The information for both is basically displayed during the development phase to observe performance. So setting up a good query cycle and presentation is a relatively fun part of the process. The author’s final monitoring results are as follows:

For some reason task_info always gets about 20M more memory information than Xcode itself shows, so subtract it from the list when you use it. To ensure that the display is always displayed at the top, we create a singleton of UIWindow, set the windowLevel value to CGFLOAT_MAX to ensure that it is displayed at the top, and rewrite some methods to ensure that they are not modified:

- (instancetype)initWithFrame: (CGRect)frame {
    if (self = [super initWithFrame: frame]) {
        [super setUserInteractionEnabled: NO];
        [super setWindowLevel: CGFLOAT_MAX];
    
		self.rootViewController = [UIViewController new];
        [self makeKeyAndVisible];
    }
    return self;
}

- (void)setWindowLevel: (UIWindowLevel)windowLevel { }
- (void)setBackgroundColor: (UIColor *)backgroundColor { }
- (void)setUserInteractionEnabled: (BOOL)userInteractionEnabled { }
Copy the code

The three TAB bars are drawn asynchronously to ensure that the main thread is not affected when updating the text. The core code:

CGSize textSize = [attributedText.string boundingRectWithSize: size options: NSStringDrawingUsesLineFragmentOrigin attributes: @{ NSFontAttributeName: self.font } context: nil].size;
textSize.width = ceil(textSize.width);
textSize.height = ceil(textSize.height);
    
CGMutablePathRef path = CGPathCreateMutable();
CGPathAddRect(path, NULL, CGRectMake((size.width - textSize.width) / 2, 5, textSize.width, textSize.height));
CTFramesetterRef frameSetter = CTFramesetterCreateWithAttributedString((CFAttributedStringRef)attributedText);
CTFrameRef frame = CTFramesetterCreateFrame(frameSetter, CFRangeMake(0, attributedText.length), path, NULL);
CTFrameDraw(frame, context);
    
UIImage * contents = UIGraphicsGetImageFromCurrentImageContext();
UIGraphicsEndImageContext();
CFRelease(frameSetter);
CFRelease(frame);
CFRelease(path);
dispatch_async(dispatch_get_main_queue(), ^{
    self.layer.contents = (id)contents.CGImage;
});
Copy the code

other

In addition to monitoring the CPU and memory resources consumed by the application itself, Darwin provides interfaces that allow us to monitor the memory and CPU usage of the entire device itself, and I have encapsulated two additional classes to capture these data. Finally, the LXDResourceMonitor class is uniformly encapsulated to monitor the use of these resources, controlling what is monitored through enumeration:

typedef NS_ENUM(NSInteger, LXDResourceMonitorType) { LXDResourceMonitorTypeDefault = (1 << 2) | (1 << 3), LXDResourceMonitorTypeSystemCpu = 1 < < 0, / / / < CPU utilization monitoring system, Low priority LXDResourceMonitorTypeSystemMemory = 1 < < 1, / / / < monitoring system memory usage, Low priority LXDResourceMonitorTypeApplicationCpu = 1 < < 2, / / / < monitoring application CPU usage, High priority LXDResourceMonitorTypeApplicationMemoty = 1 < < 3, / / / < monitoring application memory usage, high priority};Copy the code

The content of bit operation is used here, which is more concise and efficient than other methods. APM series has been completed for the most part, of course, in addition to the APM methods commonly used on the Internet, the author will also include RunLoop optimization application related technologies.

The Demo in this