This article will analyze various performance problems in iOS interface construction in detail as well as corresponding solutions. At the same time, it will give an open source list implementation of Weibo and show how to build smooth interaction through actual code.

Index Demonstration project Principle of screen display image Cause and solution Cause and solution CPU resource consumption cause and solution GPU resource consumption cause and solution AsyncDisplayKit Origin of the ASDK Information Basic principle of the ASDK ASDK Asynchronous concurrent operation Runloop Task Distribution Micro post Demo Performance Optimization Tips Pre-layout pre-render Asynchronous draw Global concurrent control more efficient asynchronous picture loading Other areas that can be improved How to measure interface fluency

The demo project

Before we start our technical discussion, you can download the Demo I wrote and run it on a real phone: github.com/ibireme/YYK… . The Demo contains a list of Twitter feeds, a post view, and a list of Twitter feeds. To be fair, I copied all the interfaces and interactions from the official app, and all the data was extracted from the official app. You can also grab your own data and replace the data in the Demo for comparison. While the functionality behind the official app is more complex, it doesn’t make a big difference in interaction performance.

The Demo runs on iOS 6 at a minimum, so you can try it out on older devices. In my tests, even on the iPhone 4S or iPad 3, the Demo list maintained a smooth interaction of 50 to 60 FPS when swiped quickly, while the list views of other apps like Twitter and Moments were pretty much jammed when swiped.

The Demo of Weibo has about 4000 lines of code, while Twitter only has about 2000 lines of code. The third-party library only uses YYKit, and the number of files is relatively small, which is convenient for viewing. All right, here’s the text.

The principle of screen display image

Let’s start with the principle of CRT display in the past. CRT electron gun scans from top to bottom line in the above way. After scanning, the display displays a frame, and then the electron gun returns to the initial position to continue the next scan. To synchronize the display with the system’s video controller, the display (or other hardware) generates a series of timing signals using a hardware clock. When the gun moves to a new line and is ready to scan, the monitor emits a horizontal synchronization signal, or HSync; When a frame is drawn and the gun returns to its original position, the display sends a vertical synchronization signal, or VSync, before it is ready to draw the next frame. The monitor usually refreshes at a fixed rate, which is the frequency at which the VSync signal is generated. Although today’s devices are mostly LCD screens, the principle remains the same.

Generally speaking, the CPU, GPU, and display in a computer system work together in this way. The CPU calculates the display content and submits it to THE GPU. After the GPU finishes rendering, the rendering result is put into the frame buffer. Then the video controller will read the data of the frame buffer line by line according to the VSync signal and transmit it to the display through possible digital-to-analog conversion.

In the simplest case, there is only one frame buffer, and reading and flushing of the frame buffer can be inefficient. To solve the efficiency problem, the display system usually introduces two buffers, that is, double buffering mechanism. In this case, the GPU will pre-render one frame into a buffer for the video controller to read, and after the next frame is rendered, the GPU will point the video controller’s pointer directly to the second buffer. So there’s a huge increase in efficiency.

Double buffering solves the efficiency problem, but it introduces a new one. When the video controller has not finished reading, that is, the screen content is just half displayed, THE GPU submits a new frame content to the frame buffer and exchanges the two buffers, the video controller will display the lower part of the new frame data on the screen, causing the picture tearing phenomenon, as shown below:

To solve this problem, gpus usually have a mechanism called VSync (also known as V-sync). When VSync is enabled, the GPU will wait for a VSync signal from the display before performing a new frame rendering and buffer update. This will solve the problem of tearing and increase the smoothness of the picture, but it will consume more computing resources and cause some latency.

So what about the mainstream mobile devices? As you can see from the web, iOS devices always use dual caching and enable vSYNC. On Android devices, Google didn’t introduce this mechanism until version 4.1. Currently, Android has triple cache + vSYNC.

Causes and solutions of caton

After the arrival of VSync signal, the graphics service of the system will notify the App through CADisplayLink and other mechanisms, and the main thread of the App will start to calculate the display content in the CPU, such as view creation, layout calculation, picture decoding, text drawing, etc. Then THE CPU will submit the calculated content to the GPU, which will transform, synthesize and render. The GPU then submits the rendering result to the frame buffer and waits for the next VSync signal to be displayed on the screen. Due to the VSync mechanism, if the CPU or GPU does not complete the content submission within a VSync period, the frame will be discarded and displayed at the next opportunity, while the display will keep the previous content unchanged. That’s why the interface gets stuck.

As you can see from the above figure, whichever CPU or GPU blocks the display process will cause frame drops. Therefore, it is necessary to evaluate and optimize CPU and GPU pressure respectively during development.

Causes and solutions of CPU resource consumption

Object creation

The creation of an object allocates memory, adjusts properties, and even reads files, consuming CPU resources. You can optimize performance by replacing heavy objects with lightweight objects. For example, CALayer is much lighter than UIView, so controls that do not need to respond to touch events are better displayed with CALayer. If the object does not involve UI operations, try to create it in the background thread, but unfortunately controls that contain CALayer can only be created and manipulated in the main thread. Creating view objects in Storyboard is much more expensive than creating them directly in code, and Storyboard is not a good technology choice for performance-sensitive interfaces.

Delay object creation as long as possible and spread it out over multiple tasks. Although this is a bit of a hassle to implement and doesn’t offer many advantages, if you can do it, try it. If objects can be reused and the cost of reuse is less than releasing and creating new objects, then such objects should be reused in a cache pool whenever possible.

Object to adjust

Object tuning is also a frequent CPU drain. Here’s a special word for CALayer: CALayer has no properties inside it. When a property method is called, it temporarily adds a method to the object through the runtime resolveInstanceMethod and stores the corresponding property value in an internal Dictionary. It also notifies the delegate, creates animations, and so on, which is very resource-intensive. UIView’s display-related properties (frame/bounds/ Transform, for example) are actually mapped from the CALayer properties, so adjusting these UIView properties consumes far more resources than normal properties. You should minimize unnecessary property changes in your application.

There are a lot of method calls and notifications between UIView and CALayer when view hierarchies are adjusted, so you should avoid adjusting view hierarchies, adding and removing views when optimizing performance.

Object is destroyed

The destruction of objects consumes a small amount of resources, but it adds up to a significant amount. Usually when a container class holds a large number of objects, the resource cost of their destruction becomes apparent. Similarly, if an object can be released in a background thread, move it to the background thread. Here’s a little Tip: Capture an object in a block, throw it on a background queue, send a random message to avoid compiler warnings, and let the object be destroyed in the background thread.

NSArray *tmp = self.array;
self.array = nil;
dispatch_async(queue, ^{
    [tmp class];
});Copy the code

Layout calculation

View layout calculation is the most common CPU drain in an App. If the view layout is precomputed in the background thread and cached, there are few performance issues in this place.

Regardless of the technology to the view layout through, it will eventually fall for UIView. The frame/bounds/center on the adjustment of the property. As mentioned above, adjusting these attributes is very expensive, so try to calculate the layout in advance and adjust the corresponding attributes at once if necessary, rather than multiple, frequent calculation and adjustment of these attributes.

Autolayout

Autolayout is a technology advocated by Apple itself and can be a great way to improve development efficiency in most cases, but Autolayout can often cause serious performance problems for complex views. As the number of views grows, the CPU consumption from Autolayout increases exponentially. For more information, see pilky.me/36/. If you don’t want to manually adjust the frame and other attributes, you can use some tools to replace (such as common left/right/top/bottom/width/height and quick properties), or use ComponentKit, AsyncDisplayKit framework, etc.

The text calculated

If an interface contains a large amount of text (such as weibo and wechat moments, etc.), text width and height calculation will occupy a large portion of resources and is inevitable. If you have no special requirements for text display, you can refer to the internal implementation of UILabel: With [NSAttributedString boundingRectWithSize: options: context:] to calculate the text width is high, Use – [NSAttributedString drawWithRect: options: context:] to draw text. Although these two methods perform well, they still need to be put into background threads to avoid blocking the main thread.

If you draw text in CoreText, you can create a CoreText typeset object and do the calculation yourself, and the CoreText object can be saved for later drawing.

Text rendering

All text content controls that can be seen on the screen, including UIWebView, are at the bottom of the page formatted and drawn as bitmaps through CoreText. Common text controls (UILabel, UITextView, etc.), its typesetting and drawing are carried out in the main thread, when the display of a large number of text, the CPU pressure will be very large. There is only one solution, and that is a custom text control that asynchronously draws text using TextKit or the low-level CoreText. Once the CoreText object is created, it can directly obtain the width and height of the text, avoiding multiple calculations (once when the UILabel size is adjusted, and again when the UILabel is drawn). CoreText objects take up less memory and can be cached for later multiple renders.

Image decoding

When you create an image using UIImage or CGImageSource methods, the image data is not immediately decoded. The image is set to UIImageView or calayer.contents, and the data in the CGImage is decoded before the CALayer is submitted to the GPU. This step occurs on the main thread and is unavoidable. If you want to get around this mechanism, it is common to draw images into CGBitmapContext in the background line and then create images directly from the Bitmap. At present, the common network photo library has this function.

Image drawing

Drawing an image usually refers to the process of drawing an image onto a canvas using methods that begin with CG, and then creating and displaying the image from the canvas. The most common place to do this is inside [UIView drawRect:]. Since CoreGraphic methods are usually thread-safe, drawing images can easily be put into background threads. The process for a simple asynchronous drawing looks something like this (it’s much more complicated than this, but the principle is the same) :

- (void)display { dispatch_async(backgroundQueue, ^{ CGContextRef ctx = CGBitmapContextCreate(...) ; // draw in context... CGImageRef img = CGBitmapContextCreateImage(ctx); CFRelease(ctx); dispatch_async(mainQueue, ^{ layer.contents = img; }); }); }Copy the code

Causes and solutions of GPU resource consumption

Compared to the CPU, the GPU can do a single thing: take the submitted Texture and vertex description, apply the transform, mix and render, and then print it to the screen. The main things you can see are textures (pictures) and shapes (vector shapes for triangle simulation).

Texture rendering

All bitmaps, including images, text and rasterized content, are eventually committed from memory to video memory and bound to the GPU Texture. Both the process of submitting to video memory and the process of GPU adjusting and rendering Texture consume a lot of GPU resources. When a large number of images are displayed in a short period of time (such as when the TableView has a large number of images and slides quickly), the CPU usage is very low and the GPU usage is very high, and the interface will still drop frames. The only way to avoid this situation is to minimize the display of a large number of pictures in a short period of time, and to display as many pictures as possible.

When the image is too large to exceed the maximum texture size of the GPU, the image needs to be preprocessed by the CPU, which will bring additional resource consumption to the CPU and GPU. For now, iPhone 4S and up, texture size is up to 4096 x 4096, more information can be found here: iosres.com. So try not to let images and views exceed this size.

Blending of views

When multiple views (or Calayers) are displayed on top of each other, the GPU blends them together first. If the view structure is too complex, the mixing process can also consume a lot of GPU resources. To reduce GPU consumption in this situation, applications should minimize the number and level of views and indicate opaque attributes in opaque views to avoid useless Alpha channel composition. Of course, this can also be done by pre-rendering multiple views as a single image.

Graph generation.

CALayer’s border, rounded corners, shadows and masks, CASharpLayer’s vector graphics display, usually trigger offscreen rendering, which usually happens on the GPU. When a list view shows a large number of calayers with rounded corners and a quick swipe, you can observe that the GPU resource is full and the CPU resource consumption is low. The interface still slides normally, but the average number of frames drops to a very low level. ShouldRasterize to avoid this, try turning on the calayer.shouldrasterize property, but this will shift the off-screen rendering onto the CPU. For situations where only rounded corners are needed, you can also simulate the same visual effect by overlaying the original view with an already drawn rounded corner image. The most radical solution is to draw the graphics that need to be displayed as images in background threads, avoiding rounded corners, shadows, masks, and other attributes.

AsyncDisplayKit

AsyncDisplayKit is Facebook’s open source library for keeping iOS interfaces smooth, and I’ve learned a lot from it, so I’ll spend a lot of time on it below.

The origin of the ASDK

ASDK is written by Scott Goodson (Linkedin), who worked at Apple on iOS built-in apps like Stocks, Calculators, Maps, clocks, Settings, Safari, and of course UIKit Framework. When he joined Facebook, he worked on Paper and created and open-source AsyncDisplayKit. Currently, he is responsible for iOS development and user experience improvement at Pinterest and Instagram.

ASDK has been open source since June 2014, with version 1.0 released in October. ASDK 2.0 is currently on the way. V2.0 adds more layout-related code, which the ComponentKit team contributed a lot to. The current version on Github’s Master branch is V1.9.1, which includes all of V2.0.

The information ASDK

To understand the principles and details of ASDK, a good place to start is with the following videos: NSLondon — Scott Goodson — Behind AsyncDisplayKit 2015.03.02 MCE 2015 — Scott Goodson — Effortless Responsiveness with AsyncDisplayKit 2.0: Intelligent User Interfaces — NSSpain 2015 The first two videos are similar in content, both introducing the fundamentals of the ASDK, with additional introductions to other projects such as POP. The latter video adds an introduction to the new features of ASDK 2.0.

In addition, you can also go to Github Issues to see ASDK related discussions. Why does Storyboard and Autolayout not support Runloop Dispatch

Later, you can also go to Google Groups to see and discuss more content: groups.google.com/forum/#! The for…

Basic principles of ASDK

According to ASDK, tasks that block the main thread fall into three main categories. Text and layout computation, rendering, decoding, and drawing can be performed asynchronously in various ways, but UIKit and Core Animation operations must be performed on the main thread. The goal of ASDK is to remove these tasks from the main thread as much as possible, and optimize performance if they can’t be removed.

To achieve this, ASDK attempts to encapsulate UIKit components:

This is a common UIView/CALayer relationship: The View holds the Layer for display, and most of the display properties in the View are actually mapped from the Layer; The Layer delegate in this case is the View, which is notified when its properties change and the animation is generated. Uiviews and CALayer are not thread-safe and can only be created, accessed, and destroyed on the main thread.

ASDK created for this ASDisplayNode class, wraps the common view properties (such as frame/bounds/alpha/transform/backgroundColor superNode/subNodes, etc.), And then it implements ASNode->UIView in the same way as UIView->CALayer.

When there is no need to respond to touch events, ASDisplayNode can be set to Layer Backed, that is, ASDisplayNode acts like the original UIView, saving more resources.

Unlike UIView and CALayer, ASDisplayNode is thread-safe and can be created and modified in background threads. Node does not create UIView and CALayer internally when it is first created, and does not generate the corresponding object internally until the view or Layer properties are first accessed on the main thread. When its properties (such as frame/ Transform) are changed, it does not immediately synchronize them to its own view or layer. Instead, it stores the changed properties to an internal intermediate variable, which is later set to the internal View or layer at once via some mechanism if needed.

By simulating and encapsulating UIView/CALayer, developers can replace UIView in code with ASNode, greatly reducing development and learning costs, and at the same time achieving a lot of performance optimization in ASDK. For easy use, ASDK encapsulates a large number of common controls into ASNode subclasses, such as Button, Control, Cell, Image, ImageView, Text, TableView, CollectionView, etc. With these controls, developers can avoid using UIKit-related controls directly for a more complete performance boost.

ASDK layer precomposition

 

Sometimes a layer can contain many sub-layers that do not respond to touch events, and do not need to be animated or positioned. ASDK implemented a technique called pre-composing these sub-layers into a single image. At development time, ASNode has replaced UIView and CALayer; Asnodes can even pre-compose to avoid creating internal UIViews and Calayers by using the various Node controls directly and setting them to Layer Backed.

In this way, you can get a big performance boost by drawing a big hierarchy onto a graph with a big drawing method. The CPU avoids the resource cost of creating UIKit objects, the GPU avoids the cost of composing and rendering multiple textures, and fewer bitmaps means less memory footprint.

ASDK operates asynchronously and concurrently

IDevice has had a dual-core CPU since the iPhone 4S, and the current iPad has even been updated to have a 3-core CPU. Taking full advantage of multi-core and executing tasks concurrently is a great way to keep the interface smooth. ASDK encapsulates operations such as layout calculation, text typesetting, and image/text/graphics rendering into smaller tasks and executes them asynchronously and concurrently using GCD. If the developer uses asNode-related controls, these concurrent operations are automatically performed in the background without much configuration.

Runloop Task distribution

Runloop work Distribution is a core technology of ASDK, which is not described in detail in the ASDK introduction video and documentation, so I will do more analysis here. If you’re not familiar with Runloop, take a look at my previous article to understand Runloop in depth, which also covers THE ASDK.

The iOS display is powered by a VSync signal, which is generated by the hardware clock and is emitted 60 times per second (depending on the hardware, such as 59.97 on a real iPhone). After iOS graphics service receives VSync signal, it will notify the App through IPC. App Runloop will register the corresponding CFRunLoopSource to receive the clock signal notification through mach_port after startup, and then the Source callback will drive the animation and display of the entire App.

Core Animation registers an Observer in the RunLoop that listens for BeforeWaiting and Exit events. This Observer has a priority of 2,000,000, which is lower than other common observers. When a touch event arrives, the RunLoop is awakened, and the code in the App performs operations such as creating and adjusting the view hierarchy, setting the UIView’s frame, modifying the CALayer’s transparency, and adding an animation to the view. These operations are eventually captured by CALayer and submitted to an intermediate state via CATransaction (the CATransaction documentation mentions these slightly, but not completely). Observers of the event are notified when the RunLoop is about to go to sleep (or exit) after all the above operations have been completed. The Observer registered with the CA then merges all intermediate states and submits them to the GPU for display in a callback; If there is an animation, the CA will trigger the relevant process multiple times through mechanisms such as DisplayLink.

ASDK emulates the Core Animation mechanism here: for all modifications and commits to ASNodes, some task must be put on the main thread to execute. When such a task occurs, ASNode encapsulates the task with ASAsyncTransaction(Group) and submits it to a global container. ASDK also registers an Observer in RunLoop to monitor the same events as the CA, but with a lower priority. After the CA processes the event before RunLoop goes to sleep, the ASDK executes all tasks submitted within the loop. See this file for the code: ASAsyncTransactionGroup.

Through this mechanism, the ASDK can synchronize asynchronous, concurrent operations to the main thread when appropriate, with good performance.

other

Advanced features are also packaged in ASDK, such as slide-list preloading, new layout modes added in V2.0, and more. ASDK is a very large library, it is not recommended that you change the entire App to ASDK driver, the most need to improve the interaction performance with ASDK optimization is enough.

Micro-blog Demo performance optimization skills

In order to demonstrate YYKit’s capabilities, I implemented weibo and Twitter demos and made a number of performance optimizations for them. Here are some tips for optimizations.

Preliminary layout

When I get the API JSON data, I will calculate the data required by each Cell in the background thread and encapsulate it into a layout object CellLayout. CellLayout contains the CoreText layout of all the text, the height of each control inside the Cell, and the overall height of the Cell. Each CellLayout doesn’t take up much memory, so when it’s generated, it can all be cached in memory for later use. In this way, the TableView does not consume any extra computation when requesting each height function; When CellLayout is set inside the Cell, the layout is no longer computed inside the Cell.

For ordinary TableViews, calculating the layout results in the background ahead of time is a very important performance optimization point. To achieve maximum performance, you may need to sacrifice some development speed by avoiding technologies like Autolayout and using text controls like UILabel. However, if you don’t have high performance requirements, you can try using TableView’s ability to estimate the height and cache each Cell height. Here’s an open source project from Baidu Know Group that can help you do just that: FDTemplateLayoutCell.

pre-rendered

The profile picture on Weibo was changed to a circle in one revision, so I followed up. When the avatar is downloaded, I pre-render the avatar as a circle in the background thread and save it separately to an ImageCache.

For TableView, off-screen rendering of Cell content will bring large GPU consumption. In the Twitter Demo, I used a lot of layer’s rounded corners to make things easier. You can quickly swipe through the list on a low-performance device (like the iPad 3) and see that the overall average frame count has dropped, although the list doesn’t lag too much. When viewed with Instument, you can see that the GPU is running at full capacity while the CPU is idle. To avoid off-screen rendering, you should avoid using layer border, corner, shadow, mask, etc., and try to pre-draw the corresponding content in the background thread.

Asynchronous rendering

I only used asynchronous drawing on controls that display text, but it worked fine. I refer to the principle of ASDK, to achieve a simple asynchronous drawing control. I’ve extracted this code separately and put it here: YYAsyncLayer. YYAsyncLayer is a subclass of CALayer, and when it needs to display something (such as a call to [Layer setNeedDisplay]), it will request a task to be drawn asynchronously from a delegate, or UIView. When drawing asynchronously, Layer passes a BOOL(^isCancelled)() block that the drawing code can call at any time to determine whether the drawing task has been cancelled.

When the TableView slides quickly, a large number of asynchronous drawing tasks are submitted to the background thread for execution. But sometimes when the slide is too fast, the drawing task can be canceled before it is finished. If you continue drawing at this point, you will waste a lot of CPU resources, and even block the thread and delay the completion of subsequent drawing tasks. My approach is to determine as quickly and in advance as possible whether the current drawing task has been canceled; Before drawing each line of text, I called isCancelled() for judgment to ensure that the cancelled task could be cancelled in time without affecting subsequent operations.

At present, some third-party micro-blog users (such as VVebo, Moke, etc.) use a way to avoid the Cell drawing process when sliding at high speed. See this project for related implementation: VVeboTableViewDemo. Its principle is that when sliding, after releasing the finger, the position of the Cell when sliding stops is calculated immediately, and several cells near that position are predrawn, while ignoring the Cell in the current slide. This method is tricky and provides a significant improvement in sliding performance, with the only drawback being that there is a lot of white space in fast sliding. This technique is a great choice if you don’t want the hassle of asynchronous drawing but still want smooth sliding.

Global concurrency control

When I do a lot of drawing with a Concurrent Queue, I occasionally run into this problem:

When a large number of tasks are submitted to the background queue, some tasks are locked for some reason (in this case CGFont locks) causing the thread to sleep or block, and concurrent Queue then creates new threads to perform other tasks. When this becomes more frequent, or when the App uses a large number of Concurrent queues to perform many tasks, the App can have dozens of threads running, creating, and destroying at the same time. The CPU implements thread concurrency using time slice rotation, and although a Concurrent queue controls thread priority, these operations can cannibalize the CPU resources of the main thread when a large number of threads are created and destroyed simultaneously. ASDK has a Demo of the Feed list: SocialAppLayout. When there are too many cells in the list and scrolling very fast, there is still a small amount of lag in the interface, which I cautiously suspect may be related to this problem.

This problem is inevitable with concurrent queues, but with Serial queues, multi-core CPU resources are not fully utilized. I wrote a simple tool, YYDispatchQueuePool, to create serial queues with the same number of cpus for different priorities and poll one of them back each time a queue is fetched from the pool. I put all asynchronous operations in the App, including image decoding, object release and asynchronous drawing, into the global serial queue according to different priorities, so as to avoid performance problems caused by multi-threading.

More efficient asynchronous image loading

SDWebImage still had a few performance issues in this Demo and some areas didn’t meet my needs, so I implemented a higher performance image loading library myself. For displaying simple single images, uiView.layer. contents is enough, and there is no need to use UIImageView to consume extra resources, so I added methods like setImageWithURL to CALayer. In addition, I also managed image decoding and other operations through YYDispatchQueuePool to control the total number of App threads.

Other areas that can be improved

After the above optimization, the microblog Demo has been very smooth, but in my imagination, there are still some further optimization skills, but LIMITED by time and energy, I have not implemented, the following are briefly listed:

There are a number of visual elements in the list that do not require a touch event and can be pre-drawn as an image using ASDK’s layer composition technique.

Further reduce the number of layers in each Cell and replace UIView with CALayer.

At present, the type of each Cell is the same, but the contents displayed are different. For example, some cells have pictures, while others have cards. Dividing the Cell by type, further reducing unnecessary view objects and operations within the Cell, should have some effect.

The tasks that need to be executed in the main thread are divided into small enough blocks and scheduled by Runloop. In each Loop, the time of the next VSync is determined, and the current unfinished tasks are postponed to the next opportunity until the next VSync comes. This is just an idea of mine, and it doesn’t necessarily work.

How do you measure interface fluency

Finally, “premature optimization is the root of all evil”. When requirements are uncertain and performance problems are not obvious, there is no need to try to optimize, but to implement the function as correctly as possible. When optimizing performance, it is also best to go through the process of modifying the code -> Profile -> modifying the code, focusing on the areas that deserve the most optimization.

If you need a clear FPS indicator, try KMCGeigerCounter. CPU sluggage can be detected by the built-in CADisplayLink. It uses a 1×1 SKView to monitor the GPU. This project has two minor problems: although SKView can monitor GPU lag, introducing SKView itself will bring a little extra resource consumption to CPU/GPU; This project has some compatibility issues with iOS 9 and needs to be tweaked slightly.

I also wrote a simple FPS indicator myself: FPSLabel is only a few dozen lines of code and uses only CADisplayLink to monitor CPU stuttering. Although not as complete as the above tool, daily use is not too much of a problem.

Finally, with the Instuments GPU Driver preset, you can see CPU and GPU resource consumption in real time. Within this preset, you can view almost any display-related data, such as Texture number, FREQUENCY of CA submissions, GPU consumption, etc., which is the best tool for locating screen lag issues.