background

Recently, after investigation, the company plans to replace FaceMagic with SenseTime products, which is said to be better in recognition rate and performance. In order to optimize the performance, SenseTime also temporarily added the frame interval detection technology. Theoretically, every frame is equivalent to less than half of the frame detection, so the performance should be doubled. However, the actual result of the pressure test is still not improved.

So we have to rethink the whole picture, because face/gesture recognition gifts are made up of recognition + play, so if recognition is good, is it because of the play that was ignored? As expected, after the test, it was found that it was caused by the heavy gift resources. Different from the skeleton animation adopted by FaceMagic, SenseTime adopts the traditional frame-by-frame animation, so a short animation may require hundreds of pictures to be played sequentially. Thinking of this, we began to try to delete unnecessary frames and minimize the size of each frame to optimize. After optimizing the resource pack, our SDK colleagues took the first pressure test and found that the CPU ratio decreased significantly. Everyone was happy and quickly integrated into the project, and started the actual pressure test in the live broadcast room. No SDK colleagues with demo pressure test effect so big ah.

When I was confused, I found the problem in my communication with SDK colleagues, their compression demo just set a resource at the beginning (SenseTime API requires set a compressed resource pack), and then start the loop recognition + play animation for compression. Every time but our studio is a gift message to set a resource (means a decompression operation), and then began to identify + the animation state (I guess, but due to the tight memory live APP, images in memory cache is easy to kill, causing repetitive file IO, and image decoding).

So we can sum up two crucial optimization points:

  1. Avoid frequent decompression of resource packages. Each decompression is an expensive operation. This is a relatively low level but easy to overlook when looking at the overall situation, and will not be discussed here.
  2. The number of frame by frame animation images is large and the quality is high. Can PNG/JPEG images be decoded into bitmaps and saved locally to avoid repeated and frequent decoding operations by CPU? With this in mind, we begin this discussion.

The source and extension of picture optimization ideas

In order to improve the smoothness of APP interface, there are many optimization methods. In terms of image optimization, one idea is to decode images asynchronously in advance, instead of the main line decoding images step by step before they are submitted to GPU as adopted by official UIKit. The originator of this idea should be Facebook’S ASDK. The domestic Ibireme god’s YYImage also borrowed this idea. (Appendix 1 at the end of this article is the idea I learned from them to forcibly decode pictures)

But in the face of the abnormal operation, frame by frame animation I think asynchronous decoding does not solve the pain points ahead of schedule, because we set the bottleneck of the CPU, asynchronous decoding or every time after decoding ahead of time, it can make decoding operation only once, let the bitmap caching of decoding to the local, reoccupy after loaded bitmap directly? Before that, let’s understand the process of displaying images on iOS.

IOS displays the process of images

When we write code, we initialize a UIImage object with the given image and assign it to UIImageView.image to render that image on the screen. So how does this display process work?

  1. Self. TestImageView. Image = [UIImage imageNamed: @ "XXX. PNG"].PNG in the application’s mainBundle and generates a CGImageRef pointer to it.
  2. RunLoop’s CATransaction catches the changes to the testImageView property and is ready to submit the data to the GPU for rendering.
  3. Before submitting data from the modelLayer Tree to the renderLayer Tree, the CPU loads the PNG image into memory and decodes the PNG into a bitmap.
  4. Driven by OpenGL, the GPU renders bitmap data to the screen.

Note 1: The first step does not load the image into memory, let alone decode it. I verified this with the following code:

    NSMutableArray *array = [NSMutableArray array];
    for (NSInteger i = 0; i < 100; i++) {
        [array addObject:[UIImage imageNamed:[NSString stringWithFormat:@"xiongbao_%ld.png", (long)i]]];
    }
    self.imageView.animationImages = [array copy];
Copy the code

No memory changes are observed by Xcode after this code is executed, only when it is actually displayed (in this case, [self.imageview startAnimating];). Before loading and decoding.

Note 2: The modelLayer Tree, presentationLayer Tree and renderLayer Tree are explained by Apple’s official documentation and some explanations are available online. My understanding is that the modelLayer that we normally operate directly through UIView is modelLayer. In a really play a role in the process of animation is presentationLayer, then modelLayer attribute values represent only value is the end of the animation, so we are in the process of animation by CALayer. PresentationLayer layer in the animation can be attained. The renderLayer is responsible for rendering, and both the modelLayer and the presentationLayer need to send the data to the renderLayer when they actually need to render the presentation.

PNG? The bitmap?

PNG/JPEG refers to a particular file format, such as PNG, which has its own representation and compression methods. Image formats are designed to be easy to transfer. Regardless of hardware decoding, the GPU can be seen as an idiot who can only render pixels at high speed in parallel. It can’t figure out what the PNG format is, JPEG, so it should be sending pixelated data to the GPU.

It is worth mentioning that a 100K PNG image on a hard disk may be more than 1M long after being decoded into a bitmap. The size of the bitmap is equal to the width of the image resolution x the height of the image resolution x the space occupied by a single pixel.

File IO VS CPU decoding

So here’s the question:

  • Using PNG images = read 100K data from local + CPU decode 100K data.
  • Read 1+M data locally using bitmap =

In fact, it is a balance of file IO and CPU calculation.

So far, I have not come up with any effective method to accurately measure the loss of 1M data read by aN iPhone application and the loss of 100K data decoded by the CPU. However, the balance is to emphasize the bottleneck of our application. Currently, the CPU is the bottleneck of our live broadcast room when all functions are turned on. The high speed of CPU operation brings hot phone and CPU frequency down, followed by frame loss, so reducing CPU load is a top priority.

Another surprise after ASDK

The world is so big, we can think of, often there are pioneers. When I was trying to do this optimization, I came across FastImageCache, which was as surprising as the first time I encountered ASDK. So the next optimization ideas will have reference FastImageCache place, worship to learn.

Color Copied Images

This is a detection point under Core Animation in Xcode’s Instruments detection tool. It says: Copy the image. FastImageCache doesn’t involve this optimization, but it naturally occurred to me when I looked at its README. What does that mean?

As mentioned above, the image display on the screen ultimately depends on GPU rendering, and the specific rendering involves the underlying packaging of our iOS framework and the specialized image rendering field, which we do not go into. IOS image rendering is done by driving the GPU with OpenGL. OpenGL has its support for Color Space as follows:

For color formats, there are more possibilities. GL_RED, GL_GREEN, and GL_BLUE represent transferring data for those specific components (GL_ALPHA cannot be used). GL_RG represents two components, R and G, in that order. GL_RGB and GL_BGR represent those three components, with GL_BGR being in reverse order. GL_RGBA and GL_BGRA represent those components; the latter reverses the order of the first three components. These are the only color formats supported (note that there are ways around that).


From OpenGL Wiki: https://www.khronos.org/opengl/wiki/Pixel_Transfer

It can be seen that OpenGL ES supports RGB BGR RGBA BGRA and other color Spaces, so when it does not support the color space, it needs to trouble the CPU to first convert it into the color space it supports (conversion process will inevitably lead to open up space and write data operation, that is, copy operation), which brings additional burden to the CPU. This is a key point that Color Copied Images are all about.

Note 1: While exploring this point, I checked our APP and found that the color space of some ICONS in the project adopted Gray – Gray image – Wikipedia, the free encyclopedia, leading to the operation of copying images. I have confirmed this point with our design student, and his reply is that this kind of picture may have been generated with PS in the early stage, but now all the pictures generated with Sketch are RGB, even the black and white pictures. I think this argument is not convincing. In principle, it should be to control the size of the image, because each pixel of a grayscale image is only 8bits, so the image is much smaller.

Note 2: We have been talking about OpenGL for a long time. In fact, we should use OpenGL ES for mobile devices. An embedded version of OpenGL, a subset of OpenGL.

Pixel alignment

This is one of the concerns of FastImageCache (which is also reflected in Instruments: For example, if the CPU reads a fixed 8 bytes of data at a time, and now has a 41 bytes file, the CPU reads 6 times and does extra processing for the 8 bytes of the sixth read (byte 1). This naturally places an additional burden on the CPU. The solution is to fill the file to 48 bytes without affecting the contents of the original file.

I’m going to try to explain this in a relatively technical way,

  1. Why does the CPU read 8 bytes at a time? This is related to the cache line size of the CPUData structure alignment – WikipediaYou can solve some of the puzzles, or you can Google them yourself. According to the FastImageCache source code, it is found that the alignment value for iPhone is 64. I have not found the information about the cache line size of A series processors in Apple official, but I have seen some discussions that the cache line size of A9 is 64. FastImageCache:
  2. Bitmap data is represented by matrices and stored in memory as two-dimensional arrays. Note the code in the appendix that forces the decoding of the image, CGBitmapContextCreate(…) One argument to the method is bytesPerRow, which is the number of bytes per line. BytesPerRow is an integer multiple of 64.A aligned bytes-per-row value must be A multiple of 8 pixels × bytes per pixel.Is not only an integer multiple of 8, but also an integer multiple of bytesPerPixel. It can be inferred that the purpose of doing this is to ensure that each row stores a complete pixel, so that no pixel data will be stored across rows, which is also a point of optimization.
  3. You might also wonder why the alignment is done on a per-row basis, rather than as a whole. That’s because the GPU renders images in behavioral units and in parallel with multiple lines.

MMAP

In addition to pixel alignment, mmAPmmap-Wikipedia is also a point of concern for FastImageCache.

What does this optimization point mean? Usually when the system loads an image (or other file), it copies the image from the hard disk to the kernel space in memory, and then copies it to the user space of the process when it needs it. In other words, the process loads an image twice. It also copies the image from kernel space to its user space (no interprocess shared memory) if another process also needs to access it.

Demand Paging – Wikipedia copies image data from disk directly to memory and updates the page table. In this way, only one copy is performed. In addition, when another process also needs to access the image to trigger the page missing interrupt, it can load the updated page table in memory to achieve memory sharing. At the same time, it can be thought that memory sharing brings the need to lock when writing.

Therefore, it can be concluded that MMAP reduces the number of data copies and memory consumption compared with traditional file reads. This is especially true for scenarios where images are shown and not written.

conclusion

Going back to the beginning, there are several optimization points for scenarios that require a large number of images:

  1. Use bitmaps directly.
  2. Avoid using color Spaces that OpenGL does not support.
  3. Ensure image data memory alignment.
  4. Use MMAP instead of traditional file reading.

Finally, I would like to pay tribute to the FastImageCache library author, and lament the importance of the knowledge base and foundation as a computer engineer in determining which layer of things to focus on when writing code.

‘!


Appendix 1

In honor of these leaders, I’ll post an idea I learned from them about how iOS forces images to be decoded, which is to use Core Graphics to draw images to an open context and save them as bitmaps:

- (void)createBitmap { UIImage *pngImage = [UIImage imageNamed:@"xiongbao_7.png"]; CGFloat width = pngImage.size.width; CGFloat height = pngImage.size.height; CGImageRef pngImageRef = pngImage.CGImage; CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB(); CGContextRef Context = CGBitmapContextCreate(NULL, width, height, 8, width * 4, colorSpace, kCGImageAlphaNoneSkipLast); CGContextDrawImage(context, CGRectMake(0, 0, width, height), pngImageRef); // decode CGImageRef bitmapImageRef = CGBitmapContextCreateImage(context); NSString *path = [NSString stringWithFormat:@"%@/xiongbao_7.bitmap", [[NSBundle mainBundle] bundlePath]]; CFURLRef url = CFURLCreateWithFileSystemPath(kCFAllocatorDefault, (__bridge CFStringRef)path, kCFURLPOSIXPathStyle, false); CFStringRef type = kUTTypeBMP; CGImageDestinationRef dest = CGImageDestinationCreateWithURL(url, type, 1, 0); CGImageDestinationAddImage(dest, bitmapImageRef, 0); // Release resources CGContextRelease(context); CGColorSpaceRelease(colorSpace); CGImageDestinationFinalize(dest); }Copy the code

reference

IOS Image load speed limit optimization – FastImageCache parsing « Bang’s blog

http://www.cairuitao.com/fastimagecache/source analysis

http://blog.csdn.net/mg0832058/article/details/5890688

Shared memory for Linux interprocess communication – Gordon0918 – blogpark

Osx-cgimageref width doesn’t agree with bytes-per-row-stack Overflow