When we use picture storage, it will inevitably involve file IO, GPU rendering and other problems, the article focuses on the computer operating system from the perspective of in-depth analysis on how to optimize the speed of IMAGE IO, improve the iOS UIImageView rendering efficiency and memory optimization, which will be very helpful for us to do more picture albums and other applications.

Image data copy?

When we use the following object-C code to fetch an image from the network and load it onto the UIImageView

NSURL* url = [NSURL URLWithString:@"https://img.alicdn.com/bao/uploaded/i2/2836521972/TB2cyksspXXXXanXpXXXXXXXXXX_!! 0-paimai.jpg"]; __weak typeof(self) weakSelf = self; NSURLSessionDataTask *task = [[NSURLSession sharedSession] dataTaskWithRequest:[[NSURLRequest alloc] initWithURL:url] completionHandler:^(NSData *data, NSURLResponse *response, NSError *error) { UIImage* image = [UIImage imageWithData:data];  dispatch_async(dispatch_get_main_queue(), ^{ [weakSelf.imageView setImage:image]; }); }]; [task resume];Copy the code

Run the above code to view CPU consumption from Instrument’s TimeProfile:

There are two problems with the above picture:

1. The application uses CA::Render::copy_image. This is because Core Animation copies a copy of the image data before rendering it with non-byte alignment. This also happens when we use image with Content file.

2. The application uses CA::Render::create_image_from_provider, which actually decodes the image. The reason is that UIImage does not actually decode the image when loaded, but is delayed until the image is displayed or otherwise decoded. This strategy saves memory, but takes up a lot of main thread CPU time to decode during display, resulting in interface stuttering.

So if we solve these two problems, we use a third-party library called FastImageCache to load images, The official Demo started with the FICDPhoto loading method using imageWithContentsOfFile, which resulted in CA::Render::copy_image copying of the image data, so changed to the following method:

- (UIImage *)sourceImage { __block UIImage *sourceImage = [UIImage imageWithContentsOfFile:[_sourceImageURL path]]; if (! sourceImage) { pthread_mutex_lock(&_mutex); NSURLSessionDataTask *task = [[NSURLSession sharedSession] dataTaskWithRequest:[[NSURLRequest alloc] initWithURL:_sourceImageURL] completionHandler:^(NSData *data, NSURLResponse *response, NSError *error) { sourceImage = [UIImage imageWithData:data]; pthread_cond_signal(&_signal);  pthread_mutex_unlock(&_mutex); }]; [task resume]; pthread_cond_wait(&_signal, &_mutex); pthread_mutex_unlock(&_mutex); } return sourceImage; }Copy the code

C ::Render::copy_image and C ::Render::create_image_from_provider do not occur when an image is rendered to ImageView using Instrument.

How to optimize?

How does FastImageCache solve these two problems?

Let’s look at the first question. Why does Core Animation copy data before rendering?

Before we do that, let’s take a look at some of the theories of computer systems, and when we’re done, we’ll look back and the answer will be clearer.

As we all know, an image is simply a file data composed of a piece of data, that is to say, when we display an image on the interface, we simply load a pile of bytes into the CPU register, and then use the GPU to turn the bytes into an array of red (R), green (G) and blue (B) primary colors. It is then displayed through the interface (what we call rendering). So let’s take a look at how byte data is loaded into the CPU from memory.

Those of us who have learned computer operating system know that all byte data are transmitted through the bus, which is connected with the transmission route of CPU to main memory and main memory to disk and other major hardware devices. The main transmission unit is Word. As digital signals are divided into high frequency and low frequency, Therefore, our computer signals can only be distinguished by 0 and 1, so we use the binary data form to express the data, so the quantization accuracy of the signal is generally measured in bits, which is the essence of byte data transmission in the bus.

On 64-bit systems, a word is eight bytes in size.

The storage units of memory are called chunks, and the size of a Chunk depends on the hardware device. Most file systems are based on a block device, a hardware abstraction layer that accesses a specified Chunk of data.

Given a 32-bit file system where the Cache is delivered to the CPU in 4-byte blocks, the following figure illustrates how the CPU accesses 4-byte lookups in 4-byte memory access granularity:

If we fetch four unaligned bytes of data, the CPU performs additional work to access the data: loading two chunks of data, removing unwanted bytes, and combining them. This process certainly degrades performance and wastes CPU cycles in order to get the correct data from memory.

Therefore, data is stored in Byte Alignment, which is better known as Byte Alignment. In this way, data is read and written in memory in the first form in the figure above. In this way, CPU cycles are saved and access speeds are accelerated.

Looking back, when we read image data from memory, it is also a heap of bytes. If the image bytes are not processed, then the following happens:

When data is transferred in words, memory is read in blocks, so the image data read from memory is bound to carry other “impurity bytes”. So that happens — “normally”, but the data that the GPU needs is the data that is ideal because of the “impurity bytes” that affect image generation. So Core Animation needs to remove the “impurity bytes” and turn them into our “ideal state”. It’s not that Core Animation does this, it’s that we have to do this when we start processing images, otherwise the image data will be messed up, but Core Animation encapsulates the underlying processing for us.

From what we learned, we realized that we needed to byte-align the images we saved.

For Cache, access is in the form of Byte blocks, and the size of the block depends on the Cache memory of the CPU. ARMv7 is 32 bytes, and A9 is 64 bytes. Under A9, CoreAnimation should be 64 bytes (that is, 8 characters, 8Byte/ word) as a block of data to read, store, and render, align the image data with 64 bytes to avoid CoreAnimation copying another copy of the data. Can save memory and copy time. (since images are already byte aligned when stored, they are byte aligned when retrieved),

How can byte block alignment prevent Core Animation from copying image data?

The following is the code form for practical operation:

Calculates the size of the image in bytes

/** FICImageTable.m */ CGSize pixelSize = [_imageFormat pixelSize]; NSInteger bytesPerPixel = [_imageFormat bytesPerPixel]; // The image is in bytes per pixel, For example FICImageFormatStyle32BitBGRA 4 bytes for 32-bit _imageRowLength = (NSInteger) FICByteAlignForCoreAnimation ((size_t) (pixelSize.width * bytesPerPixel)); _imageLength = _imageRowLength * (NSInteger)pixelSize.height;Copy the code

By FICByteAlignForCoreAnimation function to byte alignment and calculate of image data, Get the number of bytes per line of the _imageRowLength image, the number of bytes required for the image = the height of the image * _imageRowLength(the number of bytes per line of the image aligned with the byte blocks).

Byte block alignment by the bytes required for each line of the actual image

inline size_t FICByteAlignForCoreAnimation(size_t bytesPerRow) { return FICByteAlign(bytesPerRow, 64); // It is related to the CPU cache.Copy the code

Let width be evaluated as a multiple of alignment

inline size_t FICByteAlign(size_t width, size_t alignment) { return ((width + (alignment - 1)) / alignment) * alignment;  }Copy the code

Create the Chunk corresponding to the Entry, and Chunk is page aligned

// Set the length of each entry in bytes, because in addition to image data, FastImageCache also adds an additional 32 bytes of two UUID to the image _entryLength = (NSInteger)FICByteAlign(_imageLength + sizeof(FICImageTableEntryMetadata), (size_t) [FICImageTable pageSize]); entryData = [[FICImageTableEntry alloc] initWithImageTableChunk:chunk bytes:mappedEntryAddress length:(size_t) _entryLength];Copy the code

Why page alignment? Because the byte block size on disk is the page for disk, and because paging is how disk and physical memory are stored, this saves CPU cycles when reading entryData, just as byte alignment does.

Create the bitmap with _imageRowLength, avoid copying CA::Render:: COPY_image image data because the image byte blocks are aligned.


// Create CGDataProviderRef for image context creation, To provide image data and the data structure of the Release function CGDataProviderRef dataProvider = CGDataProviderCreateWithData ((__bridge_retained void *)entryData, [entryData bytes], [entryData imageLength], _FICReleaseImageData); CGSize pixelSize = [_imageFormat pixelSize]; CGBitmapInfo bitmapInfo = [_imageFormat bitmapInfo]; NSInteger bitsPerComponent = [_imageFormat bitsPerComponent]; NSInteger bitsPerPixel = [_imageFormat bytesPerPixel] * 8; NSInteger bitsPerPixel = [_imageFormat bytesPerPixel] * 8; CGColorSpaceRef colorSpace = [_imageFormat isGrayscale]? CGColorSpaceCreateDeviceGray() : CGColorSpaceCreateDeviceRGB(); CGImageRef imageRef = CGImageCreate((size_t) pixelSize.width, (size_t) pixelSize.height, (size_t) bitsPerComponent, (size_t) bitsPerPixel,(size_t) _imageRowLength, colorSpace, bitmapInfo, dataProvider, NULL, false, (CGColorRenderingIntent)0); CGDataProviderRelease(dataProvider); CGColorSpaceRelease(colorSpace);Copy the code

File reading, resource consumption?

When we get image data from disk, we must call the read() function to read image bytes from disk. Let’s see how the read() function is called:

The figure above shows when an application calls the read() function:

  1. The CPU receives an interrupt signal and enters kernel mode.

  2. In kernel mode, kernel mode program is used to access the Cache (Cache) to check whether there is image data in the Cache. If it returns again, it continues to access physical memory.

  3. Kernel mode application reads the physical memory to check whether there is a corresponding physical page (as a result of the disk with physical memory storage are the way of the paging divided into data blocks, usually a 64 – bit system for 4 KB per page), if there is a physical page data, it returns the corresponding physical page of the image data has not occurred is abnormal behavior (error).

  4. The page missing exception handler accesses the disk, finds the corresponding image data in the disk and loads it into a disk page. The disk page replaces the physical page in the physical memory as a new page, and then caches the physical page data as byte blocks in the cache

  5. An interrupt signal from the exception handler controls the return to the kernel program, which reloads the cache byte block back to place the kernel buffer.

  6. Since the memory address space of the kernel program is completely different from that of the user program, for virtual memory, the kernel program must make a copy of the data bytes in the kernel buffer before it can be returned to the user program.

(PS: the concept of a user program and a kernel program may be the same program, only converted by CPU switching mode, or they may be two different programs)

CPU read memory page process:

Of course, the above analysis is for the logical level and the hardware level of the explanation, actually on the software level, for a disk request is as follows:

The figure above shows the hierarchical model that read system calls go through in core space. As can be seen from the figure, for a read request to the disk:

  • It goes through the VIRTUAL file system layer

  • The second is the specific file system layer (e.g. Ext2)

  • Next is the Cache layer (Page cache layer)

  • Generic Block Layer

  • IO Scheduling layer (I/O Scheduler Layer)

  • Block Device Driver Layer

  • And then finally the physical Block Device layer

Read System call processing hierarchy in core space

(For this part, leave a suspense for now, and share in the future in detail on corpus system)

Through on in the face of the read () function analysis, we know that read a file from the disk operation is very complex broken and very consume resources (especially large files), and because of the physical memory and cache resource is limited, when we no longer access the image data, image data can be as a sacrifice page in physical memory and cache, When our application reads () again later, we have to go through the process again.

So how can we optimize this time to speed up our IO to the image?

How to optimize?

For our closed iOS system, the means of optimization is actually very limited, because we can not directly operate the kernel, but is it impossible to optimize? The operating system provides us with a user-level kernel function, Mmap /ummap, which is an implementation of memory mapping, so what memory mapping can bring to the operation of reading and writing files?

The answer is to optimize the memory copy process generated in flow 1 and 6 above. Let’s first look at what a memory map is.

The operating system initializes the contents of a virtual memory area by associating it with an object on a disk. This process is called Memory mapping.

The following figure shows how memory mapping works:

The following figure shows the location of the memory-mapped region in the process:

When the disk file through a memory map to the application, is directly associated with the user space address, that is to say, when we read the data from a disk file, CPU without switch in user space and kernel space, then copy bytes will not happen again, all read operations can be done in user space.

Ok, so with that said, how do we do it?

In FastImageCache, create a Chunk direct file to map the portion of a Chunk memory region:

// FICImageTableChunk.m - (instancetype)initWithFileDescriptor:(int)fileDescriptor index:(NSInteger)index length:(size_t)length { self = [super init]; if (self ! = nil) { _index = index; _length = length; _fileOffset = _index * _length; / / through memory mapping is set to the Shared memory file _bytes = mmap (NULL, _length, (PROT_READ | PROT_WRITE), (MAP_FILE | MAP_SHARED), fileDescriptor, _fileOffset); if (_bytes == MAP_FAILED) { NSLog(@"Failed to map chunk. errno=%d", errno); _bytes = NULL; self = nil; } } return self; }Copy the code

In the FastImageCache architecture analysis article, we know that an image file should correspond to an Entry. Why does memory mapping map Chunk?

The larger the memory-mapped file, the more effective it is, the smaller the data, the read() function goes directly into the kernel and copies bytes.

To make the mapping file bigger, FastImageCache even decodes the image directly when storing it:

- (void)setEntryForEntityUUID:(NSString *)entityUUID sourceImageUUID:(NSString *)sourceImageUUID imageDrawingBlock:(FICEntityImageDrawingBlock)imageDrawingBlock { if (entityUUID ! = nil && sourceImageUUID ! = nil && imageDrawingBlock ! = NULL) { [_lock lock]; / / recursive locking / / create Entry NSInteger newEntryIndex = [self _indexOfEntryForEntityUUID: entityUUID]; if (newEntryIndex == NSNotFound) { newEntryIndex = [self _nextEntryIndex]; if (newEntryIndex >= _entryCount) { // Determine how many chunks we need to support new entry index. // Number of entries should always be a multiple of _entriesPerChunk NSInteger numberOfEntriesRequired = newEntryIndex + 1; NSInteger newChunkCount = _entriesPerChunk > 0 ? ((numberOfEntriesRequired + _entriesPerChunk - 1) / _entriesPerChunk) : 0; NSInteger newEntryCount = newChunkCount * _entriesPerChunk; [self _setEntryCount:newEntryCount]; } } if (newEntryIndex < _entryCount) { CGSize pixelSize = [_imageFormat pixelSize]; CGBitmapInfo bitmapInfo = [_imageFormat bitmapInfo]; CGColorSpaceRef colorSpace = [_imageFormat isGrayscale] ? CGColorSpaceCreateDeviceGray() : CGColorSpaceCreateDeviceRGB(); NSInteger bitsPerComponent = [_imageFormat bitsPerComponent]; // Create context whose backing store *is* the mapped file data FICImageTableEntry *entryData = [self _entryDataAtIndex:newEntryIndex]; If (entryData! = nil) { [entryData setEntityUUIDBytes:FICUUIDBytesWithString(entityUUID)]; [entryData setSourceImageUUIDBytes:FICUUIDBytesWithString(sourceImageUUID)]; // Update our book-keeping _indexMap[entityUUID] = @((NSUInteger) newEntryIndex); [_occupiedIndexes addIndex:(NSUInteger) newEntryIndex]; _sourceImageMap[entityUUID] = sourceImageUUID; / / for recently used memory strategies to load and release the memory [self _entryWasAccessedWithEntityUUID: entityUUID]; [self saveMetadata]; // Unique, unchanging pointer for this entry's index NSNumber *indexNumber = [self _numberForEntryAtIndex:newEntryIndex]; // Relinquish the image table lock before calling potentially slow imageDrawingBlock to unblock other FIC operations [_lock unlock]; // Create a bitmap, draw the image data into place, CGContextRef Context = CGBitmapContextCreate([entryData bytes], (size_t) pixelsize.width, (size_t) pixelSize.height, (size_t) bitsPerComponent, (size_t) _imageRowLength, colorSpace, bitmapInfo); CGContextTranslateCTM(context, 0, pixelSize.height); CGContextScaleCTM(context, _screenScale, -_screenScale); Synchronized (indexNumber) {// Call drawing block to allow clients to draw into the context // decode imageDrawingBlock(context, [_imageFormat imageSize]); CGContextRelease(context); // Write the data back to the filesystem [entryData flush]; } } else { [_lock unlock]; } CGColorSpaceRelease(colorSpace); } else { [_lock unlock]; }}}Copy the code

Recursive locking is involved here, mainly to prevent multiple calls to lock deadlock caused by the opportunity to share with you again the magic use of recursive locking.

The -_entryDataatIndex method creates Chunk, but Chunk has no data, just a file mapping area.

Use – _indexOfEntryForEntityUUID created Entry, distribution of the image bytes and UUID for the metaData required bytes of memory space. Then we use -cgBitMapContextCreate to create a bitmap using memory space, use -imagedRawingBlock to draw all image bytes into place, and then use Entry Flush to synchronize image data to disk, which completes the image storage.

Existing problems

But aren’t memory maps without bugs?

As we know from the above study, memory mapping is directly corresponding to the virtual memory area, that is, it occupies the address space of our virtual memory, and it is a resident memory. When the mapped memory is very large, it will even affect the creation of heap memory of our program and lead to worse performance. For this reason, FastImageCahce even sets a memory limit for entries. One Entry can only store two MEgabytes of data.

NSInteger goalChunkLength = 2 * (1024 * 1024); NSInteger goalEntriesPerChunk = goalChunkLength / _entryLength; _entriesPerChunk = (NSUInteger) MAX(4, goalEntriesPerChunk); // At least 4Entry/Chunk _chunkLength = (size_t)(_entryLength * _entriesPerChunk); // Chunk memory size in bytesCopy the code

The actual size varies with the size of the image, but it’s not too different from 2M.

To be continued… (Learn new methods later, will continue to update)

Through the above method, we can effectively speed up the file IO of our pictures, especially the current female users, there are dozens of G pictures in the mobile phone, when we want to make a picture album of beautiful applications, these performance can not be ignored. According to Moore’s Law, computer performance doubles about every 18-24 months, and so does performance, but users’ demands and usage patterns improve over time! Therefore, in order to develop a deep understanding of computer principles, we can write high-performance code, and we will not be helpless in the face of high concurrency and high memory.

This paper address: http://simplecodesky.com/2018/04/10/ios-efficient-image-io/