The author | Chen yu (cen for)
Edit | orange
New retail product | alibaba tao technology
preface
Metal is a low-level graphical programming interface similar to OpenGL ES. It can directly operate gpus by using related apis. It was first released at WWDC in 2014. Metal is unique to iOS, which means it can’t be cross-platform like OpenGL ES, but it can exploit the GPU power of Apple’s mobile devices for complex computing. Game engines like Unity are optimized for 3D capabilities through Metal. The App Store also has a feature for Metal games.
The Xianyu team is the technical team that selected the Flutter solution on the client side earlier, and the current Xianyu project is also a relatively complex hybrid project of Native Flutter. As a 2C application, performance and user experience have always been the focus of xianyu technology team in development. The underlying interface of Metal, which directly operates GPU, will undoubtedly provide some new ideas for the Xianyu technical team to break through the performance bottleneck.
The following will elaborate on the new features of Metal in this conference and what technical inspirations and reflections these new features bring to the idle fish technology and the whole panning technology.
Metal new features
Harness Apple GPUs with Metal
In fact, this chapter mainly introduces the principles and workflow of Apple GPU in graphics rendering, which are some low-level hardware principles. When building apps or games using Metal, it uses the TIle-based Deferred Rendering (TBDR) architecture of the GPU to bring significant performance improvements to applications and games. This chapter mainly introduces the architecture and capability of GPU, as well as the principle and process of IMAGE rendering based on TBDR architecture. Developers use Metal to build applications and games. Since this session does not involve upper level software development, I will not repeat the details of the video. The command is used to Harness Apple GPUs with Metal
Optimize Metal apps and games with GPU counters
This chapter introduces Xcode’s GPU performance analysis tool, Instrument, which now supports GPU performance analysis. Then the performance bottleneck of GPU is analyzed from many aspects, and the optimization point when the performance bottleneck appears. In general, it is to optimize our App or game through performance analysis tools to make the whole picture more smooth. The whole chapter is divided into five parts:
✎ ✎ ✎ Entertainment Project
This session is mainly a quick review of the architecture and rendering process of Apple’s GPU. However, many rendering tasks need to be performed on different hardware units, such as ALU and TPU. They have different measurements for different throughput. There are many GPU performance metrics that need to be considered, so GPU performance counters are introduced. This counter may measure GPU utilization, too high and too low can cause our rendering performance bottlenecks. Optimize Metal Apps and Games with GPU counters (6:37 ~9: 57), mainly uses Instrument tools. For detailed use of these tools, please refer to WWDC19 session videoGetting Started with Instruments
Sing Performance Bottleneck Analysis
This chapter focuses on the various aspects that cause GPU performance bottlenecks and their optimization points. It is mainly divided into six aspects, as shown in the figure below:
1. Find Arithmetic
The Arithmetic Logic Unit (ALU) is usually used in GPU to process various operations, such as bit operation and relational operation. It is part of the core of the shader. Complex operations or high-precision floating-point operations can cause performance bottlenecks here, so here are some suggestions for optimization:
As shown in the figure above, we can replace complex operations with approximations or lookup tables. In addition, we can replace full-precision floating-point numbers with half-precision floating-point numbers. Try to avoid implicit conversions and avoid input of 32-bit floating-point numbers. And make sure all shaders are compiled using Metal’s “-ffast-math”.
The GPU uses the Texture Processing Unit to handle Texture Read and Write. Of course, there are some performance bottlenecks in the process of reading and writing. Here, the optimization points are given from reading and writing parts respectively:
Read
We can try mipMaps, as shown above. In addition, you can consider changing the filtering options. For example, use bilinear instead of trilinear to reduce the pixel size. Make sure texture compression is used, using block compression (such as ASTC) for Asset and lossless texture compression for run-time generated textures.
Write
As shown in the figure above, we should note the pixel size, as well as the number of unique MSAA samples in each pixel. In addition, you can try to optimize some logical writing.
✎ Tile Memory Load and Store
Graph block memory is a Group of high-performance memory that stores Thread Group and ImageBlock data. Tile memory can be accessed when pixel data is read or written from ImageBlock or Threadgroup, such as when using Tile shaders or when dispatching is evaluated. When the GPU performance counter is used to find the performance bottleneck in this aspect, we can optimize it as shown in the figure below.
Consider reducing the parallelism of threadGroups, or SIMD/Quadgroup operations. Also, make sure that memory allocation and access for thread groups are aligned to 16 bytes. Finally, you can consider reordering the memory access pattern.
Sing Buffer Read and Write
In Metal, the buffer is only accessed by the shader core. This is where a performance bottleneck is found. We can optimize it as shown below:
You can compress and package the data more vigorously, for example using a small type such as packed_half3. In addition, you can try vectorization loading and storing. For example, SIMD is used. Register overflow is avoided, and textures can be used to balance the workload.
✎ Sing GPU Last Level Cache
If this is the case, our GPU performance counter is showing an excessively high value. We can optimize as shown below:
If the texture or cache also shows an excessively high value, we can make this optimization a priority. We can consider reducing the size of the working set. If Shader is using Device Atomics, we can try refactoring our code to use Threadgroup Atomics.
✎ fragments Input Interpolation
Piecewise input interpolation. Segmented input is interpolated by the shader core during the rendering phase. The shader core has a dedicated segmented input interpolator. This is more fixed and high precision function. There are few points we can optimize, as shown in the figure below:
Remove as many vertex attributes passed to the segment shader as possible.
Memory bandwidth
Memory bandwidth is also an important factor that affects our GPU performance. If you see a high value in the memory bandwidth module of the GPU performance counter. We should optimize as shown below:
If the texture and cache also display high values, the optimization priority should be first. The optimization scheme is also less in size of Working Set. In addition, we should only load the data needed for the current rendering process and only store the data needed for future rendering processes. Then make sure to use texture compression.
Occupancy
If we see low overall utilization, this means that the Shader may have exhausted some internal resources, such as tile or Threadgroup memory. It could also be that the thread completes execution faster than the GPU can create a new thread.
Avoid redrawing
We can count areas that are repeatedly drawn using GPU counters, and HSR should be used efficiently to avoid such redraws. We can draw it in the order shown.
Build GPU binaries with Metal
This chapter introduces developers to a programming workflow using Metal that can enhance the rendering pipeline by optimizing Metal’s rendering and compilation model. This optimization can greatly reduce the loading time of the PSO(pipeline state object) at application startup, especially the first time. Can make our graphics render more efficient. The whole chapter is divided into four parts:
Overview of Metal’s Shader compilation model
Metal Shading Language is known as Apple’s Shader programming Language for developers, Metal compiles the programming Language into an intermediate called AIR, which is then further compiled on the device to generate the specific machine code required for each GPU. The whole process is shown below:
This process occurs during the life of each pipeline, and Apple currently caches some Metal method variants to speed up the pipeline recompilation and re-creation process, but this process still causes the screen to take too long to load. And under the current compilation model, applications cannot reuse previously generated machine code subroutines in different Psos (pipeline state objects).
So we need a way to reduce the time cost of this whole pipeline compilation (source code ->AIR->GPU binaries), and a mechanism to share subroutines and methods between different PSO’s so that the same code doesn’t have to be compiled multiple times or loaded into memory multiple times. This allows developers to use the tools to optimize their App’s first launch experience.
Metal Binary file introduction
One solution is Metal binaries, which developers can now use directly to control PSO caching for binaries. Developers can collect compiled PSO’s, store them on devices, and even distribute them to other compatible devices (same GPU and same operating system), and this binary can be seen as an Asset. Here are some routines and diagrams:
/ / create an empty binary file descriptor = MTLBinaryArchiveDescriptor () descriptor. The url = nil let binaryArchive = a try device.makeBinaryArchive(descriptor:descriptor)Copy the code
Populating an archive.png //Populating an archive // Render pipelines try binaryArchive.addRenderPipelineFunctions(with: renderPipelineDescriptor) // Compute pipelines try binaryArchive.addComputePipelineFunctions(with: computePipelineDescriptor) // Tile render pipelines try binaryArchive.addTileRenderPipelineFunctions(with: tileRenderPipelineDescriptor)Copy the code
Compiled Functions to build a pipeline state object from a file let renderPipelineDescriptor = MTLRenderPipelineDescriptor() // ... renderPipelineDescriptor.binaryArchives = [ binaryArchive ] let renderPipeline = try device.makeRenderPipelineState(descriptor: renderPipelineDescriptor)Copy the code
// serialize let documentsURL = Filemanager.default.urls (for:.documentDirectory, in:.userDomainmask).first! let archiveURL = documentsURL.appendingPathComponent("binaryArchive.metallib") try binaryArchive.serialize(to: NSURL.fileURL(withPath: archiveURL))Copy the code
// deserialize let documentsURL = FileManager.default.urls(for:.documentDirectory, in:.userDomainmask).first! let serializeURL = documentsURL.appendingPathComponent("binaryArchive.metallib") let descriptor = MTLBinaryArchiveDescriptor() descriptor.url = NSURL.fileURL(withPath: serializeURL) let binaryArchive = try device.makeBinaryArchive(descriptor: descriptor)Copy the code
In general, the Metal binary provides a way for developers to manually manage pipeline caches so that they can be retrieved from one device and deployed to other compatible devices, greatly reducing the pipeline creation time after the first installation of a game or application and device restart in an iOS environment. You can optimize your app’s first startup experience and cold startup experience.
Metal supports dynamic libraries
Dynamic libraries will allow developers to write reusable library code while reducing the time and memory cost of recompiling programs. This feature will allow developers to dynamically link computational shaders to libraries. And like binaries, dynamic libraries are serializable and transferable. This is one of the solutions to the above requirements.
At PSO generation time, each application needs to generate machine code for the program library, and compiling multiple pipelines using the same library results in duplicate machine code generation. This can lead to longer pipeline load times due to heavy compilation and increased memory. Dynamic libraries can solve this problem.
The Metal Dynamic Library allows developers to dynamically link, load, and share tool methods in machine code. Code can be reused across multiple computation pipelines, eliminating double compilation and storage of multiple identical subroutines. The TLDynamicLibrary is serializable and can be used as an Asset for your application. Tldynamiclibrary is simply a collection of export methods that compute pipeline calls.
The general workflow is as follows: We will first create an MTLLibrary as our specified dynamic library, which will compile our Metal code into AIR. We then call the method makeDynamicLibrary, which needs to specify a unique installName that Linker will use to load the dynamic library when the pipeline is created. This method compiles our dynamic library into machine code. This completes the creation of the dynamic library.
Dynamic libraries can be loaded and used by setting the libraries parameter in the MTLCompileOptions. The code is as follows:
// Let options = MTLCompileOptions() options.libraries = [utilityDylib] let library = try device.makeLibrary(source: kernelStr, options: options)Copy the code
Introduction to Development Tools
This section mainly introduces the specific tools and methods for building Metal binaries and dynamic libraries. Build GPU Binaries with Metal (from 22:51)
Debug GPU-side errors in Metal
This chapter mainly introduces gPU-side bugs. At present, if there are GPU-side bugs in our application, the error log often fails to let developers intuitively locate the error code range and call stack. So in the latest Xcode, the debug mechanism on the GPU side has been enhanced. Like errors sent from the code side, not only can the cause of the error be located, but also the error call stack and various information can be viewed in detail. Allows developers to better fix gPU-side rendering errors caused by code.
Enhanced Command Buffer Errors
This is the current error log report. We can see that the error log on GPU side is not the same as the error log on Api, so developers can quickly locate the cause of the error and the location of the error code.
The latest Metal debugging tool has enhanced this capability by allowing Shader codes to locate and classify errors as well as Api code.
We can enable the enhanced CommandBuffer error mechanism with the following code
/ / to enable enhanced version of the commandbuffer error mechanism let desc = MTLCommandBufferDescriptor () desc. ErrorOptions =. EncoderExecutionStatus let commandBuffer = commandQueue.makeCommandBuffer(descriptor: desc)Copy the code
There are five states of error:
We can also print error with the following code:
If let error = commandBuffer. Error as NSError? { if let encoderInfos = error.userInfo[MTLCommandBufferEncoderInfoErrorKey] as? [MTLCommandBufferEncoderInfo] { for info in encoderInfos { print(info.label + info.debugSignposts.joined()) if info.errorState == .faulted { print(info.label + " faulted!" )}}}}Copy the code
Developers can enable an optimized version of the error mechanism both at development time and during testing
Shader Validation
As shown above, this feature can automatically locate and catch errors and locate them to code when rendering errors occur on the GPU side, as well as fetch backtrace stack frames.
We can enable this function in Xcode by following the following process:
1. Enable the two Validation options in Metal
2. Enable the issue automatic breakpoint switch and configure the type and category
The Video uses a demo to show the entire workflow. For details, see Debug GPU-side Errors in Metal (11:25 to 14:45). The following figure shows the general workflow:
This is a Demo application and it’s obviously rendering with some exceptions, but it’s a GPU-side issue so it’s hard for developers to locate. However, once Shader Validation is enabled using the workflow described above.
Xcode will automatically break to where the exception occurred and display the exception information, which can greatly improve the efficiency of the developer’s error repair.
Gain insights into your Metal app with Xcode 12
This chapter focuses on Xcode12’s new tools for debugging and analyzing Metal Apps. As shown below:
It is mainly divided into two parts:
Metal Debugger
This tool allows developers to obtain any frame they want to analyze and debug when the App is running, and then enter various analysis interfaces provided by Xcode, including general situation, dependency situation, memory, bandwidth, GPU, Shader and other specific interfaces to conduct more detailed analysis and debugging of this frame. The whole process may be more efficient using video, so we won’t go into details and analysis here. See Gain Insights into Your Metal App with Xcode 12 for details
Metal System Trace
Compared with the Debugger mentioned before, its main function is to allow developers to capture various information and features of the application over time, which allows developers to debug some problems such as terminal, frame loss, memory leakage and so on. The Debugger mainly debugs and analyzes a frame.
He offers a tool called the Code Timeline, which allows developers to see how the GPU is running various command buffers while the application is running. It then provides a tool called a shader timeline, which allows developers to view the various shaders as they run their code. Then there is the GPU counter tool, which we have analyzed in detail in the previous article. It is mainly used to solve the problem of GPU drawing performance. Then the final tool is the memory allocation tracker, which allows developers to view the various memory allocations and releases during application execution. This can help developers solve memory leaks or reduce the application memory footprint.
Technical inspiration and thinking
WWDC 20 provides gPU-level debugging tools and performance analysis tools for Metal sessions. It provides some ideas for breaking through performance bottleneck and providing better user experience for mature, large and complex projects.
As an e-commerce App, With the increase of functions and the complexity of the project, it is inevitable that Xianyu will encounter performance bottlenecks. However, the current way of xianyu team to face the challenge is to optimize at the engineering level. From the perspective of Flutter, WWDC 20 provides more optimization ideas for Metal debugging tools and performance analysis tools. This opens up new ideas and possibilities for future applications running on iOS to tune and break performance bottlenecks.
As for cross-platform frameworks, Apple has its own SwiftUI, which is also the focus of the conference. For both Flutter and SwiftUI, however, the ultimate goal is to break the performance bottleneck and optimize the application. That is, to develop, debug and perform performance analysis at the GPU level. Understanding gpus and programming at the GPU-level is certainly one of the indispensable skill points for future client developers.