Before, I thought that it was only necessary to know about the issues related to GPU architecture. After all, the details of hardware should be hidden from the graphics API level. There are thousands of GPU architectures, and every manufacturer and model are different, so developers don’t need to fall into this detail. But a recent look at Metal, especially the new features that have been added since 2.0, has brought home the fact that the graphical API is truly in its third generation. Of course, DX12, Vulkan, and Metal were born as “modern graphics apis”, and I’m only a little behind the curve. One feature of these modern graphics apis is that they leave more of the details of manipulating the hardware up to the developer. But what really makes it necessary for me to take a closer look at the GPU architecture is that Metal offers two very different sets of apis for desktop gpus and mobile Gpus, which makes graphical programming on the desktop and mobile quite different.

Desktop GPU Architecture — IMR (Immediate Mode Rendering)



IMR is the GPU architecture that we are widely familiar with and use. Take N card as an example, it is still IMR architecture from Tesla to Turing. This architecture is also a natural fit with the previous graphics API and rendering pipeline. Each drawing instruction comes to the graphics card, which immediately executes it, running the entire pipeline from start to finish, and finally feeding the results into the Frame Buffer. However, there is a problem with this architecture. After the Depth test is enabled, the output of each fragment is Depth tested against the Depth value in the Depth Buffer. If it passes the test, the Depth Buffer and Frame Buffer need to be updated. This process involves one read and two writes to the System Memory, and the number of fragments is huge, which brings great pressure to access the System Memory. The solution for IMR is to equip the GPU with a large enough cache and a large enough bandwidth. The trade-off, however, is that the graphics card makes the motherboard bigger and bigger to accommodate more caches, and the frequent and large bandwidth accesses cause huge power consumption and heat and have to add a separate fan. These costs are acceptable on the desktop, but become a nightmare on the mobile. Both physical space and power consumption are at a premium for mobile devices, so a new GPU architecture has to be introduced

Mobile GPU Architecture — TBR (Tile Based Rendering)



The TBR architecture adds a piece of cache close to the GPU, often called Tile Memory (also called on-chip Buffer). Due to cost, power consumption and other reasons, this cache is not very large, about tens of k magnitude. First of all, the entire screen is divided into numerous small pieces called tiles, usually 32 by 32, so that the tile Memory is large enough to hold the data related to that tile. When a drawing instruction arrives on the graphics card, instead of rendering immediately, as IMR does, the vertex data will be grouped by tiles via vertex shader and clipped, and the grouped data will be stored in System Memory. This cache, also known as a Parameter Buffer (PB, Primitive List and Vertex Data), then processes the next drawing instruction. When all the vertex data of drawing instructions are processed and stored in PB or PB reaches a certain capacity, the next step of the pipe starts, that is, the graphics card will retrieve the corresponding vertex data from PB in the unit of tiles for rasterization, fragment shader and piece-by-piece processing. Frequent accesses to System Memory in tile-by-bit processing are replaced by very inexpensive accesses to Tile Memory. The data is then written back to System Memory until the tile’s frament is fully updated to the tile Memory, and the next tile is processed. Instead of imR fragmenting a large number of unestimated reads and writes to System Memory, TBR has a limited (consistent with the number of tiles) block writes. Although PB is also in System Memory, the access to PB is of the order of vertices (obviously vertex is much smaller than fragment) and the data is specially compressed, so this substitution is still valid.
However, there are some problems with this architecture, as the rendering pipeline breaks in mid-stream, which makes switching the Frame Buffer at this point cumbersome. In TBR, all cached render data is forced to draw, and some of the data is transferred to the new Frame Buffer after drawing, which increases the data copy between Tile Memory and System Memory. Therefore, be careful when switching between Frame Buffers on mobile devices. On the other hand, the rendered data is not rendered immediately, but cached, which gives a lot of room for optimization.

More powerful mobile GPU architecture — Tile Based Deferred Rendering TBDR



Putting TBDR here might lead to the misconception that TBDR is an upgraded version of TBR. TBDR is Imagination’s own mobile GPU architecture, which is widely used in PowerVR and other products. Because of its significant advantages over other TBR architectures of the same period, Apple favored it and installed it on the iPhone. In recent years, the iPhone has started to use fully homegrown A-series processors and continues to use TBDR architecture. The advantage of TBDR is that the vertex data cached in THE PB can be used to filter the fragments that flow into the rest of the pipeline in advance, which is a long-standing problem of traditional rendering pipeline — over draw. The key to achieve this step is HSR (Hidden Surface Removal) technology.



As shown in the figure above, the Image Synthesis Processor (ISP) retrieves the vertex data (only vertex data) of the current tile from the PB on a per-pixel basis. The ISP then differentiates the data and calculates the depth of the slice metadata to the point of difference, and performs depth and template tests. If the test passes, the depth and template cache on the chip are updated, and the element id is recorded in the tag buffer. When all the tiles in a tile have been processed by ISP, the tag buffer gives a unique visible tile ID for each pixel. The visible primitors are then retrieved from the PB in units of primitors, and the other varying data (such as UV) other than the vertices are differentiated by TSPF and passed to the Fragment shader. This also explains the two branches of Vertex data in the overall TBDR flowchart, *1 representing vertex data flowing into the ISP and *2 representing other data flowing out of the TSPF.
TBDR also has some drawbacks. For example, fragments that are discarded in the Fragment Shader (alpha testing) cannot be drawn before the fragment shader until it needs to be drawn. Therefore, these fragments need to be submitted to the Fragment Shader in advance through the GCS in the figure above and returned to the ISP after calculating the exact depth. This process blocks the calculation of other fragments, so it is a relatively expensive cost. In addition, for semi-transparent objects (alpha blending), since the color of a Tile is not only determined by the nearest Tile, it is forced to draw cached tiles, which also increases copying between Tile and System Memory. Therefore, objects should be grouped, first turn on the depth test to draw opaque objects, then turn off the depth test to draw transparent objects.
Although HSR technology is patented by Imagination, ina narrow sense, only graphics cards that use HSR technology can be called TBDR. But other vendors have made their own improvements to the TBR architecture, such as Arm’s Forward Pixel Kill and Snapdragon’s Flex Render, which are both trying to reduce overdrawing. There is also a Tile Based Deferred Shading pipeline called TBD on the software level, but it’s a completely different level from TBDR.

Look at IMR, TBR, and TBDR

IMR and TBR have completely different rendering flows due to the Tile Memory, and some of the rules that apply are completely different on mobile. In addition to the aforementioned issues of switching FBO, alpha testing, and alpha mixing. On IMR, for example, each frame clear is completely unnecessary because the entire screen is redrawn to cover the previous frame. For TBR, not clearing each frame means copying the previous frame to tile Memory at the beginning of each tile. To prevent unnecessary copying, clear each frame is required on TBR. Texture sampling can also become an expensive operation if you consider the value of Tile Memory in TBR and the difficulty of accessing System Memory. Texture data is stored in System Memory, and a small number of recently accessed textures are cached in Tile Memory, so using compressed textures can make the limited Tile Memory cache more texture data. At the same time, LUT (look-up Table), which does not conform to the spatial locality of texture data, will greatly reduce the cache hit and should be used less.
TBR, an architecture that had to be “compromised” to accommodate the limitations of mobile, has these limitations but also unexpected power. This magical Tile Memory not only imposes negligible read and write costs, but also provides a temporary cache for the rendering pipeline. The significance of this cache is that the rendering process that used to take multiple passes becomes a single pass. To achieve this optimization, technologies such as ImageBlock and Tile Shader are available in Metal 2.0, allowing developers to program data in Tile Memory, which makes a huge difference in the implementation of mobile and desktop rendering pipelines.