What do we talk about when we talk about optimization – Zhihu (Zhihu.com)

Geometric data optimization

Reduce the number of vertices

  • Reduce vertex meaning
    • Fewer vertices read Shader Core from System Memory/DRAM (bandwidth pressure)
    • Less VS execution (calculated stress),
    • For Mobile Gpus, this also means faster Bining Pass (mainly bandwidth pressure).
  • Three ways to reduce vertices
    • Subtraction of the model
    • LOD
    • Normal map instead of high model details.

Reduce the amount of data per vertex

  • Less data means less read and write overhead for the VAF phase and the Binning Pass phase.
  • We can use some fast vertex data compression/decoding schemes in VS [21][22] (less computational overhead for less bandwidth overhead).

Avoid small triangles

  • The most obvious drawback of small triangles is that they take up very few pixels on the screen and are a visual waste.
  • Due to Triangle Setup and Triangle Clipping in the hardware pipeline, this is mainly the “fixed cost of constructing a Triangle”. It is also mentioned in article [19] that it is important to avoid triangles smaller than 32 pixels, I guess because triangles smaller than 32 pixels may have Warp less than 32 pixels in photoshop. In GPU Driven Pipeline proposed in recent years, Compute Shader has been used to eliminate the optimization method of small triangles [23].

Optimizing index buffering

  • The more times a point is accessed, the higher the memory cache hit ratio and, accordingly, the lower the bandwidth overhead.
  • Index buffer can be rearranged so that the same vertex in a indices patch can be referenced as many times as possible [24].

Interleaving Attributes vs. Seperate Attributes

  • Custom Attributes: If a bunch of Attributes are always used together in VS (e.g. skinned weight and skinned indices; Normal/Tangent), we should put them together to reduce the Graphics API bind times,
  • If a bunch of attributes are used very differently between VS (for example, position is used very frequently, but vertex color is used infrequently), then we should store them in different buffers.
  • This principle is the same as the difference between AoS and SoA, which is to maximize the Cache utilization (the minimum unit of Cache load is the Cache Line, usually 64/128bytes, so ensure that memory access can load more useful data into the Cache each time).

Object optimization

Sorting based on camera distance /Z pre-pass

  • This works for Gpus with Early-Z (IMR/TBR), but not for TBDR.

Sort based on material /RenderState

  • Reduce drawcall
    • RenderState is a general term for state machine-based Graphics apis like OpenGL, such as buffer/texture binding, frameBuffer toggle, shader toggle, Depth/Stencil/Culling /blend modes are state switches and have performance overhead. The costs involved in state switching include command verification and generation on the driver side. Reconfiguration of GPU internal hardware state machine; Video memory read and write; Synchronization between CPUS and Gpus. Here is a figure [25] that roughly quantifies the cost of various state transitions:

Texture optimization

The core of tile optimization is only one: Cache Friendly.

Reduce texture size

  • Many people feel that the biggest optimization from reducing tile size is video memory, which is probably true for console/mobile platforms and some scenarios that are limited by video memory size. But from a performance perspective, the biggest benefit of reducing map size is increased cache hit ratio: Let’s say you change a 1024×1024 tile to a 1×1 tile with the same shader, and you’ll notice that the shader execution is faster, because for all shaders that need to sample this tile, it only needs to actually read the data from memory once, and each time after that, All you need to do is fetch that pixel from the cache. In other words, we care about how many pixels of PS execution each cache line can cover. The smaller the map size, the more pixels sampled are covered by each cache line.

Using compressed maps

  • The idea is similar to vertex compression, in that some computation is sacrificed for immediate data decompression in exchange for less bandwidth consumption. Such as DXT/PVRTC/ASTC are such ideas. The same idea can be applied to compact G-buffer generation, for example CryEngine used Best Fit Normal and YCbCr color space to compress g-buffer [27][28].

Merge textures to Texture Altas

  • This is done to reduce the binding overhead of maps, essentially reducing state switching. If you have Bindless Texture[25], this optimization won’t help much.

The use of Mipmap

  • Usually we use Mipmap to prevent flicker in texture samples where uv changes quickly (usually at a distance), but the root cause of flicker is that the Texel we pick is discontinuous when we color adjacent pixels. This actually means cache miss. Mipmaped texture stores each layer of MIP by level (contiguity in physical memory), which means that when you sample using Mipmap, the cache hit ratio is higher, and therefore performance is better than without Mipmap.

Store results to Buffer or Texture?

  • Sometimes we put some general calculations on the GPU, and the result is stored in the buffer/texture. Theoretically, if you can choose, try to store non-image data on the buffer (e.g. particle velocity/pos or skinPallete, Better to use buffer). Buffer is linear, Textute is segmented, and the segmented memory layout tends to work against cache hits in non-textured fetch mode.

Of course, Cache Friendly also means less heat for mobile platforms.

The optimization of Shader

Fewer branches?

We’ve already explained how the GPU implements branching, so when it comes to reducing branching, we shouldn’t just assume that branching is always bad for performance. It should be said that if the result of a branch depends on the shader’s decision at run time, and the result varies greatly within warp, then we should avoid branching and, if unavoidable, try to extract the common computing part out of the branch. Most of the Deferred Shading frameworks in recent years rely on the Material ID to determine the Shading type, and rely on dynamic branches to do different Shading calculations during the Shading stage. This is because the Shading changes in screen space are relatively small (most of them use standard PBR Shading). So there’s not much performance problem with branching.

Additionally, we often use Uber Shaders to achieve different textures (but share a lot of common computing). There are two ways to achieve this: using macro definition to generate different Sub shaders based on Uber Shader and Uniform branches within Uber Shader. The former may have shader switching overhead, while the latter may be better for performance (of course, it depends on the case).

Specify exactly the data type

For ALU, the duration/concurrency of many mathematical operations depends on the data bit width. Therefore, the lowest precision data type allowed by the algorithm should be used for calculation. For example, in GLSL, highp/mediump/lowp can be used to specify the data calculation accuracy of the current shader. In addition, additional instructions are required to convert Int types when doing mixed Int/Float calculations, so try not to use Int data if you don’t have to.

Vector algorithm or scalar algorithm?

In the past, many Shader Core designs are Vector Based, that is, ALU can only perform Vector addition, subtraction, multiplication and division, and also specify Vector operations for scalars. Based on this kind of design, there is some ingenuity to make a calculation as vectorized as possible [17]. However, there are more Scalar Based Shader Core, so there is no need to pay too much attention to this. However, we should delay the calculation between vectors and scalars as much as possible, as in this example:

Cache intermediate results with maps?

Most of the time, we will cache some mathematical intermediate results in a tile, and the value of the tile itself does not represent visual information, but pure numbers. For example, Marschner Hair Mode uses LUT to store BRDF[29]. UE4 uses LUT to store AMBIENT light BRDF of PBR [30].

There are two performance costs associated with LUT:

(1) The texture itself is numerical, so it can only be used in lossless format and cannot be compressed, so bytes per pixel is relatively large and takes up more read bandwidth than normal texture

(2) The sampling of a map is based on the LUT’s UV, and the UVs calculated by adjacent pixels usually have no spatial continuity, which means that every SAMPLING of a LUT will almost always result in a cache miss, so this type of LUT is slower than normal mapping.

Conclusion: Try to use fitting function to replace LUT sampling. For Mobile GPU, never try to optimize a segment of Shader with LUT. For Desktop Gpus, seriously consider using luTs.