One, foreword

Since Epic released the UE5 technology Demo at the beginning of this year, the discussion on UE5 has never stopped. The relevant technical discussion mainly centers on two new features: global lighting technology Lumen and extremely high model detail technology Nanite. There have been some articles [1][2] that have introduced Nanite technology in detail. This paper mainly starts from the RenderDoc analysis and source code of UE5, combined with some existing technical data, aiming to provide an intuitive and overview understanding of Nanite, and clarify its algorithm principle and design idea, without involving too much implementation details of source code level.

What do we need for next-gen model rendering?

To analyze the technical points of Nanite, we should first start from the point of view of technical requirements. Over the last decade, the triple-A genre has evolved around two things: interactive cinematic narratives and open worlds. In order to make a cutscene realistic, the role model needs to be shown in detail. For a sufficiently flexible and rich open world, the size of the map and the number of objects grew exponentially, both of which greatly increased the level of detail and complexity required for the scene: both the number of objects in the scene and the level of detail required for each model.

There are usually two bottlenecks in complex scene rendering:

  1. CPU-side verification and CPU-GPU communication overhead brought by each Draw Call;
  2. Overdraw caused by inaccurate culling and the resulting waste of GPU computing resources; In recent years, the optimization of rendering technology often revolves around these two problems, and has formed some technical consensus in the industry.

To address the overhead of CPU-side validation and state switching, we have a new generation of graphical APIs (Vulkan, DX12 and Metal) designed to allow drivers to do less validation on the CPU-side. Compute/Graphics/DMA Queue (Compute/Graphics/DMA Queue); Requiring developers to handle synchronization between CPU and GPU on their own; Make full use of the advantages of multi-core CPU multi-thread to submit commands to the GPU. Thanks to these optimizations, the number of Draw calls of the new generation of graphics APIs is an order of magnitude higher than that of the previous generation of graphics APIs (DX11, OpenGL) [3].

Another optimization is to reduce data traffic between the CPU and the GPU, and to more accurately eliminate triangles that do not contribute to the final picture. Based on this idea, GPU Driven Pipeline is born. For more information on GPU Driven Pipeline and culling, read my article [4].

Thanks to the increasing use of GPU Driven Pipeline in games, the vertex data of the model is further segmented into fine-grained clusters (or meshlets). To better accommodate the granularity of each Cluster to the Cache size during the Vertex Processing phase, Cluster Culling (Frustum Culling, Occulsion Culling and Backface Culling) has gradually become the best practice for complex scene optimization, and GPU manufacturers have gradually recognized this new vertex processing process.

However, traditional GPU Driven Pipeline relies on Compute Shader culling, and the culled data needs to be stored in the GPU Buffer via Execute Indirect API. The deleted Vertex/Index Buffer is fed to GPU Graphics Pipeline again, which virtually increases the cost of reading and writing. In addition, Vertex data is read repeatedly (Compute Shader is read before culling and Graphics Pipeline is read by Vertex Attribute Fetch during drawing).

Based on the above reasons, in order to further improve the flexibility of vertex processing, NVIDIA first introduced the concept of Mesh Shader[5], hoping to gradually remove some fixed units (VAF, PD and other hardware units) in the traditional vertex processing stage. Leave it to the developer to handle it through the programmable pipeline (Task Shader/Mesh Shader).





Cluster diagram



The traditional GPU Driven Pipeline eliminates CS dependence, and the eliminated data is transmitted to the vertex processing Pipeline through VRAM



Based on Mesh Shader Pipeline, Cluster culling becomes part of the Vertex processing phase, reducing unnecessary Vertex Buffer Load/Store

Three, is that enough?

So far, the problem of the number of models, the number of vertices and the number of faces in a triangle has been greatly optimized. But high-precision models and pixel-level small triangles put a new strain on the rendering pipeline: rasterization and Overdraw.

Does soft rasterization stand a chance against hard rasterization?

To understand this, you first need to understand what hardware rasterization does and what it envisage for general use, so I recommend that interested readers read this article [6]. Simply put: Traditional rasterization hardware is designed with input triangles that are assumed to be much larger than a pixel. Based on this assumption, the process of hardware rasterization is usually hierarchical.

In the case of an n-card rasterizer, a triangle usually undergoes two stages of rasterization: Coarse Raster and Fine Raster take a triangle as input and use 8×8 pixels as a block to Raster the triangle into several blocks (you can also understand it as a Coarse Raster on a FrameBuffer that is 1/8 x 1/8 of the original FrameBuffer size).

At this stage, by using a low-resolution Z-buffer, the occluded block is completely culled, which is called Z-Cull on the N card. After the Coarse Raster, the blocks that pass Z Cull will be sent to the next stage for Fine Raster, which will eventually generate pixels for coloring calculation. In the Fine Raster phase, there is the familiar Early Z. Since the calculation of MIP-MAP sampling is required, we must know the information of adjacent pixels of each pixel, and use the difference of sampling UV as the basis for the calculation of MIP-MAP sampling hierarchy. To this end, the Fine Raster output is not individual pixels, but small 2×2 Pixel quads.

For triangles close to pixel size, the waste of hardware rasterization is obvious. First of all, the Coarse Raster stage is almost useless, for these triangles are usually smaller than 8×8. For those long and narrow triangles, the situation is even worse, because a triangle usually spans multiple blocks. However, the Coarse Raster can not only eliminate these blocks, but also increase the extra computational burden. In addition, for large triangles, the Fine Raster stage based on Pixel Quad only generates a small number of useless pixels at the edges of the triangle, which is a small fraction of the area of the whole triangle. But for small triangles,At worst, the Pixel Quad generates four times as many pixels as the area of a triangle, and these pixels are also included in the execution phase of the Pixel Shader, making the number of valid pixels in the WARP significantly reduced.



Small triangles due to rasterization waste due to Pixel Quad

For these reasons, soft rasterization (based on Compute Shader) does have a chance to beat hard rasterization in the particular case of pixel-level small triangles. This is also one of the core optimizations of Nanite, which improves the efficiency of UE5 in small triangle gratings by three times [7].

Deferred Material

The problem of redrawing has long been a performance bottleneck for graphics rendering, and optimization around this topic has also emerged. On the mobile end, there is the familiar Tile Based Rendering architecture [8]. In the evolution of Rendering pipeline, Z-Prepass, Deferred Rendering, Tile Based Rendering and Staging Rendering have been proposed one after another. These different Rendering pipeline frameworks are all designed to solve the same problem: When the number of light sources exceeds a certain number and the complexity of materials increases, how to avoid too many branches of rendering logic in the Shader and reduce unnecessary redrawing? For this topic, you can read this article of mine.

Typically, a delayed rendering pipeline requires a set of Render targets called G-buffers, which store material information for all lighting calculations. In today’s AAA games, the material types are often complex, and the amount of G-buffer information that needs to be stored is increasing year by year. Take Kill Zone 2 (2009) as an example. The layout of the G-buffer is as follows:

Excluding Lighting Buffer, the actual number of G-Buffer maps required is 4, a total of 16 Bytes/Pixel; As of 2016, Uncharted 4’s G-Buffer layout is as follows:



The number of G-Buffers is 8 (32 Bytes/Pixel). In other words, the bandwidth required by the G-buffer at the same resolution doubled due to the increase in texture complexity and fidelity, and this is not taking into account the increasing game resolution over the years.

For scenarios with high Overdraw, the read-write bandwidth generated by G-Buffer drawing can often become a performance bottleneck. Therefore, the academic community proposed a new rendering pipeline called Visibility Buffer [10][11]. Instead of producing bloated G-Buffers alone, the algorithm based on the Visibility Buffer replaces them with the one with lower bandwidth overhead. The Visibility Buffer usually requires this information: (1) Instance ID, which represents the Instance (16-24 bits) of the current pixel; (2) Primitive ID which triangle (8-16 bits) the current pixel belongs to Instance; (3) Barycentric Coord, which represents the position of the current pixel within the triangle, is represented by Barycentric coordinates (16 bits); (4) Depth Buffer, representing the Depth of the current pixel (16-24 bits); (5) Material ID, which represents the Material of the current pixel (8-16 bits);

Above, we only need to store about 8~12 Bytes/Pixel to represent the material information of all the geometry in the scene. Meanwhile, we need to maintain a global vertex data and material paste chart, which stores the vertex data of all the geometry in the current frame, as well as material parameters and maps.

In the light shading phase, only the Instance ID and Primitive ID need to be indexed from the global Vertex Buffer to the relevant triangle information. Further, according to the barycenter coordinates of the pixel, the Vertex information (UV, Tangent Space, etc.) in the Vertex Buffer is interpolated to obtain per-pixel information. The next step is to index the relevant Material information based on the Material ID, perform texture sampling, and input it to the lighting calculation section to complete the shading. This method is sometimes referred to as Deferred Texturing.

Here is the rendering pipeline flow based on G-Buffer:

Here is the flow of the render pipeline based on Visibility-Buffer:

Intuitively, the Visibility Buffer reduces the storage bandwidth of the information needed for coloring (G-Buffer -> Visibility Buffer); In addition, it delays reading the geometry and map information associated with lighting calculations until the shading phase, so that pixels not visible on the screen no longer need to read these data, but only the vertex positions. Based on these two reasons, the Visibility Buffer in complex scenes with high resolution can significantly reduce the bandwidth overhead compared with the traditional G-buffer. However, maintaining global geometry and material data adds to the complexity of the engine design and reduces the flexibility of the material system. In some cases, the need to rely on Graphics APIs such as Bindless Texture[12], which are not fully hardware supported, is not compatible.

4. Implementation in Nanite

Rome was not built in a day. Any mature academic or engineering field that gives birth to a technological breakthrough must have the thinking and practice of its predecessors, which is why we spend a lot of space to introduce the relevant technical background. Nanite is an excellent engineering practice based on previous solutions, combined with the computing power of current hardware, and based on the requirements of the next generation of game technology.

Its core idea can be divided into two parts: the optimization of vertex processing and the optimization of pixel processing. The optimization of vertex processing is mainly based on the idea of GPU Driven Pipeline. The optimization of pixel processing is accomplished by combining soft rasterization with the idea of Visibility Buffer. With the help of UE5 Ancient Valley technology demonstration of RenderDoc capture frame and related source code, we can get a glimpse of Nanite technology. The whole algorithm flow is as follows:

Instance Cull && Persistent Cull

When we explain the development process of GPU Driven Pipeline in detail, it is not difficult to understand the implementation of Nanite: In the pre-processing stage, each Nanite Mesh is cut into several clusters, each Cluster contains 128 triangles. The whole Mesh is organized into a tree structure in the form of BVH (Bounding Volume Hierarchy), and each leaf node represents a Cluster. The culling is divided into two steps, including cone culling and occlusion culling based on HZB. Instance Cull is taken as the unit of Mesh, and the root node of its BVH will be sent to the Persistent Cull stage for hierarchical culling through the Mesh of Instance Cull (if a BVH node is culled, its child nodes will not be processed any more).

This raises the question of how to map the number of culling tasks in the Persistent Cull phase to the number of threads in the Compute Shader. The simplest way to do this is to give each BVH tree a separate thread, i.e. one thread responsible for a Nanite Mesh. However, due to the different complexity of each Mesh, the number and depth of nodes in the BVH tree are greatly different. Such arrangement will lead to different task processing sizes of each thread, and the threads will wait for each other, which will eventually lead to poor parallelism. Is it possible to assign a separate thread to each BVH node that needs to be processed? This is of course the ideal scenario, but in reality we cannot know in advance how many BVH nodes will be processed before culling, because the whole culling is hierarchical and dynamic.

Nanite’s solution to this problem is: Set a fixed number of threads, each thread through a global task FIFO queue to get BVH nodes, if passed, the node is to all child nodes of the node into the task queue tail, and then continue to cycle to new nodes from global queue, until the queue is empty and no longer produces a new node. This is actually a classic production-consumer model with multiple concurrent threads, except that each thread acts as both a producer and a consumer. By doing this, Nanite ensures that the processing time between threads is roughly the same.

The whole culling phase is divided into two passes: the Main Pass and the Post Pass (which can be set to just the Main Pass through a console variable). The logic of these two passes is basically the same. The only difference is that the HZB used for occlusion culling of the Main Pass is constructed based on the data of the previous frame, while the HZB used for Post Pass is the HZB of the current frame constructed after the end of the Main Pass. This is to prevent the HZB of the previous frame from mistakenly discarding some visible Mesh.

It’s important to note that Nanite doesn’t use Mesh shaders, partly because support for Mesh shaders is not yet widespread. On the other hand, since Nanite uses soft rasterization, the output of the Mesh Shader still has to be written back to the GPU Buffer for the soft rasterization input, so there is not much bandwidth saving compared to the CS scheme.

Rasterization

After culling, each Cluster is sent to a different rasterizer based on the size of its screen space. The large triangles and non-nanite Mesh are still hardware-based rasterization, while the small triangles are soft rasterization written by Compute Shader. Nanite Visibility Buffer is a map of R32G32_uint (8 Bytes/Pixel), in which 0~6 bits of channel R store Triangle ID, 7~31 bits store Cluster ID, and 32 bits of channel G store depth:



Cluster ID



Triangle ID



Depth

The whole logic of soft rasterization is relatively simple: Based on the scan line algorithm, each Cluster starts a separate Compute Shader. During the initial stage of Compute Shader, Compute and cache all the Clip Space Vertex Positon to Shared Memory. Then, each thread in CS reads the Index Buffer of the corresponding triangle and the Vertex Position after transformation, calculates the edge of the triangle according to the Vertex Position, performs reverse culling and small triangle culling (less than one pixel), and then completes the Z-test by atomic operation. Write the data to the Visibility Buffer. It’s worth mentioning that to keep the whole soft rasterization logic simple and efficient, Nanite Mesh does not support models with bone animations, vertex transformations or masks in the textures.

Emit Targets

In order to keep the data structure as compact as possible and reduce the read-write bandwidth, all the data needed for soft rasterization is stored in a Visibility Buffer, but in order to mix with the pixels generated in the scene based on hardware rasterization, Ultimately, we need to write the extra information in the Visibility Buffer into the Depth/Stencil Buffer and the Motion Vector Buffer. This phase usually consists of several full-screen passes:

(1) Emit Scene Depth/Stencil/Nanite Mask/Velocity Buffer. This step outputs up to four buffers based on the RenderTarget data required for the final Scene. Where, the Nanite Mask uses 0/1 to indicate whether the current Pixel is an ordinary Mesh or a Nanite Mesh (obtained according to the Cluster ID at the corresponding position in the Visibility Buffer). For the Nanite Mesh Pixel, The Depth in the Visibility Buffer is converted from UINT to float, and the Scene Depth Buffer is written according to whether the Nanite Mesh accepts decals. Write the Stencil Value corresponding to the decal into the Scene Stencil Buffer, calculate the Motion Vector of the current pixel according to the position of the previous frame, and write the Velocity Buffer, Discard directly the non-nanite Mesh.



Nanite Mask



Velocity Buffer

Scene Depth/Stencil Buffer

(2) Emit the Material Depth. This step will generate a Material ID Buffer, except that it is not stored in a UINT texture. Instead, store the Material ID of type UINT as float on a Depth/Stencil Target of format D32S8 (we will explain the reasons for this later), Theoretically up to 2^32 materials are supported (actually only 14 bits are used to store the Material ID) and the Nanite Mask will be written into the Stencil Buffer.



Material Depth Buffer

Classify Materials && Emit G-Buffer

We have introduced the principle of Visibility Buffer in detail. One implementation in the shading calculation stage is to maintain a global Material table in which the Material parameters and relevant map indexes are stored. The corresponding Material is found according to the Material ID of each pixel and the Material information is analyzed. Use a technical solution such as Virtual Texture or Bindless Texture/Texture Array to get the corresponding Texture data. This is possible for a simple material system, but UE contains an extremely complex material system, each with a different Shading Model. Each material parameter under the same Shading Model can also be calculated in a complex connection through the material editor. This mode of dynamic Shader Code generation based on Lianliankan obviously cannot be implemented with the above scheme.

In order to ensure that the Shader Code of each material can still be dynamically generated based on the material editor, the PS Shader of each material must be executed at least once, but we only have the material ID information of the screen Space, so instead of running the corresponding Shader (Object Space) one by one at the same time. Nanite’s Shader is executed in Screen Space to decouple the visibility calculation from the Material parameter calculation, hence the name Deferred Material. However, this leads to a new performance problem: there are often thousands of materials in the scene, and each material is drawn with a full-screen Pass, so the bandwidth pressure brought by redrawing is bound to be very high. How to reduce the meaningless redrawing becomes a new challenge.

For this reason, instead of having a full screen Pass for each material in the Base Pass drawing stage, Nanite will divide the screen space into 8×8 blocks. For example, if the screen size is 800×600, each material will generate 100×75 blocks, each block corresponding to the screen position. In order to be able to cull whole blocks, after Emit Targets, Nanite starts a CS to count the types of Material IDs contained within each block. Since the Depth corresponding to the Material ID is pre-sorted, this CS will count the maximum and minimum Material Depth of each 8×8 block and store it as the Material ID Range in a map of R32G32_uint:



Material ID Range

After having this image, each Material in its VS phase will sample the Material ID Range of the corresponding position of the posted image according to the position of its own block. If the Material ID of the current Material is in the Range, the Photoshop of the Material will continue. Otherwise, it means that no pixel in the current block uses this material, and the whole block can be culled. In this case, just set the vertex position of VS to NaN, and the GPU will cull the corresponding triangle. Since there are usually not too many material types in a block, this method can effectively reduce unnecessary Overdraw.

In fact, it’s not the first time that the idea of using block classification to reduce material branching and thus simplify rendering logic has been proposed. For example, Uncharted 4 implemented their delayed lighting [13], since the material contains multiple Shading models, In order to avoid starting a separate full-screen CS for each Shading Model, they also divided the screen into blocks (16×16), counted the types of Shading models in the blocks, and started a separate CS for each block according to the Range of Shading models in the blocks. Take the corresponding Lighting Shader in the Range, so as to avoid multiple full-screen passes or an Uber Shader containing a lot of branching logic, thus greatly improving the performance of delayed Lighting.



Shading Model Range is counted in blocks in Uncharted 4

The Material Depth Buffer comes into play after the block by block culling is completed. In the Base Pass PS phase, the Material Depth Buffer is set to Depth/Stencil Target, Depth/Stencil Test is turned on, and Compare Function is set to Equal. Photoshop will only be executed if the current pixel has the same Material ID as the Material to be drawn (Depth Test Pass) and the pixel is a Nanite Mesh (Stencil Test Pass). With the help of the Early Z/Stencil of the hardware, we completed the material ID culling on a per-pixel basis. The principle of the whole drawing and culling is shown below:



The red shows the areas that have been removed

The whole Base Pass is divided into two parts. First, the G-buffer of the non-nanite Mesh is drawn. This part is still executed in the Object Space, which is consistent with the logic of UE4. Then draw the Nanite Mesh G-Buffer according to the above process, in which the extra VS information required by the material (UV, Normal, Vertex Color, etc.) is indexed to the corresponding Vertex Position by the Cluster ID and Triangle ID of the pixel. And transform to the Clip Space, calculate the barycenter coordinates of the current pixel and the gradient of the Clip Space Position (DDX/DDY) according to the Clip Space Vertex Position and the depth value of the current pixel, All Vertex Attributes and their gradients can be obtained by interpolating the barycentric coordinates and gradients into the various Vertex Attributes (gradients can be used to calculate the Mip Map hierarchy of the sample).

So far, we have analyzed Nanite’s technical background and complete implementation logic.

Vulkan API Overhead Test Added to 3DMark [4] Vulkan API Overhead Test Added to 3DMark [4] Vulkan API Overhead Test Added to 3DMark [4] A Macro View of Nanite From Software to Hardware [5] Mesh Shading: Forward Greater Efficiency of Geometry Processing “[6]” A Trip Through the Graphics Pipeline “[7] the Nanite | Inside Unreal [8], Tile-Based Rendering [9], Render Pipes in Game Engine [10], The Visibility Buffer: A cache-friendly Approach to Deferred Shading Triangle Visibility Buffer [12] Bindless Texture [13] Lighting in Uncharted 4

This is the 983rd article of yuhu technology. Thanks to the author, luo cheng. Welcome to share, do not reprint without the authorisation of the author. If you have any unique insight or discovery also welcome to contact us, discuss together. (QQ group: 793972859)

The author homepage: https://www.zhihu.com/people/… , is currently working in the Engine China Taiwan Department of Tencent Game Research and Development Efficiency Department. Thanks again for sharing with Los Angeles. If you have any unique insights or discoveries, please contact us to discuss with us. (QQ group: 793972859)