With the upgrade of terminal playback equipment, the audience’s demand for video quality is also gradually improving. Demand has shifted from HD to 4K, and recently 8K has also become popular. In addition to the need for resolution improvement, it is also inevitable to introduce some defects in the process of video collection, such as blur caused by too fast movement of objects, reduced image quality caused by compression algorithm, lack of details caused by poor shooting/lighting parameter Settings, and increased noise. The classical interpolation algorithm generally adopts interpolation algorithm. Although the speed is very fast, the image with rich details will be blurred after amplification, and denoising is more difficult. The introduction of deep learning method, due to its huge parameter space, can well fit the noise reduction process of picture quality, so as to provide more details when improving resolution, and realize the double improvement of picture quality and resolution.

However, compared with traditional methods, the running time of deep learning model is greatly improved, and the processing of a single video may reach hours or days, which is difficult to meet the demand of mass video production. In this context, this paper introduces iQiyi’s optimization acceleration and production practice on VIDEO 4K hypersegmentation model, which improves the performance of 4K hypersegmentation model by 10 times on GPU.

Deployment challenges of complex models

1.1 Methods provided by Nvidia on model acceleration

After Volta architecture, Nvidia introduced Tensorcore, a domain specific Accelerator [DSA] for accelerated processing of matrix operations common in deep learning. The accelerator has achieved great success, significantly reducing the reasoning time of large deep learning models. However, tensor Core inherits its common weakness as a DSA. Its programming logic is so complex that the average programmer would have trouble getting it to work at its theoretical speed by exposing the UNDERLYING API. In order to reduce the threshold of programming, Nvidia built the framework TensorRT[1], which takes several input/output tensor shapes and obtains the kernel corresponding to the optimal tensorcore performance through manual grinding and assembly. And exposed to the TensorRT upper layer through an abstract interface.

During model compilation, TensorRT will run on various suitable kernels in turn according to the current state of the model, and finally select the kernel with the least time consuming, which is fixed in the final compilation TensorRT Engine. Engine is the final binary deployment. TensorRT inference will load Engine, extract the corresponding kernel name and corresponding startup parameters, and start the kernel in sequence according to the reasoning sequence of the model to complete the reasoning process.

Although TensorRT is convenient for the model to be greatly accelerated by tensorcore, due to the complexity of tensor Core, TensorRT has few exposed interfaces, and core operator implementation is still a closed source, so there are still many restrictions on how to use the model for deep optimization. For example, TensorRT’s internal operators are all developed with NCHW and only support NCHW tensor input, but the underlying tensor core needs NHWC input, and it’s 0 0 for multiple tensor inputs

1.2 IQiyi’s practice in complex video reasoning model optimization

In order to further improve the reasoning performance of the model, IQiyi analyzed the underlying mechanism of TensorRT in detail. Through this article, you will get the following knowledge points:

A. How to perform TensorRT format conversion for complex model reasoning.

B. TensorRT’s int8 quantitative reasoning internal mechanism, and how to better improve the reasoning accuracy of INT8 quantitative model in video reasoning.

Second, the transformation method of complex model TensorRT

For TensorRT deployments, I’m sure anyone who has used the model has encountered a major headache that some operators are not supported, or in the case of the Torch model, many operators require developers to use CUDA for custom implementation. In fact, this is not the problem of TensorRT alone. For the general deep learning compiler that TensorRT aims to become, the deep learning model and framework are very fast to iterate, and the computing requirements of various models are endless. It is very difficult to normalize into a general IR representation, let alone push the model to the ultimate performance.

For handling methods that do not support OP or custom CUDA kernel, we usually use two words to solve the problem in practice, namely “open” and “Together”.

2.1 Model disassembly solution

In a general sense, the reasoning process of the model is actually the replay process of a complete computational graph. Since it is a calculation graph, we disintegrate it into subgraphs and bridge the corresponding input and output, so there should be no influence on the calculation results.

So here, we use a trick to isolate subgraphs that can be exported as onNx and normally converted to TensorRT and compile them into TensorRT’s engine. Then, in an independent executable file, the engine of the subgraph was replayed in sequence, and the unconverted OP of the original model was in the middle. CUDA kernel was used for bridging.

FIG. 1 shows the specific disassembly of EDVR[2]. For EDVR, DCN custom OP and PixelShuffle cannot normally convert ONNX and TensorRT, so corresponding modules are excluded. Redefine a new class using Torch’s nn.moduel for the specified module, and export the corresponding blocks onNx.

FIG. 1 Segmentation of the original computational graph in EDVR

2.2 Operator Fusion “Fusion”

Take the EDVR model for example, one of the unconvertible op is Pixelshuffle. This OP is a common upsample operation in hyperpartitioned networks, which is used to increase the size of relevant features to the target size at the end of the network. Since it cannot be directly converted to TensorRT’s engine, it needs to be independently implemented as a custom CUDA kernel as the bridge unit of the front and back convolution parts.

But doing this directly is not friendly for acceleration. As I mentioned earlier, the input to the tensor core in TensorRT is strictly NHWC, and TensorRT itself is NCHW compliant. So TensorRT for a single engine, the input will have an NCHW to NHWC conversion, and the output will have an NHWC to NCHW conversion after a series of operations. This means that our bridge OP mode will trigger the operation of three kernels, and the running time of each of the three kernels will be long due to the super large pixel size.

Figure 2 PixelShuffle fusion in TensorRT

Pixelshuffle The blue dotted wireframe in Figure 2 is new for TensorRT’s original operator transformation and PixelShuffle bridge It is clear that the three kernels can be combined. Because from the end of convolution A to the beginning of convolution B, the pixels in the middle are only rearranged three times in accordance with certain rules. There is no calculation involved in the middle and it is completely linear.

The principle is simple, but the implementation is tricky. Since TensorRT is a closed source, it is not possible to eliminate the rearrangement process without modifying TensorRT itself. In the reasoning code, the bridge of successive convolution is actually only pixelshuffle, and the reasoning code does not know the existence of two “redundant” rearrangement kernels in TensorRT.

So here we have to do a more detailed disassembly. Disassemble TensorRT into individual kernels visible to the execution file, rather than just a black-box Engine binary. We do not elaborate on the specific dismantling process here, but interested readers can search for relevant CUDA hook methods to obtain a similar CUDA kernel replay method.

To put it simply, we recorded TensorRT’s trajectory as a “tape recorder”. The tape was cut into magnetic strips in the order of the playlist, and the “music” could be replayed in the new order. This exposes the previously hidden rearranged kernel to the executable file, and following its logic, we manually optimized the kernel to run in place of three kernels.

TensorRT’s INT8 reasoning

When the Volta architecture first came out with the Tensor Core accelerator, Nvidia only supported FP16. Full support for INT8 was added later to the Turing architecture and int4/ INT1 was added to the Ampere architecture. Tensor Core is Nvidia’s successful response to a host of deep learning hardware accelerators, but the software support, as mentioned earlier, has a lot of limitations. The same is true in quantitative terms.

TensorRT’s support for quantization precedes its support for tensor Core. Int8 quantization support has been around since TensorRT4 [3]. The initial quantization was carried out using dp4A instructions supported from Pascal architecture, which could parallelize 32bit operations into 4 8bit operations, making int8 reasoning speed a qualitative leap over the architectures of the time.

After TensorRT supports tensor Core, it still uses the framework at that time, but only supports post-training quantization. It uses KL divergence approximation to solve the corresponding INT8 quantization model from the full precision model. This is not to say that other methods than post-training are not supported within TensorRT, but for the average user, only the API exposed by TensorRT is accessible. The API does not support methods other than post-training quantification, which means that the average user is isolated from other methods.

3.1 The internal mechanism of TensorRT INT8 reasoning

Starting with TensorRT7, Nvidia began to expose int8’s quantization process in a more subtle way. This is partly due to torch/ TensorFlow’s support for pseudo quantization and onNX’s op representation of pseudo quantization. However, it is a pity that TensorRT7 still has some problems in supporting such a new int8 conversion mode. One of the biggest problems is that the conversion of bias coefficient in the convolution is wrong, and the coefficient that should be multiplied is changed to divide. As a result, if there is bias in the convolution in CNN, its accuracy will be greatly reduced. [Note: This issue has been fixed in the latest TensorRT8]

Due to the pressure to bring the business model online, it was impossible to wait for Nvidia’s next release to fix this problem, so we looked at dismantling again.

In order to understand the operation process of TensorRT INT8 in detail, we disassemble the INT8 convolution kernel and interpret the assembly code in detail. After understanding the corresponding assembly code, we can see that int8 operation is not all integer calculation, in the middle of the floating point operation.

Figure 3 quantization scaling process

When TensorRT does the model conversion, it scales the original floating-point range to the INT8 integer range of [-128,127] with an appropriate scaling factor SW, as shown in Figure 3. At the same time, during finetune/calibration of the model, the corresponding data range of the input and output of a specific convolution layer will be summarized. By SI/SO, the original floating point input and output can be scaled to int8 range. Inside the Engine binary, the weight is already solidified as INT8, and the input is int8 before entering the convolutional layer. In this way, the input/weight convolution multiplicative calculation in the convolution operator is in the form of int, and since the product of INT8 is very easy to overflow, the multiplicative and cumulative process of convolution exists in the form of INT32. The final result of multiplication and addition is also INT32. Further addition is required with bias after multiplication and addition. However, the previous int32 result actually has SW/SI corresponding to the input and weight in it. Therefore, in order to be consistent with the ratio of BIAS, it is necessary to convert the original int32 multiplied by the result into floating point first, then divide by SI and SW in turn, and add bias to get the correct result. At this time, the result is still floating point type. To output as an integer, you need to multiply by SO to get the final output.

Here, it is expressed by the formula:

(IQ*WQ*SI/SW+B)*SO

The formula is then converted to

IQ*WQ*SO*SI/SW+B*SO

Here, TensorRT makes an optimization by merging SO*SI/SW into a new coefficient and storing the result of B*SO directly in the Engine binary, SO that only one more FMA operation is required after convolution multiplication and addition to get the final result. The entire process can be seen in Figure 4.

FIG. 4 Distribution of values inside TensorRT INT8 kernel

On the basis of TensorRT7, we extracted kernel parameters and re-assigned the weight /bias to solve the problems in the original implementation.

3.2 Further improve the accuracy of INT8 inference

Although Int8 greatly improves the efficiency of reasoning, and QAT quantization improves the accuracy of quantization compared with PTQ, Int8 is still lower than full accuracy on the whole. In order to further improve the inference accuracy of INT8 model, we adopted two methods: embedding TensorRT kernel into Finetune process and real-time scaling factor calculation.

  • The TensorRT kernel is embedded in the Finetune procedure

QAT Finetune process is a pseudo-quantitative process. For PyTorch, it simply does something on the inputs and outputs, evaluates the weights/inputs/outputs as scaled in figure X, scales to an integer range, does a truncation with round, and then returns the previous floating point using the same coefficients. However, the quantization error introduced by this operation better simulates the operation of the deployment operator in the actual hardware.

Although the quantization of PyTorch itself tries to simulate the process of actual quantization inference calculation, compared with actual inference calculation, the calculation results of Int8 operator of TensorRT, for example, are still different, and this difference is a source of error for the final inference.

To eliminate this error source, we embed the TensorRT kernel in the Finetune process of PyTorch quantization training to ensure that the calculation operator used in quantization training is the same as that used in TensorRT reasoning. Simply put, the previous “recorder” is taken over again, and because it is used in the training process, the kernel weight parameter should be updated according to the current value before calculation.

  • Real-time calculation of scaling factor

In 3.1, when explaining the internal INT8 calculation mechanism of TensorRT, it was mentioned that the input/output scaling factor was calculated based on an empirical value range obtained from the Finetune dataset. There is another error in this process, that is, in actual reasoning, it is likely that the content difference of video frames will lead to inconsistency between the numerical range of feature map generated by convolution and the range of Finetune data set. In this way, continuing to use the old values will lead to the decline of the accuracy of the new content inference results.

To solve this problem, we introduced dynamic updates for scaling factors, especially for output scaling factors. Generally speaking, for the input (i.e. the original frame), its scaling factor can be fixed, and if the output scaling factor can be dynamically calculated, it can be inherited by the cascade relationship into the input scaling factor of the lower convolution, so that each scaling factor of the entire network can be updated.

Here we further disassemble TensorRT and add a new assembly module to the original INT8 kernel, where the output of int8 convolution is no longer INT8 but float16. Such a change makes the participation of the output scaling factor unnecessary for the convolution operator. In the kernel that follows, reduce is used to calculate the maximum value of the overall Float16 and determine the scale factor of the output. The output of float16 is then scaled to INT8 for the input of the next layer. The overall process is shown in Figure 5.

FIG. 5 Overall INT8 precision optimization process

Iv. Performance improvement results

In the whole process of EDVR deployment optimization, in addition to the “demolition” and “combination” optimization mentioned in the article, there are also some other optimization. As shown in FIG. 6, in the second step of optimization, redundant video memory access in DCN custom OP is optimized intensively, thus greatly improving the efficiency of the custom operator itself. In the third step, we fused some operators such as Leaky and subsequently gained about 150ms. In the fourth and fifth steps, we concentrated on the mutual subtraction operation for some intermediate state format conversion, which made EDVR reach 380ms speed in fp16 accuracy. Finally, the successful application of INT8 further improved the reasoning efficiency of the model to 180ms single frame speed on 1080p.

FIG. 6 EDVR step-by-step optimization results

Five, the outlook

We have successfully increased the inference speed of superresolution model by 10 times through deep customization of TensorRT and quantization of INT8, but this is just the beginning. As Nvidia’s architecture evolves, we see more performance improvements, with new hardware features such as structured sparse, ultra-precision networking adding more tools and weapons for optimization.

At the same time, the degree of manual optimization is still relatively high, and we plan to explore more methods of model automation and compiler optimization in the future.

reference

[1] developer.nvidia.com/zh-cn/tenso…

[2] EDVR:arxiv.org/abs/1905.02…

[3] ondemand.gputechconf.com/gtc/2017/pr…