General purpose computing for graphics Processors (GPGPU) uses a GPU to compute general purpose computing tasks that were originally handled by cpus. Due to the powerful parallel processing capability of modern Gpus, the performance of general purpose Gpus far exceeds that of traditional CPUS when facing a large number of parallel computation-intensive algorithms such as matrix multiplication and convolution. CUDA is a GPGPU high-performance computing solution launched by NVIDIA. At present, most deep learning reasoning tasks can be accelerated by CUDA.
To leverage the computing power of the CUDA platform, NVIDIA has introduced highly optimized deep learning and linear algebra operator libraries cudNN, Cublas, Cutlass, and a deep learning reasoning framework on the CUDA platform TensorRT.
-
Basic operator primitive libraries such as CUDNN and Cublas have good performance on common convolution layers and can usually meet the needs of users. However, in the face of highly customized algorithms of users, basic operator libraries often cannot give full play to the performance of hardware. This is caused by the long tail problem of operator optimization. Many general strategies of convolution optimization are introduced into the basic operator library, but these optimization strategies cannot cover all cases. The convolution layer in the actual algorithm may not benefit from the general optimization strategy, thus unable to give full play to the performance of hardware.
Another problem with the base operator library is that the user cannot customize the base operators, and when the algorithm developer wants to add a new activation function for the convolution operator, or wants to add a special convolution operator (such as LocalConv), it becomes helpless.
-
Cutlass is a linear algebra template library from NVIDIA. It defines a series of highly optimized operator components. Developers can combine these components to develop linear algebra operators with performance comparable to CudNN and Cublas. However, cutlASS only supports matrix multiplication and does not support convolution operator, so it is difficult to be directly applied to inference deployment in the field of computer vision.
-
TensorRT is a powerful deep learning reasoning deployment framework that performs well on CUDA platforms and is now mature and easy to use. However, there are some problems with TensorRT. For developers, TensorRT is a black box, and users have no way to fine-grained control over the implementation details inside TensorRT.
For example, when deploying a quantization network, developers have no control over the quantization details underlying TensorRT, and there may be problems with the accuracy of deployment and training. Another example: When TensorRT is deployed in inference, the user cannot control the use of video memory of the operator finely. Sometimes TensorRT consumes a lot of video memory when running the network, but the user does not have a good way to optimize it.
In order to implement the inference deployment of deep learning on CUDA platform, various open source frameworks have also launched their own solutions.
-
Most of the deployment solutions of open source training frameworks on CUDA platform are based on model transformation tools, which convert the network into TensorRT supported formats, and then submit the reasoning tasks to TensorRT. However, there are slight differences in the definition of operators in various training frameworks, which leads to inevitable loss of performance and accuracy in the process of model transformation.
-
As a full-platform deep learning reasoning framework, TVM supports CUDA platform well. TVM defines a series of matrix multiplication and convolution templates based on the operator optimization primitives, which can be tuned at runtime to achieve optimal performance. However, the effect of TVM’s automatic code generation technology on CUDA platform is still far from that of cudNN, Cublas and other manually tuning operator libraries. In addition, TVM takes a long time in performance tuning. These two reasons prevent TVM from being used well in real inference deployment scenarios.
Because the official libraries could not meet the customization requirements of the algorithm development, and the open source community did not optimize the CUDA platform deeply enough to meet the performance requirements of the algorithm deployment, MegEngine did a secondary development based on Cutlass and added cutlass support for convolution operators. The long tail problem in operator optimization can be solved by user-defined block size. At the same time, the framework reused the highly optimized operator components in CuTLASS, and refined a set of CUDA platform convolution operator optimization strategies, so that users can complete the customized convolution operator development at a lower development cost.
Convolutional operator development framework based on CUTLASS
Long tail problem of operator optimization
In actual model inference deployments, official libraries such as CUDNN often perform poorly. For example, CUDNN optimizes only when the number of output channels is more than 64, but when the number of channels is less than 64, CUDNN needs to complete the number of channels to 64 and start more threads for calculation, which not only causes a waste of computing resources, but also fails to obtain better operator performance.
If we take advantage of MegEngine’s open source Cutlass operator development framework, we can easily customize the smaller number of output channels.
For example, when the 4 dimensions of the input feature map are N=16, C=64, H=92, W=160, the size of the convolution kernel is 3×3, and the number of output channels is 32, we can add a new block size through the following code to deal with the situation where the number of output channels is 32:
/ / define the input feature map layout of tensor using LayoutSrc = cutlass: : layout: : TensorNCxHWx < > 32; / / define input weight of tensor layout using LayoutFilter = cutlass: : layout: : TensorCxRSKx < > 32; Using ThreadBlockShape = cutlASS ::gemm::GemmShape<32, 64, 64>; Using WarpShape = cutlass::gemm::GemmShape<32, 16, 64>; M, N, K using InstructionShape = cutlass::gemm::GemmShape<8, 8, 16>; / / define convolution post-processing operator using EpilogueOp = cutlass: : epilogue: : thread: : BiasAddLinearCombinationReluClamp<int8_t, 8, int32_t, int32_t, float>; Using Convolution = cutlass: : Convolution: : device: : Convolution < int8_t, / / input feature map data type LayoutSrc, // Input layout int8_t for feature map, // Input data type LayoutFilter for weight, // Input layout int8_t for feature map, // Input data type LayoutFilter for weight, // Input layout int8_t for feature map, // The tensor's data type LayoutSrc, the tensor's data type LayoutSrc int32_t, the bias's data type LayoutSrc, / / input bias, the layout of int32_t, / / matrix multiplication internal accumulation of data type cutlass: : convolution: : ConvType: : kConvolution, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75, ThreadBlockShape, WarpShape, InstructionShape, EpilogueOp, cutlass::convolution::threadblock:: ConvolutionNCxHWxThreadblockSwizzle< cutlass::Convolution::ConvType::kConvolution>, 2, // 2 indicates whether to enable the shared memory ping-Pong prefetch optimization. // Tensor alignment: indicates the bit width of the load/store instruction. // The wider the instruction, the higher the throughput, helps improve the performance of Convolution conv_op. typename Convoluition::Arguments args{... }; conv_op.initialize(args, workspace); // Perform the convolution operator conv_op();Copy the code
In the test on T4 card, the performance of Cutlass custom operator is 26% faster than cudNN.
Under some common convolution parameters, the performance of convolution operator defined by CuTLASS is also comparable to that of CUDNN. We measured the performance of some common convolution layers in ResNet50 on T4 card:
Among the 17 convolution layers selected, the performance of 11 convolution layers of CutlASS exceeds that of CUDNN, and the performance of the remaining 6 convolution layers basically reaches more than 80% of that of CUDNN.
Fusion operator
With the introduction of TensorCore INT8 computing unit in NVIDIA’s Turing architecture graphics card, the computing power of THE GPU has been greatly improved, but the memory access ability of the GPU has not been increased correspondingly. At this time, GPU memory access often becomes the bottleneck of reasoning performance. In this scenario, we need to merge the acquisition-intensive operators and computation-intensive operators to reduce the cost of acquisition-intensive operators. Here we show how MegEngine and Cutlass perform operator fusion using an example of reasoning acceleration using TensorCore Int8.
The 8-bit quantization convolution layer on CUDA platform uses the data Layout of NCHW4. Different from the common Layout of NCHW, this Layout packs four channels together, stores them in memory continuously, and then stores the data of W, H, C and N in Tensor from small to large according to the stride. In order to use TensorCore for acceleration, we need to convert the Layout of Tensor to the Layout of NCHW32. This Layout is similar to NCHW4, except that the 32 channels are packed together and stored in memory.
When using MegEngine for inferential deployment, MegEngine will insert the appropriate Tensor Reformat operator to transform the Layout during the diagram optimization phase as long as the user turns on the TensorCore optimization option. The first phase diagram transformation in Figure 2 is shown. Next MegEngine will remove the redundant Tensor Reformat operator, resulting in the sequence of calculations for the second stage in Figure 2.
Combined with Cutlass, MegEngine can further optimize the computed graph. First, we find that the Pooling operator and the Reformat operator connected to it are interchangeable. After exchanging the order of the two operators, the three operators at the top of the calculation graph, Elemwise, Convolution and Reformat, can be merged into a Super Conv through Cutlass, so as to obtain the final calculation graph in FIG. 2. In the optimized calculation diagram, the memory intensive operators introduced by TensorCore have all been integrated into the convolution operator, so that the optimized reasoning network can fully enjoy the acceleration effect of TensorCore without additional Tensor Reformat overhead.
So how to use cutlass operator fusion function? Cutlass has provided NCHW4 and NCHW32 these two Layout conversion of high-performance read and write components, The Convolution+Reformat fusion operator can be defined only by combining the Convolution operator and the corresponding Epilogue operator. The sample code in Figure 3 shows how to use cutlass to define a convolution operator with an input Tensor being an NCHW4 Layout and an output Tensor being an NCHW32 Layout.
/ / define the input feature map layout of tensor using LayoutSrc = cutlass: : layout: : TensorNCxHWx < 4 >; / / define input weight of tensor layout using LayoutFilter = cutlass: : layout: : TensorCxRSKx < 4 >; / / define the layout of the output tensor using LayoutDst = cutlass: : layout: : TensorNCxHWx < > 32; Using ThreadBlockShape = cutlASS ::gemm::GemmShape<64, 128, 32>; Using WarpShape = cutlass::gemm::GemmShape<64, 32, 32>; M, N, K using InstructionShape = cutlass::gemm::GemmShape<1, 1, 4>; / / define convolution post-processing operator using EpilogueOp = cutlass: : epilogue: : thread: : BiasAddLinearCombinationReluClamp<int8_t, 4, int32_t, int32_t, float>; Using Convolution = cutlass: : Convolution: : device: : Convolution < int8_t, / / input feature map data type LayoutSrc, // Input layout int8_t for feature map, // Input data type LayoutFilter for weight, // Input layout int8_t for feature map, // Input data type LayoutFilter for weight, // Input layout int8_t for feature map, // Insert the data type LayoutDst of the tensor int32_t into the tensor int32_t. // Insert the data type LayoutDst of the BIAS into the tensor int32_t. / / input bias, the layout of int32_t, / / matrix multiplication internal accumulation of data type cutlass: : convolution: : ConvType: : kConvolution, cutlass::arch::OpClassSimt, cutlass::arch::Sm61, ThreadBlockShape, WarpShape, InstructionShape, EpilogueOp, cutlass::convolution::threadblock:: ConvolutionNCxHWxThreadblockSwizzle< cutlass::Convolution::ConvType::kConvolution>, 2, // 2 indicates whether to enable prefetch optimization for shared memory ping-Pong. // Tensor alignment: indicates the bit width of the load/store instruction. // The wider the instruction, the higher the throughput, helps improve the performance of Convolution conv_op. typename Convoluition::Arguments args{... }; conv_op.initialize(args, workspace); // Perform the convolution operator conv_op();Copy the code
We have tested the first convolution layer in ResNet50 on the T4 card. When the output Tensor is NCHW4 Layout, the time-consuming is 3.03ms, and the Tensor Reformat operator is 0.309ms. The time of Convolution+Reformat operator is also 3.03ms, but the cost of Tensor Reformat operator is reduced after fusion, and the performance is improved by about 10%.
Custom convolution operator
In some highly customized scenarios, algorithm engineers will propose some new convolution operators to improve network performance. For example, Local operator in recognition task, CondConv operator proposed by Google Brain and so on, these operators introduce more parameters to improve the reasoning accuracy of the model.
However, on CUDA platform, these operators are often not well optimized, which prevents them from landing in actual reasoning tasks. We find that the calculation process of these operators is roughly the same as that of ordinary convolution operators, except that the way of accessing the convolution kernel is slightly different.
We can change the way the convolution operator accesses the convolution kernel by processing the operator of Prologue before defining the convolution operator of Cutlass, and reuse the high-performance convolution component in CuTLASS at the same time to realize the Local operator and CondConv operator with better performance. In the face recognition business of Megapixis, we have realized the high performance quantitative CondConv operator based on CutlASS, which has been implemented and obtained the free increase point without affecting the reasoning performance.
Custom activation function
Currently, the activation function supported by convolution operator in CUDNN operator library provided by NVIDIA is only ReLU. If algorithm engineers want to imagine in the model, they can use some novel activation functions (for example: HSwish), then such activation function cannot be integrated into the convolution operator, which will increase the time of model reasoning. In some scenarios with high requirements for reasoning delay, the new activation function cannot really be implemented.
With the help of cutlass, the problem of custom activation functions can be solved more easily by adding a new Epilogue operator to implement the new activation functions. For example, the following code defines the HSwish activation function:
template <typename ElementOutput_, int Count, typename ElementAccumulator_ = ElementOutput_, typename ElementBias_ = ElementOutput_, typename ElementCompute_ = ElementOutput_, FloatRoundStyle Round = FloatRoundStyle::round_to_nearest, typename Policy = NumericArrayConverterPolicy< ElementOutput_, Count, ElementAccumulator_, ElementBias_, ElementCompute_, Round > > class BiasAddLinearCombinationHSwishClamp {/ / / define Param, constructors, etc., here omit part of the code / / /... public: CUTLASS_HOST_DEVICE FragmentOutput operator()(FragmentAccumulator const& accumulator, FragmentBias const& bias, FragmentOutput const& source) const { SourceConverter source_converter; AccumulatorConverter accumulator_converter; BiasConverter bias_converter; ComputeFragment converted_source = source_converter(source); ComputeFragment converted_accumulator = accumulator_converter(accumulator); ComputeFragmentBias converted_bias = bias_converter(bias); ComputeFragment intermediate; multiplies<ComputeFragment> mul_add_source; multiply_add<ComputeFragment> mul_add_accumulator; multiply_add<ComputeFragmentBias> mul_add_bias; HSwish<ComputeFragment> hswish; minimum<ComputeFragment> min_accumulator; maximum<ComputeFragment> max_accumulator; /// Calculate + BIAS intermediate = mul_add_source(gamma_, converted_source); intermediate = mul_add_accumulator(alpha_, converted_accumulator, intermediate); intermediate = mul_add_bias(beta_, converted_bias, intermediate); /// Calculate HSwish activation intermediate = HSwish (scale_, inv_scale_, intermediate); ElementCompute const kClamp = ElementCompute( (1U << (sizeof_bits<ElementOutput>::value - 1)) - 1); intermediate = max_accumulator(intermediate, -kClamp - ElementCompute(1)); intermediate = min_accumulator(intermediate, kClamp); /// Data type OutputConverter destination_Converter; return destination_converter(intermediate); }};Copy the code
We simply pass the newly defined Epilogue operator into the Convolution operator template to get a Convolution operator that incorporates the new activation function.
Inferential deployment of CUDA platform
So far, the latest version of MegEngine has integrated the convolution operator implemented by Cutlass.
Dump the quantified model as described in [documentation] and you can use MegEngine to complete the inferential deployment.
[Document address]
Megengine.org.cn/doc/advance…
We can use the load_and_run tool to speed the model.
[How to use load_and_run]
Megengine.org.cn/doc/advance…
For example, the test results of RESNET-18 are shown in the following figure:
./load_and_run resnet18.mge --input ./cat.npy --enable-nchw32 --fast-run mgb load-and-run: Using MegBrain 8.9999.0(0) and MegDNN 9.3.0 [09 14:14:14 [email protected]:1169][WARN] enable NCHw32 optimization Load model: 3018.428ms === prepare: 182.441ms; going to warmup [09 14:11:11 [email protected]:492][ERR] timeout is set, but no fork_exec_impl not given; timeout would be ignored [09 14:11:11 [email protected]:492][ERR] timeout is set, but no fork_exec_impl not given; timeout would be ignored [09 14:11:11 [email protected]:492][ERR] timeout is set, but no fork_exec_impl not given; Timeout would be ignored warmup 0: 481.411ms === going to run input for 10 times iter 0/10: 19.432ms (exec=0.754,device=19.307) iter 1/10: 18.537ms (exec=0.899,device=18.497) iter 2/10: 18.802ms (exec=0.727,device=18.762) iter 3/10: 18.791ms (exec=0.653,device=18.759) iter 4/10: 18.614ms (exec=0.761,device=18.585) iter 5/10: 18.529ms (exec=0.708,device=18.499) iter 6/10: 18.660ms (exec=0.706,device=18.634) iter 7/10: 18.917ms (exec=0.667,device=18.894) iter 8/10: 20.090ms (exec=0.655,device=19.070) iter 9/10: 20.21ms (exec=0.630,device=19.187) === finished test #0: Time = 188.586 ms avg_time = 18.859, sd = 0.304 ms minmax ms = 18.529, 19.432Copy the code
As you can see, on the T4 card, the end-to-end time of ResNet18 is about 18.86ms, and if deployed using TensorRT, the end-to-end time is about 16.85ms. MegEngine’s reasoning performance on CUDA platforms is around 90% of that of TensorRT, which is generally comparable. In some deployment scenarios that require low inference latency but high customization and precision of inference, it is possible to use MegEngine’s CUDA platform for deployment inference.
conclusion
This paper introduces the framework of convolution operator optimization based on Cutlass in the latest version of MegEngine. In the next few articles, we will continue to explain how Cutlass optimised convolution operators and how to use Cutlass to add a high performance custom convolution operator to MegEngine.
With the help of cuTLASS framework, developers can develop custom block size convolution operator, solve the long tail problem in inference optimization, can support custom activation function, can complete the integration of convolution operator and memory intensive operator, but also can customize the performance of good variety of convolution operator.
We welcome you to use MegEngine’s CUDA inferential deployment and Cutlass-based convolution operator customization, and we look forward to your input on how to use MegEngine. Makes MegEngine and Cutlass convolution frameworks useful to a wide range of deep learning developers in highly customized inferential deployment scenarios.
reference
[1] Kerr, A., (2020). Developing CUDA kernels to push tensor cores to the absolute limit on NVIDIA A100. In: GPU Technology Conference.
[2] Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Cantanzaro, B., & Shelhamer, E. (2014). cudnn: Efficient primitives for deep learning. arXiv preprint arXiv: 1410.0759.
[3] Vanholder, H. (2016). Efficient Inference with TensorRT. In: GPU Technology Conference.
[4] Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., … & Guestrin, C. (2018). TVM: An automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). (pp. 578-594).
[5] Yang, B., Bender, G., Le, Q.V., & Ngiam, J. (2019). CondConv: Conditionally parameterized convolutions for efficient inference. In: Advances in Neural Information Processing Systems. (pp. 1305-1316).
[6] Ma, N., Zhang, X., Huang, J., & Sun, J. (2020). WeightNet: Revisiting the design space of weight network. In: Proceedings of the European Conference on Computer Vision (ECCV).
Welcome to visit
-
MegEngine Website:
megengine.org.cn -
MegEngine GitHub (Welcome Star) : github.com/MegEngine