Author: Chen Zhenhuan | kuang depending on science and technology MegEngine architects

background

In recent years, Auto mixed-precision (AMP) technology, as a simple, low-cost and effective training acceleration method in various deep learning training frameworks, has been more and more widely used in algorithm research. But most articles about mixing precision training usually stay in framework interface is introduced, and how to avoid FP16 type of precision loss and how to avoid the basic principle and the use of techniques such as NaN, to deep learning framework as a black box tool for researchers do enough, but if you want more step down again, With a little more low-level acceleration, the terms for GPU graphics architecture and CUDA programming can easily confuse people without background knowledge.

We will start with the Tensor Core involved in mixed precision training, and then use examples of code to help you understand the details of training acceleration using Tensor Core at the framework level and even CUDA programming.

Tensor Core principle

So first of all, a little bit about what mix precision and Tensor Core are. Mixed precision refers to the technique of using half precision (FP16) as input and output and full precision (FP32) for intermediate result calculation so as not to lose too much precision at the underlying hardware operator level, rather than having both FP16 and FP32 at the network level. The underlying hardware layer is actually the Tensor Core, so having a Tensor Core on the GPU is a necessary condition for training acceleration with mixed precision.

Figure 1

The Tensor Core literally means Tensor Core, and the physical meaning of that is a special region on the NVIDIA GPU (shown in the big dark green in Figure 2), There are the regular CUDA Core (light green and patches of dark green) and the latest RT Core (Ray Tracing, light yellow). CUDA Core generally contains multiple data types, and each data type contains multiple small cores. For example, INT32 Core and FP32 Core in the figure have 4×16 cores each. FP64 Core may also be included on dedicated computing cards (such as V100 and A100 graphics cards). The Tensor Core, on the other hand, has no distinction in terms of architecture and interfaces. It can be seen as a discrete computing unit on the GPU (although there are some differences internally).

Figure 2: SM of the Turing architecture 2080Ti graphics card Figure 1

In logical (mathematical) sense, compared to FP32 Core, which can only compute two numbers at a time (the left part of the two graphs in Figure 3), The Tensor Core takes two 4-by-4 FP16 Tensor products at once and adds them to the other 4-by-4 Tensor, D = A * B + C. That’s why it’s called Tensor Core. Through the design of the hardware, Tensor Core can theoretically achieve 8 times the computational throughput of FP32 Core (Volta and Turing architectures) without significant footprint or power consumption increase. Mixed precision also takes advantage of the Tensor Core, and that’s how you get training acceleration.

Figure 3

The comparison here is based on the Streaming Multiprocessor (SM), which is the basic unit of a GPU to perform complete computing tasks. A GPU generally contains several SMS (for example, V100 contains 80 SMS, A100 contains 128 SM), and a SM contains various computing cores and storage resources (Figure 2 is a complete SM).

On the V100, one SM has 8 Tensor cores and 64 FP32 cores. A Tensor Core can do 4×4×4=64 FMA in a clock cycle, 64×8 clock in total. FP32 Core can do 1×64 clock, so it’s 8 times. On the A100, the Tensor Core can take four times as many times as its predecessor, 256 TIMES as many fmas in a clock cycle, twice as many times as the V100 at four. 16 times more than FP32 Core.

Understanding from CUDA interface level

Show me the code. Now let’s understand what TensorCore means at the level of code interface. So we can understand the underlying acceleration principle of mixed precision training.

To do your calculations with Tensor Core, you need to use NVIDIA’s CUDA Runtime API. Now that we’ve got Tensor Core in the Volta architecture, there’s definitely going to be a new CUDA interface exposed. In CUDA 9.0, we have introduced the WMMA (warp-level matrix multiply and accumulate) API, which uses Tensor Core to perform matrix calculations. There are three interfaces for this paper:

void load_matrix_sync(fragment<... > &a,const T* mptr, unsigned ldm, layout_t layout);
void store_matrix_sync(T* mptr, constfragment<... > &a,unsigned ldm, layout_t layout);
void mma_sync(fragment<... > &d,constfragment<... > &a,constfragment<... > &b,constfragment<... > &c,bool satf=false);
Copy the code

At the Tensor level, you need to load data from the data pointer MPTR into your fragment through load_matrix_sync. Then use mMA_sync to evaluate the four fragments (d = a * B + C). Finally, use store_matrix_sync to return the output fragment data to the output pointer MPTR. The simplest example of multiplying and adding two 16×16 matrices is shown below 2:

#include <mma.h>
using namespace nvcuda;
       
__global__ void wmma_ker(half *a, half *b, float *c) {
   // Declare the fragments
   wmma::fragment<wmma::matrix_a, 16.16.16, half, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, 16.16.16, half, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, 16.16.16.float> c_frag;
 
   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0 f);
 
   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, 16);
   wmma::load_matrix_sync(b_frag, b, 16);
 
   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
 
   // Store the output
   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}

Copy the code

But a number of questions have accumulated, including what does warp mean? The Tensor Core takes a 4 by 4 matrix and multiplies and adds, why is it 16 by 16 here? It all comes down to the way the GPU does parallel computing.

We all know that gpus have a lot of cores, like 64 FP32 cores in one SM. When managing these cores, they are grouped into groups to improve efficiency. Several cores are bound together in behavior, perform the same commands, and advance and move together. Such a group is called a warp (corresponding to thread, which is an extension of the concept of thread). To get a multi-threaded synchronization result at the CUDA level you must warp, which is why all three of the above functions end in “_sync”.

This grouping can be found in hardware as well. If we look at the structure of Turing SM above (FIG. 2), we can see that it is divided into four identical parts (FIG. 4 below) called sub-core. The orange part is called “Warp Scheduler”. Its function is to assign tasks to Warp.

Figure 4: A sub-core of the Turing architecture SM

The assignment task can only be performed once per clock cycle. In order to keep all parts running all the time, this task usually needs to be executed in multiple clock cycles (similar to pipelined parallelism). In the current GPU design, a warp scheduler corresponds to 32 threads, which can be interpreted as a task containing 32 sub-tasks, and each sub-core has only 16 FP32 cores, so it takes two clock cycles to allocate one.

Figure 5 Instruction flow 4 in the Turing architecture sub-core

At Tensor Core you do matrix multiplications for FP16, and if you have Volta, you do the multiplication of 8×4 and 4×8 and the accumulation of the corresponding matrix at once, and it takes four clock cycles to do it, Amortized one clock cycle is exactly the multiplication and addition of two 4×4 matrices, consistent with the claimed performance of TensorCore. In fact, in CUDA Runtime API, in order to make instruction overlap higher and improve parallel efficiency, this M8N8K4 is promoted to the minimum M16N16K16, which is why WMMA :: MMA_sync takes 16×16 as the minimum unit.

In fact, matrix calculation in CUDA usually involves cutting large matrices into fixed tiling. That’s why the input of the interface is called fragment (the fragment of each thread is responsible for part of the Tiling) instead of Tensor.

Then CUDA sends m8N8K4 matrix multiplication to the Tensor Core at warp times every four clock cycles through WMMA, and then sends 8×8 matrix multiplication back to the Tensor Core at warp times. The warp – level.

We’ve taken a look at the code for Tensor Core through the WMMA CUDA API, and we’ve taken a little bit of a look at what’s going on at the bottom of the hardware, and the terms and descriptions aren’t very accurate for you to understand, but it gives you a little bit of a sense of how the GPU does parallel computing.

Understand from the framework usage level

At the framework level, you don’t call Tensor Core directly from CUDA, you do your calculations from CuDNN, which is a library of DNN operators, because CuDNN has a lot of hidden hardware details. On the other hand CuDNN implementations are good enough in most common cases that you don’t have to reinvent the wheel.

Next, we introduce the calculation process with the most commonly used convolution operation in mixed precision training. Let’s take a look at the convolution operation API5 in CuDNN first:

cudnnStatus_t cudnnConvolutionForward(
    cudnnHandle_t                       handle,
    const void                         *alpha,
    const cudnnTensorDescriptor_t       xDesc,
    const void                         *x,
    const cudnnFilterDescriptor_t       wDesc,
    const void                         *w,
    const cudnnConvolutionDescriptor_t  convDesc,
    cudnnConvolutionFwdAlgo_t           algo,
    void                               *workSpace,
    size_t                              workSpaceSizeInBytes,
    const void                         *beta,
    const cudnnTensorDescriptor_t       yDesc,
    void                               *y)
Copy the code

Here are some nouns that need to be explained:

  • CudnnStatus_t, CuDNN interfaces are designed to write results with output Pointers (such as Y) in the parameters, and the return value only contains the status information of success and failure, namely status.
  • CudnnHandle_t: Handle is an interface for communicating with devices. Similar to file Handle, any interface needs to provide a HANDLE of CUDA device.
  • CudnnTensorDescriptor_t and cudnnFilterDescriptor_t, both of which are data descriptors, contain layout, dtype, etc., because the data content is provided only by a void* pointer (such as x and w here).
  • CudnnConvolutionDescriptor_t, descriptors, operation and data descriptor similar, some parameters and attributes used to describe the Op itself, such as conv including pad, stride and dilation.
  • CudnnConvolutionFwdAlgo_t literally translates to the algorithm of forward convolution. As the specific calculation methods of convolution operation are various and each has its own suitable data scenes, the algorithm to be adopted needs to be specified here.
  • WorkSpace, compared with the upper layer code can create data objects anytime and anywhere, in the device layer, the space needed for a calculation must be declared in advance, and workSpace is the additional “working space” required for this calculation in addition to input and output, which can also be simply understood as space complexity.

After looking at the parameters of the API, it will become clear how to use this interface to calculate. We are not going to teach you how to fill in these parameters with a ready-made interface, but to ask you what you think the logical relationship between these parameters is. Specifically, What parameters do you think would determine that this convolution operation is running on Tensor Core?

The WMMA interface limits the shape of the matrix to 16×16 and the input data to half and half precision (the accumulator C can be float). So related to the data of w/x/y descriptor must be influential (so does not affect the data pointer itself has no information), so we need to indicate the data type in data descriptor for accuracy, and need the data of each dimension are multiples of 8 (in multiples of 16 because internal implementation will do some processing).

Then we analyze the convolution operator itself. Even if the data type and dimension meet the requirements, ordinary CUDA Core can be used for operation, so it can be inferred that there must be parameters controlling the operator’s behavior. Comparing with the above list, it is not difficult to guess that the operation descriptor and algorithm are two parameters. Algorithms are generally considered to be the logic of the operation, independent of the actual operation device (for example, an algorithm on the GPU, CPU should be the same process), but the device will limit the algorithm that can be run. In fact, FFT, GEMM, WINOGRAD and other algorithms support Tensor Core or FP32 CUDA Core for NCHW’s two-dimensional convolution operations, but some algorithms can only work on CUDA Core.

So the real control of whether you use the Tensor Core parameter comes in, which is the operation descriptor of Conv. In fact, in addition to the generic param parameters like pad, stride, and dilation, there is an important parameter mathType that is also included in the operation descriptor. The default value of this parameter is CUDNN_DEFAULT_MATH. And then if you want to use Tensor Core, you have to change it to CUDNN_TENSOR_OP_MATH, which by its name is also a strong Tensor Core value.

The other thing that’s worth talking about, and we all know that the important thing about mixed precision training is that the intermediate results of FP16 are stored in FP32, and then converted to FP16 at the end, so that the accuracy doesn’t go down, but that’s not really the limit of the Tensor Core, Tensor Core can do FP16 all the way through, so we need to control that in the operation descriptor as well, and that parameter is the dataType property of the operation descriptor convDesc, We need to set it to single precision (CUDNN_DATA_FLOAT) rather than half precision (CUDNN_DATA_HALF) to maintain accuracy.

Finally, a brief look at the convDesc setup code:

// Create a descriptor
checkCudnnErr( cudnnCreateConvolutionDescriptor( &cudnnConvDesc ));
 
// Set common parameters, including dataType (the last item)
checkCudnnErr( cudnnSetConvolutionNdDescriptor(
    cudnnConvDesc,
    convDim,
    padA,
    convstrideA,
    dilationA,
    CUDNN_CONVOLUTION,
    CUDNN_DATA_FLOAT) );
 
/ / set the mathType
checkCudnnErr( cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH) );
Copy the code

As for the rest of workspace, in fact, it is related to all the previous parameters, because it is necessary to know the attributes of the data, the calculation algorithm, the attributes of the operator and the calculation behavior of all the information required by the actual calculation, in order to get the required “workspace” size, here will not be more introduced.

To sum up, we can see that NVIDIA is very sophisticated in interface design, concise and reasonable parameter design allows us to control the underlying hardware computing logic at a high level of abstraction. By analyzing the logic of interface design, we also have a complete understanding of how an operator can use Tensor Core to calculate mixing accuracy.

conclusion

The Tensor Core, the underlying hardware support for mixing precision training, has always seemed to most framework users, or algorithm researchers, to have a sense of what the math means but not the details. This paper first relates it to the visible GPU chip from the physical meaning, and then explains how to control the Tensor Core to do matrix calculation from the lower CUDA interface code level. Finally, it goes back to the framework level and introduces in detail the process of using convolution operator to calculate the mixing accuracy.

I think you’ll understand why some of the restrictions we’ve seen on AMP are there, like why my video card doesn’t have acceleration (Volta and above), why I need multiples of eight in terms of the Tensor Core, And further on how the hardware decides to run FP16 or FP32, I believe that through the above code level explanation can also understand.

It is hoped that readers who have never been exposed to CUDA programming will have a deeper understanding of the underlying computing principles of mixed precision training, as well as a simple understanding of GPU computing and CUDA programming.

The attached:

  • GitHub: MegEngine Tianyuan
  • Website: MegEngine- Deep learning, simple development
  • Welcome to the MegEngine TECHNICAL exchange QQ group: 1029741705

reference

  • [1]NVIDIA TURING GPU ARCHITECTURE White Paper
  • [2] Warp Matrix Functions – Programming Guide: CUDA Toolkit (Nvidia.com) (WMMA CUDA API)
  • [3] Overall Flow Design (1) – Hierarchical structure of CUDA program
  • [4]J. Burgess, “RTX On The NVIDIA Turing GPU,” IEEE Micro, Vol. 40, No. 2, pp. 36 — 44, 2020.
  • [5]cudnnConvolutionForward – API Reference: NVIDIA Deep Learning cuDNN Documentation (ConvForward cuDNN API)