Author: ZhangXiao | kuang apparent MegEngine architects

preface

In May 2020, Nvidia released Ampere, a new generation of its GPU architecture. Among them, the most closely related to Deep Learning is the third-generation TensorCore with strong performance. The new generation of TensorCore supports richer DL (Deep Learning) data types, including the new TESorfloat-32 (TF32). Bfloat16 (BF16) cell and INT8, INT4, and INT1 cells, which provide full support for DL inference. In order to exploit the capabilities of these computing units, it used to be that senior HPC engineers wrote convolution and matrix multiplication operators implemented by GPU assembly to exploit the capabilities of the hardware. However, there is no way to cope with so many data types by relying on manual optimization operators, so the optimization of DL applications gradually relies on some automated tools, such as compilers for deep learning. In this trend, Nvidia developed a linear algebra template library CUTLASS, abstract a series of high-performance basic components, can be used to generate various data types, various units of convolution, matrix multiplication operators. MegEngine is a secondary development based on CUTLASS, which can efficiently develop new high-performance operators, quickly migrating to new GPU architectures. In the previous article, we briefly introduced MegEngine’s underlying convolution operator implementation, but this article will take a deeper look at the underlying convolution operator implementation for the MegEngine CUDA platform. And will interpret and supplement Nvidia CUTLASS Implicit GEMM convolution document.

Therefore, the CUDA knowledge that readers must know before reading this article includes:

  • When accessing Global Memory, adjacent threads in the same Warp access consecutive addresses, and the access requests are merged to maximize Global Memory throughput.
  • Global Memory is accessed using the widest possible data type (Float4) to maximize the utilization of the fetched instructions.
  • The Shared Memory of CUDA is divided into 32 banks by every 4Bytes. A bank conflict occurs when threads in the same Warp access different addresses of the same bank. The database access mode without bank conflict can maximize Shared Memory throughput.
  • GPU has four levels of storage: Global Memory (GPU), L2, L1 (Shared Memory), and register. The delay of directly accessing video Memory is very high, which is required when optimizing computation-intensive operators such as GEMM and Convolution
    • Reduce Global Memory access requests by caching L1 and registers.
    • Hide the inevitable Global Memory access delays through massive computation.

First, we need to understand some of the abstractions that CUTLASS introduces

  • TileIterator: Used to access data for a Tile in storage.TileIteratorTo achieve theadvance()Method, supported inMatrix , TensorAnd other data types.
  • Fragment: Array type, used to storeTileIteratorRead in data.FragmentIs usually stored in registers.

Then we briefly review the Pipeline of high-performance GEMM operator designed by CUTLASS. According to the operator realized by Pipeline, the performance of Cublas can reach more than 90% on CUDA platform. The following figure illustrates the pipeline-based GEMM operator designed by CUTLASS:

  1. The first line of the figure illustrates the use ofPredicatedTileIteratorSmemTileIteratorCoordinate data transfer from Global Memory to Shared Memory.
  2. The second line demonstrates thisWarpTileIteratorResponsible for moving data from Shared Memory toFragmentRegister.
  3. The third line shows thatWarpMmaOperatorFragmentMatrix data in the register performs Matrix-Multiply-Add operation.

Implicit GEMM algorithm

Convolution maps to matrix multiplication

Let’s first take a look at the definition of the forward convolution operator. Suppose the input feature map is x, the weight of the convolution layer is W, and the output is y, where x,y and W are 4-dimensional Tensor, and the four dimensions of x are NxICxIHxIW, respectively. The four dimensions of W are OCxICxFHxFW, and the four dimensions of Y are NxOCxOHxOW. So the mathematical relationship between the output y and the input x and w can be written as


y ( n . oc . oh . ow ) = ic fh fw x ( n . ic . ih . iw ) w ( oc . ic . fh . fw ) \text{y}( \text{n}, \text{oc}, \text{oh}, \text{ow} ) = \sum_{\text{ic}} \sum_{\text{fh}} \sum_{\text{fw}} \text{x} (\text{n}, \text{ic}, \text{ih}, \text{iw}) \cdot \text{w} ( \text{oc}, \text{ic}, \text{fh}, \text{fw} )

The lowercase letters in the formula represent the coordinates of Tensor in each dimension, where the relationship between ih, iw and OH, OW, fh, fw can be written as

ih = oh * stride_h - pad_h + fh
iw = ow * stride_w - pad_w + fw
Copy the code

So stride_h, stride_w, pad_h, pad_w are parameters to the convolution layer. According to the principle of im2COL algorithm, the convolution operation defined in the formula can be transformed into a matrix multiplication, i.e

C = Matmul(A, B)
Copy the code

Among them

  • Matrix A is transformed from weight, which is A matrix of OC×IC _ FH _ FW\text{OC}\times\text{IC}\cdot\text{FH}\cdot\text{FW}OC×IC _ FH _ FW.
  • Matrix B is transformed from feature map, It is a matrix of IC _ FH _ FW×N _ _ OH _ OW\text{IC}\cdot\text{FH}\cdot\text{FW}\times\text{N}\cdot\text{OH}\cdot\text{OW}IC _ FH _ FW×N _ _ OH _ OW
  • Tensor C represents the output Tensor y, which is a matrix of OC×N _ OH _ OW\text{OC}\times\text{N}\cdot\text{OH}\cdot\text{OW}OC×N _ OH _ OW.

The correspondence between the elements of the matrix and Tensor at each position is

Where, the relationship between the matrix subscript I,j,ki, j,k and the coordinate of Tensor is

i = oc
j = n * OH * OW + oh * OW + ow
k = ic * FH * FW + fh * FW + fw
Copy the code

When JJJ is known, the following relation can be used to calculate the coordinates of feature map

n = j / (OH * OW)
j_res = j % (OH * OW)
oh = j_res / OW
ow = j_res % OW
Copy the code

When KKK is known, the coordinates of weight can be deduced

ic = k / (FH * FW)
k_res = k % (FH * FW)
fh = k_res / FW
fw = k_res % FW
Copy the code

Combining OH, OW, FH, fw, you can calculate iH and IW. Based on the above discussion, we can write the operation of convolution in the form of Implicit GEMM:

GEMM_M = OC
GEMM_N = N * OH * OW
GEMM_K = IC * FH * FW
for i in range(GEMM_M):
    oc = i
    for j in range(GEMM_N):
        accumulator = 0
        n = j / (OH * OW)
        j_res = j % (OH * OW)
        oh = j_res / OW
        ow = j_res % OW
        for k in range(GEMM_K):
            ic = k / (FH * FW)
            k_res = k % (FH * FW)
            fh = k_res / FW
            fw = k_res % FW
            ih = oh * stride_h - pad_h + fh
            iw = ow * stride_w - pad_w + fw
            accumulator = accumulator + x(n, ic, ih, iw) * w(oc, ic, fh, fw)
        y(n, oc, oh, ow) = accumulator
Copy the code

The above Implicit GEMM algorithm is still serial. Next, we will transform it into a parallel algorithm on CUDA. First, we divide the entire calculation task into blocks, and let each thread block be responsible for computing and output the matrix of size TILE_MxTILE_N. The algorithm then takes the following form:

for i_out in range(GEMM_M / TILE_M):
    for j_out in range(GEMM_N / TILE_N):
        ThreadblockConvolution(x, w, y)
        
def ThreadblockConvolution(x, w, y) :
    accumulate[TILE_M, TILE_N] = 0
    for i_in in range(TILE_M):
        oc = i_out * TILE_M + i_in
        for j_in in range(TILE_N):
            j = j_out * TILE_N + j_in
            n = j / (OH * OW)
            j_res = j % (OH * OW)
            oh = j_res / OW
            ow = j_res % OW
            for k in range(GEMM_K):
                ic = k / (FH * FW)
                k_res = k % (FH * FW)
                fh = k_res / FW
                fw = k_res % FW
                ih = oh * stride_h - pad_h + fh
                iw = ow * stride_w - pad_w + fw
                accumulator(i_in, j_in) = accumulator(i_in, j_in) 
                                        + x(n, ic, ih, iw) * w(oc, ic, fh, fw)
            y(n, oc, oh, ow) = accumulator(i_in, j_in)
Copy the code

In order to improve the efficiency of Memory access, we can also block the DIMENSION of GEMM_K, and cache matrix A of TILE_MxTILE_K and matrix B of TILE_KxTILE_N into Shared Memory each time to avoid repeated Global Memory access. The algorithm then takes the following form:

for i_out in range(GEMM_M / TILE_M):
    for j_out in range(GEMM_N / TILE_N):
        ThreadblockConvolution(x, w, y)

def ThreadblockConvolution(x, w, y) :
    accumulator[TILE_M, TILE_N] = 0
    smem_A[TILE_M, TILE_K] = 0
    smem_B[TILE_K, TILE_N] = 0
    for i_in in range(TILE_M):
        oc = i_out * TILE_M + i_in
        for j_in in range(TILE_N):
            j = j_out * TILE_N + j_in
            n = j / (OH * OW)
            j_res = j % (OH * OW)
            oh = j_res / OW
            ow = j_res % OW
            for k_out in range(GEMM_K / TILE_K):
                load_tile_to_smem(x, A_smem)
                load_tile_to_smem(w, B_smem)
                WarpGemm(A_smem, B_smem, accumulator)
            y(n, oc, oh, ow) = accumulator(i_in, j_in)

def WarpGemm(A_smem, B_smem, accumulator) :
    for k_in in range(TILE_K):
        accumulator(i_in, j_in) = accumulator(i_in, j_in) 
                                + A_smem(i_in, k_in) * B_smem(k_in, j_in)
Copy the code

Because we can directly reuse CUTLASS has implemented a high performance WarpMmaOperator, so the implementation of convolutional operator based on Implicit GEMM only needs

  • adapterDeviceConvolution,KernelConvolutionandThreadblockConvolutionTensor and Convolution Layer are supported.
  • addPredicateTileIteratorSupport to read a Tile of Tensor into Shared Memory and implicitly organize the read data into a matrix.
  • Algorithm in the main loopWarpTileIteratorRead data from Shared Memory, and thenWarpGemmOperatorComplete the GEMM operation of warp-level.
  • EpilogueOperatorThe adaptation convolution operator writes the Accumulator data back into the Global Memory Tensor.

Next we will introduce MegEngine’s underlying convolution implementation in terms of the INT8 data type TensorCore convolution operator. This article will focus on how 2, 3, and 4 are implemented. Refer to the previous article for details on how to use the convolution operators already written.

Global Memory Data Layout

To maximize the throughput of the TensorCore convolution operator, MegEngine uses a 128-bit Global Memory access instruction and therefore requires 128-bit address alignment when accessing Tensor’s data. MegEngine used NCHW32 format to store Tensor. NCHW32 format features:

  • Tensor’s channel dimension is grouped into 32 channels and each 32 channels are stored in sequence.
  • The rest of the Tensor’s dimensions are stored in the store in order W, H, C, N from fast to slow.

Since the storage format of 32 channel alignment is adopted, the convolutional layer requires that the number of channels in the input and output feature map should be multiples of 32.

Preprocessing access offset

MegEngine’s contional implementation on the dimension of GEMM_K is in the order of (IC/32) _ FH _ FW _ 32(\text{IC}/32)\cdot \text{FH}\cdot \text{FW}\cdot32(IC/32) _ FH _ FW _ 32 This is written as pseudocode:

kInterleaved = 32
for ic_out in range(IC//kInterleaved):
    for fh in range(FH):
        for fw in range(FW):
            for ic_in in range(kInterleaved):
                # do mma.Copy the code

If written as a layer loop, then it should be written as:

kInterleaved = 32
for k in range(GEMM_K):
    chw = k // kInterleaved
    ic_in = k % kInterleaved
    ic_out = chw // (FH * FW)
    chw_res = chw % (FH * FW)
    fh = chw_res // FW
    fw = chw_res % FW
    pointer += ic_out * C_STRIDE + fh * H_STRIDE + fw * W_STRIDE
    # do mma.Copy the code

As you can see, a lot of division and remainder are introduced in the iteration process if the offset of the pointer is directly computed. However, on CUDA platform, the overhead of integer division and remainder is very large, so we precalculate some address offsets on the host and store them in the buffer of kernel Param, and directly read addresses from Constant Memory when necessary. Avoid division and remainder operations. For each thread, the offset that the pointer moves in the main loop looks like this:

If the address increment can be expressed in terms of delta, then delta is periodic FH*FW, i.e. :

delta(step, TILE_K) = delta(step + (FH * FW), TILE_K)
Copy the code

So we only need about O(FH _ FW)\text{O}\left(\text{FH}\cdot\text{FW}\right)O(FH _ FW) storage space. The calculation logic of address offset can be referred to conv2d_TILE_iterator_nt_src_fprop_precom.h. Since the size of the kernel param buffer is 4KB, we used about 3KB to store the increment of the address, so MegEngine’s Convolution implementation requires that the FH*FW of the Convolution Layer should not be too large, but in general, Convolution of 3×3, 5×5, 7×7 can be handled. The iteration order for Nvidia’s official implementation is slightly different from the one described in this article:

  • The official implementation needs to beICCompletion forTILE_KIn this way, some computation will be wasted when the number of channels is small.
  • The official implementation of the thread block has a large address span when accessing the input feature map, which reduces the locality of memory access and is not friendly to cache.

Therefore, in terms of performance, the MegEngine implementation will have the advantage, while the official implementation has the advantage of less restrictions on Convolution Layer parameters and better generalisation.

Warp – level Mma (Matrix multiply – add) instruction

Cuda10.2 introduces new WARp-level MMA and LDMatrix commands. Users can use TensorCore through MMA commands to perform high-speed matrix multiplication and addition operations, and carefully control Warp to feed data to TensorCore through LDMatrix. The mMA instruction is used as follows:

unsigned A, B;  // input matrix fragment data
int C[2], D[2]; // accumulators
asm volatile(
    "mma.sync.aligned.m8n8k16.rol.col.satfinite.s32.s8.s8.s32 {%0,$1}, {%2}, {%3}, {%4,%5}; \n"
    : "=r"(D[0]), "=r"(D[1) :"r"(A), "r"(B), "r"(C[0]), "r"(C[1]));
Copy the code

The semantics of this instruction are that the 8x8x16 matrix multiplication and addition is performed synchronously by A single Warp of 32 threads. It has three input operands, of which an 8×16 matrix A and A 16×8 matrix B are involved in matrix multiplication. Data for both input matrices is distributed across 32 threads in the same Warp. The layout of matrix A is shown in the figure below:

  • The 32 threads in the same Warp are divided into eight groups of four and are responsible for reading one row of an 8×16 matrix.
  • One thread in each group reads the data of the four int8s adjacent to each row, filling exactly one 32-bit register.

The layout of a similar matrix B is shown in the figure below:

  • Each group of four threads is divided into eight groups, each group is responsible for reading a column in the 16×8 matrix.
  • One thread in each group is responsible for reading the four adjacent pieces of data in a column.

The data of the matrix C and the output matrix D involved in the accumulation operation are also distributed in 32 threads, and their layout is shown in the figure below:

  • Again, each group of four threads is responsible for reading in/out a row of data.
  • Each thread is responsible for the output of two int32 types of data adjacent to each other in a row, which happens to constitute a 64-bit register.

Analysis of mMA instructions shows that if data in Global Memory/Shared Memory is stored in a RowMajor or ColumnMajor format, When the same Warp performs two consecutive 8x8x16 matrix multiplications and addition across space, the data read by each thread will jump, and each multiplication will only read 32 bits wide into the register, while low width Load instructions generally do not maximize the storage bandwidth. Therefore, Nvidia provides ldMatrix instructions that allow the same Warp to read four 8×16 matrices into registers at once. This allows each thread in the Warp to read 128 bits of data at once, maximizing bandwidth utilization. The use of ldmarix is as follows:

unsigned addr;  // shared memory pointer
int x, y, z, w; // loaded data
int4 data;      // loaded fragment
asm volatile("ldmatrix.sync.aligned.x4.m8n8.shared.b16 {%0, %1, %2, %3}, [%4];"
    : "=r"(x), "=r"(y), "=r"(z), "=r"(w)
    : "r"(addr));
data = make_int4(x, y, z, w);
Copy the code

The above instruction reads exactly four 8×16 matrices, and each thread is responsible for reading just one row of data of the matrix. After the reading, the data will be exchanged between threads and the matrix data will be redistributed to each thread. The reading process is shown in the figure below:

This section introduces the TensorCore related MMA and LDmatrix instructions. With these two high-performance instructions, we also need to design a clever Shared Memory storage format for data and eliminate the bank conflict of reading data from Shared Memory. This improves the read efficiency of Shared Memory.

Data layout in Shared Memory

Before introducing the layout of data in Shared Memory, we need to understand the Memory access characteristics of Shared Memory. The Shared Memory is divided into 32 banks by four bytes. When a thread of the same Warp accesses different addresses of the same bank, conflict occurs, which slows down the Memory access efficiency. When a thread at the same Warp accesses data of different bit widths, there are different behaviors:

  • Each thread accesses 32 bits of data in Shared Memory, which is done in one phase.
  • Each thread accesses 64-bit data in the Shared Memory. The data is fetched in two phases:
    • Phase 1: The first 16 threads access 128 bytes of data.
    • Phase 2: The last 16 threads access 128 bytes of data.
  • Each thread accesses 128 bits of data in the Shared Memory in four phases:
    • Each phase by eight threads to complete 128 bytes of data access.

If there is no bank conflict at each stage of the above process, the Shared Memory access efficiency can be maximized. In order to avoid a Shared Memory bank conflict, we usually use padding to stagger the Shared Memory data so that the data accessed by the thread does not fall into the same bank. However, the problem with this is that the Size of Shared Memory required by the kernel will be larger. However, L1 cache(Shared Memory) on SM is limited, so padding will reduce occupancy of kernel. This, in turn, degrades kernel performance. Therefore, CUTLASS designed an interlaced Shared Memory layout that allows threads to access addresses without bank conflict without padding. Next, let’s take a 64×64 matrix as an example to describe the layout of data in Shared Memory. First of all, the granularity of the data read by the thread is 128 bits, that is, 16 pieces of INT8 data, so we always show the layout of the data in groups of 16. If the matrices were organized in a RowMajor format, the logical layout would look like this:

As you can see from the picture

  • Each group of 16 elements, called a Vector, is colored with different colors.
  • Each row of 32 adjacent elements is called a Crosswise, which happens to be data for a set of channels in NCHW32 format.

In the physical storage of Shared Memory, the data of the matrix is rearranged, as shown in the figure below:

We can see that the physical layout of Shared Memory has the following characteristics:

  • Each Crosswise in four lines is stored as a group in Shared Memory, followed by the next Crosswise in four lines.
  • Each set of data contains 8 vectors occupying 128 bytes, which are exactly 32 different banks in Shared Memory.
  • Each group of data in the array is staggered, to ensure thatldmatrixBank conflict does not occur when

Video Memory -> Shared Memory data transfer

In this section, we look at moving data from Global Memory to Shared Memory. Memory to the Shared Memory data handling is done by Conv2dTileSrcIteratorFpropPrecomp, this article will not be interpreted in detail the realization of the code, but the account of the process of the thread handling data, help them establish intuitive impression, a better understanding of the code. If the logical layout of Shared Memory in the previous section is an example, the logical layout of the data read by each thread in the same Warp is as shown in the figure below. Each thread reads 16 INT8 data, which forms a Vector.

In actual physical video memory, the distribution of data accessed by threads is shown below:

  • We can see that each thread read 128 bits of data.
  • The data read by adjacent threads is physically contiguous.

Therefore, the pattern of data read by threads from Global Memory can meet the requirements of combined Memory access, and at the same time, the maximum data bit width is used for Memory access, which maximizes the utilization of video Memory bandwidth. We can then see if the data read by the thread is mapped to the physical address of the Shared Memory

  • Every eight threads write 128 bytes of data to the Shared Memory, which falls into exactly 32 different banks of the Shared Memory.
  • The retrieval of the same Warp is completed in four stages, and there is no bank conflict in each stage.

Here is an example of the Warp writing to Shared Memory:

Shared Memory -> Register data transfer

Shared Memory to the register data handling is done by MmaTensorOpMultiplicandTileIterator. The same Warp reads four 8×16 matrices into registers during each iteration, with each thread reading one row of data. For example, in the first iteration, the logical layout of the data read by the thread is as follows:

The actual physical layout of data in Shared Memory is as follows:

You can see:

  • Each thread reads 128 bits of data, so the fetching is carried out in four phases.
  • The data read by each of the eight threads in each phase falls into exactly 32 banks in Shared Memory, and there is no conflict between the data fetched by the threads.

By the second iteration, the physical layout of the data accessed by each thread is as follows:

There is no bank conflict in each stage of the same retrieval.

Accumulator Writes back to global storage

In the case of INT8, where the same Warp is responsible for outputting 64×64 results, the kernel writes back to Global Memory eight times, each time writing back to a 32×8 matrix. This ensures that every time that Tensor is written back to video memory in NCHW32 format, 32 threads of the same Warp write exactly 256 bytes of physical sequential data, and each thread writes back 8 bytes, ensuring that the video memory can be written using 64 bit wide data types. Maximize bandwidth utilization. Due to the characteristics of THE MMA instruction, the data of the output matrix is distributed on various threads. In order to consolidate the Memory, that is, to make the addresses written back by the adjacent threads continuous, we used Shared Memory to exchange the data of 32 threads in the same Warp. After data exchange, each thread has 8 consecutive channels of data, and the address written by the thread is continuous, ensuring that the write back to Global Memory meets the requirements of merge access. The process of thread exchange data is shown in the figure below:

Each iteration, 32 threads in the Warp write 32×16 matrix data into Shared Memory. Then, as shown in the figure below, each thread will read the data of the 8 consecutive channels into the register.

Data exchange in Shared Memory is accomplished by the following two iterators

  • InterleavedTileIteratorTensorOp completed each iteration will be 32 by 8 data written to the Shared Memory.
  • InterleavedSharedLoadIteratorTensorOpResponsible for reading data from 8 consecutive channelsFragmentRegister.

After the thread reads the exchanged data to the Fragment register, the EpilogueOp completes the operation of BiasAdd on the basis of convolution. BiasAddLinearCombinationRelu, for example, it’s actually completed the following operations:

accumulator = conv(x, w)
y = alpha * accumulator + beta * bias + gamma * z
Copy the code

Bias is a PerChannel Tensor, which represents the bias of each output channel. Z is a Tensor with the same size as the Convolution output, which is used for the fusion of Convolution and ElemwiseAdd. Finally EpilogueOp output will be written by TensorPredicatedTileIteratorTensorOp really back to the Global Memory. Each thread writes back data like the figure below:

You can see that the pattern written back by the thread meets the requirements of the merge access and therefore maximizes the efficiency of Global Memory writes.

conclusion

In this paper, we introduce the basic principle of MegEngine convolution operator, the operator performance can reach more than 80% of CUDNN, the speed measurement results can be seen in the article.

MegEngine will continuously optimize the convolution implementation to further improve the performance of the operator, so far there are two optimizations that can be made:

  • Learn from Nvidia’s official CUTLASS ImplicitGEMM Convolution to implement mask processing and improveTileIteratorThe efficiency of mask judgment.
  • Current convolution implementations have bank conflict when using Shared Memory to exchange data while writing back to video Memory. Two optimizations will be considered later
    • Explore the data layout of Shared Memory, eliminate bank conflict, and optimize the efficiency of Shared Memory data exchange.
    • The layout of the Weight Tensor in Global Memory is explored to improve the locality of accumulator on each Thread and avoid data exchange in Shared Memory.

The resources

  • Warp-level Matrix Fragment Mma PTX document
  • CUTLASS Implicit GEMM Convolution
  • Volta architecture and performance optimization
  • Developing CUDA kernels to push Tensor Cores to the absolute limit on Nvidia A100