Author: ma | kuang apparent MegEngine architects

preface

Single precision matrix multiplication (SGEMM) is almost a case that every CUDA student can’t get around. This classic computation-intensive case can well demonstrate commonly used optimization skills in GPU programming, and whether an efficient SGEMM Kernel can be written. It is also a good test of a CUDA programmer’s understanding of the GPU architecture. CUDA SGEMM optimization tools will be introduced in this article for those who have read CUDA C++ Programming Guide and are familiar with CUDA Programming.

CUDA matrix multiplication optimization method in detail

Naive Implementation analysis: What is so bad?

I have interviewed many college students with EXPERIENCE in CUDA programming. When asked to write a SGEMM Kernel using CUDA, they usually get the following answer:

__global__ void matrixMul(const float *A, const float *B, float *C, 
                          int M, int N, int K) {
    int tx = blockIdx.x * blockDim.x + threadIdx.x;
    int ty = blockIdx.y * blockDim.y + threadIdx.y;
    if(ty < M && tx < N) {
        float c = 0;
        for(int i = 0; i < K; ++i){ c += A[ty * K + i] * B[i * N + tx]; } C[ty * N + tx] = c; }}Copy the code

Such a Naive Kernel is certainly not what the author expected, because the performance of this Kernel can be determined to be less than 1/10 of cuBLas, which obviously does not meet our demand for high performance. So how bad is this Naive implementation?

If we look at the code, we can see that we need to read A and B once before calculating an FMA. As we know, reading Global Memory is very expensive. It usually takes hundreds of cycles, while calculating an FMA usually only takes A few cycles. A lot of time is spent on inventory. If you are A smart student, you can move A and B matrix to Shared Memory first to reduce the overhead of accessing Memory. This is indeed a good idea, but it can only reduce the cost of accessing storage from hundreds of cycles to tens of cycles, without changing the nature of the problem. The key of the problem is that the main loop consists of two Load instructions and one FMA instruction, and the calculation instruction only accounts for 1/3 of the total. The calculation access ratio is too low, which ultimately leads to the access delay cannot be hidden, thus the performance is not ideal.

Let’s think about what a Hot Loop would look like if a thread computed 4×4 results instead of just one, and Shared Memory optimization was used:

    float c[4] [4] = {{0}};
    float a_reg[4];
    float b_reg[4];
    for(int i = 0; i < K; i += TILE_K){
        __syncthreads();
        // transfer tile from global mem to shared mem
        load_gmem_tile_to_smem(A, i, smemA);
        load_gmem_tile_to_smem(B, i, smemB);
        __syncthreads();
    #pragma unroll
        for(int j = 0; j < TILE_K; ++j) {
            // load tile from shared mem to register 
            load_smem_tile_to_reg(smemA, j, a_reg);
            load_smem_tile_to_reg(smemB, j, b_reg);
            // compute matrix multiply accumulate 4x4
            mma4x4(a_reg b_reg, c); }}Copy the code

According to the analysis, four access operations are needed to read from smemA to register A_reg. Similarly for B, then the proportion of computing access instructions of the subject becomes 16/8. Compared with the previous situation, the proportion of computing instructions is greatly increased. A large enough access ratio can improve the utilization rate of the cell and hide the access delay. We can further improve the computational access ratio, so that the kernel performance is close to the theoretical peak.

Matrix partitioning and resource allocation

Obviously, we cannot use only one block to calculate a super-large matrix, which will cause a large amount of idle waste of SM (Streaming Multiprocessor), so we need to compute the matrix in blocks, as shown in the figure below:

Different block sizes have their own advantages and disadvantages in the application of matrix multiplication of different shapes. In this paper, 128×128 block is selected as an example.

As can be seen from the previous section, it is of great benefit to improve the computational call ratio. Then, can the computational call ratio be improved infinitely? The answer is no. To improve the computational fetch ratio, a single thread needs to compute a larger block, which requires more registers, but the number of registers is limited. Taking the GPU of Turing architecture as an example, the total number of registers in a single SM is 65536. Due to the restriction of instruction coding, the maximum number of registers that can be used by a single thread is 255, and the more registers you use, the better. Usage is the ratio of the number of active threads to the maximum number of concurrent threads per SM. High Occupancy does not necessarily mean high performance. However, you can hide some of the delay by switching to Warp. The number of Active warps per SM depends on the number of resources used by the block, the number of registers used by each thread, and the amount of Shared Memory used. Occupany can be achieved by using the CUDA_Occupancy_calculator.xls tool provided in the CUDA Toolkit.

If each thread computs 128 results, the required block size is 128. A single thread needs 128 registers to store the calculation results, plus the required Gmem to Smem. There are at least 180 registers required by Smem to Reg and so on. By calculating Occupany, the Active Warp number is only 8 and Occupany is 25%. If block size is set to 256, each thread only needs to calculate 64 results, adjust the usage of registers and Shared Memory and observe Occupany, it can be known that if each thread only uses 128 registers, Limiting Shared Memory usage within a block to 32K and Active Warp to 16 is a better option:

And the configuration computates a fetch ratio of 64/4 (using vector reads), which is enough to hide the fetch latency.

The ultimate access optimization

Generally, the performance of SGEMM Kernel can reach a good level (80% cublas) after selecting appropriate block resource allocation, using Shared Memory to reduce access latency, and completing loop unrolling, but this is not the end of our journey. First, we can use the vector read instruction LDS.128 to optimize Shared Memory access (corresponding to float4 data type). This can greatly reduce the number of fetch instructions and further improve the computational fetch ratio. Therefore, we need to perform A transpose before storing the A matrix into smemA:

At the same time, our kernel calculates 128×128 blocks for 256 threads. In order to merge and access Shared Memory, we divide 256 threads into two dimensions and let:

    int tx = threadIdx.x % 16;
    int ty = threadIdx.x / 16;
Copy the code

And read data from Shared Memory as follows:

Finally, a single thread calculates 2×2 4×4 results, and the result layout is shown in the figure:

In addition, micro benchmark can detect that Global Memory access delay of Turing(Tesla T4) is about 300 cycle, and Shared Memory access delay is about 30 cycle. We need to make full use of the idea of ** Prefetch ** to hide the latency of reading Global Memory into intermediate registers, writing blocks from Global Memory to Shared Memory, and reading blocks from Shared Memory. In case the cell is idle for too long due to stall, the resulting pseudocode looks like this:

    #define TILE_K 16
    __shared__ float4 smemA[2][TILE_K * 128 / 4];
    __shared__ float4 smemB[2][TILE_K * 128 / 4];
    float4 c[8] [2] = {{make_float4(0.f.0.f.0.f.0.f)}};
    float4 ldg_a_reg[2];
    float4 ldg_b_reg[2];
    float4 a_reg[2] [2];
    float4 b_reg[2] [2];
    
    // transfer first tile from global mem to shared mem
    load_gmem_tile_to_reg(A, 0, ldg_a_reg);
    load_gmem_tile_to_reg(B, 0, ldg_b_reg);
    
    store_reg_to_smem_tile_transpose(ldg_a_reg, 0, smemA[0]);
    store_reg_to_smem_tile(ldg_b_reg, 0, smemB[0]);
    __syncthreads();

    // load first tile from shared mem to register 
    load_smem_tile_to_reg(smemA[0].0, a_reg[0]);
    load_smem_tile_to_reg(smemB[0].0, b_reg[0]);

    int write_stage_idx = 1; //ping pong switch
    do {
        i += TILE_K;
        // load next tile from global mem
        load_gmem_tile_to_reg(A, i, ldg_a_reg);
        load_gmem_tile_to_reg(B, i, ldg_b_reg);
        
        int load_stage_idx = write_stage_idx ^ 1;
        
    #pragma unroll
        for(int j = 0; j < TILE_K - 1; ++j) {
            // load next tile from shared mem to register 
            load_smem_tile_to_reg(smemA[load_stage_idx], j + 1, a_reg[(j + 1) % 2]);
            load_smem_tile_to_reg(smemB[load_stage_idx], j + 1, b_reg[(j + 1) % 2]);
            // compute matrix multiply accumulate 8x8
            mma8x8(a_reg[j % 2], b_reg[j % 2], c);
        }
        
        if(i < K) {
            // store next tile to shared mem
            store_reg_to_smem_tile_transpose(ldg_a_reg, 0, smemA[write_stage_idx]);
            store_reg_to_smem_tile(ldg_b_reg, 0, smemB[write_stage_idx]);
            // use double buffer, only need one sync
            __syncthreads();
            // switch
            write_stage_idx ^= 1;
        }
        
        // load first tile from shared mem to register of next iter
        load_smem_tile_to_reg(smemA[load_stage_idx ^ 1].0, a_reg[0]);
        load_smem_tile_to_reg(smemB[load_stage_idx ^ 1].0, b_reg[0]);
        // compute last tile mma 8x8
        mma8x8(a_reg[1], b_reg[1], c);
    } while (i < K);
    
    store_c(c, C);
Copy the code

Note: If M, N, and K are multiples of 4, then Global Memory cannot read in float4 and write back in float4. The results in warp need to be swapped with Shared Memory to ensure that each warp can write back a contiguous chunk of Memory with a Store instruction.

So far, we have obtained a fully optimized SGEMM Kernel. In addition, the LDGSTS instruction is added to Ampere GPU to further optimize SGEMM performance without passing through an intermediate register during the process of transferring data blocks from Global Memory to Shared Memory.

The performance comparison

In order to avoid cublas selecting split K Kernel, we fixed K as 1024 and took M, N = 2048, 4096, 8192 and 16384 as test cases. The performance of the above SGEMM Kernel and CublAS was compared (the test GPU was Tesla T4, and the locked core frequency was 1100) :

It can be seen that the implemented SGEMM Kernel achieves the cuBLas average performance of 97.5%.

Beyond Cublas: Tune the Kernel using SASS

Here, some students still have a question: we seem to have used all the optimization means imaginable, why the written CUDA C Kernel is still a certain gap from Cublas? The answer is that a large portion of the kernel used by Cublas is not NVCC compiled CUDA kernel, but a deeply tuned version written in NVIDIA GPU’s Shader Assembly language (SASS). Despite continued advances in NVCC compilers, especially NVCC in CUDA 11, the gap between compiled kernels and hand-compiled versions has narrowed considerably, However, the Bank conflict of the register and the Reuse Cache of the register (both concepts are discussed in more detail below) are still not completely avoided, so the gap still exists. Even pseudo-assembly languages like PTX have no precise control over register allocation and face the same dilemma as CUDA C. Therefore, in order to fully exploit GPU performance limits, precise control of GPU instructions and registers is required, which must be completed by GPU native assembly language SASS. There have been a number of studies on this, like Citadel’s Dissecting the NVidia XXX GPU Architecture Via Microbenchmarking series of papers, This series of articles have systematically tested, analyzed and summarized the underlying architecture. Although some of the conclusions may not be accurate, they are of high reference value overall. At the same time, many open source assemblers such as KeplerAs, Maxas (the most mature and influential), Turingas, CuAssembler and a series of open source SASS assemblers have been spawned, making it possible to write high-performance Kernel using SASS.

Register Bank conflict

We know that Shared Memory has Bank conflict, and the Bank conflict of registers is a similar concept. NVIDIA GPU has an independent Register File for each SM, and the Register File is divided into several banks. Taking Maxwell as an example, if more than two source registers required by an instruction are from the same Bank, Conflict will occur, and the instruction will be equivalent to relaunch, wasting a cycle. The number of banks in Maxwell/Pascal Register File is 4, and the id%4 of the Register is the Bank to which the Register belongs (for example, R0 belongs to Bank 0 and R5 belongs to Bank 1). Bank conflict is generated by FFMA R1, R0, R4, R1. The Turing architecture has been improved. Register files are divided into two banks, each with two ports, and Register Bank conflicts are not generated unless the three source Register ids are the same as odd and even, which greatly alleviates Register Bank conflict.

In order to alleviate register Bank conflict, Maxwell SGEMM SASS Kernel in MaxAS makes delicate allocation of registers participating in FFMA calculation (refer to maxAS SGEMM document), as shown in the figure below:

After clever arrangement of C, register Bank conflict is greatly reduced, but still cannot be completely avoided (for example, in the part marked by black box in the figure above, the register used by A/B will generate Bank conflict). Register Reuse is needed to eliminate this part of conflict.

Register Reuse

The Reuse Cache is a Reuse Cache that NVIDIA introduced in Maxwell to alleviate the problem of register Bank conflict. The Reuse Cache is a Reuse Cache that NVIDIA uses in the Collector unit that reads the operand. Reuse Cache is read-only. The control code of the Reuse Cache specifies whether the Operand is used by the Reuse Cache. The Bank conflict is deprecated by the reuse Cache instead of the Register File because the cuobjdump is a red flag reuse.

# Maxwell GPU
FFMA R2, R64.reuse, R73, R2; R64 reuses the Reuse Cache
FFMA R3, R64.reuse, R72, R3; R64 is retrieved from the Reuse Cache to avoid conflict with R72
Copy the code

However, the. Reuse flag reuse must meet certain conditions (the register cannot reuse before it is rewritten). If the reuse flag is used arbitrarily, it may obtain historical values, which may cause calculation errors. .reuse is more like an identifier that makes the value of the register hold in the reuse Cache. The Reuse Cache is used to avoid Bank conflict, but the Reuse Cache is not used because of register allocation and instruction layout. A count of all the FFMA instructions in the main loop shows that the Reuse Cache is only about 20%, whereas the Maxas SASS Kernel is designed to Reuse up to 49%.

Finally, the performance of SGEMM Kernel fine-tuned by SASS can completely surpass cuBLAS. Interested students can compile the SGEMM Kernel in MaxAS by themselves and test it on Maxwell or Pascal GPU. Finally, although SASS can fully exploit GPU performance, there are three problems: 1. Third-party NV GPU assembler relies on reverse research of GPU architecture, and may have unknown bugs because all the underlying details of hardware are not explored. 2. Assembly Kernel is difficult to develop, more difficult to debug; 3. The ISA (instruction set) of each generation of NV GPU is different, which requires continuous development of corresponding assembler and assembly Kernel. Because of these problems, using SASS to write Kernel is a time-consuming and laborious work, and it is not recommended to try it easily unless the pursuit of extreme performance is required.

An extension of GEMM: Optimizing convolution operations

As we all know, optimized convolution operation can be realized by im2COL mapping convolution to matrix multiplication. For the above SGEMM Kernel, only the process of transferring data from Global Memory to Shared Memory is slightly modified. By changing the mapping of corresponding position into IM2COL mapping, SGEMM Kernel becomes the Kernel for calculating Conv, which is the Implicit Gemm algorithm of CUDNN convolution operation. In the process of IM2COL, if the offset of pointer is directly calculated, a large number of integer division and mod operations will be introduced, which is not a small cost. Therefore, the offset of address can be pre-calculated on the host side and passed into the kernel as parAM, then it can be read from the constant memory when needed. Avoid integer division and remainder and implement Implicit Precomp Gemm, see our previous article MegEngine TensorCore Implementation Principles for convolution operators. Guided by this approach, we have implemented efficient INT8 / INT4 convolution (MegEngine cutlass) based on Cutlass, and welcome to try it out in the ongoing iterations.

conclusion

This paper introduces in detail how to write an efficient CUDA SGEMM Kernel, and introduces the means of extreme performance optimization using SASS programming, and extends the idea of optimizing convolution operation with Implicit Gemm slightly. I hope it can give some inspiration to students who are interested in the ultimate mining of hardware performance. Finally, I’d like to welcome you to join the MegEngine team to optimize the training and reasoning performance of deep learning.

Bot said: After reading all the articles, are you interested in joining the development of deep learning framework? The Megengine team is currently hiring furiously! Looking forward to your participation ~ resume submission or detailed information can add wechat: Duoduo715495

Framework development engineer (C++)

Job Description:

  1. Responsible for the design, evolution, implementation, maintenance and optimization of Megvii core Depth framework MegEngine
  2. Optimize MegEngine performance on various computing platforms (CUDA/Arm/x86 etc.)
  3. Continue to explore advanced optimization methods to improve deep learning frameworks (e.g. graph optimization, automatic code generation, ultra-low bit quantization, thinning, etc.)

Skills required:

  1. 1-3 years experience in performance optimization (X86, CUDA, ARM, OpenCL, etc.)
  2. Good data structure and algorithm skills, able to skillfully use C++ language to write more complex algorithms
  3. Basic knowledge of deep learning and deep learning frameworks (Caffe, Tensorflow, PyTorch, etc.) is preferred
  4. Experience in complex system design is preferred