directory

  • preface
  • The GPU architecture
  • GPU processing unit
  • Concept of GPU
  • GPU threads and SM
  • GPU thread
  • SM
  • add
  • Unified memory
  • The multiplication
  • The last

preface

Before actual CUDA programming, let’s take a look at the ARCHITECTURE of the GPU. Rough and powerful compared to the CPU (manual funny).


The GPU architecture

GPU processing unit

Starting with the GPU concept kernel diagram, you will find that it is different from the CPU kernel, lacking tertiary caching, branch prediction and so on. However, the number of ALUs is increased and the Pool of Context Storge is enlarged. As you can see, the context storage pool is divided into four parts, that is, four instruction streams can be executed. For example, instruction 1 blocks and instruction 2 immediately switches, while instruction 2 blocks and instruction 3 switches, which hides the delay. Of course, how much is the amount is very particular, not the more the better. In total, the kernel contains eight ALUs and four sets of Execution contexts, each with eight Ctx. Thus, such a kernel can execute concurrent but interleaved 4 instruction streams and 32 concurrent program fragments.


Concept of GPU

Copy 16 of the above processing units and get a GPU. It’s definitely not that simple, so it’s a concept GPU.

The GPU contains 16 processing units, 128 ALUs, 64 Execution contexts, and 512 concurrent program fragments. It has 480 CUDA cores (ALU) and 177.4GB/s memory bandwidth. The GTX 980 Ti has 2816 CUDA cores and 336.5GB/s memory bandwidth. However, bandwidth is still the bottleneck. Although it is an order of magnitude higher than CPU bandwidth, you can see that the BANDWIDTH of the GTX 980 Ti is about twice that of the GTX 480 years ago.


GPU threads and SM

As there is no fully GPU running machine, generally speaking, are heterogeneous, CPU+GPU. One of the things to pay attention to is Host versus Device.


GPU thread

In CUDA architecture, the minimum unit of display chip execution is Thread. Several threads can form a block. The threads in a block can access the same shared memory and perform synchronized actions quickly. Threads in different blocks cannot access the same shared memory, so they cannot communicate with each other or synchronize directly. Therefore, the degree to which threads in different blocks can cooperate is low. Above:

And then depending on thread, block, and grid, there are different stores. The core is Thread.

  • Registers Have a set of local 32-bit Registers on each processor.
  • Parallel data caches, or Shared Memory, are Shared by all scalar processor cores, where Shared Memory is located.
  • Constant Cache, shared by all scalar processor cores, speeds up read operations from fixed memory space (a read-only area of device memory);
  • A read-only Texture Cache, shared by all scalar processor cores, speeds up reads from the Texture memory space (a read-only area of the device memory), which is accessed by each multiprocessor through a Texture unit that implements a different addressing model and data filtering.


SM

As you can see above, one of the core components of GPU hardware is SM, which translates to Streaming Multiprocessor. The SM’s core components include a CUDA core (ALU), shared memory, registers, etc. The SM can execute hundreds of threads concurrently, depending on the number of resources the SM has. When a kernel is executed, thread blocks in its GIRD are allocated to the SM, and a thread block can only be scheduled on one SM. SM can generally schedule multiple thread blocks, depending on the capabilities of the SM itself. It is possible that each thread block of a kernel is allocated multiple SMS, so the grid is only the logical layer and the SM is the physical layer of execution. Here is my GT 750M graphics card information:

SM adopts SIMT(single-instruction, multiple-thread, single-instruction and multi-thread) architecture, and the basic execution unit is Thread wraps. The Thread wraps contain 32 threads, which execute the same instructions at the same time. But each thread contains its own instruction address counter and register state, and also has its own independent execution path.


add

Try doing a matrix addition using CUDA programming:

#include <stdio.h>

__global__ void add(float * x, float *y, float * z, int n){
        int index = threadIdx.x + blockIdx.x * blockDim.x;
        int stride = blockDim.x * gridDim.x;
        
        for (int i = index; i < n; i += stride){
                z[i] = x[i] + y[i];
        }
}

int main(){
        int N = 1 << 20;
        int nBytes = N * sizeof (float);
        float *x, *y, *z;
        x = (float*)malloc(nBytes);
        y = (float*)malloc(nBytes);
        z = (float*)malloc(nBytes);

        for(int i = 0; i < N; I++) {x [I] = 10.0; [I] y = 20.0; }float *d_x, *d_y, *d_z;
        cudaMalloc((void**)&d_x, nBytes);
        cudaMalloc((void**)&d_y, nBytes);
        cudaMalloc((void**)&d_z, nBytes);

        cudaMemcpy((void*)d_x, (void*)x, nBytes, cudaMemcpyHostToDevice);
        cudaMemcpy((void*)d_y, (void*)y, nBytes, cudaMemcpyHostToDevice);
        
        dim3 blockSize(256);
        // 4096
        dim3 gridSize((N + blockSize.x - 1) / blockSize.x);
        
        add << < gridSize, blockSize >> >(d_x, d_y, d_z, N);

        cudaMemcpy((void*)z, (void*)d_z, nBytes, cudaMemcpyDeviceToHost);

        floatMaxError = 0.0;for (int i = 0; i < N; i++){
                maxError = fmax(maxError, (float) (fabs [I] - 30.0 (z))); }printf ("max default: %.4f\n", maxError);

        cudaFree(d_x);
        cudaFree(d_y);
        cudaFree(d_z);
        free(x);
        free(y);
        free(z);

        return 0;
}
Copy the code

Since I use MAC SSH to access the Linux host, so read Chinese are garbled, did not hit the comment. Just a quick word.

  • Add 20.0 to 1M float, add 20.0 to 1M float. But we don’t have direct results. The loop computes the error value and outputs the maximum error value. And then you see that it’s 0 and you’ve got everything right.
  • CUDA parts:cudaMalloc((void**)&d_x, nBytes);Allocate space on the GPU, not the CPU. withcudaMemcpy((void*)d_x, (void*)x, nBytes, cudaMemcpyHostToDevice);Put the data from the CPU into the GPU. Note that the second is the source data and the third is the direction.dim3 blockSize(256);Request 256 blocks.dim3 gridSize()Same logic. FinallycudaMemcpy((void*)z, (void*)d_z, nBytes, cudaMemcpyDeviceToHost);Copy the result from the GPU back to the CPU. Note that the third parameter is different from the previous one. Remember to free up space for applications.

Unified memory

Time is not waiting for me. In higher versions of CUDA, the operation of requesting space has been upgraded. CUDA 6.x introduces Unified Memory. The specific content is suggested to refer to the link I give, which is very detailed. To put it simply, it is good to apply for once, not CPU after GPU, and then copy to copy, too silly.

#include <stdio.h>

__global__ void add(float * x, float *y, float* z, int n){//main()
{
    int N = 1 << 20;
    int nBytes = N * sizeof(float);

    float *x, *y, *z;
    cudaMallocManaged((void**)&x, nBytes);
    cudaMallocManaged((void**)&y, nBytes);
    cudaMallocManaged((void**)&z, nBytes);

    for(int i = 0; i < N; ++ I) {x[I] = 0; [I] y = 20.0; } dim3 blockSize(256); // 4096 dim3 gridSize((N + blockSize.x - 1) / blockSize.x); add << < gridSize, blockSize >> >(x, y, z, N); cudaDeviceSynchronize();floatMaxError = 0.0;for (int i = 0; i < N; i++){
        maxError = fmax(maxError, (float) (fabs [I] - 30.0 (z))); }printf ("max default: %.4f\n", maxError);
    
    cudaFree(x);
    cudaFree(y);
    cudaFree(z);

    return 0;
}
Copy the code

CudaMallocManaged ((void**)&x, nBytes); cudaMallocManaged((void**)&x, nBytes); CudaDeviceSynchronize (); Synchronize the device to ensure that the results can be accessed correctly. Please refer to the previous link for details.


The multiplication

Those of you who have studied linear algebra know that matrix multiplication is the kind of work that is not difficult to do. You may have been lazy with MATLAB, but now you can consider CUDA (manual funny). Let’s define a Matrix:

struct Matrix
{
    int width;
    int height;
    float *elements;
};
Copy the code

Define matrix manipulation functions and multiplication functions:

__device__ float getElement(Matrix *A, int row, int col)
{
        return A->elements[row * A->width + col];
}

__device__ void setElement(Matrix *A, int row, int col, float value)
{
        A->elements[row * A->width + col] = value;
}

__global__ void matMulKernel(Matrix *A, Matrix *B, Matrix *C)
{
        float Cvalue = 0.0;
        int row = threadIdx.y + blockIdx.y * blockDim.y;
        int col = threadIdx.x + blockIdx.x * blockDim.x;
        
        for (int i = 0; i < A->width; ++i)
        {
                Cvalue += getElement(A, row, i) * getElement(B, i, col);
        }
        setElement(C, row, col, Cvalue);
}
Copy the code

The main function is not difficult to write, mainly to understand the principle of scheduling:

int main()
{
    int width = 1 << 10;
    int height = 1 << 10;
    
    Matrix *A, *B, *C;
    
    cudaMallocManaged((void**)&A, sizeof(Matrix));
    cudaMallocManaged((void**)&B, sizeof(Matrix));
    cudaMallocManaged((void**)&C, sizeof(Matrix));
    
    int nBytes = width * height * sizeof(float);
    
    cudaMallocManaged((void**)&A->elements, nBytes);
    cudaMallocManaged((void**)&B->elements, nBytes);
    cudaMallocManaged((void**)&C->elements, nBytes);

    A->height = height;
    A->width = width;
    B->height = height;
    B->width = width;
    C->height = height;
    C->width = width;
    
    for(int i = 0; i < width * height; If (A-> 1) {c -> 1; B - > elements [I] = 2.0; } dim3 blockSize(32, 32); dim3 gridSize((width + blockSize.x - 1) / blockSize.x, (height + blockSize.y - 1) / blockSize.y); matMulKernel << < gridSize, blockSize >> >(A, B, C); cudaDeviceSynchronize();floatMaxError = 0.0;for (int i = 0; i < width * height; ++i)
        maxError = fmax(maxError, fabs(C->elements[i] - 2 * width));

    printf ("max fault: %f\n", maxError);

    return 0;
}
Copy the code

It’s almost like addition, just more matrix operations. Without further ado ~ then the matrix calculation code refers to this article.


The last

CUDA has been here for a while now, and the next update should be a must-understand part of the kernel. If you have any comments or suggestions, see you in the comments section