The original link

Series of articles:

  • Cuda Programming 101: Basic concepts and programming models for Cuda programming
  • Cuda Programming 102: PERFORMANCE related topics for Cuda programs
  • Cuda programming 103: Cuda multi-card programming
  • Cuda tips: NVCC-code,-arch,-gencodeoptions

The basic idea

Before getting into coding, a more important topic is what types of problems are suitable for GPU solving.

Compared with CPU, GPU has an amazing number of cores, arithmetic units and memory bandwidth. Given a problem, if there is a way to break it down into multiple independent sub-problems to be solved in parallel, the GPU is likely to deliver better performance than the CPU. By “independent”, we mean that the decomposed sub-problems satisfy:

  • Avoid synchronization between subproblems as much as possible
  • Subproblems rely on each other as much as possible using global memory synchronization state
  • Avoid synchronous relationships between subproblems as much as possible

Matrix multiplication is a good example. There is no dependence between the calculation of each element in the matrix multiplication result and it can be parallelized by GPU. Of course, for some problems, it may not be possible to come up with a parallel solution immediately, but there are efficient parallel problem decomposition methods, such as:

(Thinking question)

  • Merge two ordered arrays
  • Find the prefix sum of an array

If the problem at hand can be decomposed smoothly, it is possible to solve it efficiently with the hardware features and programming model provided by GPU.

Programming model

Hardware perspective

  • A GPU consists of multiple Streaming Multiprocessors (SM for short).
  • Each SM contains multiple cores, which are the units that actually perform the computation.
  • As shown in the figure below, there are 28 SM on a 1080TI, and 128 cores on each SM, totaling 3584 CUDA cores.

Programming point of view

  • A programmer writes a function that is executed by multiple threads in parallel on the GPU and calls it from the CPU code. Such a function is called a kernel.
  • Multiple GPU threads form a Thread block
  • For a kernel function, the programmer specifies how many thread blocks to start and how many threads are in each block
  • Each thread can determine which block it is in and how many threads it is in the block. For a parallel processing task, Thread can use this information to determine which molecular problem it should work on.

Execution perspective

  • Each Thread block is scheduled to execute on one of the SMS
  • For each thread in a thread block, a warp is formed for each 32 threads. The SM is scheduled by warp. In a warp, all threads execute the same Instruction stream, that is, Single Instruction Multiple Thread(SIMT). If there are branch statements during execution, threads executing different branches need to wait for each other. For example, for the following statements, only half of the threads in a warp are operating at any one time, rather than each executing its own branch. Improper branching statements can degrade performance when writing the kernel.
if (threadIdx.x % 2 == 0) {
    // Some work
} else {
    // Other work
}
Copy the code

Say something else

Why is a dispatch unit of 32 threads called WARP? The reason is that warp means thread wire and warp is a weaving related device for holding threads together, hence the metaphor:

Resource constraints

Just as writing CPU code is limited by the number of CPU cores, memory space, and access speed, the GPU programming model also needs to be aware of related resource constraints:

  • The maximum number of threads in each thread block is 1024
  • The number of thread blocks when starting the kernel.
  • The maximum number of Thread blocks that each SM can process at the same time is 32
  • The maximum size of Shared memory for 1080ti is 96kB
  • GPU access speed, 1080Ti upper limit is 484GB/s, if really limited to this limit means that the code is written very well
  • . (There are many more)

To the point code

The common flow in CUDA programming is:

  • Move CPU data to GPU
  • Write a kernel that defines the computation we want to perform
  • Start the kernel
  • Transport the results from the GPU back to the CPU

The vectorAdd in the Cuda sample code does the concatenation of two arrays h_A and h_B of length numElements and stores the result into h_C. Let’s use vectorAdd as an example to illustrate this process:

  • The first is to move the CPU data to the GPU
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
Copy the code
  • And then we define our addition
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if(i < numElements) { C[i] = A[i] + B[i]; }}Copy the code

What we do is that each thread is responsible for calculating its own array subscript based on its own Thread block and threadIdx, and then performing an adjoin on the element corresponding to this subscript.

  • Next we start the kernel, where<<<blocksPerGrid, threadsPerBlock>>>Specifies the number of thread blocks and the number of threads in each block.
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
Copy the code
  • Finally, we carry the results back to the CPU.
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
Copy the code

The most important part of a CUDA program is done. The complete code also contains the contents of memory allocation, error checking of CUDA calls, etc. The complete code can be seen at samples/0_Simple/vectorAdd under the CUDA installation directory.

Some caveats

  • Kernel execution is asynchronous and immediately returned to the CPU code upon startup. If you time it, you will find that the time is very short. In fact, this time is only the time when the kernel starts.
  • Although kernel execution is asynchronous, cudaMemcpy blocks.
  • In the example code, a cudaMemcpy call from the CPU to the GPU actually completes two memory copies, one from the original CPU memory to a page-lock CPU memory, and then from this memory to the GPU memory.

More topics

  • GPU memory tier
  • Impact of GPU memory pattern on performance
  • Impact of GPU branch statements on performance
  • Synchronous operation and atomic operation in GPU
  • Data transfer between CPU and GPU, PCIe, and Page-lock memory
  • CUDA debugger and profiler

Think about the answers

  • Merge two ordered arrays: for ordered arrays with array length n and M and T threads, the array with length N can be divided into T equal parts. For the start and end points of each subarray, the corresponding upper and lower bounds in array M can be found bisdically, and then parallel merge is carried out based on this.
  • Parallel Prefix Sum (Scan) with CUDA

Reference

  • CUDA C Programming Guide
  • Udacity Intro to Parallel Programming