“Python is user-friendly but inefficient”, “C++ is more efficient but can implement a function with much more code than Python”. Do you often hear similar statements in your daily study and work? In Python’s day and age, how often do you get stuck in code and worry about speed up? “My code has only run 10 steps, and my neighbor’s has already run its first epoch.” Is this a distortion of human nature or a decline of science? Xunzi said, “A gentleman’s nature is not different, but good at false things.” This edition of source code interpretation takes you inside the (mysterious) C++ / CUDA extensions in Pytorch.

  • The topic of this issue: combining the best of Python and C++, adding C++ / CUDA extensions to PyTorch allows us to better use tools without being trapped by them.
  • Source: MMCV, PyTorch.
  • Note: C++ / CUDA extensions generally have “pre-compiled” and “just-in-time” (JIT) modes. This installment focuses on the “precompiled” pattern.

1. Start with how the extension is called

When you want to add extensions to your code to speed things up, let’s take a look at the classic example. For those of you who know something about detection or partitioning, NMS calculations are the most common operators that use C++ / CUDA extensions.

from mmcv import _ext as ext_module from torch.autograd import Function def nms(boxes, scores, iou_threshold, offset=0): inds = NMSop.apply(boxes, scores, iou_threshold, offset) dets = torch.cat((boxes[inds], scores[inds].reshape(-1, 1)), dim=1) return dets, inds class NMSop(torch.autograd.Function): @staticmethod def forward(ctx, bboxes, scores, iou_threshold, offset): inds = ext_module.nms( bboxes, scores, iou_threshold=float(iou_threshold), Offset =offset) return staticmethod def symbolic(g, bboxes, scores, iou_threshold, offset): pass # onnx conversion correlationCopy the code

Function (see previous content torch. Autograd). NMSop’s forward function kernel calls the mMCV._ext. NMS module, but the _ext module is not actually found in the MMCV source code. The MMCV /_ext.cpython-xxx.so file will appear only in the compiled MMCV library (MMCV_WITH_OPS=True python setup.py build_ext –inplace). Only then will running import mmcV._ext in Python succeed. It appears that the C++ extension is compiled via setup.py.

2. Setup.py — The compiled file for the extension

A precompile-based extension needs to be compiled, and the setuptools file is a compilation script based on setupTools. So an extension to Python package can be found in the setup.py file. Here we take a snippet of the setup.py file of MMCV,

Setup (name=' MMCV ', install_requires=install_requires, # c++/cuda extensions required to build ext_modules=get_extensions(), # cmdclass specifies behavior for python setup.py --build_ext command cmdclass={'build_ext': torch.utils.cpp_extension.buildextension})Copy the code

You can see here that one of the main arguments in the setup function, ext_modules, needs to be specified as a list of extensions, representing the extensions that actually need to be compiled. This parameter is currently obtained from the get_extensions function. The get_Extensions function is defined as follows (excerpt)

def get_extensions(): extensions = [] ext_name = 'mmcv._ext' from torch.utils.cpp_extension import (CUDAExtension, CppExtension) if torch.cuda.is_available(): Extra_compile_args [' NVCC '] = [CUDA_args] if CUDA_args else [] # compile_all files in the/MMCV/OPS/CSRC/Pytorch folder op_files =  glob.glob('./mmcv/ops/csrc/pytorch/*') extension = CUDAExtension else: # C++ compiler extension op_files = glob.glob('./ MMCV /ops/ CSRC /pytorch/*.cpp') extension = CppExtension include_path = Os.path. abspath('./ MMCV /ops/ CSRC ') ext_ops = extension(name=ext_name, # sources=op_files, Include_dirs =[include_path], # Pre-defined macro extra_compile_args=extra_compile_args) # Other compilation options extension.append (ext_ops) return extensionsCopy the code

In the above code we finally see mMCV._ext, which is the name of the newly defined extension. This tells us that the mmcV._ext module mentioned above actually specifies its module name in the setup.py file. In addition, we found that the functions used to generate extensions vary from system to system. CppExtension is called when CUDA is not available and only all.cpp files are compiled, and CUDAExtension is called when CUDAExtension is not available. Both CppExtension and CUDAExtension are extensions based on setuptools.Extension. Both of these functions add a torch/include from the system directory to the include_dirs at C++ compilation time. In addition, CUDAExtension adds cudA-related libraries and header files to the default build search path. From the setup.py file we also know that other information sent to the compiler, such as the source file address of the extension file, is stored in./ MMCV /ops/ CSRC/Pytorch/in MMCV. Other information such as include_dirs, define_macros, Extra_compile_args in torch/utils/cpp_extension py: BuildExtension together form the final GCC/NVCC command.

Def build_extensions(self): Self._check_abi () self.compiler.src_extensions += ['.cu', '.cuh'] def unix_wrap_compile(obj, src, ext, cc_args, extra_postargs, pp_opts): try: original_compiler = self.compiler.compiler_so if _is_cuda_file(src): NVCC = _join_CUDA_HOME ('bin', 'NVCC ') self.compiler.set_executable('compiler_so', nvcc) if isinstance(cflags, dict): cflags = cflags['nvcc'] cflags = COMMON_NVCC_FLAGS + ['--compiler-options', "'-fPIC'"] + cflags + _get_cuda_arch_flags(cflags) elif isinstance(cflags, dict): -- STD =c++11 if not any(flag.startswith('-std=') for flag in cflags): Original_compile (obj, SRC, ext, cc_args, cflags, pp_opts) finally Self.piler. set_executable('compiler_so', original_compiler)Copy the code

Once we’re clear we run the MMCV_WITH_OPS=True python setup.py build_ext –inplace directive.

/ usr/local/cuda 10.0 / bin/NVCC - DMMCV_WITH_CUDA - I/home - to - / MMCV MMCV/ops/CSRC - I/home - to - / / torch/include -I/home-to/torch/include/torch/csrc/api/include -I/home-to/torch/include/TH -I/home-to/torch/include/THC -i/usr/local/include/cuda 10.0 - I/home - to/python3.7 m - c/MMCV/ops / / pytorch nms_cuda. CSRC cu - o Build/temp. Linux - x86_64-3.7 /. / MMCV/ops / / pytorch nms_cuda CSRC. O - D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -std=c++11 gcc -pthread -B compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DMMCV_WITH_CUDA -I/home-to/mmcv/ops/csrc -I/home-to/torch/include -I/home-to/torch/include/torch/csrc/api/include -I/home-to/torch/include/TH -I/home-to/torch/include/THC -i/usr/local/include/cuda 10.0 - I/home - to/python3.7 m - c/MMCV/ops / / pytorch/NMS. The CSRC CPP - o Build/temp. Linux - x86_64-3.7 /. / MMCV/ops / / pytorch/NMS. The CSRC o - DTORCH_API_INCLUDE_EXTENSION_H - DTORCH_EXTENSION_NAME = _ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11 ...Copy the code

We can see this in the above run results

  1. The compiler automatically calls NVCC for CUDA files and GCC for.cpp files
  2. beCUDAExtensionPython, PyTorch, CUDA, etc.-gencode) and compile optimization information (-O3Etc.)
  3. through-DTORCH_EXTENSION_NAME=_ext 将TORCH_EXTENSION_NAMEMacro assignment for_ext. This is certainly not an idle piece of writing, but if you want to know what happens next, let’s look at the next section. Okay

3. PYBIND11_MODULE — bridge between Python and C++

We compiled the extension file with setup.py. One question remains, however, as to why compiled C++ / CUDA binaries can be called directly from Python. Checking all the compiled files again, one of them, Pybind.cpp, looks suspiciously like this when opened.

#include <torch/extension.h> // implement a Tensor NMS (Tensor boxes, Tensor scores, float iou_threshold, int offset);  PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("nms", &nms, "nms (CPU/CUDA) ", py::arg("boxes"), py::arg("scores"), py::arg("iou_threshold"), py::arg("offset")); }Copy the code

PYBIND11_MODULE here is a macro, defined in pybind11 repository (see pybind11 / include/pybind11 / pybind11 h). Pybind11 is a library for creating Python connections in C++ code. We found the source, we analyzed it further.

Here PYBIND11_MODULE provides an entry point for C++ code to access the Python interpreter. In the example above, TORCH_EXTENSION_NAME is exactly the macro that appeared during GCC compilation above, corresponding to the name variable of extension. Therefore, it will be interpreted as _ext (note that there are no double quotes). M represents the module instance to which TORCH_EXTENSION_NAME corresponds (in fact, you can specify any name). Each m. ef in {} defines a _ext member function, which is generally of the form m. ef(” function name “, pointer to a specific C++ implementation, “document “, argument list). In this way, the NMS becomes a member function of mMCV._ext, which is implemented as the already defined NMS function (an analysis of which will be covered in the next section). You can also run from mmcV. _ext import NMS in Python. If the definition here is still unclear, we can expand the macro with the C++ compiler:

Tensor nms(Tensor boxes, Tensor scores, float iou_threshold, int offset); static void pybind11_init__ext(pybind11::module &); Extern "C" __attribute__ ((visibility("default"))) PyObject *PyInit__ext() {// Omit some code auto m = pybind11::module("_ext"); // the m variable is initialized inside the macro try {pyBind11_init__ext (m); return m.ptr(); }} void pybind11_init__ext(pyBind11 ::module &m) {// Add member function m.def(" NMS ", & NMS, "NMS (CPU/CUDA) ", py::arg("boxes"), py::arg("scores"), py::arg("iou_threshold"), py::arg("offset")); }Copy the code

PyObject *PyInit_ is defined in python. h, which is the official way to declare a Python module in C++ (see the official Python documentation). _ext followed by PyInit_ is the TORCH_EXTENSION_NAME macro. This means that a new Python module named _ext is declared.

4. CPP /cu file — concrete implementation of the operator

After analyzing the PYBIND11_MODULE, Tensor NMS (Tensor boxes, Tensor scores, float iou_threshold, int offset); The function. The function definitions in MMCV/ops / / pytorch/NMS. The CSRC in the CPP

#include <torch/extension.h> Tensor nms(Tensor boxes, Tensor scores, float iou_threshold, Int offset) {if (boxes.device().is_cuda()) {// cuda implements return nMS_CUDa (boxes, scores, iou_threshold, offset); } else {return nms_cpu(boxes, scores, iou_threshold, offset); }}Copy the code

As you can see, the actual implementations are divided into nMS_CUDA and NMS_CPU, depending on the device. Let’s look at the implementation of CPP first.

4.1 CPP operator implementation

#include <torch/extension.h> using namespace at; // Apply the Tensor nMS_CPU (Tensor boxes, Tensor scores, float iou_threshold, int offset) { _i < nboxes; _i++) {if (select[_i] == false) continue; for (int64_t _j = _i + 1; _j < nboxes; _j++) {if (select[_j] == false) continue; if (select[_j] == false) continue; auto ovr = inter / (iarea + areas[j] - inter); If (OVR >= iOU_threshold) select[_j] = false; if (ovR >= iOU_threshold) select[_j] = false; } } return order_t.masked_select(select_t); }Copy the code

The above is the core code of NMS_CPU, the algorithm to have a further understanding of the students can see the source. There are two for loops, which is why we want to implement C++ / CUDA extensions to NMS. The code should be easy to understand for those of you who have some C++ background (note that int64_t is the C99 protocol’s typedef definition for the int64 type that supports different platforms), but there are also some new variable types. The typical example is the Tensor data type.

The Tensor data type is supported by torch/extension.h, which comes from at in the three namespaces of pytorch, torch and c10 in the C++ API.

Tips: ATen (A Tensor Library) is responsible for declaring and defining the logic associated with the Tensor operations. It is the namespace most commonly used in pytorch’s extension to the c++ interface. The Caffe Tensor Library (C10) is actually the foundation of ATen and contains the core abstraction of PyTorch and the actual implementation of the Tensor and Storage data structures. The Tensor defined under the torch namespace adds auto-derivative functionality compared to ATen, but it’s not usually seen in c++ extensions

0 0 This type is very powerful and basically supports all of PyTorch’s arithmations (+, -, *, /, >, < operators,.view,. Unsqueeze, etc.) Tensor’s API is available at the Tensor link. Of course the AT namespace also supports almost all of the Tensor related functions (at::ones, AT :: Zeros, AT :: Where, etc.). The ATen API is linked to. Basically, you can invoke all of PyTorch’s supported functionality in C++ simply by adding #include

to your application.

4.2 CUDA operator implementation

4.2.1 CUDA programming basics

This section is based in part on a minimalist tutorial for GETTING started with CUDA programming.

The basic concept

CUDA is a general parallel computing platform and programming model based on NVIDIA GPU. CUDA programming can use GPUs parallel computing engine to solve complex computing problems more efficiently. CUDA syntax is mostly the same as C++; the default file name suffix is.cu and the default header name suffix is.cuh. CUDA programming is heterogeneous, meaning that the CPU handles serialized programs with complex logic, while the GPU focuses on data-intensive parallel programs for maximum efficiency. The LOCATION where the CPU is located is called host, and the location where the GPU is located is called device.

CUDA program design process

In general, CUDA program execution follows the following flow:

  1. Allocate host memory and initialize the data
  2. Allocate device memory and copy data from host to device
  3. Call CUDA’s core function to perform the specified operation on device
  4. Copy the result of the operation on the device to the host
  5. Frees memory allocated on device and host

For PyTorch’s CUDA extension, both the input and output Tensor of CUDA extensions are already on the GPU, so there’s only step 3 in this sequence, which will save us valuable time and focus more on the implementation.

CUDA specifies the function device key

Because CUDA is programmed asynchronously, functions are likely to be defined and called on different devices, so additional function types are added to define functions and call devices. -__global__ : Executed on device, called from host (some specific Gpus can also be called from device), return type must be void, variable arguments are not supported, cannot be a class member function. Note that the kernel defined with __global__ is asynchronous, meaning that host does not wait for the kernel to finish executing before executing the next step. – __device__ : executed on device. Only calls can be made from device, not __global__. -__host__ : Executed on host, can only be called from host, generally omitted, cannot be used with __global__, but can be used with __device__. In this case, the function is compiled on both device and host.

Thread logic architecture form in CUDA

Once a kernel is executed on the device, many threads of magnitude are started on the device, and all threads started by a kernel are divided into two levels of architecture. All threads are grouped into a grid, and threads on the same grid share the same global memory space. The grid can be divided into blocks, and a thread block contains many threads. The thread two-layer organization structure is shown in the figure below, which is a thread organization with girD and block values of 2-DIM. Grid and block are defined as variables of type DIM3. Dim3 can be thought of as a structure variable containing three unsigned integer (x, y, z) members. When defined, dim3 can be defined as one or two dimensions, with the remaining dimensions missing a value of 1. Threads are managed by “thread bundles” in the STREAM processor (SM) in the GPU. A thread bundle contains 32 threads. Therefore, blocks are designed to have an integer multiple of 32 threads.

In order to better understand the thread architecture here, we can directly understand all threads opened by a kernel as a cell, which is called grid, and the cell is composed of different blocks. Each block has its three-dimensional coordinates (x, y, z) within the cell. All threads in each building are indexed by their three-dimensional coordinates in the block (x, y, z).

Core function call in CUDA

The kernel function (kernel) is a function that is executed in parallel in the thread on the device. The kernel function is declared with the __global__ symbol and is called with <<

>> to specify the number of threads to be executed by the kernel. Here grid and blocks need to be defined in advance. In CUDA, each thread will execute the kernel function, and each thread will be assigned a unique thread ID, which can be obtained from the built-in variable threadIdx of the kernel function. The following code is the way the kernel function is called in the thread logic architecture above.
,>

dim3 grid(3, 2); dim3 block(5, 3); kernel_fun<<< grid, block >>>(prams...) ;Copy the code

CUDA kernel function writing

The parts of the kernel call that need to be executed in parallel are done on different threads. Therefore, in the actual CUDA kernel, the system defines two built-in coordinates blockIdx and threadIdx to uniquely identify a thread. They are all variables of type DIM3 (including x, Y, and Z members). BlockIdx specifies the thread’s location in the grid. ThreaIdx specifies the location of the thread in the block. The grid and block are defined during the kernel call. The dimensions of the grid and block are gridDim and blockDim, respectively. The following kernel functions are CUDA code for matrix addition. During the program execution, the kernel function will be assigned to different threads according to the coordinates of blockIdx and threadIdx, thus achieving efficient parallelization. The following is a typical kernel function design of matrix addition.

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) 
{ 
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    int j = blockIdx.y * blockDim.y + threadIdx.y; 
    if (i < N && j < N) 
        C[i][j] = A[i][j] + B[i][j]; 
}
Copy the code

So far we have completed the basics of general CUDA operator implementation, and in the next section we will analyze an instance of NMS CUDA operator.

4.2.2 CUDA operator instance

#include <torch/extension.h> using namespace at; #define DIVUP(m, n) ((m) / (n) + ((m) % (n) > 0)) int const threadsPerBlock = sizeof(unsigned long long int) * 8; // 64 __device__ inline bool devIoU(float const *const a, float const *const b, const int offset, Const float threshold) {// Function defined on device, __global__ void nms_cuda(const int n_boxes, const float iou_threshold, const int offset, Const float *dev_boxes, unsigned long long *dev_mask) {// const int row_start = blockidx.y; // Block position in the grid const int col_start = blockidx.x; // block position in grid const int tid = threadidx.x; // thread ID (0-63) if (row_start > col_start) return; const int row_size = fminf(n_boxes - row_start * threadsPerBlock, threadsPerBlock); const int col_size = fminf(n_boxes - col_start * threadsPerBlock, threadsPerBlock); __shared__ float block_boxes[threadsPerBlock * 4]; if (tid < col_size) { block_boxes[tid * 4 + 0] = dev_boxes[(threadsPerBlock * col_start + tid) * 4 + 0]; block_boxes[tid * 4 + 1] = dev_boxes[(threadsPerBlock * col_start + tid) * 4 + 1]; block_boxes[tid * 4 + 2] = dev_boxes[(threadsPerBlock * col_start + tid) * 4 + 2]; block_boxes[tid * 4 + 3] = dev_boxes[(threadsPerBlock * col_start + tid) * 4 + 3]; } __syncthreads(); if (tid < row_size) { const int cur_box_idx = threadsPerBlock * row_start + tid; const float *cur_box = dev_boxes + cur_box_idx * 4; int i = 0; unsigned long long int t = 0; int start = 0; if (row_start == col_start) { start = tid + 1; } for (I = start; i < col_size; i++) { if (devIoU(cur_box, block_boxes + i * 4, offset, iou_threshold)) { t |= 1ULL << i; } } dev_mask[cur_box_idx * gridDim.y + col_start] = t; } } Tensor NMSCUDAKernelLauncher(Tensor boxes, Tensor scores, float iou_threshold, Int offset) {// CUDA entry at:: CUDa ::CUDAGuard Device_guard (boxes.device()); Order = STD ::get<1>(scores. Sort (0, /*descending=*/true)); auto boxes_sorted = boxes.index_select(0, order_t); int boxes_num = boxes.size(0); // const int col_blocks = DIVUP(boxes_num, threadsPerBlock); // Mask is a mask used to store whether two ioS between bboxes are greater than the threshold. A LongLong type can store 64 // bool values, so the storage space can be reduced by 64 times, just by opening (boxes_num, boxes_num/64) length. Tensor mask = at::empty({boxes_num, col_blocks}, boxes.options().dtype(at::kLong)); dim3 blocks(col_blocks, col_blocks); dim3 threads(threadsPerBlock); / / every 64 threads in a block, iterate through all of the bits of a number of LongLong cudaStream_t stream = at: : cuda: : getCurrentCUDAStream (); <<<blocks, threads, shared_memory, stream>>> nms_cuda<<<blocks, threads, 0, stream>>>(boxes_num, iou_threshold, offset, boxes_sorted.data_ptr<float>(), (unsigned long long*)mask.data_ptr<int64_t>()); at::Tensor mask_cpu = mask.to(at::kCPU); unsigned long long* mask_host = (unsigned long long*)mask_cpu.data_ptr<int64_t>(); std::vector<unsigned long long> remv(col_blocks); memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks); at::Tensor keep_t = at::zeros({boxes_num}, boxes.options().dtype(at::kBool).device(at::kCPU)); bool* keep = keep_t.data_ptr<bool>(); for (int i = 0; i < boxes_num; i++) { int nblock = i / threadsPerBlock; int inblock = i % threadsPerBlock; if (! (remv[nblock] & (1ULL << inblock))) { keep[i] = true; // set every overlap box with bit 1 in remv unsigned long long* p = mask_host + i * col_blocks; for (int j = nblock; j < col_blocks; j++) { remv[j] |= p[j]; } } } AT_CUDA_CHECK(cudaGetLastError()); return order_t.masked_select(keep_t.to(at::kCUDA)); }Copy the code