“This is the 22nd day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

Fly oar advanced use tutorial: custom CPU operator implementation and use

Operator (Op for short) is the basic component of constructing neural network. In the network model, operators correspond to computing logic in layers. For example, the Convolution Layer is an operator. The weight summation process in the Fully-connected Layer (FC Layer) is an operator. Learning the C++ implementation of custom operators can give you a deeper understanding of the underlying logic of neural network operation.

  • Tensor, you can have a multidimensional array, you can have as many dimensions as you want, different Tensor can have different Dtypes and shapes.
  • The Operator, the OP for short, is the Tensor that does all the calculations on the Tensor, so you can think of it as a calculation function, and the inputs and outputs of the function are the Tensor. A large number of operators are defined in PaddlePaddle to perform the Tensor processing of common neural network models, such as Conv2d, pool2d, Relu, etc

First, the underlying principle

All Op operators in PaddlePaddle will be registered in OpInfoMap. When Op is called in Python to perform operation, the corresponding Op is found in OpInfoMap through TraceOp and its calculation kernel is called to complete calculation and return the result.

In other words, because of different hardware, the same operator needs different kernel. For example, if you write an operator to be executed on CPU, then this operator can only run on CPU, but to run on other computing platforms, you need to implement the kernel of this platform. This article is mainly about how to implement an operator that can run on the CPU.

For example, when Y=relu(X) is executed in dynamic graph mode, the framework uses TraceOp to do this:

  1. Call forward calculation function of relu operator to complete the calculation of Y
  2. The Op operator and input and output variables needed to create BACKWARD (no calculation is made at this time, the reverse calculation is made after a subsequent call to BACKWARD ())

Through the underlying principle shown above, it is not difficult to find that the most critical part of an operator is forward propagation and reverse calculation, which are the core of an operator.

When executing Y=custom_ relu(X) in dynamic graph mode: the execution flow using C++ custom operator is the same as the native operator.

The difference with the native operator is that the native operator is compiled with the framework; Custom operators can be compiled separately. But they all end up registering in OpInfoMap.

C++ custom operator format

1. Basic format

Before writing the operations, we need to import the PaddlePaddle extension header file:

#include "paddle/extension.h"
Copy the code

Operator operation functions have specific writing requirements, which need to be observed in the coding process. The basic form is as follows:

std::vector<paddle::Tensor> OpFucntion(const paddle::Tensor& x, ..., int attr, ...) {
  ...
}
Copy the code

This is the fixed format that all paddles written in C++ need to use. In other words, this is the operator interface that Paddle provides, and you just need to define the operator according to this interface.

2. Adapt to multiple data types

In the real world, an operator often needs to support multiple data types, so you need to use template classes. Define above the interface above:

template <typename data_t>
Copy the code

Note that the template parameter name data_t is used for different data types and cannot be changed to any other name. Otherwise, the compilation fails

Then use the switch-case statement to support multiple data types:

switch(x.type()) {
  case paddle::DataType::FLOAT32:
    ...
    break;
  case paddle::DataType::FLOAT64:
    ...
    break;
  default:
    PD_THROW(
      "function  ... is not implemented for data type `",
      paddle::ToString(x.type()), "`");
}
Copy the code

If you don’t want to use switch-case, you can also use the official supplied dispatch macro, such as PD_DISPATCH_FLOATING_TYPES

3. Derivation of dimensions and types

PaddlePaddle framework supports both dynamic and static Graph execution modes. In static diagram mode, derivation of Tensor Shape and DType should be completed in networking stage to generate correct model description for subsequent Graph optimization and implementation. Therefore, in addition to the operation function of the operator, we also need to realize the dimension and type of the forward operation.

InferShape and InferDtype functions are also required to be written in the following format:

The Tensor should correspond to the input and output parameters of the forward calculation:

For a custom operator with only one input Tensor and one output Tensor, if the shape and Dtype of output Tensor and input Tensor are the same, you can omit the implementation of InferShape and InferDtype functions, but these two functions need to be implemented in other scenarios.

4. Customize operator registration

Finally, PD_BUILD_OP macros need to be called to construct the description information of the operator and associate the operator operation function with the dimension and type derivation function. The format is as follows:

PD_BUILD_OP(op_name)
    .Inputs({"X"})
    .Outputs({"Out"})
    .SetKernelFn(PD_KERNEL(Forward))
    .SetInferShapeFn(PD_INFER_SHAPE(ReluInferShape))
    .SetInferDtypeFn(PD_INFER_DTYPE(ReluInferDtype));

PD_BUILD_GRAD_OP(op_name)
    .Inputs({"X", "Out", paddle::Grad("Out")})
    .Outputs({paddle::Grad("X")})
    .SetKernelFn(PD_KERNEL(ReluCPBackward));
Copy the code

Note that:

  • PD_BUILD_OPIt is used to construct a forward operator. The parentheses are the operator name, which is also the interface name used later on the Python side. Note that no quotation marks are required before and after the operator name, and that the operator name cannot be the same as the name of an existing operator in PaddlePaddle
  • PD_BUILD_GRAD_OPUsed to construct the inverse operator corresponding to the forward operator,PD_BUILD_DOUBLE_GRAD_OPUsed to construct the quadratic derivative operator corresponding to the former inverse operator. The multiorder derivatives currently supported by Paddle only support up to the second derivative

Three, begin to implement CPU operator

The following uses a relatively simple Sin function as an example to customize a CPU operator.

1. Import necessary header files

#include "paddle/extension.h"
#include <vector>
#define CHECK_CPU_INPUT(x) PD_CHECK(x.place() == paddle::PlaceType::kCPU, #x " must be a CPU Tensor.")
Copy the code

Verify the format of the input by introducing the PaddlePaddle extension header and the macro definition.

2. Implement the forward calculation function

To accommodate multiple data types, we first add a template class.

The most important aspect of forward calculation is to implement the calculation function. C++ provides some basic operation functions, which can be used directly. The basic syntax is generally STD ::function(input).

// Void sin_cpu_forward_kernel(const data_t* x_data, data_t* out_data, int64_t x_numel) { for (int i = 0; i < x_numel; ++i) { out_data[i] = std::sin(x_data[i]); }}Copy the code

Then we just need to implement the calculation function in accordance with the format given in front of the forward propagation:

STD ::vector<paddle::Tensor> sin_cpu_forward(const paddle::Tensor& x) {// Data ready CHECK_CPU_INPUT(x); auto out = paddle::Tensor(paddle::PlaceType::kCPU, x.shape()); // Declare the output variable out, Two parameters (running device type and dimension information) // calculate implementation of PD_DISPATCH_FLOATING_TYPES(x.type(), "sin_CPU_forward_kernel ", ([&] {sin_cpu_forward_kernel<data_t>(// call the previously defined forward calculation function x.data_t <data_t>(), // get the input memory address, Out.mutable_data <data_t>(x.place()), x.size()); // Allocate memory for output})); return {out}; }Copy the code

3. Realize backward calculation function

This part requires a certain mathematical foundation, to understand the calculation method of partial differential, to understand the concept of neural network gradient, I also looked up some information in the implementation process, to share with you:

  • 3 blue1brown:www.3blue1brown.com/lessons/bac…
  • Neural network gradient descent method and its implementation
  • wolframalpha:www.wolframalpha.com/

The final website is a website that can directly calculate the partial derivative, which is more convenient, for example, here we need to calculate the partial derivative of the sin function:

The hardest part of backpropagation is calculating the gradient. If you can calculate the gradient, it’s actually very easy, and the forward calculation is similar:

template <typename data_t> void sin_cpu_backward_kernel(const data_t* grad_out_data, const data_t* out_data, data_t* grad_x_data, int64_t out_numel) { for (int i = 0; i < out_numel; ++i) { grad_x_data[i] = std::cos(grad_out_data[i]); // The partial derivative of sin(x) is cos(x)}} STD ::vector<paddle::Tensor> sin_cpu_backward(const paddle::Tensor& x, // input of forward const paddle::Tensor& out, // Backward gradient variable auto grad_x = paddle::Tensor& grad_out) {// Backward gradient variable auto grad_x = paddle::Tensor(paddle::PlaceType::kCPU, x.shape()); // Implement PD_DISPATCH_FLOATING_TYPES(out.type(), "sin_cpu_backward_kernel", Sin_cpu_backward_kernel <data_t>([&] {sin_cpu_backward_kernel<data_t>(grad_out.data<data_t>(), // Obtain the memory address, i.e. Grad_x.mutable_data <data_t>(x.place())), // apply for out.size()); // Incoming output dimension information})); return {grad_x}; }Copy the code

4. Dimension derivation

The InferDtype and InferShape functions are implemented according to the format:

STD ::vector< STD ::vector<int64_t>> sinInferShape(STD ::vector<int64_t> x_shape) {return {x_shape}; } STD ::vector<paddle::DataType> sinInferDtype(paddle::DataType x_dtype) {return {x_dtype}; }Copy the code

Because the sin(x) function has the same dimension of input and output, the implementation of InferShape and InferDtype functions can be omitted.

5. Customize operator registration

Finally, the registration of the custom operator can be completed according to the format:

PD_BUILD_OP(custom_sin_cpu)
    .Inputs({"X"})
    .Outputs({"Out"})
    .SetKernelFn(PD_KERNEL(sin_cpu_forward))
    .SetInferShapeFn(PD_INFER_SHAPE(sinInferShape))
    .SetInferDtypeFn(PD_INFER_DTYPE(sinInferDtype));

PD_BUILD_GRAD_OP(custom_sin_cpu)
    .Inputs({"X", "Out", paddle::Grad("Out")})
    .Outputs({paddle::Grad("X")})
    .SetKernelFn(PD_KERNEL(sin_cpu_backward));
Copy the code

Four, the use of custom CPU operator

The basic format for loading custom operators using JIT (just-in-time compilation) installation is as follows:

In this project, I have written the operator, located at custom_op/ custom_sin_CPU. cc, to call directly:

from paddle.utils.cpp_extension import load
custom_ops = load(
    name="custom_jit_ops",
    sources=["custom_op/custom_sin_cpu.cc"])

custom_sin_cpu = custom_ops.custom_sin_cpu
Copy the code
Compiling user custom op, it will cost a few seconds.....
Copy the code

Using this operator is also very simple, just use it, as shown below:

import paddle
import paddle.nn.functional as F
import numpy as np

Define the execution environment
device = 'cpu'
paddle.set_device(device)

# Convert input data to tensors
data = np.random.random([4.12]).astype(np.float32)
x = paddle.to_tensor(data, stop_gradient=False)

# Call a custom operator for forward calculation
y = custom_sin_cpu(x)
# Call custom operator to implement back propagation
y.mean().backward()

print("Forward calculation result: {}".format(y))
print("Gradient result: {}".format(y.grad))
Copy the code
Forward calculation results: Tensor(Shape =[4, 12], dType = FLOAT32, place=CPUPlace, stop_gradient=False, [[0.53175545, 0.70284414, 0.44641760, 0.82928818, 0.45170310, 0.07087017, 0.77653980, 0.71543890, 0.30254266, 0.37284735, 0.10566728, 0.51137722], [0.05868274, 0.77604854, 0.50411993, 0.62174445, 0.71051770, 0.04676604, 0.47530916, 0.05187472, 0.05436167, 0.71679759, 0.74827725, 0.70496327], [0.78653520, 0.33197609, 0.27495766, 0.83881938, 0.17083500, 0.25208664, 0.55356687, 0.06564844, 0.02807573, 0.66028857, 0.29398340, 0.69536334], [0.39080915, 0.03133771, 0.46310377, 0.79298347, 0.79788220, 0.74418354, 0.02709462, 0.72110707, 0.81954306, 0.40375820, 0.48059800, 0.81256640]) Tensor(Shape =[4, 12], dType = FLOAT32, place=CPUPlace, stop_gradient=False, [[0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333 [0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333, 0.02083333])Copy the code

To verify the correctness of the operator, we can compare with the existing operator of Paddle to see whether the calculation results of forward propagation and gradient are consistent:

import paddle
import paddle.nn.functional as F
import numpy as np

device = 'cpu'
paddle.set_device(device)

data = np.random.random([4.12]).astype(np.float32)

x_target = paddle.to_tensor(data, stop_gradient=False)
y_target = paddle.sin(x_target)
y_target.mean().backward()

x = paddle.to_tensor(data, stop_gradient=False)
y = custom_sin_cpu(x)
y.mean().backward()

If the output is True, the result is correct
print("sin_result: ",paddle.allclose(y_target, y).numpy())
print("sin_grad_result: ",paddle.allclose(y_target.grad, y.grad).numpy())
Copy the code
sin_result:  [ True]
sin_grad_result:  [ True]
Copy the code

It can be seen from the output result that our user-defined operator is correct from the realization of functions.

Fifth, summary and sublimation

Finally, summarize the main idea of C++ custom operator, which is actually 3 points:

  1. Forward and backward are realized
  2. Wrap forward and BACKWARD functions and register
  3. Compile loads and calls the operator

From my perspective, I think the first point is the most important part, especially the calculation of gradient in back propagation, which requires a certain mathematical basis and a profound understanding of the working mechanism of neural network.