Abstract: Describes how to add a new hardware back end to MindSpore.
This article is shared by Huawei Cloud community “How to Add a New Hardware Back end to MindSpore? Quickly Build a test environment!” , author: HWCloudAI.
MindSpore is a new generation of AI open-source computing framework developed by Huawei. The full-scene deep learning framework that best matches the computing power of Centerm AI processor provides data scientists and algorithm engineers with design-friendly and high-efficiency development experience, promoting the flourishing development of artificial intelligence software and hardware application ecology.
MindSpore supports heterogeneous computing. In addition to huawei’s Ascend NPU, MindSpore supports CPU(e.g. MKLDNN) and GPU(e.g. CUDA kernels) operators. (Note: MindSpore supports the whole network to run on different hardware platforms, but does not support different partitions of the same network to run on different hardware platforms, which is different from the heterogeneous operation mode of Graph Partition of TensorFlow.)
At present, the AI chip industry is “very busy”. At home and abroad, big and small new and old manufacturers are launching their own AI acceleration chips. It should be clear to all of you by now that the success of hardware depends on software stacks and ecosystems. MindSpore not only serves the AI hardware and software stack that underpins Huawei, but also wants its own piece of the AI ecosystem.
MindSpore is still in the stage of promotion and improvement. This paper introduces how to add a new hardware back end to MindSpore, and also introduces the directory structure of MindSpore source code. We hope to provide some useful information and reference for AI hardware manufacturers and interested developers at home and abroad, so that they can jointly use MindSpore as a framework for testing and docking AI chips, and quickly build a test environment for the whole network model.
This article is aimed at MindSpore R1.1 version of the source code: gitee.com/mindspore/m… MindSpore, as well as the demand for software version, please refer to: www.mindspore.cn/install/
The test case
This paper will focus on a simple Dense Layer network: www.mindspore.cn/doc/api\_py… Layer runs on a new hardware back end. Note: this article is aimed at the basic of the static graph execution mode: www.mindspore.cn/doc/program…
import mindspore import numpy as np import mindspore.nn as nn from mindspore import context, Tensor context.set_context(device_target="CPU", mode=context.GRAPH_MODE) # 32, 16 net = nn.Dense(32, 16, Weight ='ones', bias_init=1.2)#, activation='relu') # 48, 32 input_data = Tensor(np.ones([48, 32]).astype(np.float32), mindspore.float32) output = net(input_data) print(output.asnumpy())Copy the code
Note: I annotated the ReLU for activation here, so the Dense Layer is equivalent to a small network with only 2 nodes (MatMul +BiasAdd). The result of this use case is a 48 * 16 2d matrix. Each element has a value of 33.2.
This article takes you through the top-to-bottom components MindSpore needs to modify to support a new hardware back end. The new hardware we need to support here is called XPU. The effect we want to achieve after modifying the MindSpore code is to change the device_target in the above use case to XPU and make the Dense Layer run on the accelerator XPU. e.g.
context.set_context(device_target="XPU", mode=context.GRAPH_MODE)
Copy the code
Note: this article will not show the implementation details of specific classes and functions. For specific implementation, please refer to the supported hardware backends in the corresponding directories, such as CPU, GPU, Ascend
Added new device Target parameter option
First, we need to add new valid_targets from the front-end MEPython layer: gitee.com/mindspore/m…
def set_device_target(self, target): If not target in valid_targets = ["CPU", "GPU", "Ascend", "Davinci", "XPU"] raise ValueError(f"Target device name {target} is invalid! It must be one of {valid_targets}") if target == "Davinci": target = "Ascend" self.set_param(ms_ctx_param.device_target, target) if self.enable_debug_runtime and target == "CPU": self.set_backend_policy("vm")Copy the code
Then you need to add a new target in the C++ ms context component: gitee.com/mindspore/m…
const int kGraphMode = 0; const int kPynativeMode = 1; const char kCPUDevice[] = "CPU"; const char kGPUDevice[] = "GPU"; const char kXPUDevice[] = "XPU"; // Add new hardware target const char kAscendDevice[] = "Ascend"; const char kDavinciInferenceDevice[] = "AscendInference"; const char kDavinciDevice[] = "Davinci"; const char KNpuLog[] = "_npu_log"; const unsigned int MAX_CALL_DEPTH_DEFAULT = 1000; Const STD ::set< STD ::string> kTargetSet = {kCPUDevice, kGPUDevice, kXPUDevice, kAscendDevice, kDavinciDevice};Copy the code
Add a new Runtime Device
In the Runtime Device directory: gitee.com/mindspore/m… Address space on the device side, memory management (allocation and reclamation) on the device side, kernel Runtime components, and some communication components related to hardware device, such as MPI component that supports distributed communication. We will first add a folder named xpu under the directory shown below (note the modification of cmakelists.txt to add folder) :
Here are three new basic components to create for xPU accelerators:
Xpu_device_address: refers to the memory address information of the accelerator device side and the API interface for memory transfer between the host side and the device side. For example, on NVIDIA GPU, it can be wrapper of: cudaMemcpyAsyncxpu_device_address.h
#include <string> #include <vector> #include "runtime/device/device_address.h" #include "utils/shape_utils.h" namespace mindspore { namespace device { namespace xpu { class XPUDeviceAddress : public DeviceAddress { public: XPUDeviceAddress(void *ptr, size_t size) : DeviceAddress(ptr, size) {} XPUDeviceAddress(void *ptr, size_t size, const string &format, TypeId type_id) : DeviceAddress(ptr, size, format, type_id) {} ~XPUDeviceAddress() override = default; bool SyncDeviceToHost(const ShapeVector &shape, size_t size, TypeId type, void *host_ptr) const override; bool SyncHostToDevice(const ShapeVector &shape, size_t size, TypeId type, const void *host_ptr) const override; DeviceAddressType DeviceType() const override { return DeviceAddressType::kXPU; }}; } // namespace xpu } // namespace device } // namespace mindsporeCopy the code
Xpu_resource_manager: manages, allocates, and schedules memory and other resources on the device side. xpu_resource_manager.h
#include <vector>
#include <map>
#include "backend/session/kernel_graph.h"
#include "backend/session/session_basic.h"
#include "runtime/device/device_address.h"
#include "runtime/device/xpu/xpu_simple_mem_plan.h"
namespace mindspore {
namespace device {
namespace xpu {
class XPUResourceManager {
public:
XPUResourceManager() = default;
~XPUResourceManager();
void AssignMemory(const session::KernelGraph *graph);
void IncreaseAddressRefCount(const session::KernelGraph *graph);
void DecreaseAddressRefCount(const AnfNodePtr &kernel);
void *MemMalloc(size_t mem_size);
void MemFree(void *ptr);
private:
void MemFree();
XPUSimpleMemPlan mem_plan_;
size_t mem_size_{0};
uint8_t *mem_ptr_{nullptr};
bool dynamic_malloc_{false};
std::map<void *, size_t> dynamic_mem_;
};
} // namespace xpu
} // namespace device
} // namespace mindspore
Copy the code
Xpu_kernel_runtime: hardware operator execution control module, mainly responsible for hardware runtime startup (Init()), network execution on hardware (Run(..)) (ReleaseDeviceRes()) xpu_kernel_runtime.h
#include <memory>
#include <vector>
#include <string>
#include <map>
#include <set>
#include "runtime/device/kernel_runtime.h"
#include "runtime/device/kernel_runtime_manager.h"
#include "backend/session/kernel_graph.h"
#include "backend/session/session_basic.h"
#include "runtime/device/xpu/xpu_resource_manager.h"
#include "backend/session/anf_runtime_algorithm.h"
#include "utils/any.h"
namespace mindspore {
namespace device {
namespace xpu {
class XPUKernelRuntime : public KernelRuntime {
public:
XPUKernelRuntime() = default;
~XPUKernelRuntime() override = default;
bool Init() override;
void ReleaseDeviceRes() override;
bool Run(session::KernelGraph *graph, bool is_task_sink) override;
void AssignKernelAddress(session::KernelGraph *kernel_graph);
void CreateOutputTensors(session::KernelGraph *kernel_graph, const std::vector<tensor::TensorPtr> &inputs,
VectorRef *outputs);
void BindInputOutput(session::KernelGraph *kernel_graph, const std::vector<tensor::TensorPtr> &inputs,
VectorRef *outputs);
protected:
bool SyncStream() override { return true; };
DeviceAddressPtr CreateDeviceAddress(void *device_ptr, size_t device_size, const string &format,
TypeId type_id) override;
private:
XPUResourceManager resource_manager_;
std::set<DeviceAddressPtr> bound_addresses_;
std::map<AnfNodePtr, tensor::TensorPtr> input_param_tensor_map_;
};
MS_REG_KERNEL_RUNTIME(kXPUDevice, XPUKernelRuntime);
} // namespace xpu
} // namespace device
} // namespace mindspore
Copy the code
Add a new Target session
MindSpore’s sessions provide an environment for Op kernel execution and Tensor evaluation. Session is the core module that controls the data flow graph representing the neural network. It consists of three main steps: graph compilation (kernel generation), graph optimization, and graph execution. MindSpore has its own Session component for each backend hardware platform. The related code can be found in the backend/ Session directory: gitee.com/mindspore/m…
We create a new session class for XPU: xpu_session.h
#include <string> #include <memory> #include <map> #include <vector> #include "backend/session/session_basic.h" #include "backend/session/kernel_graph.h" #include "runtime/device/xpu/xpu_kernel_runtime.h" // use the new xpu kernel runtime #include "backend/session/session_factory.h" namespace mindspore { namespace session { class XPUSession : public SessionBasic { public: XPUSession() = default; ~XPUSession() override = default; void Init(uint32_t device_id) override { InitExecutor(kXPUDevice, device_id); } GraphId CompileGraphImpl(const AnfNodePtrList &lst, const AnfNodePtrList &outputs) override; void RunGraphImpl(const GraphId &graph_id, const std::vector<tensor::TensorPtr> &inputs, VectorRef *outputs) override; void Optimize(const std::shared_ptr<KernelGraph> &kernel_graph); protected: void UnifyMindIR(const KernelGraphPtr &graph) override { return; } void CreateOutputTensors(const GraphId &graph_id, const std::vector<tensor::TensorPtr> &input_tensors, VectorRef *, std::map<tensor::TensorPtr, session::KernelWithIndex> *tensor_to_node) override; private: void SetKernelInfo(const KernelGraph *kernel_graph); void BuildKernel(const KernelGraph *kernel_graph); device::xpu::XPUKernelRuntime *runtime_ = dynamic_cast<device::xpu::XPUKernelRuntime*>(device::KernelRuntimeManager::Instance().GetKernelRuntime(kXPUDevice, 0)); }; MS_REG_SESSION(kXPUDevice, XPUSession); } // namespace session } // namespace mindsporeCopy the code
Compile at the graph (CompileGraphImpl(..) Build (BuildKernel(..)) ) represents the kernel corresponding to each node op in the neural network data flow graph, and saves the kernel information of each node in the graph (SetKernelInfo(..)). ) for later graph execution (RunGraphImpl(..) ) step is called.
Add a kernel for new hardware
The hardware backend supported by MindSpore for each OP operator can be found in the backend/ kernel_Compiler directory: gitee.com/mindspore/m…
Here we can see that for a few hardware backends, each folder represents a different kernel type, where:
CPU: there are operators to call MKLDNN(oneDNN), there are pure c++ operators written.
Gpu: CudNN/Cublas operators, cudA-written operators, distributed training operators and NCCL operators.
Ascend: Operators related to huawei Da Vinci AI chips include the following folders: TBE, AICPU, AKG, HCCL, etc
To introduce the components needed to add kernel support to our new hardware backend, first create a folder named xpu in the directory above (note the change of cmakelists.txt add folder). In the new folder we first create the base class for xpu kernel: xpu_kernel.h:
#include <string>
#include <vector>
#include <memory>
#include <numeric>
#include <functional>
#include "backend/kernel_compiler/kernel.h"
#include "ir/anf.h"
#include "backend/session/anf_runtime_algorithm.h"
#include "utils/ms_utils.h"
using mindspore::kernel::Address;
using mindspore::kernel::AddressPtr;
namespace mindspore {
namespace kernel {
class XPUKernel : public kernel::KernelMod {
public:
XPUKernel() = default;
~XPUKernel() override = default;
void Init(const CNodePtr &kernel_node);
virtual void InitKernel(const CNodePtr &kernel_node) = 0;
bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
const std::vector<AddressPtr> &outputs, void * stream_ptr) override {
return Launch(inputs, workspace, outputs);
};
virtual bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
const std::vector<AddressPtr> &outputs) = 0;
const std::vector<size_t> &GetInputSizeList() const override { return input_size_list_; }
const std::vector<size_t> &GetOutputSizeList() const override { return output_size_list_; }
const std::vector<size_t> &GetWorkspaceSizeList() const override { return workspace_size_list_; }
void SetOpName(const std::string &op_name) { op_name_ = op_name; }
const std::string GetOpName() const { return op_name_; }
protected:
virtual void InitInputOutputSize(const CNodePtr &kernel_node);
std::vector<size_t> input_size_list_ = {};
std::vector<size_t> output_size_list_ = {};
std::vector<size_t> workspace_size_list_ = {};
std::string bin_path_ = {};
std::string tilingName_ = {};
};
} // namespace kernel
} // namespace mindspore
Copy the code
Currently popular frameworks generally support operator kernel by using operator name (opcode) kernel, such as CPU Kernels of MKLDNN in Mindspore: The advantage of the MindSpore/ MindSpore form is that the REPO code files are very clear and the specific properties of each operator can be easily expressed. The disadvantage is that there may be some duplicated code logic. Because the use case in this paper is very simple, we only need to support two operators: MatMul and BiasAdd. We will implement the kernel class named by the number of Tensor inputs and outputs.
Since both MatMul and BiasAdd are two input and one output operators, we define our kernel class as two_in_one_out_xpu_kerel.h
#include "backend/kernel_compiler/xpu/xpu_kernel.h" // xpu kernel base class #include "backend/kernel_compiler/xpu/xpu_kernel_factory.h" #include <stdio.h> #include <limits.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <dirent.h> #include <algorithm> #include <fstream> #include <iostream> namespace mindspore { namespace kernel { class TwoInOneOutXPUKernel : public XPUKernel { public: TwoInOneOutXPUKernel() = default; ~TwoInOneOutXPUKernel() override = default; void InitKernel(const CNodePtr &kernel_node) override; bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace, const std::vector<AddressPtr> &outputs) override; private: bool NeedsFormatTransformation(); char trans_a_{TRANSPOSE_NO}; char trans_b_{TRANSPOSE_NO}; int32_t dim_m_{0}; int32_t dim_n_{0}; int32_t dim_k_{0}; std::vector<size_t> inputA_shape_; std::vector<size_t> inputB_shape_; std::vector<size_t> output_shape_; size_t input_a_size_ = 0; size_t input_b_size_ = 0; size_t output_size_ = 0; void *inputA_data_ = nullptr; void *inputB_data_ = nullptr; void *output_data_ = nullptr; }; MS_REG_XPU_KERNEL( TwoInOneOutXPU, mindspore::device::xpu::KernelAttr().AddInputAttr(kNumberTypeFloat32).AddInputAttr(kNumberTypeFloat32).AddOutputAttr(kNu mberTypeFloat32), TwoInOneOutXPUKernel); } // namespace kernel } // namespace mindsporeCopy the code
“Backend /kernel_compiler/xpu/xpu_kernel_factory.h” is used to create the kernel factory class. For details, see cpu_kernel_factory.h: Gitee.com/mindspore/m…
The two basic functions for each kernel are InitKernel(..). And LaunchKernel (..) Responsible for kernel initialization and operation respectively. It is important to note here that for general implementations like CNN static diagrams, InitKernel(..) The LaunchKernel(..) will only run once when the kernel is created (session compilegraph above). Is called each time the graph executes. For example, to run a CNN inference, infernce64 pictures are needed, and the batch size of the network is 32. The whole picture needs to be executed twice, that is, for each kernel, InitKernel(..). Is called once, and the LaunchKernel(..) It’s going to get called twice.
We will not go into details about the implementation of MatMul and BiasAdd kernel, but introduce some basic apis that MindSpore needs to use for operator kernel:
Obtain TwoInOneOutXPUKernel input and output shape information:
inputA_shape_ = AnfAlgo::GetInputDeviceShape(kernel_node, 0);
inputB_shape_ = AnfAlgo::GetInputDeviceShape(kernel_node, 1);
output_shape_ = AnfAlgo::GetOutputDeviceShape(kernel_node, 0);
Copy the code
Get operator attribute information, e.g. MatMul transpose information:
bool trans_a = AnfAlgo::GetNodeAttr<bool>(kernel_node, TRANSPOSE_A);
bool trans_b = AnfAlgo::GetNodeAttr<bool>(kernel_node, TRANSPOSE_B);
Copy the code
Get input and output memory pointer in Launch:
auto input_a = reinterpret_cast<float *>(inputs[0]->addr);
auto input_b = reinterpret_cast<float *>(inputs[1]->addr);
auto output = reinterpret_cast<float *>(outputs[0]->addr);
Copy the code
Other Matters needing attention
Like other mainstream frameworks, MindSpore has its own set of standards and specifications. Here are some of the pitfalls that MindSpore has stepped on:
The default format for Tensor in MindSpore is NCHW. If you are adding a hardware backend that supports different formats, be careful to add format conversions. Format conversion can be done before and after the invocation of each kernel (low efficiency), or graph optimization pass can be used to efficiently insert format conversion nodes from the perspective of the whole network.
Precision conversion, if your hardware platform only supports certain precision, such as FP16, and the network is FP32 then pay attention to precision conversion, precision conversion is similar to the above format conversion. Precision conversion can be done on the host side or (if supported by the hardware) on the device side.
For each kernel code logic to distinguish which data is unchanged and which will change, need to be re-initialized before each execution, so that different logic codes can be reasonably and correctly allocated to corresponding InitKernel(..). Or LaunchKernel (..) The soup.
MindSpore has its own set of properties for some Python front-end Layerapis, such as Denselayer: gitee.com/mindspore/m… Two input matrices are transposed:
self.matmul = P.MatMul(transpose_b=True)
self.batch_matmul = P.BatchMatMul(transpose_b=True)
self.activation = get_activation(activation) if isinstance(activation, str) else activation
if activation is not None and not isinstance(self.activation, (Cell, Primitive)):
raise TypeError("The activation must be str or Cell or Primitive,"" but got {}.".format(activation))
self.activation_flag = self.activation is not None
Copy the code
For Debug, you can add the following environment variables to help output information:
export GLOG_v=1
export SLOG_PRINT_TO_STDOUT=1
Copy the code
For changes to CMake files, you can add new files to if (ENABLE_CPU) at the beginning of the test. CPU acts as a baseline platform for MindSpore. Whether you build a GPU or huawei’s D/Ascend Target, cpu-related files will be built.
conclusion
This article is a technical article on how to modify the MindSpore source code to add a new hardware backend according to the author’s own understanding of MindSpore. The success of an open source software framework depends on the support of the community and the participation of various vendors. I hope this article will serve as a catalyst for more hardware vendors and developers to participate in the ecological development of MindSpore. You are also welcome to clap bricks to discuss together!
Isn’t it exciting to learn about the key technologies of MindSpore? [Click the link] and [register now], you can learn a classic case to master the Deep learning based on MindSpore in ModelArts platform!
Click to follow, the first time to learn about Huawei cloud fresh technology ~