“This is the fifth day of my participation in the First Challenge 2022. For details: First Challenge 2022”

Please follow my official account [Jizhi Vision] for more notes to share

Hi, I’m Jizhi Vision. This paper mainly discusses the reasoning organization process of inference engine, including Nvidia TensorRT, Huawei CANN and TVM.

For users and most developers, there is no need to pay too much attention to the inference engine. For example, when you use Tensorrt, you only need to know how to use the process, how to generate Eng, and how to call Eng to doInference. But don’t you wonder how it created Eng, how it loaded Eng to make inference, and what Eng is, it cannot be visualized intuitively with Netron like.pth,.cfg and.pb. Here we will discuss the reasoning organization process of TensorRT, CANN and TVM.

1, TensorRT

Tensorrt is a very useful high-performance reasoning framework, which can be used in the following two ways:

(1) TensorRT is embedded into mature AI frameworks, such as TF-TRT, Torch TRT, ONNX-TrT, TVM-TRT, etc. Most of the methods are to prioritize the operators supported by TensorRT in the way of TensorRT execution. Operators not supported by Tensorrt fall back to the original framework;

(2) directly use C++ API or Python API to take the inference engine of tensorrt. As for operators not supported by tensorrt, I can use finer granularity operators for splicing or replacement (such as replacing upsample with resize, You can even customize the operator directly with CUDA.

Tensorrt’s reasoning process is as follows:

Network Definition is the Network Definition in Tensorrt after parsing the model. Builder is mainly used to generate the corresponding executable program according to the corresponding hardware of Network Definition, that is, Engine. If you do online reasoning, you can just go straight into runtime. In practice, we often need to keep this Engine as an offline.eng model to decouple the Runtime process, since runtime is usually at the user’s site and operations before Runtime are usually at home. That’s where serialize and deserialize come in.

Serialize is used to generate binary.eng, which is called Optimized Plans. This is called an offline model.

When it comes time to deploy, as long as we have the offline model in hand, then use deserialize (deserialize) to continue to perform reasoning after the hardware executable Engine.

So the whole process of Tensorrt createEng and doInference is as follows:

2, CANN

Ascending CANN may not be as much as Tessorrt contact, first a brief introduction.

Centerm Compute Architecture for Neural Networks (CANN) is a heterogeneous computing framework launched by Huawei for all AI scenarios. It supports mainstream front-end AI frameworks in the industry on the top and shields hardware differences of serialized chips on the bottom. Enable user’s full scenario AI application with rich software stack functions (I find Huawei likes to use the word “enable”, I also use ha ha). The architecture of CANN is as follows:

Let’s talk about the organization process of CANN Create. om (Ascend’s offline model).

In fact, CANN exposes much more information than Tensorrt. The logical architecture of CANN TBE software stack is as follows:

Graph Engine (GE) + Fusion Engine (FE) + Tensor Boost Engine (TBE) = Tensor Boost Engine (TENsorrt)

GE is a graph engine, FE is a fusion engine, and TBE is a tensor acceleration engine. GE is mainly used as a fusion engine for parsing front-end framework, linking back end and scheduling. The main function of FE is to realize operator fusion and UB fusion. The TBE provides operators to compile and execute tasks on huawei Accelerator cards. Combined with the figure above, the whole process is divided into several stages: 2~4 are operator adaptation and replacement, 5 is subgraph splitting and optimization, and 6 is scheduling and process layout. It is not difficult to see that after stage 6.1, the concept of network layer does not exist in the actual reasoning process. At this time, offline model.om files are composed of one taskInfo after another, and the final Runtime is delivered by calling the information in.om.

3, TVM

As we all know, Centerm CANN is based on TVM, so the model compilation process of CANN is similar to TVM. The model compilation process for TVM is as follows, with its offline model organization location after relay. Build -> Graph Optimize.

The compile flow logic for the entire TVM is as follows:

TVM uses relay. Frontend. From_AIFrameworks to convert the AI framework model into relay IR, and optimize the diagram on relay IR. Finally, CodeGen, which mainly implements memory allocation and code generation on specified hardware devices. This serializes the offline model of TVM (.json and.params).

Above we discussed the reasoning organization process of Tensorrt, CANN, TVM, if there is anything wrong, welcome to exchange ~ I hope my share can help you learn.


【 Model Reasoning 】 Talk about the reasoning organization process of reasoning engine