“This is the 26th day of my participation in the First Challenge 2022. For details: First Challenge 2022”

Welcome to follow my public account [Jizhi Vision], reply 001 to get Google programming specification

Hi, I’m Jizhi, this article is about Tengine TensorRT backend organization process.

So let’s start.

1. Oblique connecting part at the back end

First of all, find a TRT backend routine in the Examples of Tengine project, and here take TM_CLASSIFICATION_TRT for illustration.

In the main of tm_ClassiFICation_trt. CPP, the classification function is as follows, where the algorithm configuration parameters are passed in:

if (tengine_classify(model_file, image_file, img_h, img_w, mean, scale, loop_count, num_thread, cpu_affinity) < 0)
	return - 1;
Copy the code

Enter tengine_classify

int tengine_classify(const char* model_file, const char* image_file, int img_h, int img_w, const float* mean, const float* scale, int loop_count, int num_thread, int affinity){
    /* set runtime options */
    struct options opt;
    opt.num_thread = num_thread;
    opt.cluster = TENGINE_CLUSTER_ALL;
    opt.precision = TENGINE_MODE_FP32;
    opt.affinity = affinity;

    /* inital tengine */
    if (init_tengine() != 0) {fprintf(stderr, "Initial tengine failed.\n");
        return - 1;
    }
    fprintf(stderr, "tengine-lite library version: %s\n".get_tengine_version());

    /* create NVIDIA TensorRT backend */
    context_t trt_context = create_context("trt".1);
    int rtt = add_context_device(trt_context, "TensorRT");
     if (0 > rtt){
        fprintf(stderr, "add_context_device NV TensorRT DEVICE failed.\n");
        return - 1;
    }

    /* create graph, load tengine model xxx.tmfile */
    graph_t graph = create_graph(trt_context, "tengine", model_file);
    if (NULL == graph){
        fprintf(stderr, "Create graph failed.\n");
        return - 1; }... };Copy the code

Focus on the linked TRT backend in the code above:

/* create NVIDIA TensorRT backend */
context_t trt_context = create_context("trt".1);          // Create the TRT backend
int rtt = add_context_device(trt_context, "TensorRT");     // Tap the TRT back end
Copy the code

Add_context_device back-end mitter implementation:

int add_context_device(context_t context, const char* dev_name){
    struct context* ctx = (struct context*)context;
    if (NULL == ctx){
        TLOG_ERR("Tengine: Context pointer is null.\n");
        return - 1;
    }

    if (NULL! = ctx->device){TLOG_ERR("Tengine: Context(%s) is not multi-device collaborative.\n", ctx->name);
        return - 1;
    }

    struct device* selected_device = find_device_via_name(dev_name);       / / match "TensorRT"
    if (NULL == selected_device){
        TLOG_ERR("Tengine: Device(%s) is not found(may not registered).\n", dev_name);
        return - 1;
    }

    ctx->device = selected_device;
    return 0;
}
Copy the code

The pointer to the device structure matching the “TensorRT” backend is assigned to CTX -> Device, thus completing the TensorRT backend miter. This set of interfaces is generally common to different backends.

2. The back-end miter is connected to the TensorRT implementation part

Then to the unique part of the TRT back end, the device architecture implementation mentioned earlier can be linked to the TRT Device back end code implementation. Let’s look at the definition of the device structure:

typedef struct device
{
    const char* name;
    struct interface* interface; / /! < device scheduler operation interface
    struct allocator* allocator; / /! < device allocation operation interface
    struct optimizer* optimizer; / /! < device optimizer operation interface
    struct scheduler* scheduler; / /! < device scheduler
    void* privacy;               / /! < device privacy data
} ir_device_t;
Copy the code

The interface diagrammed to the TRT backend implementation is trt_device.cc with the following code:

static struct trt_device nv_trt_dev = {
        .base = {
                .name       = TRT_DEVICE_NAME,
                .interface  = &trt_interface,
                .allocator  = &trt_allocator,
                .optimizer  = &trt_optimizer,
                .scheduler  = nullptr,
                .privacy    = nullptr,}};Copy the code

Here we mainly look at the interface interface, which is mainly the realization of network operators and the organization of network structure:

static struct interface trt_interface = {
        .init           = trt_dev_init,
        .pre_run        = trt_dev_prerun,
        .run            = trt_dev_run,
        .post_run       = trt_dev_postrun,
        .async_run      = nullptr,
        .async_wait     = nullptr,
        .release_graph  = nullptr,
        .release_device = trt_dev_release,
};
Copy the code

Look at prerun, where the TRT network execution structure is built:

int trt_dev_prerun(struct device* dev, struct subgraph* subgraph, void* options){
    subgraph->device_graph = new TensorRTEngine;                    // Build the TRT network execution structure
    auto engine = (TensorRTEngine*)subgraph->device_graph;
    ir_graph_t* graph = subgraph->graph;
    if (nullptr! = options){struct trt_option* opt = (struct trt_option*)options;
        engine->SetOption(opt);
        return engine->PreRun(subgraph, opt);
    }
    else{
        return engine->PreRun(subgraph, nullptr);}
}
Copy the code

Enter the main function realization:

subgraph->device_graph = new TensorRTEngine;
Copy the code

The TRT network execution structure is also constructed when the TensorRTEngine class is instantiated.

class TensorRTEngine
{
public:
    TensorRTEngine(a); ~TensorRTEngine() = default;
    int PreRun(struct subgraph* subgraph, struct trt_option* opt);
    int Run(struct subgraph* subgraph);
    int PoseRun(struct subgraph* subgraph);
    void SetOption(trt_opt_t* opt);

private:
    int Build(struct subgraph* subgraph);
    void SetRange(struct graph* ir_graph, uint16_t id, nvinfer1::ITensor* trt_tensor);
    void SetRange(struct tensor* ir_tensor, nvinfer1::ITensor* trt_tensor);
    bool check_if_input_in_map(uint16_t& id, std::map<uint16_t.uint16_t>& map);
    int get_type(int mode, nvinfer1::DataType& type);

private:
    size_t card_id;
    uint16_t tensor_swap_count;
    std::map<uint16_t, nvinfer1::ITensor*> tensor_real_map;
    std::map<uint16_t.uint16_t> tensor_swap_map;
    std::map<uint16_t, nvinfer1::ILayer*> layer_map;
    std::vector<void*> io_tensors;
    std::vector<void*> host_buffer;
    nvinfer1::DataType precision;

private:
    trt_opt_t option;

private:
    bool AddTensor(struct graph* ir_graph, struct tensor* ir_tensor);
    bool AddAbsVal(struct graph* ir_graph, struct node* node);
    bool AddAddN(struct graph* ir_graph, struct node* node);
    bool AddBatchNormNode(struct graph* ir_graph, struct node* node);
    bool AddConcatNode(struct graph* ir_graph, struct node* node);
    bool AddConvolutionNode(struct graph* ir_graph, struct node* node);
    bool AddDeConvolutionNode(struct graph* ir_graph, struct node* node);
    bool AddCropNode(struct graph* ir_graph, struct node* node);
    bool AddDropoutNode(struct graph* ir_graph, struct node* node);
    bool AddEltwiseLayer(struct graph* ir_graph, struct node* node);
    bool AddFlattenNode(struct graph* ir_graph, struct node* node);
    bool AddFullyConnectedNode(struct graph* ir_graph, struct node* node);
    bool AddHardSwishNode(struct graph* ir_graph, struct node* node);
    bool AddInstanceNormNode(struct graph* ir_graph, struct node* node);
    bool AddInterpNode(struct graph* ir_graph, struct node* node);
    bool AddMishNode(struct graph* ir_graph, struct node* node);
    bool AddPadNode(struct graph* ir_graph, struct node* node);
    bool AddPermuteNode(struct graph* ir_graph, struct node* node);
    bool AddPoolingNode(struct graph* ir_graph, struct node* node);
    bool addReLUNode(struct graph* ir_graph, struct node* node);
    bool AddReductionNode(struct graph* ir_graph, struct node* node);
    bool AddReshapeNode(struct graph* ir_graph, struct node* node);
    bool AddResizeNode(struct graph* ir_graph, struct node* node);
    bool AddTanhNode(struct graph* ir_graph, struct node* node);
    bool AddTranspose(struct graph* ir_graph, struct node* node);
    bool AddSliceNode(struct graph* ir_graph, struct node* node);
    bool AddSoftmaxNode(struct graph* ir_graph, struct node* node);
    bool AddSplitNode(struct graph* ir_graph, struct node* node);
    bool AddSqueezeNode(struct graph* ir_graph, struct node* node);
    bool AddUpSampleNode(struct graph* ir_graph, struct node* node);

private:
    nvinfer1::IBuilder* builder;
    nvinfer1::INetworkDefinition* network;
    nvinfer1::IBuilderConfig* config;
    nvinfer1::ICudaEngine* engine;
    nvinfer1::IExecutionContext* context;
};
Copy the code

TensorRTEngine::Build: TensorRTEngine::Build: TensorRTEngine::Build: TensorRTEngine::Build

int TensorRTEngine::Build(struct subgraph* subgraph){
    const auto cuda_status = cudaSetDevice(this->option.gpu_index);;

    struct graph* ir_graph = subgraph->graph;

    for (uint16_t i = 0; i < subgraph->node_num; i++){
        uint16_t node_id = subgraph->node_list[i];
        auto ir_node = get_ir_graph_node(ir_graph, node_id);
      
        // Add network data
        for (uint8_t j = 0; j < ir_node->input_num; j++){
            struct tensor* ir_tensor = get_ir_graph_tensor(ir_graph, ir_node->input_tensors[j]);
            if (TENSOR_TYPE_INPUT == ir_tensor->tensor_type || TENSOR_TYPE_VAR == ir_tensor->tensor_type){
                if(!AddTensor(ir_graph, ir_tensor)){
                    TLOG_ERR("Tengine: Cannot add input tensor(id: %d, name: %s) from node(id: %d, name: %s).\n", ir_tensor->index, ir_tensor->name, ir_node->index, ir_node->name);
                    return - 5;}}}
    }
	
    for (uint16_t i = 0; i < subgraph->node_num; i++){
        uint16_t node_id = subgraph->node_list[i];
        auto ir_node = get_ir_graph_node(ir_graph, node_id);
        auto op_type = ir_node->op.type;
		
        // Add network operator implementation
        switch (op_type){
            case OP_ABSVAL:
                if (!AddAbsVal(ir_graph, ir_node)){
                    TLOG_ERR("Tengine: Cannot add AbsVal op(%d).\n", ir_node->index);
                    return - 6;
                }
                break;
            case OP_ADD_N:
                if (!AddAddN(ir_graph, ir_node)){
                    TLOG_ERR("Tengine: Cannot add AddN op(%d).\n", ir_node->index);
                    return - 6;
                }
                break;
            case OP_BATCHNORM:
                if (!AddBatchNormNode(ir_graph, ir_node)){
                    TLOG_ERR("Tengine: Cannot add BatchNorm op(%d).\n", ir_node->index);
                    return - 6;
                }
                break;
            case OP_CONST:
                continue;
            case OP_CONCAT:
                if (!AddConcatNode(ir_graph, ir_node)){
                    TLOG_ERR("Tengine: Cannot add Concat op(%d).\n", ir_node->index);
                    return - 6;
                }
                break;
            case OP_CONV: {
                if (!AddConvolutionNode(ir_graph, ir_node)){
                    TLOG_ERR("Tengine: Cannot add Convolution op(%d).\n", ir_node->index);
                    return - 6;
                }
                break; }... }}// Set output to output
    for(uint8_t i = 0; i < subgraph->output_num; i++){
        struct tensor* output_tensor = get_ir_graph_tensor(ir_graph, subgraph->output_tensor_list[i]);
        uint16_t output_node_id = output_tensor->producer;
        nvinfer1::ILayer* layer = layer_map[output_node_id];
        layer->setPrecision(nvinfer1::DataType::kFLOAT);
        for (int j = 0; j < layer->getNbOutputs(a); j++){ layer->setOutputType(j, nvinfer1::DataType::kFLOAT);
        }

        //layer->getOutput(i)->setName(output_tensor->name);
        auto trt_tensor = this->tensor_real_map[this->tensor_swap_map[output_tensor->index]];
        trt_tensor->setName(output_tensor->name);
        this->network->markOutput(*trt_tensor); }}Copy the code

Then look at TensorRTEngine::PreRun, which is the process of building a TRT inference engine. The logic is as follows:

There may be no model serialization and deserialization, but the end goal is to build the IExecutionContext:

this->context = engine->createExecutionContext(a);Copy the code

Take a look at the INT8 quantization operation on the Tengine TRT backend:

case nvinfer1::DataType::kINT8:
{
    if (this->builder->platformHasFastInt8()) {struct tensor* input = get_ir_graph_tensor(ir_graph, subgraph->input_tensor_list[0]);
        if (nullptr! = input &&1 <= input->quant_param_num){
            this->config->setFlag(nvinfer1::BuilderFlag::kINT8);
            this->config->setInt8Calibrator(nullptr);              // Pass in the TensorRT format INT8 calibration table
            this->precision = nvinfer1::DataType::kINT8;
        }
        else{
            TLOG_ERR("Tengine: Try enable INT8, but network does not have quant params, rollback to FP32.\n");}
    }
    else{
        TLOG_ERR("Tengine: Target inference precision(%d) is not supported, rollback.\n", opt->precision); }break;
}
Copy the code

The above quantification process may make you a little confused. Yes, he only quantified the weight of the model, but did not quantify the activation value. This ->config->setInt8Calibrator(NULlPtr) requires that you pass in a TensorRT format INT8 calibration table.

This whole set really comes down to a direct miter of Tengine and TensorRT:

(1) The output file of Tengine quantization module is not well used for the quantization miter of TensorRT. In fact, the quantization module of Tengine is quite clear;

(2) The TRT implementation part of Tengine TRT back-end is relatively independent, meaning that it is possible to directly take out the TRT back-end, which is a relatively complete SET of TRT reasoning engineering, without much Tengine style, except for the network loading part;


Ok, the Tengine TensorRT backend organization process has been shared. I hope my sharing can help you a little bit.


[Public Account transmission]

Model Reasoning: Talk about the Tengine TensorRT Backend Organization Process