Contribute another wave of enthusiasm before graduation!

This article will give you a quick start on some of the things you need to know when using OpenVino. OpenVino has a big advantage over TVM and libtorch on intelx86’s CPU side, and it can be said that no other framework can match OpenVino’s inferred speed on X86. OpenVino really surprised me in my actual tests and was worth a try. In addition, Intel is also developing OpenVino(as you can see from the new frequency), and there are some related events and contests to try out.

See the calculation stick in the lower right corner (thanks for providing this picture)?

What is a OPENVINO

Similar to TensorRT, OpenVino is a set of deep learning tool libraries developed by hardware manufacturers for their hardware platforms, including inference libraries, model optimization and a series of functions related to the deployment of deep learning models. To put it more simply, if you want to deploy a deep learning model on an Intel-CPU or embedded device, consider OpenVino, which is optimized for Intel’s multi-generation CPUS and other hardware platforms, if you used LibTorch or TVM.

The official address: docs.openvinotoolkit.org/latest/inde…

Look at the workflow, which is similar to TensorRT and other deployment tools, train the model, parse it into openVino-specific.xml and.bin, and then pass it into the Inference Engine for Inference. The only difference is that like TensorRT, it’s closed source, so I give you the.so file and you call it.

The installation

Install according to the official process can be, simple and quick, is a compiled library and some header files and documentation. Personally on Ubuntu and Mac have been installed, download the installation package directly can be installed, or relatively easy. On Windows, it can be a bit of a hassle.

In addition, the official explanation is not very detailed, but many details have been mentioned to you. It is recommended to read the official documents carefully.

This tutorial assumes that OpenVino is installed correctly, environment variables are set correctly, and the official verification tests are fine.

The development of

Remember to enable the environment variables first, or set them to global (setting them to global may conflict with some installed libraries, such as OpenCV). Openvino’s environment variables are very convenient and can be detected by Openvino’s dependent Cmake components once you set them.

Activation command:

source /opt/intel/openvino/bin/setupvars.sh
Copy the code

For me, I need to develop in Ubuntu, but I don’t want to set up the global environment, so I just need to

Prototype @ prototype - X299 - UD4 - Pro: ~ / Downloads/bin/clion 2018.3 $source/opt/intel/openvino/bin/setupvars.sh [setupvars.sh] OpenVINO environment initialized Prototype @ prototype - X299 - UD4 - Pro: ~ / Downloads/clion - 2018.3 / bin $sh clion. ShCopy the code

Call Clion from the command line to redevelop Openvino. That said, if you want to debug OpenVino code using an IDE, Clion is recommended. Note that since OpenVino requires a lot of environment variables to be set, if you don’t want your environment variables to be global (which may conflict with your other variables), you can do this.

When OpenVino is installed, it already comes with many common libraries, such as OpenCV, which is specially compiled and optimized for Intel processors and has better ability to process video and image streams.

advantages

Since I have used TVM, TensorRT, LibTorch, TFlite and a series of libraries related to desktop neural network deployment (TVM, NCNN, MNN, TNN, etc.), I can find that it is really difficult to unify the neural network inference library, and everyone wants to do their own. There are so many of them, but the principles and methods of use of most of the frameworks are similar (except that TVM is a neural network compiler and machine automatic search optimization). Of course, there are also some large companies that develop their own inference libraries without open source.

In my opinion, OpenVino is a relatively mature and rapidly developing inference library. It provides sufficient demo and sample, which is easy to get started and can be used for rapid deployment and development. Its performance on Intel hardware platform exceeds that of most open source libraries.

I have also tried the optimization of TVM on CPU platform. According to my personal test result, OpenVino is faster than OpenVino, of course, it is not complete statistics. I think I am more familiar with the tuning of OpenVino on my own CPU.

Neural network computing stick

Neural Network computing stick (Intel Neural Myriad X 2 VPU), which is similar to accelerator, is a hardware platform supported by OpenVino. There are slightly fewer operators supported than using a CPU, but most of the models are adequate. You can test the performance of the compute bar directly by using the official routines and the official Benchmark. Without modifying any code, plug it into usB-3.0 and add -d MYRIAD to the command line.

After a simple performance test, THE speed of the HRNET-W32-256-192 was similar to that of the i5-7360U in the official demo. In the performance test, the HRNET-W32-256-192 can reach 18fps, while it can only run around 14fps on my MacBook Pro2017 (both are deployed using OpenVino). Even running the official human-pose multiplayer posture detection can reach 10fps, running the Posenet-224×224 has 99fps.

Use HRNet source code and weight: github.com/stefanopini… , pose_hrnet_w32_256x192. PTH

Overall, the computing power of this lollipop exceeded my expectations.

A practical example

Here is a simple example to demonstrate, here the human body posture estimation HRNet for testing. Pose_hrnet_w32_256x192.pth is used. For comparison, after optimization of this model with TVM opt-level=3, It can run up to 62ms on an Intel® Core™ I7-7800X CPU @ 3.50ghz × 12 (with two cores).

Export the ONNX model

We take pose_HRnet_W32_256x192.pth and export it as an ONNX model.

from SimpleHRNet import SimpleHRNet
import torch

model = SimpleHRNet(
    32.17.'scripts/weights/pose_hrnet_w32_256x192.pth',
    model_name='HRNet',
    resolution=(256.192),
    multiperson=False,
    return_bounding_boxes=False,
    max_batch_size=1,
)

model = model.model

example = torch.rand(1.3.256.192)

torch_out = torch.onnx.export(model,
                              example,
                              "scripts/weights/pose_hrnet_w32_256x192.onnx",
                              verbose=True,
                              export_params=True,
                              opset_version=11
                              )
Copy the code

Note that HRNet involves a large number of UpSample operations, and ONNX has relatively complete support for UpSample operations in the latest version. Therefore, you need to set opset_version=11 when exporting; otherwise, the export will fail.

Convert the ONNX model to IR

OpenVino reads the model by converting the ONNX model to IR format (.xml and.bin), so the.onnx model continues to be converted (each inference tool is equipped with a front end to parse the different models).

Start by installing the libraries required for the transformation model according to the official tutorial.

. Then the transformation model in py code in/opt/Intel/openvino/deployment_tools/model_optimizer/mo py, according to the actual installation position of each person, after entering this directory, perform:

python3 mo.py --input_model <INPUT_MODEL>.onnx
Copy the code

I’m ready to convert.

Worrying is obviously not so simple.

If we convert pose_hrnet_W32_256x192.onnx from the previous step, an error will be reported. Because OpenVino’s ONNX converter does not support the REsize (UpSample) operation of OP11, it cannot successfully infer the shape before and after the node (generally speaking, the front-end interpreter needs to deduce the shape after each node to proceed with the next operation) :

Related questions: software.intel.com/en-us/forum…

So what? There is no official support so far, so a simple temporary solution has been written. Because OpenVino’s interpreted front-end code is open source, we can modify it directly. Conversion code in/opt/Intel/openvino/deployment_tools/model_optimizer, by understanding the conversion code can be found that openvino model conversion process is resolved first. ONNX model, record the parameters of the model, and then replace them with OpenVino model format.

Because the main problem is the Resize operator, the transformation code corresponds to the/opt/Intel/openvino deployment_tools model_optimizer/extensions/ops/upsample py part. You can see that the code to infer dimensions is in the upsample_infer section, which means that OpenVino’s model parser cannot infer this. ONNX model dimension information before and after the Resize operator, that is, out_height and out_width, since it cannot be deduced by code, we can deduce it by ourselves.

There are many ways to derive this. You can run the model prediction process in Pytorch code directly, observe all dimensions of resize, or obtain the corresponding dimensions from the front end interpreter of another platform. Here I used TVM’s ONNX interpreter to obtain the dimensions of the resize operator. TVM’s ONNX interpreter actually supports the UPsample operator of OP11.

As a result, the upsample_infer method in upsample.py was changed to:

@staticmethod
def upsample_infer(node: Node) :
    layout = node.graph.graph['layout']
    assert len(layout) == 4

    input_shape = node.in_node(0).shape
    temp_name = node.soft_get('name')

    if temp_name in ['Resize_331'.'Resize_526'.'Resize_721'.'Resize_916'.'Resize_1174'.'Resize_1206'.'Resize_1512'.'Resize_1544']:
        out_height, out_width = 32.24
    elif temp_name in ['Resize_1247'.'Resize_1585']:
        out_height, out_width = 16.12
    else:
        out_height, out_width = 64.48

    node['height_scale'] = out_height/input_shape[2]
    node['width_scale'] = out_width / input_shape[3]
    if input_shape is None:
        return

    assert node.has('width_scale') is not None and node.has('height_scale') is not None
    node.out_node().shape = shape_for_layout(layout,
                                                batch=input_shape[get_batch_dim(layout, 4)],
                                                features=input_shape[get_features_dim(layout, 4)],
                                                height=out_height,
                                                width=out_width)

Copy the code

Where, Resize_331, etc., are the parts of the model where upsample is needed. The dimensions obtained in the previous step are directly written in…. by if-else method This is a one-off, it only fits this model.

Another thing to note is that in the replacement step, the OpenVino front end needs to replace the parsed parameters with its own operator structure. This step the code at/opt/Intel/openvino deployment_tools model_optimizer/extensions/middle/UpsampleToResample py, To replace height_scale and width_scale, use the following code:

. height_scale = scales[2]
    width_scale = scales[3]
    if len(scales) == 5:
        depth_scale = scales[4]
else:
    height_scale = upsample['height_scale']
    width_scale = upsample['height_scale']
Copy the code

So we need to calculate the height_scale and height_scale in the previous step and assign them to properties in the Node object:

node['height_scale'] = out_height/input_shape[2]
node['width_scale'] = out_width / input_shape[3]
Copy the code

So, we’re done with this operator that we can’t convert.

Conversion output:

Model Optimizer arguments: Common parameters: - Path to the Input Model: /home/prototype/Desktop/Deep-Learning/Pytorch-Learn/tvm_code/weights/pose_hrnet_w32_256x192.onnx - Path for generated IR: /opt/ Intel /openvino_2020.2.120/deployment_tools/model_optimizer/. -ir output name: pose_hrnet_w32_256x192 - Log level: ERROR - Batch: Not specified, inherited from the model - Input layers: Not specified, inherited from the model - Output layers: Not specified, inherited from the model - Input shapes: Not specified, inherited from the model-mean values: [0.485,0.456,0.406] -Scale values: [0.229,0.224,0.225] -scale factor: Not specified -precision of IR: fp32-enable fusing: True - Enable grouped convolutions fusing: True - Move mean values to preprocess section: False - Reverse input channels: False ONNX specific parameters: Model Optimizer version: 2020.2.0-60-g0bc66e26ff [SUCCESS] Generated IR version 10 model. [SUCCESS] XML file: Generated IR version 10 model /opt/ Intel /openvino_2020.2.120/deployment_tools/model_optimizer/./pose_hrnet_w32_256x192.xml [SUCCESS] BIN file: /opt/ Intel /openvino_2020.2.120/deployment_tools/model_optimizer/./pose_hrnet_w32_256x192.bin [SUCCESS] Total execution Time: 52.49 seconds. [SUCCESS] Memory Consumed: 1693 MBCopy the code

Another thing to note is that in order to normalize the input image for later prediction, Scale_values [0.229,0.224,0.225] –mean_values [0.485,0.456,0.406] –mean_values [0.485,0.456,0.406]

In this way, the exported model data range is 0-1, and the channel order of the input image is RGB(because the ONNX model input channel order is RGB), and the image needs to be adjusted according to this in the process of input image later.

reasoning

The reasoning process consists mainly of the usual steps: loading the model, setting up inputs and outputs. The official flow chart below is pretty straightforward. The reasoning described in this section is very similar to the official example, with some modifications to other code that has been changed for different models.

The code can be modified directly from the official demo. Here I use human_pose_ESTIMation_demo as an example. I recommend that you take a look at the official routine first, and the rest of the deployment code will be based on this demo. HRNet is a top-down pose detection, and the official pose detection example is based on OpenPose top-down, which is relatively easy to modify because it is a pose detection example.

Initialize the Core

The first step is to initialize the Core. There is officially a HumanPoseEstimator class in human_pose_ESTIMation_demo, where the private engine member variables are:

InferenceEngine::Core ie;
std::string targetDeviceName;
InferenceEngine::CNNNetwork network;
InferenceEngine::ExecutableNetwork executableNetwork;
InferenceEngine::InferRequest::Ptr requestNext;
InferenceEngine::InferRequest::Ptr requestCurr;
Copy the code

In the constructor, we first read the model’s.xml information and.bin(only the.xml address is needed to get the.bin address), then check that the model’s input and output dimensions are correct (an error is reported if they are not correct), and set the model’s input and output data types. ExecutableNetwork = ie.LoadNetwork(network, targetDeviceName)

network = ie.ReadNetwork(modelPath);

const auto& inputInfo = network.getInputsInfo(a);if (inputInfo.size() != 1) {
    throw std::runtime_error(modelPath + ": expected to have 1 input");
}

const auto& imageInputInfo = *inputInfo.begin(a);const auto& imageInputDims = imageInputInfo.second->getTensorDesc().getDims(a);if (imageInputDims.size() != 4 || imageInputDims[0] != 1 || imageInputDims[1] != 3) {
    throw std::runtime_error(
        modelPath + ": expected \"" + imageInputInfo.first + "\" to have dimensions 1x3xHxW");
}

inputLayerSize = cv::Size(imageInputDims[3], imageInputDims[2]);
// need to be fp32
imageInputInfo.second->setPrecision(InferenceEngine::Precision::FP32);
imageInputInfo.second->setLayout(InferenceEngine::Layout::NCHW);

InferenceEngine::OutputsDataMap outputInfo = network.getOutputsInfo(a);// there is only one output in HRNET
auto outputIt = outputInfo.begin(a);const auto& resOutputInfo = *outputIt++;

resBlobName = resOutputInfo.first;
auto output_data = resOutputInfo.second;
output_data->setPrecision(InferenceEngine::Precision::FP32);

const auto& resOutputDims = resOutputInfo.second->getTensorDesc().getDims(a);if (resOutputDims.size() != 4 || resOutputDims[0] != 1
        || resOutputDims[1] != keypointsNumber) {
    throw std::runtime_error(
        modelPath + ": expected \"" + resBlobName + "\" to have dimensions "
            "1x" + std::to_string(keypointsNumber) + "xHFMxWFM");
}

executableNetwork = ie.LoadNetwork(network, targetDeviceName);
requestNext = executableNetwork.CreateInferRequestPtr(a); requestCurr = executableNetwork.CreateInferRequestPtr(a);Copy the code

After initializing the model for inference, you first need to read the image and set the input image. The main step here is to convert the video frame read by OpenCV into a format that the inference engine can load. Similar to TVM, LibTorch, and TensorRT, the main steps are as follows:

CV_Assert(image.type() == CV_8UC3);
// Get the buffer address of the model's input data, then move the input data to that address
InferenceEngine::Blob::Ptr input = requestNext->GetBlob(network.getInputsInfo().begin()->first);
auto buffer = input->buffer().as<InferenceEngine::PrecisionTrait<InferenceEngine::Precision::FP32>::value_type *>();
cv::Mat resizedImage;

// The input dimension of our model is RGB, so we need to transform the dimension and divide by 255 to normalize the data dimension
cv::resize(image, resizedImage, cv::Size(inputLayerSize.width, inputLayerSize.height), cv::INTER_CUBIC);
cv::cvtColor(resizedImage, resizedImage, cv::COLOR_BGR2RGB);
cv::Mat tensor;
resizedImage.convertTo(tensor, CV_32FC3, 1.0/255);

The planes point to the address of the buffer, then split the root of the tensor
// Divide the three dimensions of RGB into plane, which moves the input data to the data address of buffer.
std::vector<cv::Mat> planes(3);
for (size_t pId = 0; pId < planes.size(a); pId++) { planes[pId] = cv::Mat(inputLayerSize, CV_32FC1, buffer + pId * inputLayerSize.area());
}
cv::split(tensor, planes);
Copy the code

This passes the read data to the input address of the inference engine (see the code above), and then the inference is done. There are two methods of inference, one is synchronous and the other is asynchronous, which is where I think OpenVino is slightly different from other framework inference processes. Users can use OpenVino’s built-in asynchronous approach to improve overall network inference FPS.

StartCurr () and startNext() are member functions. The content is to call requestCurr->StartAsync() to perform reasoning. StartCurr () and readyCurr() are used to simulate both synchronous and asynchronous operations. If it is synchronous, startCurr() and readyCurr() are used to determine whether the result is returned. If it does not return, continue the while(true) loop until the result is returned; If it is asynchronous, you can continue to wait for a result, then use startNext() to continue the prediction for the next frame, and then wait for each result in turn to be retrieved.

while(true){
...
        if (isAsyncMode) {
        if (isModeChanged) {
            estimator.startCurr(a); }if(! isLastFrame) { estimator.startNext(); 
        }
    } else if(! isModeChanged) { estimator.startCurr(a); }if (estimator.readyCurr()) {
        poses = estimator.postprocessCurr(a); std::cout <<"pose get! "<< std::endl; }... }Copy the code

Specific definitions of relevant member functions are shown:

void HumanPoseEstimator::startCurr(a) {
    requestCurr->StartAsync(a); }void HumanPoseEstimator::startNext(a) {
    requestNext->StartAsync(a); }bool HumanPoseEstimator::readyCurr(a) {
    if (InferenceEngine::OK == requestCurr->Wait(InferenceEngine::IInferRequest::WaitMode::RESULT_READY)) {
        return true;
    } else {
        return false; }}Copy the code

The internal scheduling is controlled by OpenVino’s TBB, which is a multi-threaded scheduling tool developed by Intel. It can quickly and safely assign tasks through multi-threading. This can be expressed as asynchronous operations described above.

Detailed introduction articles at: www.edge-ai-vision.com/2020/03/max…

Asynchrony and synchronization

One area of concern in OpenVino is the built-in asynchronous synchronization mechanism. InferRequest ->infer() and inferRequest->startAsync(), where inferRequest->infer() will start the inference process but block the main line. That is, you have to wait for the model to execute before you do the next step; InferRequest ->startAsync() returns the status directly and the code continues. Because this function is used to move the reasoning step to another thread, it doesn’t block the main thread, and the entire code doesn’t get stuck here.

According to the official before multithreaded processing way is using OpenMP, but officials also addressed the issue of using OpenMP small, want to learn more about the children’s shoes can look here: docs.openvinotoolkit.org/latest/_doc…

conclusion

RequestCurr ->GetBlob (requestCurr->GetBlob);

std::vector<HumanPose> HumanPoseEstimator::postprocessCurr(a) {
    InferenceEngine::Blob::Ptr pafsBlob = requestCurr->GetBlob(pafsBlobName);
    InferenceEngine::Blob::Ptr heatMapsBlob = requestCurr->GetBlob(heatmapsBlobName);
    InferenceEngine::SizeVector heatMapDims = heatMapsBlob->getTensorDesc().getDims(a); std::vector<HumanPose> poses =postprocess(
            heatMapsBlob->buffer(),
            heatMapDims[2] * heatMapDims[3],
            keypointsNumber,
            pafsBlob->buffer(),
            heatMapDims[2] * heatMapDims[3],
            pafsBlob->getTensorDesc().getDims(to)1],
            heatMapDims[3], heatMapDims[2], imageSize);

    return poses;
}
Copy the code

If the whole deployment process is not clear, can the official deployment tutorial: docs.openvinotoolkit.org/latest/_doc…

Optimization steps

How to optimize the model best effect, please look at the official documentation: docs.openvinotoolkit.org/latest/_doc…

  • OpenCV Cap/Render Time, the time required to render the results through OpenCV code during presentation.
  • Detection Time is the actual execution time of the model.
  • Wallclock time is the total time required for all steps to execute together, i.e. model execution time and rendering time.

Another example (generating so)

This is a quick tutorial on how to wrap OpenVino reasoning code into a dynamic link library (so) that is called in Python and returns results. Because sometimes we need other front-end code to wrap our back-end reasoning code, this is a structure that most applications use.

Taking the official human_pose_ESTIMation_demo example, we mark several of these functions as export functions:

#ifndef DFROBOT_2D_POSE_C_API_H
#define DFROBOT_2D_POSE_C_API_H

#define EXPORT_DLL __attribute__((visibility("default")))

struct Points{

    float data_x[18];
    float data_y[18];
    float score;

};

extern "C" {

    EXPORT_DLL int runInference(a);
    EXPORT_DLL bool isReady(a);
    EXPORT_DLL Points getResult(a);

}
Copy the code

Modify the cmake, Modify # add_executable(${IE_SAMPLE_NAME} ${IE_SAMPLE_SOURCES} ${IE_SAMPLE_HEADERS}) to add_library(${IE_SAMPLE_NAME} SHARED ${LIBSOURCES}) is used to generate.so files. If you set core to a global variable, it will be initialized when.so is read by external code. ExecutableNetwork = ie.LoadNetwork(network, targetDeviceName); If you had put the model into the CPU to execute, the code would have been stuck here, probably in conflict with the openVino thread:

.using namespace InferenceEngine;
using namespace human_pose_estimation;

// Core is initialized in the constructor when the object is defined, so there is a problem.
HumanPoseEstimator estimator("human-pose-estimation-0001.xml"."CPU".false);  
std::vector<HumanPose> poses;

bool poseReady = false;

bool isReady(a)
{
    return poseReady;
}

int runInference(a) {
    try {

        cv::VideoCapture cap;

        cap.open("action_demo.mp4");

        // read input (video) frame
        cv::Mat curr_frame; cap >> curr_frame;
        cv::Mat next_frame;

        if(! cap.grab()) {
            throw std::logic_error("Failed to get frame from cv::VideoCapture"); }...Copy the code

There are two solutions:

  • Switch the model execution environment to GPU or MYRIAD(Compute stick);
  • Make a global object a static object in a function that is initialized only once when the function is executed.

This makes it easy to package OpenVino as.so.

Some summaries and questions

OpenVino, as a tool for intel-CPU deployment, is similar to most other inference frameworks and, like TensorRT, has rich routines and documentation (TVM and other open source social frameworks have less documentation), making it relatively easy to get started.

The advantages are obvious, the model optimization capability on Intel-CPU is very good, most of the models are all-in-one, and conversion of ONNX model rarely fails. Multiple cores can be fully utilized and the number of threads used can be flexibly set. OpenVino is recommended first for CPU deployment.

The disadvantage is that the front-end conversion of other framework models is not very perfect, but this is normal, after all, every time there will be some new operators, which requires Intel engineers to improve slowly, of course, if you are in a hurry, you can modify the official conversion source code (Python), or relatively easy.

On the degree of support each model operator: docs.openvinotoolkit.org/latest/_doc…

On the degree of support equipment and a variety of operator: docs.openvinotoolkit.org/latest/_doc…

Transform model pit

As mentioned in the previous process, the transformation model is not compatible with all op, so we need to wait for the official perfection or add these operator transformation codes by ourselves. The transformation code is written by Python, so it is relatively easy to modify.

Numerical range of the Openvino model

In general, Pytorch and TF pre-training models have values ranging from 0 to 1, but the official OpenVino example has values ranging from 0 to 255, because most of these are converted by Caffe.

Input channel BGR

OpenVino’s default input order is BGR(similar to OpenCV). However, the model we need to convert (e.g. onNX exported from Pytorch) is RGB. The official converter will not convert you by default. You need to use the –reverse_input_channels parameter for conversion.

In short, the channel order of the input image should be consistent with that of the model.

ReShape

This is a unique feature of OpenVino that can change the scale information of the input image after converting the model, but only the model is not very complex and can’t contain the OP of resize. HRNet 0 is not working properly in the test

Docs.openvinotoolkit.org/latest/_doc…

Afterword.

It is meaningful to write this article when I am about to graduate, the last period of my graduate school life and the completion of my last small project, because I have a lot of things to deal with after graduation, so I am in a little hurry and not very efficient. But will always be able to think of their own summed up down, I hope to have some help for you. Finally, I also hope that my work can be smooth, and I can take time to calm down in my busy life and enjoy life.

Communication –

If you are like-minded with me, Lao Pan is willing to communicate with you; If you like Lao Pan’s content, welcome to follow and support. The blog is updated with an in-depth original article every week. Follow the public account “Oldpan Blog” and don’t miss the latest articles. Lao Pan will also organize some of his own private collection, hope to help you, the public account reply “888” to get Lao Pan learning route information and article summary, there are more waiting for you to dig. If you don’t want to miss Penn’s latest tweets, check out the mystery link.