directory

One, environment installation

1.1 Basic Environment Introduction

1.2 PTH model serialization export to PT

1.2 download libtorch

1.3 install OpenCV

1.4 create win32 C++ console project

Second, the complete inference code

Three, test,


One, environment installation

1.1 Basic Environment Introduction

In this paper, the experimental environment is windows10, IDE selection VS 2019, and the final development of win32 C++ program, which can batch portrait photo matting processing, without user intervention. This article focuses on the deployment of Pytorch’s trained PTH model, which is capable of implementing C++ calls without the python platform for a true production level deployment. See my other blog tutorials 1 and 2 for more information on how to train a portrait cutout model with Pytorch.

1.2 PTH model serialization export to PT

After training the model, you can get the PTH model, which is then serialized and exported using the following Python code (converting dynamic diagrams to static diagrams) :

#! /usr/bin/env python # -*- encoding: Utf-8 -*- "@ file :pth2pt.py @ Import torch. Backends. cudnn as cudnn import torch from torch Import nn from model. SHM import SHM from utils import * import time import cv2 # test image img_id='16' imgPath = /results/'+img_id+'.png' # Device = Torch. Device (" CUDa "if torch. Cuda.is_available () else" CPU ") device = torch.device("cpu") if __name__ == '__main__': Checkpoint = "./results/shm_temp. PTH "checkpoint = torch. Load (checkpoint) model = SHM() model = Model.to (device) model.load_state_dict(checkpoint['model']) model.eval() # img_org = cv2.imread(imgPath, Cv2.imread_color) width = img_org.shape[1] height = img_org.shape[0] # img = cv2.resize(img_org, (320,320), Img = (img.astype(Np.float32) - (114., 121., 134.,)) / 255.0 h, w, interpolation = Cv2.inter_CUBIC) img = (img.astype(NP.float32) - (114., 121., 134.,)) / 255.0 h, w, c = img.shape img = torch.from_numpy(img.transpose((2, 0, 1))).view(c, h, w).float() img= img.view(1, 3, h, Img = img.to(device) # model inference and serialize export with torch. No_grad (): Alpha = model(img) alpha = alpha.squeeze(0).float().mul(255).add_(0.5).clamp_(0, 255).permute(1, 2, 0).to(' CPU ', torch.uint8).numpy() traced_script_module = torch.jit.trace(model, img) traced_script_module.save("./results/matting.pt")Copy the code

1.2 download libtorch

Download libtorch from the official website. Since we need to use the GPU for reasoning later, download libtorch for the cudA version. In particular, pay attention to the issue of version consistency when downloading, that is, which VERSION of CUDA is used in the final inference platform, then the corresponding VERSION of CUDA needs to be downloaded here, and the corresponding version of CUDa10.1 is downloaded in this paper. The details are shown in the figure below:

1.3 install OpenCV

Due to the need for image loading and other operations, the open source framework OpenCV is selected here. See my other blog for details on how to install it. This article uses OpencV version 4.2.

1.4 create win32 C++ console project

This article uses VS2019 to create a C++ project, as follows:

Switch to 64-bit platform after creation:

Then the project configuration is as follows:

(1) property configuration — VC++ directory

Here I put the downloaded libtorch in the project directory. Readers can follow the path above and modify it according to their own path.

(2) Property configuration — library directory

(3) Linker — input

Add the lib file as follows:

opencv_world420.lib
asmjit.lib
c10.lib
c10_cuda.lib
caffe2_detectron_ops_gpu.lib
caffe2_module_test_dynamic.lib
caffe2_nvrtc.lib
clog.lib
cpuinfo.lib
dnnl.lib
fbgemm.lib
gloo.lib
gloo_cuda.lib
libprotobuf.lib
libprotobuf-lite.lib
libprotoc.lib
mkldnn.lib
torch.lib
torch_cpu.lib
torch_cuda.lib
Copy the code

Libtorch contains all the lib files in libtorch. In practice, it is not necessary to include all the lib files in libtorch. In order to avoid any problems, it is necessary to include all the lib files in libtorch. Then copy all the DLLS in libtorch to the project root directory. As shown below:

Last but not least, there is a bug in the older VERSION of VS that requires manual additions for our code to properly call the GPU. Go to the linker — command line — other options and type: /INCLUDE:? warp_size@cuda@at@@YAHXZ

The details are shown in the figure below:

If we don’t do that we won’t be able to call the GPU later in our code.

Second, the complete inference code

After completing the previous configuration, you can edit the code. The complete code is as follows, mainly to achieve end-to-end matting:

/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ********** * Copyright(c) 2020-2025 * All rights Reserved. ** File name :PortraitOpt * Brief description: C++ based image matting (GPU and CPU common) ** Created :2020-10-09 * Author: Qian Bin * Version :V1.0.0 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *********/ // Import the deep learning torch library #undef UNICODE #include <torch/script.h #include <iostream> #include "time.h" #include <opencv2/opencv. HPP > #include <opencv2/core/core.hpp> #include <opencv2/highgui/highgui.hpp> #include <opencv2/imgproc.hpp> #include < opencv2 imgproc/types_c. H > # include < opencv2 / videoio/videoio HPP > / / define the namespace using namespace STD. using namespace cv; Int main() {// determine the current DeviceType (CPU or GPU) torch::DeviceType device_type = at::kCPU; // Define device type if (Torch :: CUDa ::is_available()) device_type = at::kCUDA; / / load the cutout model torch: : jit: : script: : Module model; try { model = torch::jit::load("matting.pt"); } catch (const c10::Error& e) {cout << "can't load model" << endl; return 0; } model.eval(); model.to(device_type); CV ::String pattern = "imgs/*.jpg"; vector<cv::String> fn; glob(pattern, fn, false); vector<Mat> images; int imgNum = fn.size(); STD ::cout << "Current number of images to process:" << imgNum << endl; / / -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- circulation processing images -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- the for (int picIndex = 0; picIndex < imgNum; PicIndex ++) {Mat img = imread(fn[picIndex]); clock_t start, finish; double duration; char imgpath[256]; start = clock(); Mat orgimg = img.clone(); Int org_width = img.cols; int org_height = img.rows; resize(img, img, Size(320, 320), 0, 0, INTER_CUBIC); STD ::vector<int64_t> sizes = {1, img.rows, img.cols,3}; at::TensorOptions options(at::ScalarType::Byte); at::Tensor tensor_image = torch::from_blob(img.data, at::IntList(sizes), options); Tensor_image = tensor_image.totype (at::kFloat); tensor_image = tensor_image.totype (at::kFloat); tensor_image = tensor_image.totype (at::kFloat); Tensor_image = tensor_image.permute({0, 3, 1, 2}); Tensor_image [0] [0] = tensor_image [0] [0]. Sub (114). Div (255.0); Tensor_image [0] [1] = tensor_image [0] [1], sub (121). Div (255.0); Tensor_image [0] [2] = tensor_image [0] [2]. The sub (134). Div (255.0); tensor_image = tensor_image.to(device_type); STD ::vector<torch:: JIT ::IValue> inputs; inputs.push_back(tensor_image); at::Tensor out_tensor = model.forward(inputs).toTensor(); out_tensor = out_tensor.squeeze().detach(); Out_tensor = out_tensor. The mul (255). Add_ (0.5). The clamp_ (0, 255). The to (torch: : kU8); out_tensor = out_tensor.to(torch::kCPU); cv::Mat alphaImg(img.rows, img.cols, CV_8UC1); std::memcpy((void*)alphaImg.data, out_tensor.data_ptr(), sizeof(torch::kU8) * out_tensor.numel()); Resize (alphaImg, alphaImg, Size(org_width, org_height), 0, 0, INTER_CUBIC); // Synthesize Mat bg(orgim.size (), orgim.type (), Scalar(255, 255, 255)) with white background; // Mat Alpha, comp; cvtColor(alphaImg, alpha, COLOR_GRAY2BGR); comp = orgimg.clone(); for (int i = 0; i < alpha.rows; i++) for (int j = 0; j < alpha.cols; j++) { Vec3b alpha_p = alpha.at<Vec3b>(i, j); Vec3b bg_p = bg.at<Vec3b>(i, j); Vec3b img_p = orgimg.at<Vec3b>(i, j); if (alpha_p[0] > 210) { alpha_p[0] = 255; alpha_p[1] = 255; alpha_p[2] = 255; } else { alpha_p[0] = 0; alpha_p[1] = 0; alpha_p[2] = 0; } comp. At < Vec3b > (I, j) [0] = int (img_p [0] * (alpha_p [0] / 255.0) + bg_p [0] * (1.0 - alpha_p [0] / 255.0)); Comp. At < Vec3b > (I, j) [1] = int (img_p [1] * (alpha_p [1] / 255.0) + bg_p [1] * (1.0 - alpha_p [1] / 255.0)); Comp. At < Vec3b > (I, j) [2] = int (img_p [2] * (alpha_p [2] / 255.0) + bg_p [2] * (1.0 - alpha_p [2] / 255.0)); } finish = clock(); duration = (double)(finish - start) / CLOCKS_PER_SEC; Printf (" matting time %f milliseconds \n", duration * 1000); Sprintf_s (imgPATH, "results/%d_matting.jpg", picIndex); imwrite(imgpath, comp); // Manually release memory resources elsions.clear (); img.release(); orgimg.release(); alphaImg.release(); alpha.release(); comp.release(); } return 0; }Copy the code

Three, test,

The test results are as follows:

The first one or two images will be slow, presumably performing some GPU scheduling and so on, and then the reasoning speed of each image will be normal. The matting effect is as follows:

The original:

Matting results:

\