This paper introduces solov2-TENsorRT acceleration I: Acceleration plan
Solov2-tensorrt Acceleration I: Acceleration plan
We have from Tencent, Ali and other front-line AI algorithm engineers form wechat communication group, if you want to communicate, welcome to add wechat: JintianandMerry pull group, add please note “communication group”
I’ve written about SOLOV2 before, but nothing about deployment. Recently, the author wanted to deploy an instance segmentation algorithm, whose goal was to achieve OK accuracy and speed up to Realtime. I tried some two-stage schemes, but the results were too deep, so I had to give up. Finally, SOLOV2 will be deployed for two main reasons:
- SOLOV2 model is simple and one of the few models with relatively balanced accuracy (mAP 38) and speed;
- SOLOV2 is a real single stage. It does not need to frame and then perform the regression of mask, which is different from CenterMask, CondInst and other methods, although both claim to be singe stage.
Of course, this is only feasible. In fact, SOLOV2 has deep problems, such as dynamic convolution and Matrix NMS. For those of you familiar with deep learning deployments, dynamic convolution is currently not supported within TensorRT. In other words, you need to handle it in a special way. The other is matrix NMS, which you can either use the CUDa kernel to do or encapsulate into the model, because unlike other NMS, it depends too much on matrix operations, and if you write post-processing, you will probably stop. But anyway, we’ve been through all the difficulties, so it’s a good example. The final effect of acceleration is as follows:
The effect shown above is roughly the same as the SOLOv2 Coco mAP 35.5 model we tuned, 56ms for Python and 23ms for tensorRT, double the speed. The overall acceleration is as follows:
model | latency | mAP | evaluation | input |
---|---|---|---|---|
python | 55ms | 35.5 | You can also | 732×704 |
TensorRT(C++) | 23ms (GTX1080Ti, FP32) | 35.5 | Fast to the hair loss | 732×704 |
In short, we have implemented a high precision, high speed instance segmentation model!! And can pure C++ reasoning!!
Mining pit record
The above shows you the incredible speed of SOLOV2 after acceleration. Next, I will share the problems encountered during the whole process. It can be said that the crater is as deep as the Himalayas are high.
1. Dynamic convolution is not supported
First of all, dynamic convolution is not supported, which many people may not have encountered, because you can’t encounter it. Even I, a wily person who has deployed PanopticFCN before, failed to do so.
For those of you who have studied example segmentation, you will see that the dynamic convolution used by PanopticFCN is not quite the same as SOLOV2, but it’s essentially the same thing, except SOLOV2 has exposed the problem.
So what is different, and why is it essentially the same? No, if you only consider the dynamic convolution of size 1×1, and both the step size and the expanded edge are 1, then the convolution actually degenerates into a normal matrix multiplication. Note that this is matmul not *. So the two are essentially the same thing, as the authors of PanopticFCN write:
torch.matmul(a, b)
It’s essentially replacing the dynamic convolution here with a matrix multiplication. So you know how to solve dynamic convolution, right?
2. matrix nms
In fact, this part could also be written with kernel, but I was lazy and did not do it. Instead, I directly integrated onNX into it. I tested the speed of this part, but in fact, it didn’t take much time. It can be said that with your plugin to write, the difference should not be too big.
I might say it’s easy, but there’s a lot of stuff going on inside of onNX… I’m not going to go through this, but if you’re interested, you can try it out for yourself.
3. How does Gridmask trace onnx
After experimenting with this super-complex model, I became more and more aware of the need for a unified diagram representation and the importance and necessity of deploying framework support for logical diagrams within the model, as it was difficult for purely data-driven static diagrams to meet our increasing model complexity demands. In this, the most direct embodiment is SOLOV2 for Gridmask processing.
I have tried to export this part with the mmDET implementation of SOLOV2, and found that writing method is really difficult to deal with. However, switching to the version of Detectron2 can better support it, and you only need to modify the writing method of LinSPCE. So that ONNX can find its own operator to implement.
4. Output mask processing
As we all know, there are two output number changes in SOLOV2. Once the output changes, it means that your model may not be free. In other words, you need a graph with detection results to generate an output of detection results and then export it. You can’t do it in another picture.
In fact, SOLOV2 also has this problem. One change lies in the result of feature encoding and kernel flight, the other part is the determination of twice threshold and the final filtering of matrix NMS. After so many screenings, the number that comes out is uncertain.
But it’s not that there’s no solution, and the solution at the moment is to fix the output, and the cost of that is that the output will have a lot of tensor. Which brings us to the next question;
5. The E2E takes a long time due to simple and crude post-processing
A lot of people when they do a tutorial or blog, they just write forward reasoning time, and they don’t mention the pre and post processing time, and some people just run a network reasoning part and look at the time and think it’s done, but it’s not, there’s a lot of detail in there. For example, few people know that OpencV resize can take twice as long as your network. You may not believe this, but try it out for yourself.
Another is the simple and rough post-processing. For example, the output of SOLOV2 is characterized by a large output size, because its mask is relative to the original image. The consequence of this is that its tensors take up a lot of space. Many people run forward and leave it at that, but it takes twice as long to copy the output to the CPU and visualize it as it does to run forward on your network.
Therefore, simple and crude post-processing here is not feasible. Finally, we screen out the output we need according to the output of FIX, and some further work is needed. Our solution is to write kernel to handle.
SOLOV2 application case sharing
In fact, this is only a preliminary result of our preliminary results, the speed has reached the requirements of deployment, the accuracy is also very high. Here we share some further examples of using SOLOV2.
In my opinion, SOLOV2 is suitable for the following application scenarios:
- Scenes requiring instance segmentation, such as panel, top view detection, irregular shape object detection, etc.;
- Suitable for real-time scenarios where you want to do industrial defect detection, but the traditional detection algorithm is not available, SOLOV2 can help. Landing and commercialization we have a TensorRT solution that is easy to deploy
1. Chip detection and identification
# # #
2. Garbage detection of cans
3. Pedestrian detection and counting
In general, can still play in industry SOLOV2 comes in, although it has some own shortcomings, but believe that under the constant optimization, the model can compare a shoulder or other detection model, at least in some instances segmentation, other detection methods are not necessarily can do it, even if can do it, also can not have so fast.
conclusion
This article Outlines some of the problems encountered during deployment, some of the acceleration results, and some further use cases. If you are interested in SOLOV2, you have the following requirements:
- You have your own data set in line with the usage scenario, you can join our knowledge planet, contact the star master to obtain the corresponding training code and acceleration code;
- If you need TensorRT for other models, you can also contact us for some interesting and useful algorithms that we can deploy to industry.
Finally, the next article will cover the technical details of SOLOV2 TensorRT acceleration and how to do the post-processing.
More and more
If you want to learn artificial intelligence and are interested in cutting-edge AI technology, you can join our Knowledge planet to get the first news, cutting-edge academic trends, industry news and so on! Your support will encourage us to create more frequently, and we will help you embark on a deeper journey of deep learning!
The articles
zhuanlan.zhihu.com/p/165009477
zhuanlan.zhihu.com/p/149398749
zhuanlan.zhihu.com/p/147622974
zhuanlan.zhihu.com/p/144727162