Wechat scan has been online for some time, attracting great attention from the outside world. Compared with “patting” of relevant competitive products in the industry, “scanning” is characterized by “scanning”, which brings more convenient user experience. “Scanning” is inseparable from the efficient detection of mobile objects, this article will reveal the secrets for you.


The author | arlencai, tencent WXG application researcher

The background,

“Scan” is the highlight of “scan” recognition, bringing more convenient user experience. Compared with the interactive mode of “clap”, the difficulty of “sweep” lies in how to automatically select the image frame containing the object, which cannot be separated from the efficient object detection at the mobile end.

Second, the problem of

“Scan” object detection is a general object detection in open environment. Complex and diverse object forms require strong generalization of the model, and the computing bottleneck of mobile terminal requires the model to maintain high real-time performance.

What kind of class-wise or object-ness of mobile detection is required to “scan” objects?

The advantage of class-wise detection (namely, traditional object detection) is to output the location and category of objects simultaneously. However, the category of objects in open environment is difficult to accurately define and completely cover.

Therefore, we define the problem as object-ness detection (i.e., subject detection) : we only focus on whether it is an Object and the position of the Object, not the specific category of the Object.

Object-ness Object detection has stronger universality for diversified objects, and greatly reduces the burden of the model to ensure its real-time performance. This is the difference in the definition of mobile detection of “scan” objects compared to related competing products.

Three, the selection of

In recent years, object detection algorithms are changing with each passing day. Facing a variety of detection models (see Figure 1), the appropriate one is the best one.

1. One-stage

In terms of the hierarchical structure of the model, it can be divided into two-stage and one-stage.

(1) Two-stage detector is represented by R-CNN series (Fast R-CNN [1], Faster R-CNN [2], mask-RCNN [3]). The first stage of the model outputs a rough object candidate proposal. The second stage further regression object coordinates and classification object categories.

The advantages of two-stage detector are as follows: RoIPool’s candidate box scale normalization has good robustness to small objects; Further region classification is more friendly to the detection needs of more categories.

(2) One-stage detector is represented by YOLO and SSD series (YOLO V1-V3 [4-6], SSD [7], RetinaNet[8]), which is characterized by full convolutional network (FCN) which directly outputs coordinates and categories of objects, facilitating the acceleration of mobile terminal.

For the application scene of subject detection in “Scan” recognition, the demand for small objects and multiple categories is not as strong as that for real-time, so we choose the model structure of one-stage.

2. Anchor-free

(1) Anchor is a feature of R-CNN series and SSD series detection methods: In one-stage detector, various anchors are generated through slide window as candidate frames; In two-stage detector, RPN selects suitable candidate frames from Anchor for classification and regression in the second stage.

Anchor provides the object shape prior for detector, which can effectively reduce the complexity of detection task, but the empirical Anchor parameters will greatly affect the performance of the model.

(2) With the development of network structure (such as FPN[9] and DeformConv [10]) and Loss function (such as Focal Loss[8] and IOU Loss[11]), anchor-free detector gradually shows new vitality. Among them, the scale robust network structure enhances the expression ability of the model and trains the robust loss function to solve the sample balance and measurement problems.

The anchor-free method is represented by Yolov1-V2 [4-5] and its derivatives (DenseBox [12], DuBox [13], FoveaBox [14], FCOS[15], ConerNet [16], CenterNet[17], etc.). They set aside the shape priori of the candidate box and directly classify the object’s category and regression the object’s coordinates.

In the application scenario of “scan”, complex and diverse object shapes pose great challenges to the design of Anchor, so we choose the model structure of Anchor-free.

3. Light-head

Over the past year, the detector of Anchor – Free has changed with each passing day. However, in the application scenario of mobile terminals, most of the one-stage and anchor-free detectors still have the following deficiencies:

(1) Multi-head: In order to enhance the detection ability of the model for multi-scale objects, most detectors (such as FoveaBox[14], DuBox [13] and FCOS[15]) generally adopt multi-head output to improve the scale robustness of the model.

Among them, low-level features meet the needs of small object detection, while high-level features should be used for large object detection. However, the multi-output network structure is not friendly for mobile acceleration.

(2) Post-process: In order to solve the problems of anchor prior loss and multi-head result integration, most detectors need to rely on complex post-processing, such as non-maximum suppression (NMS) and various tricks, but they are generally not suitable for parallel acceleration.

To sum up, CenterNet was selected as the mobile detection model of “Scan” detected objects (see Figure 2). CenterNet is the Anchor -free detection method of one-stage. The output of single-head and regression of Gaussian response graph make it independent of post-processing of NMS.

CenterNet turns the target detection problem into a standard key point estimation problem: the thermal map of the center point (peak point is the center point) is obtained through the full convolutional network, and the corresponding object width and height information of the peak point is predicted.

In addition, we introduced gaussian sampling, Gaussian weighting, GIOU Loss[19] and other technologies in TTFNet[18] to accelerate CenterNet training. It only takes 5 hours to complete MS-COCO training with 4 Tesla P4 pieces. This saves a lot of time for model tuning and optimization.

Four, optimization

In view of the detection requirements of mobile terminals, we changed CenterNet backbone from ResNet18 to ShuffleNetV2 which is more friendly to mobile devices [20].

However, the efficiency improvement brought by just relying on Backbone is limited, so we carry out targeted model optimization.

1. Large receptive field (Large RF)

From ResNet to ShuffleNetV2, the depth and receptive field of the model were mainly affected. In CenterNet with thermal regression, the receptive field of the model is extremely important.

How to improve the receptive field of the model while keeping the network light?

From AlexNet to VGG, VGG disassembled large scale convolutional kernels into multiple small scale convolutional kernels (1 5×5→2 3×3) : under the same receptive field, the number of parameters and computation of two 3×3 convolution is only one 18/25 of 5×5.

However, this does not apply in the era of depth-wise convolution.

In ShuffleNet, the 5×5 depth-wise convolution obtains twice the receptive field, and only adds a very small amount of computation compared with the 3×3 depth-wise convolution (See FIG. 3).

Therefore, we replace all depth wise convolution in ShuffleNetV2 with 5×5 convolution.

Due to the lack of ImageNet pre-trained 5×5 model, we skillfully applied the zero padding of the convolution kernel to the 3×3 ShuffleNetV2 pre-trained model to obtain the 5×5 large convolution kernel ShuffleNetV2.

2. Light Head

CenterNet’s detection header uses an upsampling structure like U-Net[21] to effectively fuse low-level details and improve detection performance for small objects.

However, CenterNet’s detection header was not optimized for mobile, so we ShuffleNet it (see red box in Figure 4).

First, all ordinary 3×3 convolution of the detection head is replaced by 5×5 depth-wise convolution, and deformable convolution (DeformConv) is also transformed to depth-WISE deformable convolution.

Secondly, according to ShuffleNet channel compression technique, residual fusion of multi-layer features in CenterNet was transformed to concat of channel compression.

Through Large RF and Light Head, the optimized model achieves excellent results in FLOPs, Parameters and mAP in MS-COCO database, as shown in Table 1.

Figure 4: CenterNet inspection header structure optimization

3. Pyramid Interpolation Module (PIM)

However, DeformConv (DeformConv) is not friendly to mobile acceleration, so we need to redesign the alternative to DeformConv.

DeformConv can extract multi-scale information adaptively, which plays a great role in small object detection in MS-COCO. “Scan” detection of objects is not very strong demand for small objects, DeformConv is more to provide a variety of scale features.

In this regard, we refer to Pyramid PoolingModule (PPM) of image segmentation method PSPNet[22] (see figure 5) and propose Pyramid Interpolation Module. PIM) simultaneously realizes the fusion of multi-scale features and interpolation of feature maps (see blue box in FIG. 4).

PIM mainly includes three branches for double upsampling: cavity deconvolution, convolution + upsampling, and global average pooling + full connection.

Among them, “void deconvolution” corresponds to large-scale features; “Convolution + upsampling” corresponds to small scale features; Global average Pooling + Full Connection corresponds to global characteristics.

In the backbone network of ShuffleNetV2 X0.5, the influences of various upsampling methods on detection performance are compared in Table 2. It can be seen that PIM effectively replaces DeformConv in “scanning” recognition.

Figure 5: Pyramid pooling module for PSPNet

Five, deployment,

Through the above optimization, we finally adopted the optimal results in Table 2 as the mobile terminal detection model of “scan” identified objects.

The model uses MMDetection based on PyTorch framework as a training tool. In mobile deployment, we used NCNN framework to convert PyTorch model to ONNX model and then to NCNN model, and quantized parameters to 16bit during the conversion process.

In addition, in order to further reduce the model size and accelerate, we fuse the three continuous linear operations of CONV/BN /scale in the network into a CONV layer, which can reduce the number of parameters by about 5% and increase the speed by about 5%-10% without affecting the effect.

Finally, the mobile detection model of “scan” is only 436 KB, and the single frame detection time on A11 CPU of iphone8 is only 15ms.

Six, outlook

At present, “scan” mobile detection is just the beginning, and the development of mobile object detection is just beginning.

Aside from the “scan” object detection scenario, CenterNet still has the following problems with general object detection:

How to solve the explosive growth of detection head caused by category increase? Is there a more general alternative to deformable conv? Can the upsampling structure of U-NET formula be further optimized?

There is a long way to go, and we will explore these questions in our future work.

Welcome to exchange, welcome to join us. Wechat is looking for talents in computer vision and natural language processing. Please send your resume to [email protected].

References:

[1] Girshick, Ross. “Fast R-CNN.” international conference on computer vision

(2015) : 1440-1448.

[2] Ren, Shaoqing, et al. “Faster R-CNN: Towards Real-Time Object Detection with

Region Proposal Networks.” IEEE Transactions on Pattern Analysis and Machine

Intelligence 39.6 (2017) : 1137-1149.

[3] He, Kaiming, et al. “Mask R-CNN.” international conference on computer

vision (2017): 2980-2988.

[4] Redmon, Joseph, et al. “You Only Look Once: Unified, Real-Time Object

Detection.” computer vision and pattern recognition (2016): 779-788.

[5] Redmon, Joseph, and Ali Farhadi. “YOLO9000: Better, Faster, Stronger.”

computer vision and pattern recognition (2017): 6517-6525.

[6] Redmon, Joseph, and Ali Farhadi. “YOLOv3: An Incremental Improvement.”

arXiv: Computer Vision and Pattern Recognition (2018).

[7] Liu, Wei, et al. “SSD: Single Shot MultiBox Detector.” european conference

on computer vision (2016): 21-37.

[8] Lin, Tsungyi, et al. “Focal Loss for Dense Object Detection.” international

conference on computer vision (2017): 2999-3007.

[9] Lin, Tsungyi, et al. “Feature Pyramid Networks for Object Detection.”

computer vision and pattern recognition (2017): 936-944.

[10] Dai, Jifeng, et al. “Deformable Convolutional Networks.” international

conference on computer vision (2017): 764-773.

[11] Yu, Jiahui, et al. “UnitBox: An Advanced Object Detection Network.” acm

multimedia (2016): 516-520.

[12] Huang, Lichao, et al. “DenseBox: Unifying Landmark Localization with End to

End Object Detection.” arXiv: Computer Vision and Pattern Recognition (2015).

[13] Chen, Shuai, et al. “DuBox: No-Prior Box Objection Detection via Residual

Dual Scale Detectors.” arXiv: Computer Vision and Pattern Recognition (2019).

[14] Kong, Tao, et al. “FoveaBox: Beyond Anchor-based Object Detector.” arXiv:

Computer Vision and Pattern Recognition (2019).

[15] Tian, Zhi, et al. “FCOS: Fully Convolutional One-Stage Object Detection.”

international conference on computer vision (2019): 9627-9636.

[16] Law, Hei, and Jia Deng. “CornerNet: Detecting Objects as Paired Keypoints.”

european conference on computer vision (2019): 765-781.

[17] Zhou, Xingyi, Dequan Wang, and Philipp Krahenbuhl. “Objects as Points.”

arXiv: Computer Vision and Pattern Recognition (2019).

[18] Liu, Zili, et al. “Training-Time-Friendly Network for Real-Time Object

Detection.” arXiv: Computer Vision and Pattern Recognition (2019).

[19] Rezatofighi, Hamid, et al. “Generalized Intersection Over Union: A Metric

and a Loss for Bounding Box Regression.” computer vision and pattern recognition

(2019) : 658-666.

[20] Ma, Ningning, et al. “ShuffleNet V2: Practical Guidelines for Efficient CNN

Architecture Design.” european conference on computer vision (2018): 122-138.

[21] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional

Networks for Biomedical Image Segmentation.” medical image computing and

computer assisted intervention (2015): 234-241.

[22] Zhao, Hengshuang, et al. “Pyramid Scene Parsing Network.” computer vision

and pattern recognition (2017): 6230-6239.

[23] Li, Zeming, et al. “Light-Head R-CNN: In Defense of Two-Stage Object

Detector.” arXiv: Computer Vision and Pattern Recognition (2017).

[24] Wang, Jun, Xiang Li, and Charles X. Ling. “Pelee: A Real-Time Object

Detection System on Mobile Devices.” neural information processing systems

(2018) : 1967-1976.

Recommended reading

Tencent cloud technology community: wechat sweep online, a full disclosure of the scan behind the recognition technology!