Interpretation of CVPR2020 paper CenterMask of Meituan unmanned Delivery

Computer vision technology is an important part of the realization of autonomous driving, and Meituan unmanned delivery team has been actively exploring this field for a long time. Not long ago, CenterMask image instance segmentation algorithm proposed by high precision map Group was included in CVPR2020. This paper will introduce this method.

The full name of CVPR is IEEE Conference on Computer Vision and Pattern Recognition, which, together with ICCV and ECCV, is known as the top three conferences in the field of Computer Vision. This year’s CVPR conference received 6,656 submissions and accepted 1,470, with an acceptance rate of 22%.

background

The significance of one-stage instance segmentation

Image instance segmentation is one of the most important and basic problems in computer vision, which has very important applications in many fields, such as map element extraction, autonomous vehicle perception, etc. Different from object detection and semantic segmentation, instance segmentation needs to simultaneously locate, classify and segment each instance (object) in the image. From this perspective, instance segmentation has the characteristics of both target detection and semantic segmentation, so it is more challenging. At present, two-stage target detection networks (Faster R-CNN[2] series) are widely used in mainstream instance segmentation algorithms (e.g., Mask R-CNN[1]).

In 2019, the anchor-free target detection method of one-stage ushered in a new round of explosion, and many excellent one-stage target detection networks were proposed, such as CenterNet[3], FCOS[4], etc. Compared with the two-stage algorithm, this kind of method directly predicts all information required by bounding box, such as position, box size and category, without relying on pre-set anchor. Therefore, it has advantages of simple and flexible frame and fast speed. Therefore, it is natural to think whether the instance segmentation task can also adopt such a one-stage anchor-free idea to achieve a better balance of speed and accuracy. In this paper, we analyze the two difficulties in this problem and propose CenterMask method to solve it.

The difficulty of one-stage instance segmentation

Compared with one-stage target detection, one-stage instance segmentation is more difficult. Different from object bounding box, which can be represented by the coordinates of four angles in object detection, the shape and size of mask segmented by instance are more flexible, which is difficult to be represented by vector of fixed size. Starting from the problem itself, the one-stage instance segmentation is mainly faced with two difficulties:

How to distinguish different object instances, especially those under the same category. Two-stage method uses Region of Interest (ROI) to limit the scope of a single object, and only needs to segment the Region within ROI, which greatly reduces the interference of other objects. However, the one-stage method needs to segment all objects in the image directly.
How to retain pixel-level location information is a common problem faced by two-stage and one-stage instance segmentation. In essence, segmentation is a pixel-level task, and the precision of object edge pixel segmentation has great influence on the final effect. However, most of the existing instance segmentation methods transform the features of fixed size to the size of the original object, or use a fixed number of points to describe the contour, which can not retain the spatial information of the original image.

Related work Introduction

According to the setting of target detection, the existing instance segmentation methods can be roughly divided into two categories: two-stage instance segmentation method and one-stage instance segmentation method.

Two-stage instance segmentation follows the process of detection first and segmentation later. First, the whole image is detected and bounding box is obtained, and then the regions inside bounding box are segmented to obtain the mask of each object. The main representative of two-stage method is Mask R-CNN[1], which adds a branch of Mask segmentation to the network of Faster R-CNN[2] for segmentation of each Region of Interest (ROI for short). However, mapping ROI of different sizes to masks of the same scale will lead to loss of position accuracy. Therefore, RoIAlign is introduced in this method to restore a certain degree of position information. PANet[5] improved the Mask R-CNN network by enhancing the dissemination of information in the network. Mask Scoring R-CNN[6] improves the quality of segmented masks by introducing a Scoring module for masks. The above two-stage method can achieve the effect of SOTA, but the method is complicated and time-consuming, so people begin to actively explore a simpler and faster one-stage instance segmentation algorithm.
Existing one-stage instance segmentation algorithms can be roughly divided into two categories: global image-based methods and local image-based methods. The global method firstly generates the global feature map, and then combines the features by some operations to get the final mask of each instance. For example, InstanceFCN[7] first uses the full convolutional network [8] to get instance-sensitive score maps containing the relative location of object instances, and then uses the levitator Module to output the segmentation results of different objects. YOLACT[9] first generated multiple Prototype masks for the global image, and then combined the Prototype masks with mask coefficients generated for each instance as segmentation results for each instance. The method based on global image can retain the location information of objects well and achieve pixel-level feature alignment, but it performs poorly when overlap exists between different objects. Corresponding to this, the local region-based method directly outputs the segmentation results of instances based on local information. PolarMask[10] used contours to represent different instances, and described the contours of objects by polygons composed of rays emitted from the center of the objects. However, polygons with fixed number of endpoints could not accurately describe the edges of objects, and the content-based method could not well represent objects with holes. TensorMask[11] uses 4D tensor to represent the masks of different objects in space, and introduces aligned representation and tensor Bipyramid to better recover the spatial location details of objects. However, these operations of feature alignment make the whole network slower than two-stage’s Mask R-CNN.

Different from the above methods, our proposed CenterMask network contains both a global saliency map generation branch and a local shape prediction branch, which can realize the discrimination of different object instances under the condition of pixel-level feature alignment.

CenterMask introduction

This work aims to propose a one-stage image instance segmentation algorithm, which does not rely on pre-set ROI regions to predict masks, which requires the model to simultaneously locate, classify and segment objects in the image. To achieve this task, we split the instance into two parallel sub-tasks, and then combine the results of the two sub-tasks to get the final segmentation result of each instance.

The first branch, the Local Shape branch, takes rough Shape information from the object’s central point representation and is used to constrain the position regions of different objects to naturally differentiate between different instances. The second branch (the Global Saliency branch) predicts the Global Saliency map of the whole image, which is used to retain accurate location information and achieve accurate segmentation. Finally, the rough but instance-aware local shape and the delicate but instance-unaware global saliency were combined to obtain the segmentation result of each object.

1. Overall network framework

The overall network structure of CenterMask is shown in Figure 2. Given an input image, the network outputs five parallel branches after extracting features from backbone network. The Heatmap and Offset branches are used to predict the coordinates of all central points. The coordinates are obtained according to the general process of key point prediction. The Shape and Size branches are used to predict the Local Shape at the center, while the Saliency branch is used to predict the Global Saliency Map. It can be seen that the predicted Local Shape contains rough but instance-aware Shape information, while the Global Saliency contains subtle but instance-aware significance information. Finally, the Local Shape obtained at each position was multiplied with the Global Saliency at the corresponding position to obtain the segmentation result of each instance. Details of the Local Shape and Global Saliency branches are described below.

2. The Local Shape prediction

In order to distinguish instances in different locations, the center point of each instance is used to model its mask. The center point is defined as the center of the object’s bounding box. An intuitive idea is to directly use the image features extracted from the center point of the object to represent, but the image features of fixed size is difficult to represent objects of different sizes. Therefore, we split the representation of the object mask into two parts: the shape and size of the mask, the shape of the mask is represented by image features of fixed size, and the size (height and width) of the mask is represented by two-dimensional vectors. Both of the above information can be obtained simultaneously from the representation of the center point of the object. As shown in Figure 3, P represents the image features extracted by Backbone network, and Shape and size represent the branches of predicting the above two information. Use Fshape(size H x W x S x S) to represent the feature images obtained by the Shape branch, and Fsize(size H x W x 2) to represent the feature images obtained by the size branch. 0 If the center of an object is at (x, Y) the shape of the object is Fshape(x, Y) with a size of 1 x 1 x S x S 0 The size feature of this point is Fsize(x,y), and h and W represent the predicted height and width. Resize the above two-dimensional plane matrix to the size of H x W, and then the LocalShape representation of the object can be obtained.

3. Global Saliency

Although the above Local Shape means that masks of each instance can be generated, because the mask is obtained by the fixed size feature resize, it can only describe rough Shape information and cannot well retain the details of spatial positions (especially the edges of objects). How to obtain fine spatial location information from fixed size features is a common problem faced by instance segmentation. Different from other complex feature alignment operations to solve this problem, we adopt a simpler and faster method. Inspired by the idea of fine segmentation of the whole image directly in the field of semantic segmentation, we propose to predict a global size salient image to achieve feature alignment. Parallel to the Local Shape branch, the Global Saliency branch predicts a Global feature graph behind backbone network, which is used to indicate whether each pixel in the image belongs to the foreground (object area) or background area.

The experimental results

1. Visualize the results

In order to verify the effects of the Local Shape and Global Saliency branches proposed in this paper, we visualized the segmentation results of the independent branches, as shown in FIG. 4. Where (a) represents the output result of only the Local Shape branch network. It can be seen that although the predicted mask is rough, this branch can distinguish different objects well. (b) represents the output result of Global Saliency branch network only. It can be seen that fine segmentation of objects can be achieved by using Saliency branch only when there is no occlusion between objects. (c) Represents the performance of CenterMask in complex scenes. From left to right, the segmentation effect of CenterMask with only Local Shape branch, only Global Saliency branch, and both exist at the same time are respectively. It can be seen that in the case of occlusion between objects, only the Saliency branch can not be well segmented, while the combination of Shape and Saliency branch can realize the distinction between different instances while fine segmentation.

2. Comparison of methods

The accuracy (AP) and speed (FPS) of CenterMask compared to other methods on the COCO test-Dev dataset are shown in Figure 5. Among them, two models are superior to our method in accuracy: Two-stage Mask R-CNN and One-stage TensorMask, but their speeds are about 4fps and 8fps, respectively, slower than our method. In addition, our method is superior to other one-stage instance segmentation algorithms in both speed and precision, achieving a balance between speed and precision. Visualizations of CenterMask and other methods are shown in Figure 6.

In addition, we also migrated the proposed Local Shape and Global Saliency branches to the mainstream one-stage target detection network FCOS, and the final experimental results are shown in Figure 7. The best model can achieve an accuracy of 38.5, which proves the applicability of this method.

future

First of all, CenterMask method, as our initial attempt in the field of one-stage instance segmentation, has achieved a good balance between speed and accuracy, but it is still not completely free from the influence of target detection in essence. In the future, we hope to explore a method independent of box Crop to simplify the whole process. Secondly, since the idea of CenterMask predicting Global Saliency is inspired by the idea of semantic segmentation, while panoramic segmentation combines both instance segmentation and semantic segmentation, our method is expected to be better applied in the field of panoramic segmentation in the future. We also hope that more work combining semantic segmentation and instance segmentation will be put forward in the future.

For more details, see CenterMask: Single Shot instance segmentation with Point representation

Recruitment information

Meituan unmanned vehicle distribution center continues to recruit a large number of positions, sincerely looking for perception/high-precision map/decision planning/prediction algorithm experts, unmanned vehicle system development engineers/experts. The unmanned vehicle distribution center mainly relies on unmanned driving technology and sensors such as vision and laser to sense the surrounding environment in real time. Through high-precision positioning and intelligent decision-making planning, the unmanned vehicle distribution center ensures the real-time delivery capability of the whole scene. Please send your resume to [email protected] (subject: Meituan Unmanned Vehicle Team)

reference

[1] He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.
[2] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015:91-99.
[3] Zhou X, Wang D, Krähenbühl P. Objects as points[J]. arXiv preprint arXiv:1904.07850, 2019.
[4] Tian Z, Shen C, Chen H, et al. Fcos: Fully convolutional one-stage object detection[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019:9627-9636.
[5] Liu S, Qi L, Qin H, et al. Path aggregation network for instance segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8759-8768.
[6] Huang Z, Huang L, Gong Y, et al. Mask scoring r-cnn[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 6409-6418.
[7] Dai J, He K, Li Y, et al. Instance-sensitive fully convolutional networks[C]//European Conference on Computer Vision. Springer, Cham, 2016:534-549.
[8] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431-3440.
[9] Bolya D, Zhou C, Xiao F, et al. YOLACT: real-time instance segmentation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9157-9166.
[10] Xie, Enze, et al. “Polarmask: Single shot instance segmentation with polar representation.” //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2020
[11] Chen, Xinlei, et al. “Tensormask: A foundation for dense object segmentation.” Proceedings of the IEEE International Conference on Computer Vision. 2019.

To read more technical articles, please scan the code to follow the wechat public number – Meituan technical team!