BlendMask uses a more reasonable Blender module to integrate top-level and low-level semantic information to extract more accurate instance segmentation features. This model achieves the effect of state-to-art, but its structure is very simple and its reasoning speed is not slow. The highest accuracy can reach 41.3AP. The performance and speed of the real-time version of BlendMask-RT are respectively 34.2mAP and 25FPS, and the optimization method of the paper is very valuable to learn, and it is worth reading
BlendMask: top-down Meets bottom-up for Instance Segmentation
- Address:Arxiv.org/abs/2001.00…
Introduction
In the early days, there were mainly two kinds of dense instance segmentation models, top-down Apporach and bottom-up Apporach
top-down apporach
The top-down model first obtains the box region through some methods, and then carries out mask extraction for the pixels in the region. This model generally has the following problems:
- The local consistency between feature and mask will be lost. This paper discusses deep-mask and uses FC to propose mask
- For redundant feature extraction, different Bboxes will extract masks again
- Because of the convolution of the reduced feature map, the location information will be lost
bottom-up apporach
In the bottom-up model, per-pixel prediction is made for the whole image, each pixel generates a feature vector, and then the pixels are grouped by some methods. As the per-pixel prediction is carried out and the step size is small, local consistency and location information can be well preserved, but there are still several problems as follows:
- The heavy reliance on the quality of per-pixel prediction leads to suboptimal segmentation
- Because the mask is extracted in low dimension, the segmentation capability for complex scenes (with many categories) is limited
- A complex post-processing approach is required
hybridizing apporach
Considering the above problems, this paper integrates the top-down and bottom-up strategies, and uses instance-level information (Bbox) to crop and weighted output per-pixel prediction. Although FCIS and YOLACT already have similar ideas, the paper believes that they do not properly deal with top-level and bottom-level features. High-dimensional features contain overall instance information, while low-dimensional features retain better location information. The focus of the paper is to study how to combine high-dimensional and low-dimensional features. The main contributions are as follows:
- A proposal-based instance mask merging method, Blender, was proposed. Compared with YOLACT and FCIS merging methods on COCO, 1.9 mAP and 1.3mAP were improved respectively
- A simple algorithm network BlendMask is proposed based on FCOS
- The inference time of BlendMask does not increase with the increase of the number of predictions as the second-order detector does
- The accuracy and speed of BlendMask are better than that of Mask R-CNN, and Mask mAP is 1.1 higher than that of the best full convolution instance segmentation network, Tensor-Mask
- Because the Bottom module can segment multiple objects simultaneously, BlendMask can be directly used for panoramic segmentation
- Mask R- The output of CNN Mask is fixed to, the mask output pixel of BlendMask can be large and not restricted by FPN
- BlendMask is versatile and flexible, and can be used for other instance-level recognition tasks, such as key-point detection, with a few minor modifications
Our methods
Overall pipeline
The bottom module is used to predict score maps, and the top layer is used to predict the expense of the element. The Blender Module is used to integrate scores and attentions, and the overall architecture is shown in Figure 2
- Bottom module
The Score maps predicted by the Bottom Module are referred to in this article as the base..The size ofWhere N is the batch size, K is the quantity of the base,Is the size of the input, andIs the output step of the Score Maps.
The decoder of DeepLab V3+ is adopted in this paper. The decoder contains two inputs, one low-level feature and one high-level feature. After upsample of the high-level feature, the output is fused with the low-level feature. The input of the bottom Module can be a backbone feature or a pyramid of features such as YOLACT or Panoptic FPN
- Top Layer
After each detection tower, a convolution layer is added to predict top-level element. In YOLACT, each layer of the pyramid () of the outputfor, namely, the overall weight value of each channel of the corresponding base. And the paper outputfor.Is the resolution of Attention, which is the weight of the pixels corresponding to each channel in the base, and is more granular, which is an Element-wise operation (described later).
Because the element is a 3-D structure (), so you can learn some instance-level information, such as the general shape and posture of the object.The value of is relatively small, and only rough prediction is made. Generally, the maximum value is 14, and output channel is used as (). Use the FCOS post-process method to select the top D bboxes before sending them to the next moduleAnd the corresponding element, the specific selection method is to select the classification confidenceThere are top D Bboxes of the threshold. The threshold is usually 0.05
- Blender module
The Blender Module is a key part of the BlendMask, combining output based on a placement sensitive base based on wrench
Blender module
The input to the Blender module is the base for the bottom-levelAnd the top-level element of choiceAnd bbox
First, use the ROIPooler of Mask R-CNN to intercept each BboxCorresponds to the base region and resize into a fixedSize feature map. Specifically, RoIAlign with sampleing ratio=1 is used to sample only 1 points per bin, and Mask R-CNN samples 4 points per bin. In the training, GT Bbox is directly used as the proposals, while in the reasoning, the detection results of FCOS are directly used
Attention the sizeIs better thanSmall, so it needs to be rightTo interpolate frominto.
And then toK-dimensional attention was respectively normalized by SoftMax to generate a group of Score maps
And then for each regiontheAnd the corresponding Score mapthePerform the Element-wise product, and finally add K results to get the result
Figure 1 is a vivid visualization of the Blend Module in action, showing the characteristics of the Attenttions and the base and the fusion process
Configurations and baselines
The super parameters of BlendMask are as follows:
- , the resolution of bottom-level RoI
- , top-level Prediction resolution
- , the number of bases (channel)
- The input to a bottom module can be a backbone network or FPN feature
- The sampling method of the basis can be nearest neighbor or bilinear pooling
- The interpolation method of top-level element can be nearest neighbor or bilinear sampling
Abbreviations for papersTo represent the model, backbone features C3 and C5 are used as the input of the bottom module, top-level attention uses the nearest neighbor interpolation, bottom level uses bilinear interpolation, consistent with RoIAlign
Semantics encoded in learned bases and attentions
The visualization results of base and element are shown in Figure 3. The paper believes that BlendMask can extract two kinds of location information:
- Whether pixels are on objects (Semantic masks)
- Whether the pixel is located on a specific part of the object (position-sensitive features), such as the upper left corner, lower right corner
The red and blue bases detect the upper right and lower left partial points of the target respectively, the yellow base detects the semantic mask (semantic mask) with high probability, and the green base activates the boundary of the object. The position-sensitive features are helpful for instance level segmentation. Semantic Mask can supplement postion-sensitive to make the final result smoother. Because of learning more accurate features, BlendMask uses much less base latitude than YOLACT and FCIS (4 vs. 32 vs. 49).
Experiment
Ablation experiments
- Merging methods: Blender vs. YOLACT vs. FCIS
In this paper, blender is transformed into the merge model of the other two algorithms for experiment. As can be seen from Table1, blender’s merge method is better than the other two algorithms
- Top and bottom resolutions
As can be seen from Table2, with the increase of resolution, the precision is getting higher and higher. In order to maintain the cost performance ratio, the ratio of R/M is maintained to be greater than 4. In general, the reasoning time is relatively stable
- Number of bases
We can see from Table3 that K=4 is optimal
- Bottom feature locations: backbone vs. FPN
As you can see from Figure 4, using FPN features as input to the bottom module not only improves efficiency, but also accelerates inference time
- Interpolation method: nearest vs. bilinear
In the interpolation of top-level element, bilinear element is 0.2AP higher than nearest neighbor
For bottom-level score maps, the bilinear is 2AP higher than the nearest neighbor
- Other improvements
The paper also tries other experiments to improve the network effect. Although these tricks can improve the network to some extent, they are not added into the final network
Main result
- Quantitative results
The results show that BlendMask is better than the current instance segmentation algorithm in both effect and speed. However, in the case that R-50 does not use multi-scale, the effect of BlendMask is worse than that of Mask R-CNN
- Real-time setting
In order to compare with YOLACT, this paper transformed a compact version of BlendMask-RT: 1) reduced the convolution number of prediction head 2) merged classification Tower and Box Tower 3) used PROto-FPN and removed P7. According to the results, BlendMask-RT is 7ms faster and 3.3AP higher than YOLACT
- Qualitative results
Figure 4 shows the visualizations. You can see that BlendMask is better than Mask R-CNN, because BlendMask’s Mask discrimination is 56 while Mask R-CNN’s Mask discrimination is only 28. In addition, YOLACT has difficulty distinguishing adjacent instances, while BlendMask does not
Discussions
- Comparison with Mask R-CNN
The structure of BlendMask is similar to that of Mask R-CNN. It is accelerated by removing the site-sensitive feature map and repeated Mask feature extraction. Another advantage of replacing the original complex global feature calculation BlendMask by using wrench guided Blender is that it produces a high-quality mask, while the resolution output is not limited by top-level sampling. For Mask R-CNN, increasing the resolution will increase the calculation time of HEAD, and the depth of head needs to be increased to extract accurate Mask features. In addition, the reasoning time for Mask R-CNN increases as the number of Bboxes increases, which is not friendly to real-time computing. Finally, the Blender module is very flexible because top-level instance Attention prediction has only one convolution layer, which is almost free to add to other detection algorithms
- Panoptic Segmentation
BlendMask can perform the panoramic segmentation task by using the semantic segmentation branch of Panoptic-FPN. The results show that BlendMask is more effective
conclusion
BlendMask extracts more accurate instance segmentation features by combining top-level and low-level semantic information in a more reasonable Blender module. The model synthesizes the structure of various excellent algorithms such as YOLACT, FOCS, Mask R-CNN, Compare Tricky, But it’s a valuable reference. BlendMask model is very simple, the effect can reach the state-art, the reasoning speed is not slow, the highest accuracy can reach 41.3AP, the real-time version of BlendMask-RT performance and speed are respectively 34.2MAP and 25FPS, and the paper experiments are sufficient, it is worth reading
Refer to the content
- – (DeeplabV3+)Encoder-Decoder with Atrous Convolution
- YOLACT, a real-time instance segmentation algorithm, can achieve 33 FPS/30mAP! Now open source!
- FCOS- A good anchor free target detection method
If this article is helpful to you, please click a like or watch bai ~ more content please pay attention to personal wechat public account [Xiaofei’s algorithm engineering notes]