preface

In this paper, the progress of instance segmentation in single-stage method is introduced comprehensively. According to the three categories of local mask-based, global mask-based and location-based segmentation, 19 related papers are analyzed, and their advantages and disadvantages are introduced.

At the end of the official CV technical guide, 19 papers are attached to download methods.

Click a concern, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Instance segmentation is a challenging computer vision task that requires the prediction of object instances and their per-pixel segmentation masks. This makes it a hybrid of semantic segmentation and target detection.

Since Mask R-CNN, SOTA methods for instance segmentation are mainly Mask RCNN and its variants (PANet, Mask Score RCNN, etc.). It uses the method of detection and segmentation, first carries on the target detection, extracts the boundary box around each target instance, then carries on the binary segmentation inside each boundary box, separates the foreground (target) and the background.

In addition to the top-down approach of test and then split (or test by test and split), there are other example splitting methods. One example is to focus on pixels by instance partitioning as a bottom-up pixel allocation problem, as done in SpatialEmbedding (ICCV 2019). But these methods generally have worse performance than detecting and then splitting SOTA, which we won’t cover in detail in this article.

However, Mask RCNN is very slow and cannot be used in many real-time applications. In addition, the Mask predicted by Mask RCNN has fixed resolution, so it is not fine enough for large targets with complex shapes. Due to advances in anchor-free target detection methods such as CenterNet and FCOS, there has been a wave of research on single-stage instance segmentation. Many of these methods are faster and more accurate than Mask RCNN, as shown in the figure below.

Reasoning times for a single-phase approach recently tested on a Tesla V100 GPU

This article reviews the latest progress in one-phase instance segmentation, focusing on mask representation, a key aspect of instance segmentation.

Local mask and global mask

One of the core questions to ask in instance segmentation is the representation or parameterization of the instance mask — 1) whether to use a local or global mask, and 2) how to represent/parameterize the mask.

Mask: local mask and global mask

There are two main ways to represent instance masks: local masks and global masks.

The global mask is what we ultimately want, which has the same spatial scope as the input image, although the resolution may be smaller, such as 1/4 or 1/8 of the original image. It has the natural advantage of having the same resolution (and therefore fixed-length characteristics) for large or small targets. This does not sacrifice the resolution of the larger target, and fixed resolution facilitates batch processing for optimization.

A local mask is usually more compact because it does not have too many boundaries as a global mask. It must be used with the location of the mask to be restored to the global mask, and the local mask size will depend on the target size. But to perform effective batch processing, the instance mask requires a fixed-length parameterization. The simplest solution is to adjust the instance Mask to a fixed image resolution, as Mask RCNN does. As we see below, there are more efficient ways to parameterize local masks.

According to whether local mask or global mask is used, single-stage instance segmentation can be largely divided into local-mask-based and global-mask-based methods.

Method based on local mask

The local mask-based approach outputs the instance mask directly on each local region.

The outline of the explicit code

Bounding box is a rough mask in a sense that approximates the contour of the mask with the smallest Bounding rectangle. ExtremeNet (Bottom-up Object Detection by Grouping Extreme and Center Points, CVPR 2019) by using four extremum point (so I have eight degrees of freedom is a bounding box instead of the traditional four DoF), and the richer parameterization can be naturally extend through at the edge of the corresponding in both directions of the extreme value point extends to the edge length of a quarter of a, to the octagon mask.

Since then, there has been a series of attempts to encode/parameterize the contour of instance masks into fixed-length coefficients, given different decomposition bases. These methods regress to the center of each instance (not necessarily the bBox center) and the contour relative to that center.

ESE-Seg (Explicit Shape Encoding for Real-Time Instance Segmentation, ICCV 2019) designs an inner center radius Shape signature for each Instance and fits it with Chebyshev polynomials.

PolarMask (PolarMask: Single Shot Instance Segmentation with Polar Representation, CVPR 2020) uses light rays spaced at constant angles from the center to describe the contour.

FourierNet (FourierNet: Compact Mask representation for instance segmentation using differentiable shape decoders) introduces the decoder using Fourier transform, And implements smoother boundaries than PolarMask.

Various contour-based methods

These methods typically use 20 to 40 coefficients to parameterize the mask contour. They are fast in reasoning and easy to optimize. However, their disadvantages are also obvious. First, visually, they all look — let’s be honest — pretty awful. They do not accurately depict masks, nor do they depict objects with holes in the center.

This series of methods is interesting, but not promising. The complex topology of instance masks or explicit encoding of their contours is difficult to deal with.

Structured 4D tensors

TensorMask (TensorMask: A Foundation for Dense Object Segmentation, ICCV 2019) is one of the first works to demonstrate the idea of Dense mask prediction by predicting the mask of the position of each feature graph. TensorMask still predicts masks by region of interest rather than global masks, and it is able to run instance segmentation without running target detection.

TensorMask uses structured 4D tensors to represent masks in the spatial domain (2D iterates all possible positions in the image, 2D represents the mask of each position), It also introduced alignment representations and tensor bipyramid to restore spatial details, but these alignment operations made the network even slower than the two-stage Mask R-CNN. In addition, in order to achieve good performance, it needs to be trained with a schedule 6 times longer than the standard COCO Target detection pipeline (6x schedule).

Compact mask encoding

The natural target mask is not random, similar to the natural image, and the instance mask is located in a much lower intrinsic dimension than the pixel space.

MEInst (Mask Encoding for Single Shot Instance Segmentation, CVPR 2020) refines the Mask into a compact and fixed dimensional representation. By performing simple linear transformations using PCA, MEInst was able to compress a 28×28 local mask into a 60-dimensional eigenvector. This paper also attempts to directly regression 28×28= 784-DIM eigenvectors on a single stage target detector (FCOS), and obtains reasonable results even when 1 or 2 AP points drop.

This means that directly predicting high-dimensional masks (as a natural representation of each TensorMask) is not completely impossible, but difficult to optimize. The compact representation of the mask makes it easier to optimize and runs faster when reasoning. It is most similar to Mask RCNN and can be used directly with most other target detection algorithms.

Method based on global mask

The global-mask-based method first generates intermediate and shared feature maps based on the whole image, and then combines the extracted features to form the final mask of each instance. This is the main method of single – stage instance segmentation recently.

Prototypes and Coefficients

YOLACT (YOLACT: Real-time instance Segmentation, ICCV 2019) is one of the earliest attempts at real-time instance segmentation. YOLACT splits the instance into two parallel tasks, generating a set of prototype masks and predicting the mask coefficients for each instance.

Prototype masks are generated with FCN and can benefit directly from improvements in semantic segmentation. Coefficients are predicted as additional features of the bounding box. These two parallel steps are followed by the assembly step: a simple linear combination through matrix multiplication and a clipping operation of predictive bounding boxes for each instance. Clipping reduces the network’s burden of suppressing noise outside the bounding box, but if the bounding box contains part of another instance of the same class, you will still see some leakage.

The prediction of the prototype mask is critical to ensure the high resolution of the final instance mask, which is comparable to semantic segmentation. The stereotype mask depends only on the input image and is independent of the category and specific instance. This distributed representation is compact because the number of prototype masks is independent of the number of instances, which makes YOLACT’s Mask calculation cost constant (unlike Mask RCNN, which has a linear relationship with the number of instances).

Review InstanceFCN (instance-SensitiveFully Convolutional Networks, ECCV 2016) and the subsequent STUDY FCIS of MSRA (Fully Convolutional instance-Aware Semantic Segmentation, CVPR 2017), which appear to be a special case of YOLACT. Both InstanceFCN and FCIS use FCN to generate multiple instance-sensitive fractional graphs containing the relative positions of the target instances, and then apply an assembly module to output the target instances. Position-sensitive fractional plots can be treated as prototype masks, but IntanceFCN and FCIS combine position-sensitive prototype masks using a fixed set of space pool operations rather than learning linear coefficients.

InstanceFCN [b] and FCIS [c] use fixed pool operations for instance segmentation

BlendMask (BlendMask: Top-down Meets bottom-up for Instance Segmentation, CVPR 2020) is based on YOLACT, but instead of predicting a scalar coefficient for each prototype mask, BlendMask predicts a low resolution (7×7) attention map to blend the mask bounding boxes within. The attention diagram is predicted to be a high-dimensional feature (7×7=49-d) attached to each boundary box. Interestingly, BlendMask uses four prototype masks, but it only works for one prototype mask.

CenterMask (CenterMask: Single shot instance segmentation with Point Representation, CVPR 2020) works in almost exactly the same way and explicitly uses a prototype mask (named global salient graph).

CenterMask uses CenterNet as the backbone, while BlendMask uses similar Anchorfree and single-stage FCOS as the backbone.

CenterMask architecture. BlendMask has a very similar pipeline.

Note that both BlendMask and CenterMask are further dependent on the detected bounding box. Before mixing with the clipped prototype mask, note that the force diagram or mask size must be scaled to the same size as the bounding box.

CondInst (Conditional Convolutions for Instance Segmentation) goes further and completely eliminates any dependence on bounding boxes. Instead of assembling a cropped prototype mask, it borrows the idea of a dynamic filter and predicts the parameters of a lightweight FCN header. The FCN header has three layers and a total of 169 parameters. Surprisingly, the authors show that even if the prototype mask is a separate 2-CH CoordConv, the network can reach 31 aps on the COCO. We will discuss this in the implicit representation section below.

BlendMask /CenterMask and CondInst are both extensions of YOLACT.

  • BlendMask/CenterMask is trying to mix cropped prototype masks with fine-grained masks in each bBox. YOLACT is a special case of BlendMask or CenterMask, where the resolution of the attentional diagram is 1×1.

  • CondInst is trying to mix cropped prototype masks with deeper convolution made up of dynamic predictive filters. YOLACT is a special case of CondInst, where FCN is the 1×1 CONV layer.

The use of branches to predict prototype masks allows these methods to benefit from the auxiliary task of using semantic segmentation (usually up 1 or 2 points in AP). It can also be naturally extended to perform panoptic segmentation.

Some technical details about the parameters required to represent each instance mask are listed below. These methods with global masks and coefficients take 32, 196, 169 parameters per instance mask.

  • YOLACT uses 32 prototype masks + 32-DIM mask coefficient + box clipping;

  • BlendMask uses 4 prototype masks + 4 7×7 attention maps + box clipping;

  • CondInst using coordConv + 3 1X1 dynamic CONV (169 parameters)

SOLO and SOLOv2: Split targets by position

SOLO is one of them and deserves its own piece. These papers are very insightful and well written. They are a work of art to me (like CenterNet, another of my favorites).

SOLOv1 architecture

The first author of the paper replied the motivation of SOLO on Zhihu, which I quote as follows:

“Semantic segmentation predicts the semantic category of each pixel in the image. Similarly, for example, segmentation, we recommend predicting the “instance class” of each pixel. The key question now is, how do we define instance categories?”

Two target instances in the input image are the same instance if they have exactly the same shape and position. Any two different instances either have different positions or shapes. Since shape is generally difficult to describe, we approximate shape by size.

Thus, the “instance category” is defined by location and size. Locations are classified by their central location. SOLO approximates the center position by dividing the input image into a grid of S x S cells and S² classes. Size is handled by assigning objects of different sizes to different levels of the feature pyramid (FPN). So for each pixel, SOLO only needs to decide which SxS grid cell and which FPN level to assign the pixel (and the corresponding instance category) to. So SOLO only needs to perform two pixel-level classification problems, similar to semantic segmentation.

Now the other key question is how is the mask represented?

Instance masks are directly represented by global masks stacked into the S² channel. It’s an ingenious design that solves many problems at once. First, many previous studies stored 2D masks as flat vectors, which quickly became unmanageable when the increase in mask resolution caused the number of channels to explode. The global mask naturally preserves the spatial relationships within the pixels of the mask. Second, global mask generation can keep the high resolution of the mask. Third, the number of predictive masks is fixed, independent of the target in the image. This is similar to the working line of the prototype mask, and we’ll see how the two streams merge in SOLOv2.

SOLO makes instance splitting a category-only problem and removes any problems that rely on regression. This makes SOLO naturally independent of target detection. SOLO and CondInst are two works of direct manipulation of global masks, which are true boxed free methods.

Global mask for SOLO prediction. The mask is redundant, sparse and robust to target location errors.

Resolution tradeoff

From the global mask predicted by SOLO, we can see that the mask is relatively insensitive to positioning error, because the masks predicted by adjacent channels are very similar. This creates a tradeoff between the resolution (and therefore accuracy) of the target location and the instance mask.

TensorMask’s idea of 4D structured tensors is sound in theory, but difficult to implement in practice in the current framework of NHWC tensor formats. Flattening a two-dimensional tensor with spatial semantics to a one-dimensional vector inevitably loses some spatial detail (similar to semantic segmentation using a fully connected network), and even representing a low resolution 128×128 image has its limitations. 2D of position or 2D of mask must sacrifice resolution. Most previous studies considered positional resolution to be more important and downsampled/compressed the mask size, compromising the expressiveness and quality of the mask. TensorMask tries to strike a balance, but the tedious operation leads to slow training and reasoning. SOLO realized that we didn’t need high resolution location information, and borrowed from YOLO by compressing locations into a rough S² grid. In this way, SOLO maintains the high resolution of the global mask.

I naively thought SOLO might work by predicting the S² x W x H global mask as additional flat WH-dimensional features attached to every S² grid in YOLO. I was wrong — making the global mask at full resolution instead of flat vectors is actually the key to SOLO’s success.

(9) And Dynamic SOLO

As mentioned above, the global mask predicted by SOLO in the S² channel is very redundant and sparse. Even at coarse resolution S=20, there are 400 channels, and there can’t be so many objects in the image that each channel contains a valid instance mask.

(5) In somethingled SOLO, the original M tensor of the shape H x W x S² is replaced by two tensors X and Y of the shape H x W x S. For objects at grid positions (I, j), M_ij is approximated by element-by-element multiplication X_i ⊗ Y_j. This reduced the number of channels from 400 to 40, and experiments showed no performance degradation.

SOLO vs Decoupled SOLO vs SOLOv2

Now it’s natural to ask, can we borrow YOLACT’s prototype mask idea by combining them by predicting fewer masks and predicting the coefficients of each grid cell? SOLOv2 does just that.

In SOLOv2, there are two branches, a feature branch and a kernel branch. The feature branch predicts the E prototype mask, and the kernel branch predicts the size D kernel at each S² grid cell location. As we saw in the YOLACT section above, this dynamic filter approach is the most flexible. When D=E, is a simple linear combination of the prototype mask (or 1×1 conv), the same as YOLACT. This paper also tries 3×3 Conv Kernels (D=9E). This can be taken further by predicting the weights and biases of lightweight multi-layer FCNS (for example, in CondInst).

Now, because the global mask branch is decoupled from its dedicated location, we can observe that the new prototype mask exhibits a more complex pattern than in SOLO. They are still location-sensitive and more akin to YOLACT.

An implicit representation of a mask

The idea of a dynamic filter used in CondInst and SOLOv2 sounds great at first, but when viewed as a natural extension of the list of coefficients used for linear combinations, it is actually quite simple.

It can also be considered that we parameterize the mask using coefficients or attention plots, or eventually parameterize it into a dynamic filter for the head of a small neural network. The idea of using neural networks to dynamically encode geometric entities has also been explored recently in 3D learning. Traditionally, 3D shapes are either encoded using voxels, point clouds, or grids. Occupancy Networks (Occupancy Networks: Learning 3D Reconstruction in Function Space, CVPR 2019) proposes to encode shapes as neural networks, and regard the continuous decision boundaries of deep neural networks as 3D surfaces. The network receives a point in 3D and determines whether it is on the boundary of the encoded 3D shape. This method allows the 3D mesh to be extracted at any resolution during reasoning.

Implicit representation proposed in Occupancy Networks

Can we learn a neural network of dynamic filters for each target instance so that the network receives a point in 2D and outputs whether the point belongs to the target mask? This naturally outputs a global mask and can have any resolution required.

A review of CondInst ablation studies demonstrates that there is only CoordConv input (for performing uniform spatial sampling) even without prototype masks. Since this operation is separate from the resolution of the prototype mask, it would be interesting to enter CoordConv separately at a higher resolution to get a higher resolution global mask to see if this improves performance. I firmly believe that implicit encoding of instance masks is the future.

With CoordConv input and no prototype mask, CondInst can also predict good performance

The last sentence

Most single-stage instance segmentation efforts are based on Anchor-free target detection, such as CenterNet and FCOS. Perhaps unsurprisingly, many of these papers came from the same laboratory at the University of Adelaide where FCOS was created. They were recently posted at github.com/aim-uofa/Ad… Open source their platform.

Many of the recent approaches are fast and can achieve real-time or near-real-time performance (30+ FPS). An NMS is often the bottleneck for real-time instance splitting. For true real-time performance, YOLACT uses Fast NMS and SOLOv2 uses Matrix NMS.

Afterword.

  • Predicting the high-dimensional feature vectors of instance masks is tricky. Almost all approaches focus on how to compress the mask into a low-dimensional representation. These methods typically use 20 to 200 parameters to describe a mask, with varying degrees of success. I think this is a basic limit on the minimum number of parameters that can represent the shape of the mask.

  • Hand-designed parametric profiles are not very promising.

  • Local masks depend essentially on target detection. Hope to see more research on directly generating global masks.

  • The implicit representation of the mask is expressive, compact, and can be generated at any resolution. CondInst has the potential to generate higher resolution global masks by harnessing the power of implicit representations.

  • The SOLO is simple, while the SOLOv2 is fast and accurate. Hope to see more future research along this line.

19 related papers can be obtained in the background of CV technical Guide under the keyword “0011”.

The resources

1. SOLO: Segmenting Objects by Locations, Arxiv 12/2019

2. SOLOv2: Dynamic, Faster and Stronger, Arxiv 03/2020

3. YOLACT: Real-time Instance Segmentation, ICCV 2019

4. PolarMask: Single Shot Instance Segmentation with Polar Representation, CVPR 2020 oral

5. ESE-Seg: Explicit Shape Encoding for Real-Time Instance Segmentation, ICCV 2019

6. PointRend: Image Segmentation as Rendering, CVPR 2020 oral

7. TensorMask: A Foundation for Dense Object Segmentation, ICCV 2019

8. BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation, CVPR 2020

9. CenterMask: single shot instance segmentation with point representation, CVPR 2020

10. MEInst: Mask Encoding for Single Shot Instance Segmentation, CVPR 2020)

11. CondInst: Conditional Convolutions for Instance Segmentation, Arxiv 03/2020

12. Occupancy Networks: Learning 3D Reconstruction in Function Space, CVPR 2019

13. FCOS: Fully Convolutional One-Stage Object Detection, ICCV 2019

14. Mask R-CNN, ICCV 2017 Best paper

15. PANet: Path Aggregation Network for Instance Segmentation, CVPR 2018

16. Mask Scoring R-CNN, CVPR 2019

17. InstanceFCN: Instance-sensitive Fully Convolutional Networks, ECCV 2016)

18. FCIS: Fully Convolutional Instance-aware Semantic Segmentation, CVPR 2017

19. FCN: Fully Convolutional Networks for Semantic Segmentation, CVPR 2015

20. CoordConv: An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution, NeurIPS 2018

21. Associative Embedding: End-to-End Learning for Joint Detection and Grouping, NeuRIPS 2017

22. SpatialEmbedding: Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth, ICCV 2019

By Patrick Langechuan Liu

Compilation: CV technical Guide

The original link: towardsdatascience.com/single-stag…

In this paper, from the public, CV technical guides * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * technical summary series * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Reply keyword “technical summary” in the public account to obtain the summary PDF of the original technical summary article of the public account.

​​

Other articles

PaddleOCR: 7% more effect, 220% faster

One year working experience and perception of CV algorithm engineer

Summary of some new data sets proposed by CVPR2021

Video understanding overview: Action recognition, sequence action localization, video Embedding

Overview of multi-label classification

Fee-shot Learning Beginner notes

Overview of human pose estimation in deep learning

Incremental learning deep neural network

Overview of human pose estimation in deep learning

Summary of common methods of small target detection

CV technical Guide – Summary and classification of essential articles

Normalization method summary | under fitting and over fitting

NMS summary | loss function technical summary

Attention mechanism technical summary | technical summary characteristics of pyramid

Pooling technical summary | summary data method

Paper innovation common thinking summary | GPU parallel card training summary

Summary of CNN structure Evolution (I) Classical model

Summary of CNN structural evolution (II) Lightweight model

Summary of CNN structure evolution (iii) Design principles

Summary of CNN visualization technology (I) Feature map visualization

Summary of CNN visualization technology (II) Convolution kernel visualization

CNN visualization technology summary (iii) class visualization

Summary of CNN visualization technology (IV) Visualization tools and projects

Summary of image annotation tools in computer vision

Review and summary of various Optimizer gradient descent optimization algorithms

Summary | classic open source data sets at home and abroad

The Softmax function and its misconceptions

Common strategies for improving machine learning model performance

Resources sharing | SAHI: big slices of small target detection in auxiliary reasoning library

Summary of image annotation tools in computer vision

Batch Size effect on neural network training

Summary of tuning methods for hyperparameters of neural networks

Use Ray to load the PyTorch model 340 times faster

Summary of image annotation tools in computer vision

A review of the latest research on small target detection in 2021

Capsule Networks: The New Deep Learning Network

Summary of computer vision terms (a) to build the knowledge system of computer vision

A review of small sample learning in computer vision