ICCV2021 | SOTR: use the transformer object segmentation

preface

In this paper, some defects of existing instance segmentation methods and difficulties of Transformer in instance segmentation are introduced, and a high quality instance segmentation model SOTR based on Transformer is proposed.

Experiments show that SOTR not only provides a new framework for instance segmentation, but also outperforms SOTA instance segmentation on MS Coco dataset.

This article is from the public CV technical guide of the paper sharing series

Pay attention to the public CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

SOTR: Segmenting Objects with Transformers

Code: github.com/easton-cau/…

Background

Modern instance segmentation methods are usually based on CNN and follow the detect-before-segmentation paradigm, which consists of a detector used to identify and locate all objects and a mask branch used to generate segmentation masks. The success of this segmentation idea is attributed to the following advantages, namely translation equivariance and location, but ** faces the following obstacles: ****1) Due to the limited sensory field, CNN lacks the coherence of features in high-level visual semantic information to relate instances, resulting in sub-optimal segmentation results on large objects; **2) Both segmentation quality and inference speed are heavily dependent on the target detector, which has poor performance in complex scenes.

To overcome these shortcomings, many recent studies have tended to move away from detect-before-segmentation in favor of a bottom-up strategy that learns the embedded and instance-aware characteristics of each pixel and then uses post-processing techniques to continuously group them into instances based on the embedded characteristics. Therefore, these methods can preserve position and local coherent information well. However, the main disadvantage of the bottom-up model **** is the unstable clustering (such as fragmented and joint masks) and the poor generalization ability of data sets for different scenarios.

In addition, Transformer can easily capture global characteristics and naturally model remote semantic dependencies. In particular, self-attention is a key mechanism in Transformer, which broadly aggregates feature and location information from across the input field. As a result, transformer based models can better distinguish overlapping instances with the same semantic categories, making them better suited for high-level visual tasks than CNN.

However, these Transformer based approaches still have shortcomings. On the one hand, Transformer is poor at extracting low-level features, leading to false predictions for small targets. On the other hand, due to the universality of feature mapping, it requires a lot of memory and time, especially in the training phase.

Contributions

To overcome these shortcomings, this paper proposes an innovative bottom-up model, SOTR, which cleverly combines the advantages of CNN and Transformer.

SOTR focuses on how to make better use of the semantic information extracted by Transformer. In order to reduce the storage and computational complexity of the traditional self-attention mechanism, this paper proposes dual attention, which adopts the sparse representation of the traditional attention matrix.

1. This paper proposes an innovative CNN-Transformer hybrid instance segmentation framework, called SOTR. It can effectively model local connections and remote dependencies, using the CNN backbone and Transformer encoders in the input field to make them highly expressive. More importantly, SOTR simplifies the pipeline by directly splitting object instances without relying on Box detection.

2.** has dual focus, a new position-sensitive self-attention mechanism tailored for the Transformer. The well-designed structure of SOTR offers significant savings in computation and memory compared to the original Transformer, especially for large inputs with dense predictions such as instance segmentation.

3. In addition to purely Transformer based models, the proposed SOTR does not require pre-training on large data sets to generalize inductive bias well. Therefore, SOTR is easier to apply when there is insufficient data.

4. On the MS Coco benchmark, SOTR achieved 40.2% of the performance of AP using resNET-101-FPN backbone, exceeding the accuracy of most of the most SOTA methods. In addition, due to the global information extraction of Twin Transformer, SOTR showed significantly better performance on medium objects (59.0%) and large objects (73.0%).

Methods

SOTR is a cnN-Transformer hybrid instance segmentation model that can simultaneously learn 2D representations and easily capture remote information. It follows the direct segmentation paradigm by first dividing the input feature map into patches and then predicting the category of each patch while dynamically partitioning each instance.

Specifically, the model is mainly composed of three parts: 1) backbone module, which is used to extract image features from input images, especially low-level features and local features; 2) Transformer, for modeling global and semantic dependencies; 3) Multi-level up-sampling module, which is used to carry out dynamic convolution operation between the generated feature graph and the corresponding convolution kernel to generate the final segmentation mask.

SOTR is built on a simple FPN trunk with minimal modifications. The model flattens out the FPN features P2-P6 and complements them with positional embedding before feeding them into transformer. Two headers have been added after Transformer to predict instance classes and generate dynamic convolution kernels. The multistage upsampling module takes p2-P4 features in FPN and P5 features in transformer as inputs and generates the final prediction using the dynamic convolution operation shown in the red box in the figure.

Twin attention

Self-attention requires quadratic computation in both time and memory, resulting in higher computation costs in high-dimensional sequences such as images, and hindering the scalability of the model in different environments. In order to alleviate this problem, the paper proposes a twin attention mechanism, which simplifies the attention matrix into sparse representation.

The strategy is to limit the receptive field to a designed block pattern with a fixed stride. It first computes attention in each column, while keeping the elements in the different columns separate. This strategy aggregates contextual information between elements on a horizontal scale (Figure (1)). A similar note is then performed on each line to take full advantage of the vertical range of feature interactions (Figure (2)). The attention of these two scales is sequentially connected to become the last one, which has a global reception field covering information from both dimensions.

Given that the feature graph Fi of FPN is H×W×C(the i-layer of FPN), SOTR firstly divides the feature graph Fi into N∗N patches, where Pi is N×N×C, and then stacks them into fixed blocks along the vertical and horizontal directions. Positional embeddings are added to the block to preserve positional information, i.e. the space for column and row positional embeddings is 1∗N∗C and N∗1∗C. Both attention layers adopt multiple attention mechanism. To facilitate multi-layer connections and post-processing, all the sub-layers in Twin Attention produce N×N×C output.

The twin attention mechanism can effectively reduce the memory and computational complexity from O((H×W)^2 to (H×W^2+W×H^2).

Transformer layer

Three different Transformer layers based on the encoder serve as the basic building blocks (as shown below). The original Transformer layer is similar to the encoder used in NLP(Figure (a)). It consists of two parts: 1) the multi-head self-attention mechanism after layer normalization; 2) Multi-level perception after layer normalization. In addition, residual joins are used to connect the two parts. Finally, multidimensional sequence characteristics can be obtained as outputs of K series connections for these Transformer layers for subsequent prediction in different functional headers.

In order to achieve the best compromise between computational cost and feature extraction effect, the author follows the original Transformer layer design and uses only dual attention instead of multi-attention in the Pure Twin Transformer layer (Figure (b)).

In order to further improve the performance of Twin Transformer, a hybrid Twin Transformer layer is also designed, as shown in Figure 3(C). It adds two 3×3 convolution layers on each twin attention module, connected by a Leaky RELU layer. By introducing convolution operation, the attentional mechanism can be supplemented, local information can be captured better, and feature representation ability can be enhanced.

Functional head

Feature maps from the Transformer module are entered into different functional heads for subsequent prediction. The class head comprises a single-layer linear layer, which is used to output the classification results of N×N×M, where M is the number of classes.

Since each patch assigns only one category to a single object centered within the patch, such as YOLO, the paper uses multilevel predictions and shares headers at different feature levels to further improve model performance and efficiency for objects of different scales.

Kernel head is also composed of linear layers, parallel to class head, and N×N×D tensors are output for subsequent mask generation, where the tensors represent N×N convolution kernels with D parameters.

During training, Focal Loss was applied to the classification, and all supervision of these convolution nuclei came from the final mask loss.

Mask

A simple approach to construct mask feature representations for instance awareness and position-sensitive segmentation is to predict each feature graph at a different scale. However, this adds time and resources. Inspired by Panoptic FPN, a multilevel up-sampling module is designed to combine the features of each layer of FPN and transformer into a unified mask feature.

First, relatively low-resolution feature map P5 with location information is obtained from Transformer and fused with P2-P4 in FPN. For feature maps of each scale, the operation is divided into 3×3Conv, Group Norm and ReLU stages. Then, p3-P5 bilinear up-sampling was performed on the (H/4, W/4) resolutions of 2×, 4× and 8×, respectively. Finally, point-by-point convolution and upsampling were performed after the p2-P5 were added to produce the final unified H×W feature graph.

For example, mask prediction, SOTR generates masks for each patch by performing dynamic convolution operation on the above unified feature map. Given prediction convolution kernel K(N×N× D) from kernel head, each kernel is responsible for the generation of mask of instance in corresponding patch. Specific operations are as follows:

Where ∗ represents convolution operation, and Z is the mask ultimately generated. It should be noted that the value of D depends on the shape of the convolution kernel, that is, D=λ^2C, where λ is the kernel size. The final instance segmentation mask can be generated by Matrix NMS, and each mask is independently supervised by Dice Loss.

Conclusion

1. Comparison between original Transformer, Pure Twin Transformer and Hybrid Twin Transformer

As shown in the above table, the proposed Pure Hybrid Twin Transformers greatly exceeds the original Transformer in all indicators, which means that the Twin Transformer architecture has not only successfully captured the remote dependencies in vertical and horizontal dimensions, Moreover, it is more suitable to combine with CNN backbone to learn image features and representation.

For pure and Twin Transformers, the latter works much better. This is because 3∗3Conv can extract local information and improve feature expression, thus improving the rationality of Twin Transformer.

2. Visualization of mask

3. Detailed comparison with other methods

SOTR performed better than Mask R-CNN and BlendMask in two cases:

1) Mask R-CNN and BlendMask cannot detect positive examples of objects with complex shapes that are easily missed by other models (such as carrots in front of trains, sleeping elephants and drivers in small carriages).

2) Overlapping objects (e.g. people in front of a train) that cannot be separated by an exact boundary. SOTR is able to predict the mask with clearer boundaries, while SOLOv2 tends to segment the object into separate parts (for example, dividing a train into head and body), sometimes failing to exclude the background from the image. Due to the introduction of Transformer, SOTR can better obtain comprehensive global information and avoid such splitting on objects.

In addition, SOLOv2 has a higher rate of false positives than SOTR by designating non-existent objects as instances.

4. Real-time comparison

Related articles reading

Single phase review division | semantic segmentation were reviewed

Case segmentation summary summary comprehensive collation edition

CVPR2021 | SETR: use Transformer to rethink the semantic segmentation from the Angle of the sequence to sequence

CVPR2021 | Transformer used for End – to – End examples of video segmentation

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

ICCV2021 | SOTR: use the transformer object segmentation

Background

Contributions