Preface:
Video instance segmentation (VIS) is a task that needs to classify, segment and track the interested objects in the video simultaneously. In this paper, a novel Transformers based video instance segmentation framework VisTR is proposed, which treats THE VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
Given a video segment consisting of multiple image frames as input, VisTR directly outputs a sequence of masks for each instance in the video. Its core is a new and effective instance sequence matching and segmentation strategy, which can monitor and slice the whole instance at the sequence level. VisTR divides and tracks instances from the perspective of similarity learning, greatly simplifying the whole process, which is very different from existing methods.
VisTR is the fastest of the existing VIS models and works best using a single model approach on the YouTubeVIS dataset. This is the first time researchers have demonstrated a simpler and faster Transformer based framework for video instance splitting with competitive accuracy.
Point to focus on computer vision technology articles.
The starting point
SOTA approaches typically develop complex pipelines to solve this task. The top-down method follows the tracking-by-detection paradigm and relies heavily on the image-level instance segmentation model and complex manual design rules to associate instances. The bottom-up method separates object instances by clustering the learned pixel embedding. Because of their heavy reliance on dense predictive quality, these methods often require multiple steps to generate masks iteratively, which makes them slow. Therefore, there is a strong need for a simple, end-to-end trainable VIS framework.
Here, we take a closer look at the video instance splitting task. Video frames contain much richer information than a single image, such as motion patterns and the temporal consistency of an instance, providing useful clues for instance segmentation and classification. At the same time, better learning of instance characteristics can help you track instances. In essence, both instance segmentation and instance tracking are related to similarity learning: instance segmentation is learning similarity at pixel level, and instance tracking is learning similarity between instances. Therefore, it is natural to address these two subtasks in a single framework and benefit each other. Our goal here is to develop such an end-to-end VIS framework. The framework needs to be simple and powerful without the bells and whistles.
The main contributions
-
We propose a new video-instance segmentation framework based on Transformers, called VisTR, which treats the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. This framework is very different from existing methods and greatly simplifies the entire process.
-
VisTR solves VIS from a new perspective of similarity learning. Instance segmentation is to learn the similarity at pixel level, and instance tracking is to learn the similarity between instances. As a result, instance tracing is implemented seamlessly and naturally within the same instance splitting framework.
-
Key to the success of VisTR is a new strategy for instance sequence matching and splitting, which is tailored to our framework. This well-designed strategy enables us to monitor and split instances as a whole at the sequence level.
-
VisTR achieved strong results on youtube-VIS datasets, achieving 38.6% mask mAP at 57.7fps, which is the best and fastest approach using a single model.
Methods
The entire VisTR schema is shown in Figure 2. It consists of four main components: a CNN trunk for extracting compact feature representations of multiple frames, an encoder-decoder Transformer for modeling similarity between pixel – and instance-level features, a sequence matching module for instance monitoring the model, and an instance sequence segmentation module.
Transformer Encoder
Transformer encoders are used to model the similarity between all the pixel-level features in a fragment. Firstly, 1×1 convolution is applied to the above feature map to reduce the dimension from C to D (d < C), thus generating a new feature map F1.
In order to form a clip-level feature sequence that can be input into the Transformer encoder, the space and time dimensions of F1 are flattened into one dimension to obtain a 2D feature map with size d × (T·H·W). Note that the chronological order is always the same as the order of the initial input. Each encoder layer has a standard architecture consisting of a multi-head self-attention module and a fully connected feedforward network (FFN).
Transformer Decoder
The Transformer decoder is designed to decode the top pixel features, called instance-level features, that can represent each frame instance. Inspired by DETR, we also introduce a fixed number of input emplacements to query instance features from pixel features, called instance queries.
Assuming that the model decodes n instances per frame, the number of instance queries for T frames is n = n · T. Instance queries are model learning and have the same dimensions as pixel features. With the output of encoder E and N instance query Q as input, Transformer decoder outputs N instance characteristics, which are represented by O in Figure 2.
The overall prediction follows the input frame order, and the instance prediction order of different images is the same. Therefore, it is possible to track instances in different frames by directly linking items of the corresponding index.
Instance Sequence Matching
The decoder outputs a fixed number of predicted sequences that are out of order, each frame containing n instance sequences. In this paper, as in DETR, the Hungarian algorithm is used for matching.
Bounding box is needed in object detection even though it is instance segmentation to facilitate combinatorial optimization calculation. The normalized center, width and height of bounding box were calculated by FFN, that is, full connection.
Calculate the label of bounding box using Softmax. So we end up with n by T bounding boxes. Use the tag probability distribution and bounding box obtained above to match the instance sequence and gournd truth.
Finally, the loss of the Hungarian algorithm is calculated, and the tag probability distribution and the position of bounding box are considered. Losses follow the DETR design, using L1 losses and IOU losses. The following formula is the loss of training. It consists of tag loss, bounding box, and instance sequence.
Conclusion
The following figure shows a visualization of VisTR on the YouTube VIS validation dataset. Each line contains images sampled from the same video. VisTR is good at tracking and splitting challenging instances, such as :(a) overlapping instances, (b) relative position changes between instances, © confusion caused by similar instances of the same type, and (d) instances of different postings.
Thesis: End-to-end Video Instance Segmentation with Transformers
Get: In CV technical guide background reply keyword “0005” to get the paper.
Code: git. IO/VisTR
This article comes from the public account CV technical guide of the paper sharing series.
Welcome to the public account CV technical guide, focusing on the technical summary of computer vision, the latest technology tracking, classical paper interpretation.
Reply to the keyword “technical summary” in the official number to obtain the summary PDF of the original technical summary of the official number
Other articles
Self-attention in computer vision
Classic paper series — Capsule Networks: New deep learning networks
Review column | attitude estimation were reviewed
Gossip about CUDA optimization
Why IS GEMM central to deep learning
Why is 8 bits enough to use deep neural networks?
Classic paper series | target detection – CornerNet & also named anchor boxes of defects
What about the AI bubble
Use Dice Loss for clear boundary detection
PVT– Backbone function without convolution dense prediction
CVPR2021 | open the target detection of the world
Siamese network summary
Past, present and possibility of visual object detection and recognition
What concepts or techniques have you learned as an algorithm engineer that have made you feel like you’ve grown tremendously?
Summary of computer vision terms (1) to build a knowledge system of computer vision
Summary of underfitting and overfitting techniques
Summary of normalization methods
Summary of common ideas of paper innovation
Summary of methods of reading English literature efficiently in CV direction
A review of small sample learning in computer vision
A brief overview of knowledge distillation
Optimize OpenCV video read speed
NMS summary
Technical summary of loss function
Technical summary of attention mechanisms
Summary of feature pyramid technology
Summary of pooling technology
Summary of data enhancement methods
Summary of CNN structure Evolution (I) Classic model
Summary of CNN structure evolution (II) Lightweight model
Summary of CNN structure evolution (III) Design principles
How to view the future of computer vision
Summary of CNN Visualization Technology (I) Visualization of feature map
Summary of CNN visualization technology (2) Visualization of convolution kernel
Summary of CNN Visualization Technology (III) Class visualization
CNN Visualization Technology Summary (IV) Visualization tools and projects