preface

This paper introduces an end-to-end Transformer model for visual tracking that captures global feature dependencies of spatial and temporal information in video sequences. SOTA performance was achieved on five challenging short – and long-term benchmarks in real time, up to six times faster than Siam R-CNN.

This article is from the public CV technical guide of the paper sharing series

Pay attention to the public CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Learning Spatio-Temporal Transformer for Visual Tracking

Code: github.com/researchmm/…

Backgound

Convolution kernels are not good at modeling long-term correlations of image content and features because they only deal with local neighborhoods, whether in space or time. The current popular trackers, including offline Siamese trackers and online learning models, are almost all based on convolution operations. Therefore, these methods can only model local relationships of image content well, but are limited to capturing remote global interactions. Such a defect can reduce the model’s ability to handle scenarios where global context information is important for locating target objects, such as objects that undergo large-scale changes or move in and out of view frequently.

Both spatial and temporal information are important for target tracking. The former contains information about the appearance of the object for target positioning, while the latter contains changes in the state of the object across frames. Whereas previous Siamese trackers only use spatial information for tracking, the online approach uses historical predictions for model updates. While these methods were successful, they did not explicitly model the relationship between space and time.

Contribution

Inspired by the recent detection Transformer (DETR), this paper proposes a new end-to-end tracking structure using encoder-decoder Transformer to improve the performance of the traditional convolution model.

The new architecture consists of three key components: encoder, decoder, and predictive header.

1. The encoder accepts input from the initial target object, the current image and the dynamically updated template. The self-attention module in the encoder learns the relationship between inputs through the feature dependencies of inputs. Because the template image is updated throughout the video sequence, the encoder can capture spatial and temporal information about the target.

2. The decoder learns embedded queries to predict the spatial location of the target object.

3. Use the prediction head based on corner points to estimate the boundary box of the target object in the current frame. At the same time, learn to score points to control the update of dynamic template images.

To sum up, this work has three contributions.

1. Proposed a new transformer architecture dedicated to visual tracking. It can capture the global feature dependence of spatial and temporal information in video sequence. The use of dynamic update templates is proposed.

2. The whole method is end-to-end, without cosine Windows, bounding box smoothing and other post-processing steps, which greatly simplifies the existing tracking pipeline.

3. The proposed tracker achieves SOTA performance on five challenging short – and long-term benchmarks while operating at real-time speed.

Methods

This paper proposes a space-time Transformer network for visual tracking called STARK. This paper is based on a simple baseline method, which directly uses the original codec transformer for tracking and only considers the spatial information. The paper extends the baseline to learn spatial and temporal representations for target localization, introducing a dynamic template and an update controller to capture changes in the appearance of the target object.

The Baseline method

Figure 2 shows the baseline approach

The Baseline is mainly composed of three parts: convolutional backbone, codec converter and bounding box prediction head.

The original image is first sampled by CNN Backbone, then Flatten and Concatenate to get vector, and then sinusoidal position embedding is added as Encoder input of Transformer. A query vector is randomly initialized and the Decoder takes the target query and the enhanced feature sequence from the encoder as input. Unlike DETR, which uses 100 object queries, only one query is entered into the decoder to predict an bounding of the target object. In addition, since there is only one prediction, the paper removes the Hungarian algorithm used in DETR to predict associations. The target query can focus on all locations and search area features on the template to learn a robust representation of the final bounding box prediction.

DETR uses three layers of perceptron to predict target coordinates. However, as GFLoss points out, direct regression coordinates are equivalent to fitting the Dirac delta distribution, which does not take into account ambiguity and uncertainty in the data set. This representation is not flexible and not robust to the challenges of occlusion and clutter in target tracking.

In order to improve the quality of box estimation, a new prediction head is designed by estimating the probability distribution of box corner points. As shown in Figure 3, the search region features are first extracted from the output sequence of the encoder, and then the similarity between the search region features and the embedded output of the decoder is calculated. Character sequence will reshape into 3 d, and finally by L layer Conv – BN – ReLU full convolution network output two probability graph, a probability the graph is the coordinates of the left upper corner of the bounding box, a probability in the bottom right hand corner of the bounding box coordinates, like DETR, there is not much fine.

In this paper, methods

The spatio-temporal tracking framework is proposed in this paper. Pink highlights the difference from purely spatial architecture.

Unlike the baseline approach, which uses only the first and current frames, the spatio-temporal approach introduces a dynamically updated template sampled from an intermediate frame as additional input (the only contribution of the paper), as shown in the figure. In addition to the spatial information of the initial template, dynamic templates can capture changes in the appearance of the target over time, providing additional temporal information. The triplet’s feature graph is flattened and spliced, and then sent to the encoder. The encoder extracts distinguishable spatio-temporal features by modeling global relationships between all elements in spatial and temporal dimensions.

There are cases where dynamic templates should not be updated during tracing. For example, clipped templates are unreliable when the target is completely obscured or moved out of sight, or when the tracker drifts. For simplicity, dynamic templates can be updated as long as the search area contains targets. To automatically determine whether the current state is reliable, the paper adds a simple score prediction header, which is a three-layer perceptron followed by Sigmoid activation. If the score is higher than the threshold τ, the current state is considered reliable.

Training and reasoning

As recent work has pointed out, joint learning of location and classification can lead to suboptimal solutions of the two tasks, which helps decouple location and classification. Therefore, the training process is divided into two stages, with positioning as the primary task and classification as the secondary task.

Specifically, in the first phase, the entire network was trained end-to-end, with the exception of the fractional head, using only location-related losses. At this stage, make sure that all search images contain the target object and let the model learn localization capabilities. In the second stage, only the binary cross entropy loss defined as follows is used to optimize the fractional header

All other parameters are frozen to avoid affecting the ability to locate. In this way, after two stages of training, the final model has learned both positioning ability and classification ability.

In the reasoning process, the two templates and corresponding characteristics are initialized in the first frame. The search area is then clipped and fed into the network, generating a bounding box and confidence score. The dynamic template is updated only when the update interval is reached and the confidence score is higher than the threshold τ. In order to improve efficiency, the update interval is set as Tu frame. The new template is cropped from the original image and fed into the trunk for feature extraction.

Conclusion

Compared with previous long term trackers, the framework of the proposed method is much simpler. In particular, previous approaches typically consist of multiple components, such as a basic tracker, a target validation module, and a global detector. In contrast, the proposed approach has only a network that learns in an end-to-end manner. A large number of experiments show that the proposed method establishes new SOTA performance on both short-term and long-term tracking benchmarks.

For example, the paper’s space-time Transformer tracker outperformed Siam R-CNN by 3.9%(AO Score) and 2.3%(Success) on GOT-10K and LaSOT, respectively. In addition, the paper’s tracker can run in real time and is six times faster on the Tesla V100 gpu than the Siam R-CNN(30V.s.5fps), as shown in the figure

Comparison with SOTA on LaSOT. Success performance is visually compared to frame-Persecond (Fps) tracking speed.

Comparison with other SOTA methods on multiple data sets

Speed, computation and parameters

Welcome to the public accountCV Technical GuideFocus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Other articles

Scientific writing and philosophy of thesis

Summary of traditional feature extraction methods in computer vision

ICCV2021 | TOOD: mission alignment of single phase target detection

CVPR2020 | D3S: discriminant single shot segmentation tracker

Summary of common Trick in Pytorch data flow

Summary of innovative ideas of Transformer model in computer vision

PNNX: PyTorch neural network interchange format

One year working experience and perception of CV algorithm engineer

Summary | | the domestic and foreign classic open source dataset Softmax function and its misunderstanding

A brief introduction TorchShard introduction | Pytorch from Flash

Resource sharing | use FiftyOne to speed up your essay writing

ICCV2021 | progressive sampling Vision Transformer

MobileVIT: Lightweight Vision Transformer+ mobile deployment

ICCV2021 | SOTR: use the transformer object segmentation

ICCV2021 | PnP – DETR: use the Transformer for effective visual analysis

ICCV2021 | Vision reflection and improvement of the relative position encoding in the Transformer

ICCV2021 | MicroNet: at a low FLOPs improve image recognition

ICCV2021 | to rethink the visual space dimension of transformers

CVPR2021 | TransCenter: transformer used in multiple target tracking algorithm

CVPR2021 | open the target detection of the world

CVPR2021 | TimeSformer – video understand note model of space and time

CVPR2021 | Transformer used for End – to – End examples of video segmentation

2021- Review of multi-target tracking in video surveillance

To fully understand the target detection of anchor | instance version “synthetical consolidation and rehabilitation division summary of single phase review division | | attitude estimate review semantic segmentation were reviewed

Small target detection in return loss function summary | target detection methods summarized

Video understanding overview: Action recognition, sequence action localization, video Embedding

Summary of convolutional neural network compression methods

The difference between video target detection and image target detection