Introduction to the

A new MOT method recently proposed by Zhejiang University and Dharma Institute currently ranks among the top in several data sets commonly used in MOT Challenge. In fact, Tracklets Predicting Based Adaptive Graph Tracking have indicated the two biggest innovations of this paper, feature extraction Based on trajectory prediction and feature aggregation Based on Adaptive Graph network. Most of the existing method of multi-object tracking link detection results of the current frame to the history section is based on the characteristics of the linear combination of the cosine distance and the target boundary box IOU as measure, it has two problems: one is two different frames on a frame) (the current frame and the same goal to extract the characteristics of tend to appear inconsistent problems; Secondly, it is unreasonable for feature extraction to only consider appearance without considering position relation and track segment information.

Therefore, this paper proposes a new high-precision end-to-end multi-target tracking framework TPAGT (the previous version is called FGAGT, which feels TPAGT is more suitable for the work of the paper). This method solves the above two problems and achieves a new SOTA on multiple data sets.

  • Tracklets Predicting Based Adaptive Graph Tracking
  • Paper address arxiv.org/abs/2010.09…
  • Paper source code is not open source

introduce

First of all, TPAGT is divided into a two-stage framework according to the general MOT method, that is, the detection is completed first, and then the target features are extracted according to the detection results to the corresponding positions. Finally, the results are obtained by using the association algorithm, which generally adopts the Hungarian algorithm. The single-stage method integrates detection and feature extraction, which is a compromise of accuracy for speed, so the accuracy is lower than that of the two-stage method. Therefore, as a two-stage method, the accuracy of TPAGT should be innovative, but the corresponding speed is relatively slow, specific reasoning speed, the paper did not mention, can only wait for the open source test.

Let’s start with a few issues that the current MOT approach doesn’t address.

  1. The problem of inconsistent features is caused by the fact that the features of the target on the tracklet are from the previous frame, not the current frame (this is easy to understand, the current frame only determines the location of the target by the detection result of the current frame). However, in the process of movement, The attitude, light intensity and Angle of view of the target may all change, which leads to the inconsistency of the features of the same target from different images even if the detection is accurate. Such inconsistency has a great negative impact on data association.
  2. In fact, from DeepSORT onwards, the feature extractor mainly focused on appearance information, because it was crucial to ignore some MOT methods for motion modeling, so the feature extraction branch became the ReID branch, mainly because the re-recognition model focused on appearance information. However, the location relationships between targets and the history of tracklets are also important for MOT tasks.
  3. Sample imbalance problem a tracklet can only match one detection frame, then this tracklet is a continuous positive example, no matching tracklet is a continuous negative example. Obviously, the number of positive cases is far less than the number of negative cases, and the imbalance of different types of samples is further aggravated by the emergence of a small number of new targets and the disappearance of old targets.

TPAGT has solved all the above problems one by one. One of the main problems is that the features in traklets are inconsistent with the current frame. How to solve this problem? Since the target cannot be stationary on the image, it is very unreasonable to use the position at the last moment. Therefore, motion estimation of the last frame is required to obtain the bbox position predicted by the target in the current frame and then extract features. Then it is the problem of feature fusion. Considering that the relation between objects is approximately a graph representation, the author uses GNN (graph neural network) to carry out information aggregation. In order to better obtain the global spatio-temporal information, GNN edge weight adaptive learning. Finally, Balanced MSE Loss is used to solve the problem of sample imbalance, which is a weighted MSE and belongs to the common thinking.

The framework design

Tracklets predicting based feature re-extracting

The figure above is the design of the overall framework. I will first introduce the network pipeline. Firstly, the input of network includes current frame image, current frame detection result and historical frame detection result. Then, the image is sent to Backbone to get the feature map (here backbone finally uses ResNet101+FPN effect is the best), and then the Bbox (here the current frame is used to detect the Bbox, The optical flow prediction bbox used in the last frame) is mapped to the feature map to obtain the region appearance features through RoI Align and then sent to the full connection (this operation is similar to the Faster R-CNN proposal feature extraction, please refer to my blog if you don’t understand). Then combining the position information of the current frame and the historical frame information, the graph network adaptive learning feature fusion is used to calculate the similarity. With the similarity matrix, Hungary can calculate the matching result.

There is a misunderstanding in the above statement. It extracts features from the current feature graph for both the predicted bbox of the past frame and the unpredicted Bbox of the past frame. In fact, no, in fact, for one thing,
t 2 t-2
Frame features in processing
t 1 t-1
When the frame has been re-extracted, there must be a serious misalignment problem on the current frame with the then Bbox extraction; Second, it will greatly increase the complexity of network computing, which is completely unnecessary. The picture in this paper is slightly misleading, so it can be studied carefully after open source.

We know that the previous MOT method of motion modeling is mainly represented by the kalman filter state estimation methods, optical flow method and the displacement prediction method, this paper USES the center of the sparse optical flow method to predict bbox movement, as a result of the target motion sometimes is high speed, in order to deal with this movement pattern, must adopt the appropriate optical flow method, Pyramid optical flow is used in this paper, which has strong robustness. For details, you can refer to this blog. The following figure is the current frame position of the target predicted by pyramid optical flow (Figure B), and Figure C is the frame of GT.

Adapted Graph Neural Network

Let’s talk about this adaptive graph neural network. It is not a new thing to treat Tracklets and Detections as binary graphs, but TPAGT should be one of the few works to aggregate features. Before that, we only aggregate motion and appearance features by artificial design. The author’s adaptive feature aggregation by graph network is a very advanced idea. Each detection target and each tracklet are nodes. As shown in the figure above, there is no connection between detection and tracklets, but there is a connection between each tracklet and each Detection. The learning purpose of graph network is that the state of each node is embedded hv\mathbf{h}_{v}hv, or the feature vector after aggregation of other information. Finally, the hv\mathbf{h}_{v}hv contains information about the neighbor node.

The state embedding to be learned is updated by the following formula. The first line indicates the node update of detections, and the second line indicates the node update of Tracklets. There are NNN Detection and MMM Tracklets in total. Here are some symbols for the first line, similar to the second. FFF represents neural network operation and can be understood as network fitting function. Ht, CJH_ {t, C}^{j} HT, Cj represents the state embedding of the third Detection at CCC layer. At the beginning, c = 0, hd, I = 0 fdi, ht, I = 0 FTJC = 0, h_ {0} d, ^ {I} = f_ ^ {d} {I}, h_ {t, 0} ^ {I} = f_ {t} ^ {j} c = 0, hd, I = 0 fdi, ht, I = 0 FTJ, Ed, CI,je_{d, C}^{I,j} Ed, CI,j represents the edge weights of the third detection and the JJJ tracklet on the graph of CCC layer. The author of this article only uses a single-layer GNN with added adaptive, so the following is a detailed description of single-layer learning.


h d . c + 1 i = f ( h d . c i . { h t . c j . e d . c i . j } j = 1 N ) . i = 1 . 2 . . M h t . c + 1 j = f ( h t . c j . { h d . c i . e t . c j . i } i = 1 M ) j = 1 . 2 . . N \begin{aligned} h_{d, c+1}^{i} &=f\left(h_{d, c}^{i},\left\{h_{t, c}^{j}, e_{d, c}^{i, j}\right\}_{j=1}^{N}\right), I = 1, 2, \ \ cdots, M \ h_ {t, c + 1} ^ {j} & = f \ left (h_ {t, c} ^ {j}, \ left \ {h_ ^ {d, c} {I}, e_ {t, c} ^ {j, I} \ right \} _ {I = 1} ^ {M} \ right) j = 1, 2, \ \ cdots, N end} {aligned

First of all, random initialization is not adopted in the initialization of edge weights, but the prior information of features and positions of nodes is adopted as follows, mainly to calculate normalized distance similarity between feature vectors of each node. The detailed steps of graph information aggregation are as follows.

  1. Calculate the initial similarity

    s i . j = 1 f d i f t j 2 + 1 x 1 0 16 s i . j = s i . j s i . 1 2 + s i . 2 2 + s i . j 2 + + s i . N 2 S f t = [ s i . j ] M x N . i = 1 . M . j = 1 . N \begin{array}{c} s_{i, j}=\frac{1}{\left\|f_{d}^{i}-f_{t}^{j}\right\|_{2}+1 \times 10^{-16}} \\ s_{i, j}=\frac{s_{i, j}}{\sqrt{s_{i, 1}^{2}+s_{i, 2}^{2}+\cdots s_{i, j}^{2}+\cdots+s_{i, N}^{2}}}\\ \mathbf{S}_{\mathrm{ft}}=\left[s_{i, j}\right]_{M \times N}, i=1, \cdots M, j=1, \cdots N \end{array}
  2. Edge weights are composed by IOU and initial similarity above (WWW learnable, indicating the relative importance of location and appearance information)

E = w x I O U + ( 1 w ) x S f t \mathrm{E}=w \times \mathrm{IOU}+(1-w) \times \mathrm{S}_{\mathrm{ft}}
  1. According to the above adaptive weight aggregation node characteristics (⊙\odot⊙ stands for dot product)

F t a g = E F t = E [ f t 1 . f t 2 . . f t N ] T \mathbf{F}_{\mathrm{t}}^{\mathrm{ag}}=\mathrm{EF}_{t}=\mathrm{E}\left[f_{t}^{1}, f_{t}^{2}, \cdots, f_{t}^{N}\right]^{T}

H d = sigma ( F d W 1 + Sigmoid ( F d W a ) Even though F t a g W 2 ) \mathbf{H}_{\mathrm{d}}=\sigma\left(\mathbf{F}_{d} W_{1}+\operatorname{Sigmoid}\left(\mathbf{F}_{d} W_{a}\right) \odot \mathbf{F}_{\mathrm{t}}^{\mathbf{a g}} W_{2}\right)

H t = sigma ( F t W 1 + Sigmoid ( F t W a ) Even though F d a g W 2 ) \mathbf{H}_{\mathrm{t}}=\sigma\left(\mathbf{F}_{t} W_{1}+\operatorname{Sigmoid}\left(\mathbf{F}_{t} W_{a}\right) \odot \mathbf{F}_{\mathrm{d}}^{\mathbf{a g}} W_{2}\right)

Existing graph tracking methods require additional full-connection layer dimensionality reduction feature vectors and then calculate the similarity through Euclidean distance. TPAGT’s method only standardizates the features from the single hidden layer graph network, and then multiplies them to get the final battle of similarity, as shown below. The final similarity matrix value is between 0 and 1. The larger the value is, the more similar the two objects are. The purpose of learning is to make feature vectors of the same target as close as possible and feature vectors of different targets as vertical as possible, which is equivalent to triplet loss, but simpler.


h d i = h d i h d i 2 . h t j = h t j h t j 2 . S o u t = H d H t T h_{d}^{i}=\frac{h_{d}^{i}}{\left\|h_{d}^{i}\right\|_{2}}, h_{t}^{j}=\frac{h_{t}^{j}}{\left\|h_{t}^{j}\right\|_{2}}, \mathbf{S}_{\mathrm{out}=\mathbf{H}_{\mathrm{d}} \mathbf{H}_{\mathbf{t}}^{\mathrm{T}}}

Blanced MSE Loss

After obtaining the final similarity matrix, the supervised training can be carried out. However, GT is labeled as 1 for the same target and 0 for different targets. The following figure is the visualization made by the author, where each row represents a detection and each column represents a tracklet. The green line indicates that Detection did not match any tracklets, so it is a new target. In contrast, red columns represent objects that have disappeared. 1 represents positive cases, and 0 represents negative cases. Obviously, there is a serious imbalance between positive and negative cases. Therefore, MSE is weighted (overparameter) according to the target type, as shown in the following formula.


L = Alpha. E c 0 + Beta. E c 1 + gamma E n e + Delta t. E d + Epsilon. E w = i = 1 M j = 1 N [ Alpha. ( S ^ i . j S i . j ) 2 I continue  I S i . j = 0 + Beta. ( S ^ i . j S i . j ) 2 I continue  I S i . j = 1 + gamma ( S ^ i . j S i . j ) 2 I n e w + Delta t. ( S ^ i . j S i . j ) 2 I disap  + Epsilon. W 2 2 ] \begin{aligned} \mathcal{L} &=\alpha E_{c 0}+\beta E_{c 1}+\gamma E_{n e}+\delta E_{d}+\varepsilon E_{w} \\ &=\sum_{i=1}^{M} \sum_{j=1}^{N}\left[\begin{array}{c} \alpha\left(\hat{S}_{i, j}-S_{i, j}\right)^{2} \cdot \mathbb{I}_{\text {continue }} \cdot \mathbb{I}_{S_{i, j}=0}+\beta\left(\hat{S}_{i, j}-S_{i, j}\right)^{2} \cdot \mathbb{I}_{\text {continue }} \cdot \mathbb{I}_{S_{i, j}=1} \\ +\gamma\left(\hat{S}_{i, j}-S_{i, j}\right)^{2} \cdot \mathbb{I}_{n e w}+\delta\left(\hat{S}_{i, j}-S_{i, j}\right)^{2} \cdot \mathbb{I}_{\text {disap }}+\varepsilon\|W\|_{2}^{2} \end{array}\right] \end{aligned}

Reasoning design

When we reason, we get a similarity matrix, so how do we use this matrix? Suppose there are NNN detection and MMM tracklets, and the matrix is M×NM\times NM×N. At this time, an augmented matrix of M×MM\times MM×M is added at the end. Each value in the matrix is a threshold value, as shown in the figure below. The Hungarian algorithm becomes the matching method with filtering. As the similarity between line 3 and line 8 below is not higher than the threshold (0.2), it becomes the new target.

Experiment and analysis

FairMOT detection results were used in the detection part, that is, CenterNet was used as the detector. For feature extraction part, the article uses RESnet 101-FPN as backbone, pre-trained on COCO, and then 30 rounds fine Tune on MOT dataset. For other training details, you can refer to the paper by yourself. I will not go into details here. I have conducted tests on Public and private tracks, and the results are as follows.

In addition, abundant ablation experiments were performed to demonstrate the robustness of TPAGT.

conclusion

The feature reextraction strategy is proposed and AGNN is introduced to feature fusion, thus the TPAGT framework is constructed, which is an end-to-end learning framework that can directly output the similarity matrix. SOTA performances at both tracks at MOT Challenge.

reference

[1]Shan C, Wei C, Deng B, et al. Adaptive Graph Tracking Based on Tracklets [J]. ArXiv :2010.09015 [CS], 2020.