ICCV is said to be the first paper to introduce meta-learning into target tracking, using the architecture of Siamese network, but using the idea of meta-learning when updating the model online.
Motivation
- the number of positive samples are highly limited.
- overfitting
During tracing, the model needs to be updated when the appearance of the target changes. In addition, update operations need to adopt stochastic gradient Descent (SGD), Lagrange multipliers, Ridge Regressio and other methods, with low efficiency, usually lower than 20FPS, which cannot meet real-time demand. In addition, updates are usually made using a handful of target appearance templates in the course of tracking, and due to a lack of positive samples, Therefore, the model is easy to fall into overfitting and lose generalization ability.
Introduction
Based on the above background and motivation, this paper proposes an End to end Visual Tracking network structure, which mainly includes two parts
- Siamese matching network for target search
- meta-learning network for adaptive feature space.
On the basis of tracking algorithm SiameseFc, referred to as matching network, meta- Learner network is added to dynamically generate part of matching network parameters during operation. With meta-learner network, the matching network can adapt to the target shape change, and only need to calculate forward pass for dynamically added parameters of the matching network. Therefore, the real-time effect is good, reaching 48fps. The overall process is shown in the figure below:The trained meta-learning network provides additional convolutional kernels and channel attention information for the Matching network (SiameseFC network in this case). In this way, the feature space can be modified adaptively based on new appearance templates obtained in the course of tracking without over-fitting.
Meta-learning
To put it simply, after learning many tasks, people can quickly adapt to new tasks with the help of previous learning knowledge and a small number of new samples. Just like we play LOL well, it is a reason to master king glory. In fact, the concept of meta-learning has been proposed for a long time, but it has been applied to deep learning and reinforcement learning to achieve some effects, either improving performance or improving training efficiency. The current understanding of meta-learning can be simply understood as “learning to learn”. In other words, it is to learn hyperparameters. The definition of hyperparameters can be understood as the part not driven by data-driven when designing algorithms, which can be model structure or training parameters. It can also dynamically modify the model structure according to the situation. Meta-learning, in short, is about learning and replacing what people do when designing algorithms. In fact, meta-learning in this paper mainly applied the idea of meta-learning to update the weight of the last convolution layer (conv5). Make it generate a weight that can be adaptively updated based on the existing tracking template, and then combine the original weight with the weight generated by the meta-learning network to generate the final W Adapt.
Here I would like to share some knowledge related to meta-learning from Teacher Li Hongyi’s machine learning lecture, that is, in meta-learning network training, we do not pursue the optimal of a single task, but to find the global optimal that can reduce all tasks to separate. Like these two pictures.
SiameseFc
SiameseFc tracking algorithm does not update model or maintain template, but uses two full convolutional CNNS to form Simaese network, extracts convolutional layer features for correlation, and generates heatmap to predict target location, as shown in the figure below. In both networks, one input is the target template for the start frame, and the other input is the larger area near the target (usually set to 4x, equivalent to the search area).The core calculation formula can be seen in the figure above:
The loss function of the network is as follows, y[u] represents the real label, and its positioning target box is 1 and outside the box is -1:
Ok, let’s get to the text!
Algorithm
The network structure of the paper is as follows:The blue one is the structure of SiamFC. The article uses conv5 features to calculate a loss, and then gives the gradient to the meta-learner network to calculate a weight and calculate a attention. The weights were concated to the weight of the original conv5, and then the new weight was used to calculate the final features, and cross-correlation was made between the two features.
Components
First of all, let’s look at the blue part, which is actually a SiameseFc network, using CNN with 5 convolutional layers and applying 2 pooling layers with kernel size 3 and step 2 after the first two convolutional layers. After each convolutional layer, a Batch Normalization layer is inserted. The kernel size and input/output size of each layer of CNN are W1:11 ×11×3×128, W2:5 ×5×128×256, W3:3 ×3× 256×384, W4:3 ×3×384×256, and W5:1 ×1×256×192. For input, an RGB image of size 127×127×3 is used for X and an RGB image of size 255×255×3 is used for Z, matching the network to generate a response graph of size 17×17. It’s essentially convolution, and then cross-correlation:Where x is the template, z is the search area,w = {w1,w2… WN} represents a group of weights for each layer, and ϕ W (⋅) represents the features extracted by Nlayer feature extractor when the weight of the entire network is W.
Meta – Learner network M context patches z = {z1,… , zM}, and target X to calculate additional parameters that enable adaptive updates to the trace template.
First, the last layer of Matching network conv5 was used to calculate the average negative gradient:Yi refers to the binary label graph for ZI calculated according to ground-truth, that is, the response map. Meta-leaner is designed on the basis that the characteristic of δ is empirically different according to a target as the target changes
Then, with δ as the input, meta-learner network G (·) can generate target-specific weights corresponding to the input as follows:Then, the two weights were spliced together to update the original weight of the matching network
[W5,wtarget] of size 1×1×256×(192+32)
The paper also said that a channel-wise sig-moid attention weights was generated here, but the concrete implementation was not mentioned in the paper, and I did not see how to use it later
Tracking algorithm
For the tracing process, first save K context images as zmem = {z1… , zK}, and the corresponding response graph Y ˆ = {yˆ,… We compute that the target response at each frame is greater than a threshold τ. We add the patch to the template pool when we compute the target response at each frame.P in the formula corresponds to positions in the set of all possible positions P in the response graph, and ρ (·) is a normalized function. In the following calculation, when the δ of meta-learner network is input, the least entropy is selected from the template pool, in fact, M templates with larger response and more reliable are used to update the network.
We have proposed a new response graph yˆ ⊗ h that adds a cosine window function that penalizes large shifts and ensures that the target size changes smoothly.
The whole tracking process is as follows:
Implementation and Training
Meta-learner network uses loss gradient δ in (2) as input, and the information is obtained from matching network, which explains its state in the current feature space. Then, g(·) in the meta-learner network learns the mapping from the loss gradient to the adaptive weight Wtarget, which describes the target-specific feature space, namely the so-called target-specific weights. The meta-learner network can be trained by a loss function that measures the adaptive weight Wtarget in correctly fitting the new example {z1,… , zM’}.
Matching Network
Network
5 convolutional layers 2 pooling layers of kernel size 3 and stride 2 are applied after the first two convolutional W1:11×11×3×128, W2:5×5×128×256, W3:3×3×256×384, W4:3×3×384×256, W5:3×3×384×256 Inputs, 27×127×3 for x 255×255×3 for z response map 17 × 17
Train
During training, random samples of (x,z) are taken from the target trajectory in the selected video sequence. Then, the ground-truth response map Y ∈{−1, +1}17×17 is generated, where the value is +1 at the target position, or -1 otherwise. For the loss function L (fw(x, z), y), the logistic loss function is defined as:In the formula, P represents a position in the set of every possible position P in the response graph, and ζ (y [p]) is a weighting function to reduce label imbalance. The loss function was optimized using the Adam optimizer for a learning rate of 10−4 with a batch size of 8, and 95000 iterations were run.
Meta-learner Network
Network
3 fully connected layers Each intermediate layer is followed by a dropout layer with the keep probability of 0.7 when training. For input, Gradient δ of size 1×1×256×192 Output Wtarget of size 1×1×256×32 [W5, Wtarget] 1×1×256×(192+32)
Train
During training, M ‘(M’ ≥M)context patches are randomly selected from a training sequence, that is, Zreg ={z1,… ZM ‘} and then take M of them to calculate the gradient δ. Then, formula 2 is used to calculate the response binary graph. Assuming the target is located at the center I of ZI. The purpose of doing this is to train meta-learner network not to pursue the optimal of a single task, but to find the global optimal that can make all tasks fall to the global optimal respectively. Therefore, the target is supposed to be in the center of context patch, so that no matter where the target goes, Our x’s respond very quickly. The loss function of meta-learner network is optimized for meta-Learne network, and the matching network is fixed and unchanged, as shown in the following formula.
Experimental Results
Quantitative results of OTB[51] and LaSOT[12] datasets. MLT is the proposed algorithm. This algorithm shows good performance on OTB datasets, outperforms other algorithms on large-scale LaSOT datasets, and gains improved performance by using the extra feature space provided by the meta-learner.
Mlt-mt only adds meta-learner
Mlt-mt-ft adds Adam optimizer for finetuning