This paper is from a 2021 paper, which briefly reviews existing SOTA models and MOT algorithms, discusses deep learning in multi-object tracking, introduces evaluation indicators, data sets and benchmark results, and finally gives a conclusion.
This article is from the public CV technical guide
Pay attention to the public CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.
Multi-target tracking (MTT) in video surveillance is an important and challenging task, which has attracted extensive attention of researchers due to its potential applications in various fields. Multi-target tracking tasks require locating the target individually in each frame, which is still a huge challenge because the appearance of the target changes immediately and extreme occlusion occurs. In addition, the multi-target tracking framework needs to perform multiple tasks, namely target detection, trajectory estimation, inter-frame association and re-recognition. Various approaches have been proposed and assumptions have been made to constrain the problem in the context of a particular problem. In this paper, MTT models using deep learning representation ability are reviewed.
Multi-target tracking is divided into two main tasks: target detection and tracking. To classify objects within a group, the MTT algorithm associates a unique ID with each detected object that remains specific to that object for a specific period of time. These ids are then used to generate the motion trajectory of the tracked object.
The accuracy of target detection determines the effectiveness of target tracking system. The accuracy of MTT model is greatly affected by proportion variation, frequent ID switching, rotation, illumination variation and other factors. Figure 1 shows the output of the MTT algorithm. In addition, there are complex tasks such as background clutter, backtracking, track initialization and termination in multi-target tracking system. To overcome these problems, researchers have proposed a variety of strategies using deep neural networks.
Classification of MTT algorithm
MOT implementations can be divided into Detection based (DBT) or Detection Free Tracking (DFT), depending on how the object is initialized. However, the MTT model is standardized around detection-based training, where detections (identifying objects in frames) are retrieved as pre-trace steps. Since a target detector is required to identify targets in DBT, performance depends largely on the quality of the detector, so choosing a detection framework is crucial.
Detection-free tracking (DFT)
The detector’s output is often used as input to the tracker, which is fed to a motion prediction algorithm that predicts where the object will move in the next few seconds. However, in detection-free tracing, this is not the case. Dft-based models require that a fixed number of objects must be manually initialized in the first frame and then must be positioned in subsequent frames.
The DFT is a difficult task because there is limited and unclear information about the object to be tracked. As a result, the initial bounding box is only approximate to the object of interest in the background, and the appearance of the object can change dramatically over time.
Online Tracking
Online tracking algorithms, also known as sequential tracking, generate predictions for the current frame based on past and present information. This type of algorithm processes frames in a step-by-step manner. In some applications, such as autonomous driving and robot navigation, this information is essential.
Batch Tracking
To determine the identity of an object in a given frame, batch tracing (offline tracing) technology uses information from the previous frame. They often use global data, which improves tracking quality; However, due to computational and memory constraints, it is not always possible to process all frames at once.
Deep learning algorithm
The main steps shared by most algorithms are as follows:
Object Detection stage: by analyzing the input frame, the target is located in a series of frames using boundary boxes.
Motion Prediction phase: Analysis of detection to extract appearance, Motion, or interaction features.
Calculation phase of Affinity: The extracted features are used to calculate the similarity/distance between detection pairs.
Association phase: Use the similarity/distance measure in Association by providing the same ID to the detections corresponding to the same target.
The test phase
The detection stage mainly uses some algorithms in target detection.
In one evaluation, YOLO single-convolutional neural network directly predicts multiple bounding boxes and class probabilities from the whole map, trains on the whole map and directly optimizes the detection performance, while learning the generalized representation of the target. However, YOLO imposes strict spatial constraints on bounding box prediction, limiting the number of adjacent items that can be predicted by the model. Swarms of small objects, such as birds, are also problematic for the model.
Faster R-CNN, a single unified object recognition network composed of full depth CNN, improves the accuracy and efficiency of detection, while reducing the computational overhead. The model incorporates a training method that alternates between fine tuning of regional schemes, enabling a unified, deep learning-based target recognition system to operate at near-real-time frame rates, and then fine-tune target detection while maintaining a fixed target.
In some surveillance images, occlusion is so frequent that it is impossible to detect the entire shape of the object as in the human case.
To solve this problem, Khan et al. proposed temporal consistency model, which was trained to detect only head position. Similarly, some techniques have been explored to track only head position, rather than the entire body shape.
Bewley proposed Framework SORT on THE EL29 to harness the power of CNN-based detection, which achieved best-in-class performance in terms of speed and accuracy in MOT prospects, focusing on frame-to-frame prediction and correlation. It became able to be ranked as the best performing by replacing detections obtained from Aggregated Channel Features (ACF) with detections calculated by Faster RCNN, based on an architecture of Kalman filters and Hungarian algorithms. In some cases, CNN is used in the detection step for other purposes than constructing the target boundary box.
For multi-target tracking (such as cars), Min proposed an upgraded ViBe combining the new strategy of robust detection and binary classifier for robust and accurate identification of multiple vehicles. When ViBe algorithm was used to identify cars, CNN used it to eliminate false positives. It can effectively suppress dynamic noise and quickly remove ghost and residual shadow of objects.
Motion Prediction stage
The performance of depth models can be improved when used to study MOT features such as temporal and spatial attention graphs or temporal sequences. Some models based on end-to-end deep learning can not only extract features of appearance descriptors, but also features of motion information.
Wang et al. proposed one of the first methods to apply DL in MOT pipelines. The system makes full use of the advantages of single target tracker and solves the drift problem caused by occlusion without affecting the computing power. In order to improve the extraction features, two layers of stacked encoders are used in the network, and then support vector machines are used to calculate the affinity. The visibility map of the target is learned and then used to infer a spatial attention map, which is then used to weight features. Visibility maps can also be used to estimate occlusion. This is known as the time aware process.
The most commonly used CNN-based methods can be further divided into classic CNN and Siamese CNN for feature extraction.
Classic CNN
Kim et al. claimed that Multiple Hypotheses Tracking (MHT) technology is compatible with existing visual Tracking perspectives. Advances in modern detection-based tracking techniques and the development of efficient feature representation for object appearance provide new possibilities for MHT processes. They improved MHT by incorporating a regularized least squares framework for online training of appearance models for each tracking target.
Wojke et al. proposed an improvement to SORT that produced relatively more unit shifts, although it achieved better accuracy and precision at high frame rates. Wojke et al. improved it by integrating the appearance motion information and overcame this problem by replacing the correlation metric with a convolutional neural network (CNN). The convolutional neural network can be trained to distinguish pedestrians in large-scale pedestrian re-recognition data sets. Compared to SORT, the upgraded tracing system effectively reduced the number of identity flips from 1,423 to 781. This is a reduction of about 45%, achieving competitive performance while maintaining real-time speed.
Siamese CNN
Siamese CNN has proved useful in MOT because the purpose of feature learning in the tracking phase is to determine the similarity between detection and tracking.
Leal-taxe et al. proposed a two-stage matching detection strategy, which provides a new perspective for target association challenge in pedestrian tracking. In this case, they applied the concept of CNN to multi-person tracking and proposed to learn whether two detections belong to the same trajectory to avoid manual design features for data association. The learning framework of the model is divided into two phases.
CNN is pre-trained in the Siamese structure to measure the similarity of two equal-sized image regions, and then CNN is combined with the collected features to generate predictions. By describing the tracking problem as a linear programming and combining depth features and motion information with gradient enhancement methods, they solve the tracking problem well.
Affinity calculation phase
While some implementations use deep learning models to instantly generate affinity scores without the need for explicit distance measures between features, there are still other approaches to calculate affinity between tracking and detection by applying some distance measures to the features obtained by CNN.
Milan et al. solved the problem of data association and trajectory estimation in neural network environment. The state estimation of tracking target in online MOT task adopts a recursive Bayesian filter composed of observation prediction and update. The model extends RNN to model the process, and the target state, existing observations and their corresponding matching matrix and existence prospect are input into the network. The model outputs the prediction status and update results of the target, as well as the existence probability to judge whether the target is terminated or not, and achieves good tracking effect.
Instead of calculating the affinity between the target and the detector, Chen et al. suggested calculating the affinity between the sampled particle and the tracking target. Instead, use detection that is inconsistent with the object being traced to create a new trace and recover the lost object. Although it is an online monitoring algorithm, at the time of publication it was able to get the best results on MOT15, using both public and private tests.
Trace/correlation phase
Deep learning has been used in some MTT models to improve correlation steps.
Ma et al., in expanding the Siamese tracker network, adopted a bi-directional GRU to determine where to terminate the tracker. For each detection, the network extracts the track features and sends them to the bidirectional GRU network, whose output is briefly collected in Euclidean space to provide the overall track features. In the tracking process, sub-tracks are generated according to the local distance between the bi-directional GRU outputs, and then divided into small sub-tracks. Finally, considering the global similarity of the time pool, these subtracks are reconnected to the long track. On the MOT16 dataset, the results obtained by this method are comparable to the latest SOTA levels.
Lehn et al. proposed a collaborative implementation of associative tasks using multiple deep RL(reinforcement learning) agents. Prediction network and decision network are two key components of the model. Using the latest tracking trajectories, CNN is used as a prediction network and trained to predict target motion in new frames.
Other methods
In addition to the model based on the above four steps, there are several other approaches.
Jiang et al. used Deep RL agent to complete bounding boxes regression, which improved the efficiency of tracking algorithm. Vgg-16cnn is used for appearance extraction. The extracted features are saved and used for the history of the last 10 movements of the target. Then the network is integrated to predict the bounding boxes movement, scaling, termination and other alternative results. On the MOT15 dataset, the bounding boxes regression method was used on several state-of-the-art MOT algorithms to improve 2 to 7 absolute MoTA points, making it one of the top public detection methods.
Xiang et al. deployed MetricNet for pedestrian tracking and combined the affinity model with trajectory estimation obtained by Bayesian filters. Vgg-16cnn is used for target re-recognition training, feature extraction and bounding boxes regression. The movement model is divided into two parts, one is based on trajectory coordinates as input, the other is based on Bayesian filtering combined with detection frame, and the updated position of target is output on MOT16 and MOT15. The algorithm obtained the best and the second best scores in the online method.
Recent advances in SOT algorithms for model Free Single Object TACking have greatly promoted the application of SOT in multi-object tracking (MOT) to improve recovery and reduce dependence on external detectors. SOT algorithms, on the other hand, are usually designed to distinguish the target from its surroundings, and they often run into problems when the target is spatially mixed with similar artifacts, as seen in MOT.
Chu et al. proposed a model to address robustness and eliminate dependence on external detectors. They implemented a model using three different CNNS in the algorithm. Integrate PafNet to distinguish between background and trace objects. This part distinguishes the tracking targets. Another integrated CNN is the convolution layer, which determines whether the tracking model needs to be refreshed or not. Using support vector machine classifier and Hungarian technique, uncorrelated detection is used to recover from target occlusion. The algorithm was tested on MOT15 and MOT16 datasets, with the first method producing the best overall results and the second producing the best results among the online methods.
Evaluation indicators
The most relevant are Classical Metrics and CLEAR MOT Metrics.
Classical Metrics pointed out the pitfalls of the algorithm, such as multi-target tracking (MT) trajectories, multi-loss (ML) trajectories, ID switching, and so on.
CLEAR MOT Metrics has MOTA(multi-object tracking accuracy) and MOTP(multi-object tracking accuracy). MOTA combines false positives, false negatives, and mismatch rates into a single value, resulting in an overall good tracking performance. Despite some flaws and complaints, this is by far the most widely used evaluation method. MOTP describes the accuracy of tracking objects using bounding box overlap and/or distance measurements.
Base data set
Benchmark data sets include MOTChallenger, KITTI, and UADETRAC.
The MOTChallest dataset is the largest and most complete pedestrian tracking dataset currently available, providing additional data for training depth models. MOT15, the original MOT challenge dataset, features videos with a set of properties that the model needs to better generalize to get good results. MOT16 and MOT19 are other modified versions.
Benchmark result
The public results of tests on the MOT ChallengeMOT15 and MOT16 datasets are listed below for Gioele et al., recorded from the corresponding publications for a clear comparison of results between the methods mentioned in this work.
Because detection quality has an impact on performance, the results are divided into models based on public detection and models based on private detection. These methods fall into two categories: online and offline.
The year of the published reference document, its mode of operation, MOTA, MOTP, IDF1, main track (MT) and main lost (ML) indicators, expressed as a percentage; Absolute numbers of false positives (FP), false negatives (FN), ID switches (IDS) and fragments (Frag); Algorithm speed expressed in frames per second (Hz).
For each metric, the arrow up (↑) indicates a higher score, while the arrow down (↓) indicates the opposite score. Best performance is emphasized in models running the same mode (batch/online), and each statistic is highlighted in bold. We list only the results obtained from the models accessed in this review in Tables 2 and 3.
In reality, using deep learning and models with online processing patterns yields the greatest results. However, this may be the result of a greater emphasis on building an online approach, which is becoming increasingly popular in the MOT deep learning research community. A large amount of fragmentation is a common problem with online methods and is not reflected in the MOTA scores. When occlusion or detection is lost, the online algorithm does not look forward, re-identify the missing target, or insert the missing track segment in the video.
conclusion
In this paper, deep learning is used to solve MTT problems. This study discusses solutions using deep learning to solve each of the four steps of the MTT problem, bringing the total number of SOTA’s MOT techniques to N.
Evaluation of the MOT algorithm, including evaluation measures and baseline results from accessible data sets, is briefly discussed. Single-object trackers have recently benefited from the introduction of depth models into global graph optimization algorithms, resulting in high-performance online trackers; Batch processing, on the other hand, benefits from introducing depth models into global graph optimization algorithms.
Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.
Reply keyword “technical summary” in the public account to obtain the summary PDF of the original technical summary article of the public account.
Other articles
Understanding target detection algorithms from a unified perspective: Analysis and summary of recent advances
Image inpainting required 10 paper | HOG and SIFT image features extraction
To fully understand the target detection of anchor | instance version “synthetical consolidation and rehabilitation division summary of single phase review division | small target detection problems, ideas and solutions CVPR2021 | SETR: use Transformer to rethink the semantic segmentation from the Angle of the sequence to sequence
Deep learning model size and model inference speed
Small target detection in return loss function summary | target detection methods summarized
The difference between video target detection and image target detection
Siamese network overview | | attitude estimate review semantic segmentation were reviewed
Visual Transformer review | small target detection in 2021 newest research were reviewed
One year working experience and perception of CV algorithm engineer
Video understanding overview: Action recognition, sequence action localization, video Embedding
The present situation of computer vision from CVPR 2021 paper
ICCV2021 | MicroNet: at a low FLOPs improve image recognition
ICCV2021 | depth understanding of CNN
ICCV2021 | to rethink the visual space dimension of transformers
CVPR2021 | TransCenter: transformer used in multiple target tracking algorithm
CVPR2021 | open the target detection of the world
CVPR2021 | TimeSformer – video understand note model of space and time
CVPR2021 | an efficient pyramid segmentation module PSA
New way YOLOF CVPR2021 | characteristics of pyramid
Classic paper series network | capsule: new deep learning network
Classic paper series | to rethink on ImageNet training beforehand
Classic paper series | Group, Normalization and the defect of BN
Classic paper series | target detection – CornerNet & also named anchor boxes of defects
Classic paper series | narrow gap between the Anchor – -based and Anchor – free testing method: adaptive training sample selection