ECCV 2020, one of the top three international conferences on computer vision, has announced the results of its papers. In this ECCV 2020, 5025 papers were submitted effectively, and 1361 papers were finally accepted for publication, with an acceptance rate of 27%, which is lower than the previous one. Among them, the number of oral papers was 104, accounting for 2% of the total submitted papers; Spotlight was 161, or 5% of the submissions; Other papers are all posters.
ECCV (European Conference on Computer Vision) is one of the world’s top Computer Vision conferences, held every two years. With the development of artificial intelligence, computer vision research in-depth and rapid development of the application, each will attract a large number of papers submitted, and this year’s ECCV submission volume is ECCV 2018 more than twice, a record high. In the case of more and more fierce competition, ECCV Tencent Youtu Laboratory selected a total of 8 papers, covering target tracking, pedestrian re-recognition, face recognition, human posture estimation, motion recognition, object detection and other hot and frontier fields, demonstrating Tencent’s scientific research and innovation strength in the field of computer vision.
The following are some papers selected by Tencent Youtu for ECCV 2020:
Chained-tracker: End-to-end Joint Detection and Tracking Algorithm Based on Target Pair Regression Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking
Most of the existing multi-target tracking (MOT) algorithms are based on the traditional detection and tracking framework, including target detection, feature extraction and target association, and a few MOT algorithms integrate two of the three modules to achieve partial end-to-end tracking. This paper proposes a chain-tracking algorithm Chained Tracker (CTracker), which is the first two-frame input mode in the industry. The three modules can be integrated into a single network to achieve end-to-end joint detection and tracking. It is the first algorithm to transform the target association problem in tracking algorithm into Box pair regression problem. The input of the network is two adjacent frames, called Chain node. The output of the network is the detection frame pairs of the same target in two adjacent frames. The detection frame pairs of adjacent nodes can be associated by common frames. In order to further improve the tracking effect, we also designed a joint attention module to highlight the effective information area in the detection frame pair regression, including the target attention mechanism in the target classification branch and the identity attention mechanism in the authentication branch. Without introducing additional data, CTracker obtained SOTA results on both MOT16 and MOT17, with MOTA of 67.6 and 66.6, respectively.
Algorithm framework diagram:
Network structure diagram:
Don’t Disturb the pedestrian. Put the caller first and re-identify them Under the Interference of Other Pedestrians
Traditional pedestrian re-recognition assumes that the cropped image contains only one person. However, in a crowded scene, an existing detector may generate boundary boxes with multiple people, with a large proportion of background pedestrians, or with human occlusion. Features extracted from these images with pedestrian interference may contain interference information, which will lead to incorrect retrieval results. In order to solve this problem, this paper proposes a new deep network (PISNet). PISNet first used the Query image-guided attention module to enhance the features of the target in the image. In addition, we propose that the reverse attention module and multi-person separation loss function promote the attention module to suppress the interference of other pedestrians. Our method is evaluated on two new pedestrian interference datasets and the results show that it has better performance than the existing Re-ID method.
Improving Face Recognition from Hard Samples via Distribution Distillation Loss
At present, face recognition algorithms based on deep learning can handle simple samples well, but they still perform poorly for difficult samples (low resolution, large pose, etc.). There are two main ways to try to solve this problem. The first method is to design a specific structure or loss function by making full use of a priori information about face distortion that needs to be processed. This approach usually does not allow easy migration to other distortion types. The second method is to design an appropriate loss function to reduce the intra-class distance and increase the inter-class distance, so as to obtain more discriminating facial features. This approach generally has significant performance differences between easy and difficult samples. In order to improve the performance of face recognition model on difficult samples, we propose a loss function based on distributed distillation. Specifically, we first construct two similarity distributions (Teacher distribution constructed from simple samples and Student distribution constructed from difficult samples) through a pre-trained recognition model, and then make the Student distribution close to the Teacher distribution by distribution distillation loss function. In order to reduce the overlap area of similarity between homologous samples and non-homologous samples in Student distribution, the recognition performance of difficult samples is improved. We conduct sufficient experiments on common large-scale face test sets and multiple face test sets with different distortion types (race, resolution, pose) to verify the effectiveness of the method.
Adversarial Semantic Data Augmentation for Human Pose Estimation
The main purpose of human pose estimation is to locate the key coordinates of human body. The performance of the current SOTA method in the three difficult scenarios of severe occlusion, bystander interference and symmetric similarity still needs to be improved. One of the main reasons is that there is less training data for these three difficult scenarios. Previous methods mainly use global spatial transform methods such as scale, rotation and translation to enhance training data. This kind of conventional data enhancement method does not improve or help the above three difficult scenarios. This paper proposes a Adversarial Semantic Data Augmentation method (Adversarial Semantic Data Augmentation). In this method, the human body is divided into several semantic parts, and these parts are reassembled at different fine granularity to simulate the above three difficult scenarios. In the process of reassembling human body parts, there are multiple spatial transformation parameters to adjust each human body part, so as to flexibly combine the shielding of others, arm crossing, complex movements and so on. In order to make the human pose estimation network better learn the robustness of these difficult scenes, we designed another generation network (G) to control the spatial transformation parameters of each human body part, and the human pose estimation network (D) as the discriminant network to learn from the difficult samples generated by G. G and D are pitted against each other during training. The G network continuously generates various difficult samples to confuse the human pose estimation network. The human pose estimation network has improved its prediction accuracy of difficult scenes from this process.
Face Anti-spoofing via Disentangled Representation Learning
Liveness detection technology is used to determine whether the object is a real person in the authentication scene, to defend against photos, masks, screen retakes and other attacks, to ensure the security of face recognition. At present, the living detection methods based on RGB images often extract discriminant features directly from images, but such features may contain information unrelated to living tasks, such as illumination, background or ID, which affects the generalization of the methods in practice. In view of generalization, this paper proposes the following innovations from the perspective of feature decoupling:
1. Design a decoupling framework to decouple image features into vivo related features and vivo independent features, and make use of vivo related features for true and false discrimination.
2. Combined with the constraints of bottom texture feature and top depth feature, the decoupling of living body features is further promoted.
3. Explore and demonstrate the factors that influence living characteristics, such as attack media, acquisition equipment, etc., to further understand the nature of living tasks.
In this paper, we demonstrate the effectiveness of the feature decoupling method based on auxiliary constraints for living tasks in several academic data sets.
SSCGAN: Facial Attribute Editing via Style Skip Connections
Existing face attribute editing methods usually adopt encoder-decoder structure, in which attribute information is expressed as a one-Hot vector and then splice with image or feature layer. However, such operations only learn local semantic mapping and ignore global face statistics. In this article, we propose to solve this problem by modifying the global information (style characteristics) at the channel level. We design a generative adversarial network (SSCGAN) based on style jump connection to achieve accurate face attribute manipulation. Specifically, we inject target property information over multiple style jump connection paths between the encoder and decoder. Each connection extracts the style feature of the hidden layer in the encoder and then performs a residual-based mapping function to migrate this style feature to the space of the target attribute. This adjusted style feature can then be used as input to instantiate the decoder’s hidden layer feature. In addition, in order to avoid the loss of spatial information (such as hair texture or pupil position), we further introduce a spatial information transfer module based on jump connections. By manipulating global style and local spatial information, the proposed method can get better results in both attribute generation accuracy and image quality. Experimental results show that the algorithm proposed in this paper is superior to all existing methods.
07 Temporal difference Representation based Temporal Distinct Representation Learning for Action Recognition
2D convolutional neural networks have been widely used in image recognition and achieved success. Researchers are now trying to use 2D convolutional neural networks to model videos. However, the limitation of using 2D convolution is that the shared parameters of different frames of videos will lead to repeated and redundant information extraction, especially the important changes between frames will be ignored at the level of spatial semantics. In this work, we try to solve this problem in two ways:
Firstly, a channel-level serialized attention mechanism PEM is designed to gradually activate discriminating channels in features, so as to avoid repeated information extraction.
Second, a temporal diversity Loss function (TD Loss) is designed to force the convolution kernel to pay attention to and capture the changes between frames rather than the areas of apparent similarity.
Our method achieves SOTA effect on the time-series something-something V1 and V2. In addition, we also achieved significant accuracy improvements in Kinetics data sets with weak timing but large scale.
Structure diagram:
Effect schematic diagram:
Dive Deeper Into Box for Object Detection Dive Deeper Into Box for Object Detection
The frameless detection model is now the latest method with the highest detection level due to accurate boundary box estimation. However, the frameless detection is still inadequate in boundary location, and there is still a lot of room for improvement for the boundary frame with the highest confidence. In this work, we use a boundary reordering boundary box recombination strategy to generate better boundary boxes in training, so as to have a better fit to the object. In addition, we observed semantic inconsistencies in boundary box classification and position regression learning in the existing methods, so we screened the classification and regression objectives in the training process, thus providing semantic consistency learning objectives. Experiments show that our optimization method is very effective in improving detection performance.
Method diagram