1. The introduction

The highly anticipated International Conference on Computer Vision (ICCV 2021) was held online from October 11 to 17, attracting extensive attention from researchers in the field of Computer Vision worldwide. Ali cloud multimedia AI team this year (by ali cloud video and dharma school visual team) attended the MFR mask character identification, the global challenge, and in a total of five circuit, win one gold, one silver and two bronze, represents our profound technology accumulation in the field of character identification and industry leading technology advantage.

2. Introduction to the competition

The MFR Mask Human Identity Challenge is a global challenge jointly organized by Imperial College London, Tsinghua University and Insightface.ai to address the challenges posed by mask wearing algorithms during the COVID-19 pandemic. The competition lasted for more than four months from June 1 to October 11, attracting nearly 400 teams from all over the world. It is by far the largest and most participated authoritative competition in the field of identity identification. According to official statistics, the competition received more than 10,000 submissions, with teams competing fiercely.

2.1 Training data set

The training data set of this competition can only use the 3 officially provided data sets, and other additional data sets and pre-training models are not allowed to be used, so as to ensure the fairness and impartiality of algorithm comparison. The three data sets officially provided are ms1M small-scale data set, GLINT360K medium-scale data set and Webface260M large-scale data set. The number of character IDS and pictures contained in each data set is shown in the following table:

2.2 Evaluation data set

The evaluation data set of this competition contains positive and negative sample pairs with a scale of trillions, which is the largest authoritative evaluation data set with the most complete information in the industry. It is worth noting that all evaluation data sets are not open to the public, and only interfaces are provided for automatic evaluation in the background to avoid algorithm overfitting test data sets. Detailed statistics for the InsightFace track evaluation dataset are shown in the following table:

The detailed statistics of the WebFace260M track evaluation dataset are shown in the following table:

2.3 Evaluation Indicators

The evaluation indicators of this competition not only include performance indicators, but also feature dimensions and reasoning time limits, so they are closer to real business scenarios. Detailed evaluation indicators are shown in the following table:

3. Solutions

Below, we will deconstruct our solution in terms of data, models, loss functions, and so on.

3.1 Data cleaning based on self-learning

As is known to all, noise data widely exist in the training data set related to human identity identification. For example, the same person image is dispersed under different person ID or multiple person images are mixed under the same person ID, so the noise in the data set will have a great influence on the performance of the recognition model. In view of the above problems, we propose a data cleaning framework based on self-learning, as shown in the figure below:

First, we use raw data to train the initial model M0, and then use the model to perform a series of operations such as feature extraction, ID merging, inter-class cleaning and intra-class cleaning. For each character ID, DBSCAN clustering algorithm was used to calculate the central feature, and then the central feature was used for similarity retrieval. The high-dimensional vector feature retrieval engine developed by Dharma Yuan was Proxima, which could quickly and accurately recall topK results with the highest similarity to Query records in Doc. Then, we use the cleaned data set to train the new model M1, and then repeat the process of data cleaning and new model training. Through continuous iterative self-learning, the data quality is getting higher and the model performance is also getting stronger and stronger. Specifically, the schematic diagram of inter-class cleaning and intra-class cleaning is shown in the figure below:

It is worth noting that, unlike CAST[1] data cleaning framework, inter-class cleaning is carried out first and then intra-class cleaning is carried out in our cleaning process. In this way, new ID central features can be updated after inter-class cleaning, making the whole cleaning process more complete and the cleaning effect better. In order to verify the impact of data cleaning on the final performance, we conducted a series of comparative experiments on THE MS1M dataset, and the results are shown in the following table:

The threshold in the table refers to the similarity threshold of cleaning within the class. It can be seen that when the threshold is set too low (for example, 0.05), the noise is not cleaned, so the performance is not the best. However, when the threshold is set too high (such as 0.50), the noise is cleaned and the difficult samples are also cleaned, which leads to the weakening of the model generalization ability and the degradation of the performance in the evaluation data set. Therefore, an intermediate threshold of 0.25 was selected, which not only cleaned a lot of noise, but also retained difficult samples, and achieved the best performance in all evaluation indexes. In addition, we also draw the relationship between different similarity thresholds and the number of remaining images, as shown in the figure below:

3.2 Mask wearing data generation

To solve the problem of insufficient data of mask wearing, a feasible solution is to draw masks on the existing image without masks. However, most of the current drawing schemes belong to the position paste schema, which is not realistic and lacks flexibility in the image of wearing masks. Therefore, we draw on the ideas of PRNet[2,3] and adopt an image fusion scheme [4] to obtain images of wearing masks that are more consistent with the real situation, as shown below.

The principle of this scheme is to generate UV Texture Map of mask image and original image respectively through 3D reconstruction, and then synthesize mask image with Texture space. In the process of data generation, we used 8 types of masks, which means that we can generate 8 mask wearing images of different styles on the existing data set. The scheme based on UV mapping overcomes the problems of unideal connection and deformation between original image and mask image in traditional plane projection. In addition, due to the rendering process, the mask image can be rendered differently, such as adjusting the Angle of the mask and the lighting effect. An example of the generated mask wearing image is shown below:

In the process of generating the mask wearing data training model, we found that the proportion of mask wearing data has different degrees of influence on the performance of the model. Therefore, we set the proportion of mask wearing data to 5%, 10%, 15%, 20% and 25% respectively, and the experimental results are shown in the table below:

As can be seen from the above table, when the proportion of mask wearing data is 5%, the model has the highest performance in THE MR-ALL evaluation set. When the proportion of Mask wearing data is adjusted to 25%, the performance of the Mask wearing evaluation set improves significantly, but the performance of MR-ALL decreases significantly. This indicates that when mask wearing data and normal data are mixed for training, its proportion is an important parameter affecting the performance of the model. In the end, we chose 15% of data wearing masks, which achieved a good balance between mask wearing and normal data performance.

3.3 NAS Based Backbone Network

Different backbone networks have great differences in feature extraction ability. In the field of identity identification, the baseline backbone network commonly used in the industry is IR-100 proposed in ArcFace[5]. In this competition, we adopted zero-shot NAS (Zen-NAS[6]) paradigm proposed by Dharma Institute to search for backbone networks with stronger representational ability in model space. Zen-nas is different from the traditional NAS method. It uses Zen-Score to replace the performance evaluation Score of the search model. It is worth noting that Zen-Score is proportional to the final performance index of the model, so the whole search process is very efficient. The core algorithm structure of Zen-NAS is shown in the figure below:

Based on the IR-SE baseline backbone network, we used Zen-NAS to search three variables related to model structure, namely, the number of channels in the Input layer, the number of channels in the Block layer and the stack times of different Block layers. The limitation is that the backbone network searched meets the inference time constraints of each track. An interesting finding is that the performance of the backbone network searched by Zen-NAS on the MS1M small dataset track is almost the same as that of IR-SE-100, but on the WebFace260M large dataset track, the performance is significantly better than the baseline. The reason may be that as the search space increases, the scope of NAS search increases, and the probability of finding more powerful models also increases.

3.4 Loss function

One of the baseline Loss functions we used in this competition is Curricular Loss[7] which simulates the ideas learned in the course during the training process, training in the order of samples from easy to difficult. However, because training data sets are often extremely lopsided, popular figures contain thousands of images, while unpopular figures often contain only one image. One of the reasons that we focus on the long tail of unbalanced data is that we introduced the Balanced Softmax Loss[8] thinking into Curricular Loss, proposing a new Loss function: Balanced curloss, which is given in other ways:

On the MS1M track we compared the performance of Balanced curloss (BCL) with the original curloss (CL), the results are given in the following chart:

One can see that Balanced Curricular Loss is one of the other reasons that the audit shows a significant improvement in the indicators, both in Mask and MR-all, relative to Curricular Loss, which supports its effectiveness.

3.5 Distillation of knowledge

Due to the constraints on the reasoning time of the model, the result will be cancelled if the model exceeds the time limit. Therefore, we use knowledge distillation to transfer the powerful representational ability of the large model to the small model, and then use the small model for reasoning to meet the requirements of reasoning time. The framework of knowledge distillation adopted in this competition is shown in the figure below:

Of these, distilling losses use the simplest L2 Loss to transfer characteristic information of the teacher model, while the student models use Balanced Curricular Loss training, the final Loss function is a weighted sum of distilling losses and training losses. After knowledge distillation, the student model even exceeded the teacher model in some indicators of the evaluation data set, and the reasoning time was greatly shortened, and the performance of the MS1M small data set track was greatly improved.

3.6 Concurrent model and data

The number of training data IDS of WebFace260M big data set track is more than 2 million and the total number of pictures is more than 40 million, resulting in the traditional multi-machine and multi-card data parallel training mode has been unable to accommodate a complete model. Partial FC[9] evenly distributes FC layers to different Gpus, and each GPU is responsible for calculating the sub-FC layer results stored in its own video memory unit. Finally, the approximate full FC layer results are obtained through the synchronous communication operation among all Gpus. The Partial FC diagram is as follows:

With Partial FC, model parallelism and data parallelism can be used at the same time, so that the large model that could not be trained before can be trained normally. In addition, negative sample sampling can be adopted to further increase the batch size of training and shorten the training period of the model.

3.7 Other Skills

In the whole process of the competition, we successively tried different strategies such as data enhancement, tag reconstruction and learning rate change, among which the effective strategies are shown in the figure below:

4. Competition results

In this competition, our Mind_FT team won 1 champion (WebFace260M SFR), 1 runner-up (InsightFace Unconstrained) and 2 third runners-up (WebFace260M) in the 5 courses of InsightFace and WebFace260M Main and InsightFace MS1M). A screenshot of the final results of the WebFace260M track official leaderboard is shown below:

At the Workshop after the competition, we were invited to share the solutions of the competition around the world. In addition, our paper submitted in this competition was also included in ICCV 2021 Workshop[10]. Finally, show us the certificates of honor we received in this competition:

5. EssentialMC2 introduction and open source

EssentialMC2, multimedia cognitive computing based on entity spatio-temporal relationship reasoning, is the core algorithm architecture based on the long-term research results of mind-digital-intelligence media group of Dharma Institute on video understanding technology. The core content includes three basic modules of representation learning MHRL, relational reasoning MECR2 and open set learning MOSL3, which respectively optimize the video understanding algorithm framework from three aspects of basic representation, relational reasoning and learning methods. Based on these three basic modules, we summarize a set of code framework suitable for large-scale video understanding algorithm research and development training, and open source, including the group’s recently published excellent papers and algorithm competition results.

Essmc2 is a complete set of EssentialMC2 supporting deep learning training framework code package suitable for large-scale video understanding algorithm development and training. The main goal of open source is to provide a large number of verifiable algorithms and pre-training models to support users to quickly trial and error at a low cost. At the same time, we hope to establish an influential open source ecosystem in the field of video understanding and attract more contributors to participate in project construction. The main design idea of ESSMC2 is “configuration is object”. Through the simple configuration file and the design pattern of Registry, objects can be quickly constructed and used in the form of many model definition files, optimizers, data sets, pre-processing pipeline and other parameters. In essence, it fits the scene of continuous adjustment and experiment in the daily use of deep learning. At the same time through the consistent perspective to achieve seamless switching between single machine and distributed, users only need to define once, can switch in single machine single card, single machine multi-card, distributed environment, at the same time to achieve easy to use and high portability. At present, the first available version of ESSMC2 has been released. Welcome to try it out. We will add more algorithms and pre-training models in the future. Link: github.com/alibaba/Ess… .

6. Product landing

With the emergence of Internet video content and VR, metasexes and other applications, the number of unstructured video content is increasing rapidly. How to quickly identify and accurately understand these content has become a key link in content value mining. Character is an important content in video. High-precision video character identification technology can quickly extract key information of video character and realize intelligent applications such as character clip editing and character search. In addition, analyzing and understanding the multi-dimensional content of video, such as visual, voice and text, and identifying more abundant video content entity labels such as people, things, objects, fields and signs can form structured video information and help extract key video information more comprehensively. Furthermore, structured entity tags serve as the basis of semantic reasoning to help understand the core content of videos and realize high-level semantic analysis of video content through multi-modal information fusion, thus realizing category and topic understanding. Ali Cloud multimedia AI team’s high-accuracy person identification and video analysis technology has been integrated into EssentialMC2 core algorithm architecture, and has been output as a product, supporting the analysis and understanding of multi-dimensional content of videos and images and output of structured labels (click to experience: Retina Video Cloud Multimedia AI Experience Center – Smart label products retina.aliyun.com/#/Label).

Multimedia AI products

Through comprehensive analysis of visual, text, voice, and behavior information in videos, smart label products can recognize content with high accuracy by combining multi-modal information fusion and alignment technology. By combining analysis results of video categories, smart label products can output multi-dimensional scene labels that fit video content.

Category target sign: to achieve high-level semantic analysis of video content, and then realize the understanding of categories and themes. Video classification labels can be divided into level 1, level 2 and level 3 categories to achieve media asset management and personalized recommendation applications.

Entity label: entity label used for video content recognition, including video category theme, film and television comprehensive IP, characters, behavior events, objects, scenes, logos, screen labels, and supporting knowledge graph information of characters and IP. Among them, the IP search of film and television comprehensive is based on the video fingerprint technology, which makes fingerprint comparison and retrieval between the target video and the film and television comprehensive resources in the library. It supports IP identification of more than 60,000 movies, TV dramas, variety shows, cartoons and music, and can analyze and identify which IP content of the film and TV series contained in the target video content. Help to achieve accurate personalized recommendation, copyright retrieval and other applications. Based on various types of data such as Youku, Douban and Baike, the information atlas covering film and television ensemble, music, people, landmarks and objects is constructed. For the entity labels hit by video recognition, the knowledge atlas information can be output, which can be used for media asset association and related recommendation applications.

Keyword tags: support video voice recognition and video OCR text recognition, combining NLP technology to analyze the text content of voice and text, output keywords related to the content of the video, for refined content matching recommendation.

Perfect label system, flexible customization ability

Smart label products integrate PGC and UGC video content of youku, Tudou, UC overseas and other platforms for learning and training, providing the most comprehensive and high-quality video label system. In addition to providing a general label category system, support open multi-level customization ability, support face self-registration, self-defined entity tags and other extended functions; For customer-specific tag system service scenarios, one-to-one tag customization services are provided through tag mapping and customization training to help customers solve video processing efficiency problems on the platform more specifically.

High-quality human-machine collaboration service

For accurate business scenarios, intelligent label products support the introduction of human interactive judgment to form efficient and professional human-machine collaboration platform services. AI recognition algorithm and human complement each other to provide accurate video labels for personalized business scenarios. The human-machine collaboration system has advanced human-machine collaboration platform tools and a professional annotation team. Through standardized delivery management processes such as personnel training, trial operation, quality inspection and acceptance, it ensures the quality of data annotation and helps quickly realize high-quality and low-cost annotation data services. Through the human-machine cooperation mode of AI algorithm + manual, manual annotation service is provided as the supplement and modification of AI algorithm to ensure accurate and high-quality service output results, and realize the improvement of business efficiency and user experience.

Sports industry and film and television industry video tag recognition

Video label recognition in media industry and e-commerce industry

All the above capabilities have been integrated into aliyun Video cloud smart label products, providing high-quality video analysis and human-machine collaboration services. Welcome to understand and experience the trial (smart label products retina.aliyun.com/#/Label), so as to build more efficient and intelligent video business applications.

[1] Zheng Zhu et al. Webface260m: A benchmark unveilingthe power of million-scale deep face recognition. CVPR 2021. [2] Yao Feng, et al. Joint 3d face reconstruction and dense alignment with position map regression network. ECCV, 2018. [3] Jun Wang et al. Facex-zoo: A PyTorch Toolbox for Face Recognition. Arxiv, ABS /2101.04407, 2021. [4] Jiankang Deng et al. Masked Face Recognition Challenge: [5] Wang J, Wang J, Wang J, et al. A Study on The spatial pattern of The InsightFace Track. Additive angular margin loss for deep face recognition. CVPR 2019. [6] Ming Lin, et al. Zen-NAS: A Zero-Shot NAS for High-Performance Image Recognition. ICCV 2021. [7] Yuge Huang et al. Curricularface: Adaptive curriculum learning loss for deep face recognition. CVPR 2020. [8] Jiawei Ren et al. Balanced meta-softmax for long-tailed visual recognition. NeurIPS, 2020. [9] Xiang An, et al. Partial fc: Training 10 million identities on a single machine. ICCV 2021. [10] Tao Feng, et al. Towards Mask-robust Face Recognition. ICCV 2021.

“Video cloud technology” your most noteworthy audio and video technology public account, weekly push from Ali Cloud front-line practice technology articles, here with the audio and video field first-class engineers exchange exchange. You can join ali Cloud video cloud product technology exchange group, and the industry together to discuss audio and video technology, get more industry latest information.