In this CSDN technology Open Course Plus, Dr. Xiangju Lu, a scientist from IQiyi, shared the technologies related to multimodal character recognition. You can learn about iQiyi’s main research work in the field of multimodal technology and how to apply these technologies in iQiyi videos in the notes of the open course. \
【 Introduction of lecturer 】
Lk fragrant chrysanthemum
\
Dr. Xiangju Lu, iQiyi scientist, PersonAI team leader, focuses on character recognition, AI and other technologies, responsible for iQiyi multi-modal character recognition, intelligent creation and other related businesses. Organized the “iQIYI Multi-Mode Video Character Recognition Competition”, opened the world’s first film and television character database iQiyi-vid, created a database of one million characters and 40,000 cartoon characters, and applied relevant technologies into iQIYI APP “Scan” and AI radar and other products.
Introduction of multimodal technology foundation ****
First of all, please consider a question: is character recognition just equivalent to face recognition? In fact, character recognition is not just face recognition in our work now, why is that? Because in the video, especially in some kind of variety show, or action movies, entirely by face is unable to meet all the conditions of the individual, know the identity of a person also need to other properties, such as the following figure on the right side of the figure, you see that he is guo degang, but as I we face recognition of choose and employ persons is absolutely no way to identify, Because his face is not visible, just the back of his head, so the technology we have now based on human recognition also involves human recognition, which is re-ID on our surveillance. In addition, clothing, hair styles, voice prints and biometrics such as fingerprints and irises need to be recognized in the video. Therefore, now the recognition of people based on video scenes has become a recognition of comprehensive needs.
Second, how to identify the avatar? The reason why we call it virtual character is that it is not a real character. It includes cartoon character, two-dimensional character, animation and game character, etc. Now there are more and more characters in this part, which has become a very important demand in the entertainment industry. Under these realistic requirements, our research work is basically applied in practice. Based on these practical applications, we will share with you our main algorithms in character recognition and virtual character recognition.
Ii. Multi-modal technology Interpretation (I) : **** Character Recognition (IQFace)
This part will mainly introduce the multimodal basic technology of human character recognition. Based on the demand of iQiyi video content, we not only need to do face recognition, in the case of insufficient or unclear face information, but also need other information to assist in character positioning. Among all the information, the first thing we think of is voice information; Secondly, in the silent situation, we need to make use of some action information, posture information such as back figure and clothing information to judge the identity of characters in combination with the scene (such as fighting, walking and monitoring). These are the main categories of information we need to deal with in our business.
Through face detection and facial positioning face recognition and age, gender, facial expressions and other attributes analysis, get a better understanding; In addition, according to the actual business needs, such as the unique temperament of some artists in the video, we can get some special attributes. The processing method of this part is “tailored” according to the actual business scene. In addition to face information, also can use the human body information, such as the human body posture estimation (size, dress), behavioral data (gestures and movements), human RE – ID feature extraction, extracted from the character’s voice voiceprint characteristics, all these will help us to attribute characters status analysis and judgment, we are also used in practical engineering face, body, These three kinds of information constitute multimodal information recognition.
With the basic data information of multimodal recognition, the algorithm of multimodal technology is the next step. The figure shows our overall algorithm framework and engineering logic.
At present, we face don’t related algorithm using the face database ID number to 5.5 million, the number of celebrities can directly identify the name has reached around 300000, in order to support such a large-scale character training data, we developed a customized distributed framework, although there are some open source framework, but more suitable for simple tasks, It is difficult to meet the tasks with customization requirements, so our self-developed framework can achieve great improvement in both the overall training accuracy and training speed.
\
We can optimize model stereotype, data stereotype, including GPO, process communication. In terms of the accuracy of recognition, we have evaluated it on our own data set: the first data set is the middle school student database, and the data distribution is mainly focused on the actual application scenarios of id photo or ID photo matching; The second is the iQiyi employee database, which is the database of our internal employees, which contains a large number of human faces, gestures, expressions and other changes; The third is the data set released by IQiyi in the multi-modal character recognition contest, which is mainly for the identification of video data of stars.
The actual business scene is faced with a lot of face attributes related requirements, now face attributes have been supported to 27, including common attributes (expression, men and women, age) and unique face attributes, such as temperament, micro expression attributes. (micro expression refers to face the basic activity unit of an activated state, also called an A, the micro expression in addition to the eleven common AU basic energy, we have strong demand according to the actual business categories, such as tongue, bowl, pouting lips and eyebrows rise for processing) micro expression refers to face the basic activity unit of an activated state, also called an A, At present, in addition to eleven common AU basic energy sources, micro-expressions are processed according to categories with strong demands in actual business, such as tongue sticking out, eye rolling, pout mouth and eyebrows rising. In this context, we propose an innovative approach: Microexpressions and expression in the database is used to automatically generate the micro expression in video packet data, specific practice is to micro expression of a packet data in library to extract facial characteristics of micro expression and the expression package to the characters in the document at the same time with the long video microexpressions material matching, and then to copy migration, to realize the expression package automatically generated, This method can not only be used for facial microexpression generation, but also has been used for cartoon character microexpression generation.
In the face of so much face data, how to deal with noise is a very difficult task. In the figure, we have a series of noise processing processes, mainly algorithm, artificial auxiliary, the face data set noise ratio is reduced to a very low, so that the model accuracy has been greatly improved. The model speed was optimized through model quantization, pruning, distillation and other processes, and the CPU version was customized and optimized, which saved a lot of resources.
In addition, in addition to the known ID information, we should make full use of iQiyi video resources to obtain unlabeled data to assist face model training. The following is mainly about how we use these unlabeled data for training. A paper on this work, optimizing face Recognition Models with Unlabeled Data, was published this year at ICCV 2019 Workshop.
If you want to all the data are known ID is difficult, need a lot of artificial tagging work, but it is easy to access the data without a label, we can obtain no labels of massive amounts of data to assist in face recognition model training, the main idea is to use no tag data filled with unknown data distribution, make label data distribution more tight, In other words, the classification interval of labeled data is larger, and the interval within the classification becomes tighter, and better classification effect is finally achieved. The specific method is shown in the figure below. In this way, an additional Loss is obtained for the data without labels, which is superimposed on the Loss trained before to assist the final model training.
- Specific model and algorithm interpretation: Unknown Identity Rejection (UIR) Loss****
In order to make use of unlabeled data, we design a semi-supervised Loss function, Unknown Identity Rejection (UIR) Loss. Face recognition is an open-set problem, which classifies people in an open environment into two categories: labeled onesAnd no-label classes.. In the training process, for labeled classes, each sample feature needs to approximate the centroid vector of the corresponding class in the classification layer. For unlabeled classes that do not belong to any of the classes in the classification layer, the model needs to “reject” them, that is, features that are sufficiently distant from the class center of each classification layer. As shown in Figure (a),Represents the centroid vectors of two classification layers, and the dots represent sample features. In Figure (b), add a labelless classLater, in order todistanceIf the distance is enough, the labeled category will be more sparse in the feature space and the distance between classes will be larger.
For THE CNN classification model, the output of the full connection classification layer is obtained by SoftmaxRepresents the probability values belonging to each category. However, the no-label category does not belong to either category, ideallyThey should be small enough to be filtered by setting a threshold to increase the rejection rate. With this in mind, the question can be translated to:
The above equation is a multi-objective minimization problem, which can be transformed into:
Therefore, UIR Loss is obtained, namely:
The total loss of the model is loss with label category plus UIR loss without label category:
The block diagram of the model is as follows. The unlabeled data and labeled data are used as inputs, the features are obtained through the backbone network, and the output probability value is obtained by the whole connection layer, and calculated according to the probability value respectively 。
Experimental results ****
We used the MS1MV2 dataset cleaned by MS-Celeb-1M as labeled data, including 5 million image data of 90,000 people categories. The data was extracted from the Internet and cleaned to basically ensure a low coincidence rate with labeled data, and about 4.9 million unlabeled data were obtained.
The validity of the method was verified on iQIYI-VID, trillion-Pairs and IJB-C test sets, respectively. Four kinds of backbone networks are tested. Experimental results show that the model performance is improved after adding UIR Loss without label data. Due to the length, the ijb-c test results only included part ResNet100. For other results, please refer to the paper.
Ii. Multi-modal technology interpretation (II) : **** Virtual Character Recognition (iCartoonFace)
Based on the preliminary understanding of multi-modal technology of real character recognition, the technology and experience of virtual character recognition are introduced. What does avatar recognition involve? Generally speaking, virtual character recognition includes cartoon, animation, game characters and all created virtual images.
Virtual character recognition technology the first challenge is the data source of the problems, both graphics and character identification information quantity, corresponding word in real business is not enough, at the same time, these data labeling information quality is not high also, we need to spend a lot of time in the early stage of the work for data cleaning and labeling work. Up to now, we have accumulated more than 40,000 characters and nearly 500,000 training pictures, and the labeling accuracy is 98%. The labeling information includes position detection frame, posture, gender, color, etc.
Model training is carried out after data sorting. Special attention should be paid to one kind of data in the training process, as shown in the figure below. It is difficult for the model to identify different characters with little difference and the same character with great difference, which is very common in actual videos. In practical engineering, we can carry out special treatment on the model itself or the test standard.
In this paper, some loss functions such as Softmax, SphereFace, CasFace and ARCFace are used for reference to make the distribution within the class closer and the distribution difference between classes larger, so as to improve the accuracy of discrimination in practical application.
In addition, the fusion of real data and cartoon data is used to make up for the lack of virtual character data. In the figure below, A represents before fusion, and B represents closer distribution of cartoon characters and distance between classes after fusion with human face. Experimental data also prove the effectiveness of the method.
Relevant papers have not been published yet, please continue to pay attention to our news.
Multimodal database and multimodal algorithm ****
After two years of accumulation, iQiyi’s multi-modal database based on video tasks in real scenes has become the first multi-modal data in the industry, with clear labels and the largest scale. It is committed to providing more help to everyone’s research work.
Based on the multi-modal database, we design a multi-modal recognition algorithm architecture using four features of human face, human head, human body and voice print, and propose a multi-model attention model in the model to integrate these four special features.
Multimodal Character Recognition Dataset IQIYI-VID
Challenge.ai.iqiyi.com/detail?race… .
This article mainly describes the collection and annotation process of data sets, and does not involve specific multi-mode algorithm for the time being. Please continue to pay attention to our news for more information about iQiyi multi-mode algorithm, and we will give you a detailed interpretation after publication.
Based on this, many research teams have made improvements in data enhancement, cross-validation and training with noiseless samples. As shown in the figure below, the main body of the model was the network of the full connection layer, which could receive the information between the deep and shallow layers. Jumping connections were added between the two dense layers to fuse the information of different layers. At the same time, improvements were made according to the residual block idea. Add dropout and Batch norm to prevent overfitting.
Iv. Application and practice cases of multi-modal technology in video scene: ****
Only look at TA and AI radar ********
In the process of using iQiyi APP, you may have experienced the function of “only look at TA”, followed by the AI radar function of TV terminal, etc. Behind these daily applications, you cannot do without the support of multi-modal database and multi-modal technical algorithm. In view of the multi-modal algorithm that everyone is concerned about, I would like to share the following points with you:
1, everyone is focus on how many modal weighting algorithm, combined with unified, and multiple modal algorithm is a very complicated problem, and the data noise is very big, one may not be able to identify all the characteristics of the machine learning model, is also not much characteristics can play a positive role, so we can’t only depend on the adjustment of weight, should start from learning process model, Using algorithms to refine what features play a key role in what situations.
2. The matching of micro-expression features is based on the similarity of face and the similarity of each AU. Copywriting matching is to download a lot of emoticons with copywriting on the Internet, and then match them with emoticons extracted from the video. If the matching effect is good, the copywriting will be transferred.
Iqiyi App only looks at TA function display \
For those interested in many of the research papers and databases mentioned above, please refer to:
Thesis and Thesis Address:
- Unknown Identity Rejection Loss: Utilizing Unlabeled Data for Face Recognition
Arxiv.org/pdf/1910.10…
- ICartoonFace: A Benchmark of Cartoon Person Recognition
Arxiv.org/pdf/1907.13…
- Iqiyi-vid: A Large Dataset for Multi-Modal Person Identification
Arxiv.org/abs/1811.07…
This article is from AI Technology Base camp
end
Maybe you’d like to see ****\
Massive data real-time analysis service technology architecture evolution
Iqiyi “Multi-modal Character Recognition Competition” was concluded, and the accuracy of character recognition was improved to 91.14%
Scan the qr code below, more exciting content to accompany you!