Introduction: the third technology lesson of “Open your eyes and Sound in the World”. Hu Yifeng, senior computer vision algorithm engineer of netease Yidun, brought technical sharing with the theme of “Exploration and Practice of Video deep forgery Detection Technology in the field of content security”.

Lecturer profile: Hu Yifeng, senior computer vision algorithm engineer of netease Yidun, is mainly responsible for the development, implementation and optimization of image and video AI algorithms in the field of content security. We have rich experience in r & D and project implementation in many fields such as contraband, political-related, violent and terrorist content identification, logo identification, image retrieval, video in-depth authentication and so on.

The “double-edged sword” effect of AI technology application

AI is a very hot word in recent years, and has penetrated into every aspect of life, such as AI+ security, AI+ transportation, AI+ medical, AI+ retail, etc. Among the mature applications of many AI technologies, face technology is one of the most widely used technologies, which is commonly seen in intelligent security, financial transactions, public transportation and other fields. I believe that many people have experienced face payment and face entry.

With the rapid development of AI technology, the level of AI auto-generated content has improved significantly. Relying on text, voice, image, video and other carriers, AI automatic generation technology is widely used to imitate and forge human thoughts, behaviors and characteristics. This reduces the consumption of human resources and other costs to a certain extent, and brings convenience and spiritual enjoyment to our lives. To a certain extent, the simulation data and virtual content brought by AI automatic generation technology can bring new application scenarios to some vertical fields or directly promote the technological progress in this field.

However, everything has two sides, the development of science and technology also exists “double-edged sword” effect. While enjoying the convenient experience brought by face technology, people are inevitably exposed to the risks and hidden dangers brought by the abuse of face technology. As the AI in automatic face, skin care, the popularity of technologies and applications such as intelligent P diagram, automatically generated by the AI technology of safety risk and “black ash” production problem is growing, especially face related technology, as one of AI technology most widely landing scenario, safety, ethical and moral challenges faced by more and more serious. The combination of AI automatic generation technology and face technology in video carriers, known as “video deep forgery”, has become a “disaster zone” for AI technology abuse.

Video deep forgery technology

Video deep forgery technology, from the perspective of technology, forgery methods are mainly divided into four kinds. The first is the generation of full face, which generally uses gan-related algorithms to generate virtual faces that do not exist in real life, often seen in some virtual scenes such as games. The second is the AI face, the actual face in life to replace each other, this kind of application because of strong pertinence, good entertainment effect, often can be widely spread, so in the academic and industrial circles are the focus of the core research object. Face changing is the most widely used video depth forgery method with the biggest potential trouble. The third is the programming of face attributes, mainly including the editing of hair style, hair color, eyes, skin color and other important attributes, which often exists in some self-portrait beauty apps. The fourth is expression change, which endows human face with different expressions such as joy, anger, sorrow and joy, or reflects A’s expression on B’s face.

From the point of view of the specific algorithm, mainly through GAN, self-encoder, style transfer and other methods to complete, which will also involve some operations such as key point positioning, alignment, segmentation, fusion and so on.

In addition to the technical direction, algorithm, now face forgery open data set is also more. To a certain extent, this provides data support for the innovation and iteration of video depth authentication algorithm and promotes the development of video depth authentication algorithm. However, video depth authentication is an open set problem with continuous confrontation. It is not realistic to solve this problem well only by using the model trained on open data. In order to solve this problem better, more systematic and comprehensive scheme design is needed, which is also the focus and difficulty of the deep forgery detection business.

Video depth forgery recognition methods and difficulties

As a counter to video deep forgery, video deep forgery authentication methods mainly include the following: artificial features, CNN, CNN+ artificial features, CRNN, Transformer and so on. These methods cover the main direction of face forgery recognition and also describe the overall process of face forgery recognition.

The first is the artificial features, such as Eye Blink, head pose, etc., compared to the real human face, there will inevitably be some inconsistencies between fake or changed faces, which we call “fake traces”. It is a traditional and effective method to excavate artificial features based on these statistical traces. Traditional features tend to have strong pertinence, but lack generalization ability. Especially after a lot of post-processing of attack videos, the effect of traditional features will be greatly reduced. Therefore, some current research work starts from traditional features +CNN features and focuses on the fusion of features and classifiers. Artificial features are taken as the supplement of deep learning features. Of course, the artificial features mentioned here refer to the non-end-to-end learning method by adding prior knowledge through statistical observation.

Of course, with the development of deep learning, end-to-end face forgery recognition using deep learning directly is also a hot research topic. Most of the methods based on deep learning transform face forgery recognition into a problem of “face detection + classification”. Through face detection, the position of the face is first detected, and appropriate expansion is made, and then the crop is sent to the subsequent dicclassification of whether the face is cut or not. This method is relatively direct and the process is relatively simple. At the same time, since face detection is relatively mature now, it is generally not the difficulty of this task, so the whole task is transformed into a face patch classification problem. After transforming into a classification problem, some mature methods, such as semi-supervised learning, transfer learning and so on, can be more directly connected to solve the problem of face forgery recognition.

With the popularity of Deepfake and the growing popularity of short videos, the main battlefield of face forgery and face-changing has moved to the field of video. Therefore, the research on face forgery naturally adds video sequence information and uses RNN and LSTM to encode sequence features, so as to solve the recognition problem. This part of the work is also the focus of current research.

Through the introduction of forgery and authentication, it is not difficult to find that forgery and authentication is a process of confrontation, and many authentication methods are customized for some forgery methods. This process of confrontation also reflects a difficult point in current academic circles, the method of authentication does not have good generalization. The same authentication method may have huge differences in performance on different data. Moreover, this academic problem will be further amplified when it is transferred to the industry, because what we are facing is not one or several methods, nor data sets, but an open set problem, facing a massive number of forged methods and unknown Internet data. Therefore, the characteristics of many unknown methods and many confrontations bring great difficulties to the practice of in-depth video authentication.

Of course, the forgery method is not only reflected in the specific forgery algorithm. In fact, we found that post-processing of forgery is one of the biggest challenges to identification. Many forgery methods, in order to cover the traces of forgery, will do strong post-processing to counter. Now more popular whitening, skin and other tools, also played an objective role in post-processing. These post-processing will largely cover up the traces of forgery and bring great difficulty to identification.

In addition, the widespread distribution of data is a more general problem, which is also encountered in face recognition.

Netease YIDun video deep forgery detection solution

In view of the above difficulties, our overall solution is shown in the figure below, adopting the overall idea of “face detection + classification”. Classification is the dichotomy of whether it is a forged face. Chose this subject plan, because this is the best effect in academia at present, the most widely used method of face detection also has been the industry is very mature technology, can make our energy on the rear on the classification problem of categorizing identification problem into a problem, it is more convenient we closely combined with the industry’s advanced technology, to achieve the effect of get twice the result with half the effort.

So for the above problems of multiple forgery methods, multiple post-processing methods and wide data distribution. From the data level, we closely combine the current hot semi-supervised technology to mine difficult samples, improve the accuracy of mining data, reduce the annotation cost, and improve the ability of learning with noise. At the same time, we will also directly from the perspective of forgery and post-processing, for the identification of counter material. These two ways are actually data level fusion. At the algorithm level, common and effective methods and features will be involved in our plan, and feature level selection and fusion will be carried out. Of course, the integration of the final decision level is also very important.

We may be familiar with the semi-supervised method. It should be pointed out that the semi-supervised method has a good fit with the video in-depth authentication problem. The so-called fit can be seen from the direct relationship between the semi-supervised approach and the difficulties we discussed earlier. The main semi-supervised methods are listed below: generative methods, consistency regulation, Plabel, and hybrid methods.

First, generative methods. As mentioned above, much of deepfake’s data is generated by GAN, which can naturally link up with generative semi-supervised methods. This corresponds to the difficulties of generative methods mentioned above. The second is the method of consistency regulation. We know that the core idea of consistency regulation method is that the output of pairs with different input transformations should be consistent. This is actually to improve the generalization ability of the model and the transformation robustness of the model. This point can be corresponding to the difficulties of various post-processing methods mentioned above, and the robustness of model transformation can be improved by using non-label data. The core of the method based on Plabel is to expand the distribution of training data by means of pseudo labels, so as to improve the performance of the model. This corresponds to the problem of multiple forgery methods and wide data distribution mentioned above.

Therefore, the application of semi-supervised method in deepfake detection can better solve the relevant difficulties.

In addition to semi-supervision, since Deepfake is an adversarial problem, generating training data and training the initial model with supervision is also the most direct and effective approach. Of course, there are also multiple generation methods and post-processing methods to be considered.

At present, there are many researches on deepfake detection in academia. As mentioned above, most of these studies explore and integrate more robust features for recognition from the aspects of feature network structure design and loss design, including conventional embedding feature, frequency domain feature, sequence feature, manually defined feature, forged trace feature, etc. The fusion of these methods is partly at the feature level and partly at the decision level. It should be pointed out that deepfake Detection is a highly targeted task from training data to algorithm. In order to achieve good generalization effect on open set problem, in addition to seeking breakthrough on single algorithm, the fusion and selection of multiple methods is also one of the most core and effective methods.

Therefore, convergence at the decision level is essential. Model fusion is the most direct and effective method to improve model effect and generalization. This idea is common in other AI problems, but there is a special point that needs to be pointed out. The general consensus of model fusion is that the existing models with different but similar indicators are listed on the test set, and the fusion effect will be improved, while deepfake Detection is not strong in generalization ability of cross-forgery methods. It is often the case that two models behave very differently on the same set of data. In view of this problem, the integration strategy needs to be considered more carefully and the strategy of adding more choices.

Of course, the speed of multiple models is limited to some extent. In off-line scenarios with high speed requirements, models are further distilled to allow small models to integrate the capability of multiple large models.

Netease Yidun video depth forgery detection results

Like other businesses related to politics, terrorism and prohibition, video deep forgery detection is also positioned as an open set of iterative optimization problems. Netease Yidun has designed a complete solution from the perspectives of solution ideas, data and models. In terms of results, in the second China artificial intelligence competition video deep forgery detection track, netease yidun stood out from 188 enterprises, universities and research units and obtained the highest a-level certificate with the TOP1 score.

Today’s share introduces the content related to forgery and identification from the perspective of background, forgery, identification and technical solutions. Hope students who are interested in AI and forgery identification can get some help from this article.

Scan the wechat official account for more information: