The Heart of machine original, author: Qiu Lulu.
As practitioners in the (pseudo) AI industry, the friends in the editorial department of Heart believe that they can identify “artificial intelligence and artificial intellectual disability” with a good degree of confidence. However, when I put the following screenshot of the IQiyi app in front of you, the “recognizers” in the editorial department said that this time it was not credible.
In the popular video, it is almost inevitable that “bullet screen covers the face”. However, in this video, the dense bullet screen seems to be impressed by Teacher Li Jian’s spirit, and very accurately “bypassed” his handsome face. There are occasional glitches, but these tend to be more algorithmic than human.
It is a well-known fact in the academic community that although several major research teams have reported “out-human” results in the object detection task, the image segmentation task still seems to leave a lot of room for improvement. Google’s DeepLabv3+, released in February, achieved its best state-of-art performance on the PASCAL VOC 2012 data segmentation dataset, with an IOU of 89%, using 300 million internal pages of pre-training. On Cityscapes, the figure is just 82.1%.
Given this level of research, can image segmentation technology be used in the industry? Is the “face and background segmentation” in iQiyi app based on artificial intelligence or human? With a laundry list of questions from the editorial department, we reached the iQiyi Technology Product Center and caught Feng Wei, the algorithm leader of the project named “AI barrage mask” and a researcher of iQiyi Technology Product Center. He gave us a very detailed answer.
Problem one: is it segmentation? What is the partition?
First of all, we are most concerned about whether this “bullet mask” is artificial intelligence or artificial:
Is it image segmentation? Is!!! What kind of image segmentation is it? Semantic segmentation!
More specifically, it’s a semantic segmentation with two categories: each pixel in the image is assigned to either a “foreground” category or a “background” category, and the system generates a corresponding mask file based on the segmentation result.
The algorithm is based on The Google DeepLabv3 model. The technical team has also tried other segmentation models such as FCN, but DeepLab’s model is indeed a breakthrough.
Feng Wei also showed us some classification results in variety show and film and TV drama scenarios.
V.qq.com/x/page/z135… (Chinese new rap bullet screen mask effect)
Why would you want to use image segmentation to make “bullet mask”?
The iQiyi team’s image segmentation technology has been in storage for quite a long time, originally intended to be used as a background replacement for short videos.
Background replacement involves removing an image from a short video recorded by the user and placing it in a different background. But from a technical point of view, the qualified segmentation effect of a single image is not equal to the qualified video segmentation effect: the segmentation results in the video before and after a few frames of the image is slightly discontinuous, will cause the edge of the frame segmentation constantly jitter, and such segmentation incoherent is very affecting the user experience.
Are there any scenarios that require less than background substitution? Yes, for example, keep the original background and insert the dynamic background between the original background and the segmented portrait layer. In this way, the segmentation edge and the original background are still together and the error is not so obvious. This is where the barrage mask comes from.
“Once the technology is ready, we have been demo ing our various capabilities in different business departments, so that students of the product can come up with a lot of good ideas.” Feng Wei said.
In fact, the deep learning model used in bullet mask is not only segmentation, but also recognition. Before the video is segmented, the “scene recognition model” will first identify each frame of the image to determine whether the current frame belongs to a close shot or a distant shot.
The purpose of this scene recognition task is to determine whether the image is a close-up or close-up shot, so that the image will enter the segmentation model to generate a mask, while the remote image will not generate a mask, and the bullet screen will cover the whole picture as before. In this way, the problem of mask jitter between frames is well solved.
It is worth mentioning that this scene recognition classifier is also an example of the accumulation of existing technologies and the reuse of a scene: previously, this classifier was mainly used for iQiyi’s intelligent auxiliary post-production and other functions.
After the segmentation, the system will further use morphological processing algorithms such as “corrosion” and “expansion” to fine-cut the segmentation module output foreground area, and delete the foreground area with small picture proportion according to the needs of the application scene.
After this series of processing, into the mask file generation, compression and other production processes.
Question 2: Do we need to mark our own data? How much data is marked?
The answer is yes! Tens of thousands of them.
General segmentation models are trained with MS COCO and other general data sets, and the effect is very general when directly used in variety scenes.
“Scene switching and stage lighting are two common segmentation models that are difficult to handle. So we picked tens of thousands of images of typical scenes ourselves and it took the team three weeks to label them.” Feng Wei said.
The consistency of the distribution of training sets and test sets has also been well guaranteed: “The first program with the function of bullet screen mask is” China New Rap Season 2 “, so we used “China New Rap Season 1” and “Hot Blood Street Dance Troupe” created by the same shooting team to do the training set.
It is worth mentioning, because the system does not need the mask eventually split “fine” to a human hair, so tagging work relative to the general semantic annotation is more easier, segmentation Feng Wei shows some supplement in the training set of sample, “does not need to be careful to the pixel, the characters with straight line part of the box out”.
The IOU was improved from 87.6% to 93.6% after the general semantic segmentation model was completely refined using a dedicated data set.
Question 3: How about efficiency? Fast? Is it expensive?
In the reasoning stage, it takes several minutes for a GPU to segment a 1-minute video, still within O(1) time.
In actual production, the system often encounters more stringent time requirements. “” China’s new rap” production team has certain confidential requirement, such as program at eight o ‘clock on Saturday to online, we may have to get film at four o ‘clock. So we through the video subdivision number to control the concurrency of production services, and in all the shard business layer via the message queue, again after each subdivision of the production has a separate mechanism of condition monitoring and try again. It ended up using multiple Gpus at the same time, and it took about 40 minutes to process a 90-minute video.”
The team is also testing the use of bullet masks in real-time scenarios such as live parties.
Q4: Do you plan to upgrade? What else can be done besides prevent “barrage” from covering your face?
First of all, there is an upgraded version to prevent “bullet-screen”, such as upgrading from semantic segmentation to instance segmentation, and turning “everyone’s bullet-screen” into “you love bean’s exclusive anti-blocking halo”.
The task of image segmentation is also divided into several kinds, semantic segmentation only requires the system to divide all the “people” in the image into “category people”. In addition, there are also “instance segmentation” and “Panoptic segmentation” that need to classify different figures into different categories and not even the background.
The technical team of IQiyi is also studying instance segmentation based on MaskRCNN, supplemented by iQiyi’s strength: star face recognition, and trying to make “fan exclusive bullet screen mask”.
“For example, if you like Wu Yifan, then when other stars come out, bullet screen will still block them, only When Wu Yifan comes out, bullet screen will bypass him.” Sounds like a very fan psychology design.
Another is to extend the boundaries of categories in semantic segmentation. For example, can you distinguish pixels in the focal length of the lens from pixels out of the focal length?
This idea also from the actual needs: “the delay jubilee strategy, segmentation model not only can identify the accounts for the main position of leading role, lens and character appear together, a outside the coke in the corner, completely blurred little eunuch figure will be separated. And in fact behind this part is not needed, instead, points out influence the user experience.” In other words, what the system really wants to split is the “in focus” and “out of focus” of the lens, but since there is no model for this particular kind of segmentation task, the “in focus” reference is used instead. The ones that are not so well defined are still a problem that needs to be solved, and developing some new taxonomy might be a solution, but it’s not a problem that can be solved with tens of thousands of fine-tuned data sets.
Even semantic segmentation itself can also be extended to many different application scenarios, such as the identification of goods, which is also very useful.
“For example, if a mobile phone manufacturer sponsors a certain show, but it is not a sponsor of our platform, we need to code the logo or take out the product and replace it. This is still done manually by editors.”
In addition, there is the combination of tracking algorithm and segmentation algorithm, model acceleration and model compression for mobile terminal and so on… Sounds like the researchers at the Tech Product Center have a work schedule stretching back 8102 years!
When I returned to the editorial department and communicated with my friends about iQiyi’s practice, I had a common experience: the final product effect of bullet screen mask is very good. In a word, it can be said that it is to correct the expectation of model effect and “do what one can”.
Although segmentation accuracy 80% or so of the model is still a “baby”, but if you don’t “battling” it painstakingly, but some fine segmentation to the hair does not affect the use of simple scene, along with a series of engineering practices, such as recognition model excluding scene in a difficult situation, through the graphics methods to optimize segmentation effect). In the end, the system can still have a good result.
Although deep learning thought is end-to-end, but need to face the problem is that the reality is more complicated than the training set, always appears in the “magic” model, the process of “heaven” like “put the elephant into the refrigerator” is divided into three steps, after obtaining an available version to use iterative method to solve new problems, and is also a good choice?