Author: Feng Longtao
background
With the vigorous development of the broadcast and short video platform, trill, now has more than 600 million active users, every day there are vast amounts of UGC video way through different channels such as mobile phone terminal (iOS/android), web side, PC uploaded to the server, such as video service China need to do different gear for each video transcoding and enhance processing and other services, After transcoding, the system will distribute videos in different stalls and resolutions according to the user’s viewing environment. In the face of the challenges brought by the diversified large-scale video, how to better control the end-to-end video quality experience optimization has become particularly critical for the maintenance and optimization of the entire video link. At the same time, the quality of live video also needs corresponding monitoring services to ensure the quality of live video. In addition, quality data as an important indicator to measure the quality of our video service. Therefore, we build a multimedia quality evaluation system, so as to more accurately obtain the objective/subjective quality of multimedia, as well as quality-related indicators and problems. Our entire multimedia quality evaluation system is shown in the figure below. Our video quality evaluation algorithm is mainly divided into two parts: full-reference metric, including commonly used PSNR, SSIM, VMAF, etc., which is mainly used to evaluate the quality changes during video transcoding and distribution; No-reference metric includes video resolution algorithm VQScore, aesthetic algorithm AESMOD, attribute algorithm such as noise, color, block effect, source detection, and real-time live monitoring. The no-reference algorithm is mainly used to analyze the quality of video uploaded by users and to access any node of video processing link in a more flexible way.
Video resolution algorithm VQScore
Video Quality Score (VQScore) is a non-reference Video Quality evaluation index. Unreferenced quality evaluation means that quality evaluation is carried out independently without reference video and the machine imitates human’s visual perception of the video. As the ultimate consumer of video is the user, if users’ visual feelings on video can be accurately obtained, many applications can be carried out: video quality monitoring, recommendation based on video quality, processing algorithm optimization based on human subjective feelings, low-quality video screening, etc.
Because it is very difficult to imitate human subjective feelings, which is interfered by many factors, the industry does not have a good solution at present. Based on this, VQScore is designed to address this current situation and long-term pain points in the company’s business.
The following figure is the flow chart of the whole VQScore system. After multimedia files are input into the system, corresponding strategies are decoded according to the characteristics of corresponding types of files and subsequent requirements of corresponding algorithms, and different frame extraction methods and scene switching detection are used. Then, the quality indexes in different dimensions are obtained through different multimedia quality algorithms. Finally, in the decision-making process, the final quality score and quality indicators of different dimensions are obtained according to business requirements and different integration strategies. Through quality score and quality index, the quality of multimedia can be measured and the specific quality problems can be indicated.
Decoding strategy
Usually the multimedia files entering the system are compressed, and subsequent quality algorithms require the original pixel information, so it needs to be decompressed (decoded) first. Since the subsequent algorithm depends on the result after decoding, there are many points needing attention in this process, among which the typical ones are:
- For a video file, the time domain is usually redundant, and the contents between adjacent frames are similar. Therefore, to save redundant calculation, frame extraction is required for the video. According to the characteristics of the service and video, frame extraction strategy can adopt uniform frame extraction, frame extraction according to the space complexity, interval continuous frame extraction and other methods.
- In some videos, there will be scene switching. There are various ways of scene switching, such as fade in and out, flash white, flip and so on. Generally, for the frame in the scene switch, the content of the frame is usually meaningless, which is a disturbance to the quality algorithm. Therefore, it is necessary to identify the scene switch frame in the process of frame extraction, and then remove the scene switch frame.
- For ultra high resolution videos and pictures, if computing resources and time complexity requirements, it is necessary to sacrifice part of resolution information and quality, and reduce the original video and picture sampling.
Multimedia quality algorithm
Multimedia quality algorithm is the core of the whole system, due to measure the quality of multimedia is a very complicated and subjective problem, especially using the computer to predict customer’s perception of the quality, the quality of our multimedia algorithm is composed of the quality of many dimension algorithm, from various angles to measure the video quality, at the same time also can get more detailed video quality, To do different applications. Here are some of the algorithms being used in the system:
-
Subjective quality prediction algorithm: We learn from the current mainstream non-reference quality evaluation algorithm in the academic world and use deep learning to predict the subjective quality of videos. For deep learning, we first need a data set to train the neural network. We ask different users to mark the subjective quality of videos according to the current mainstream subjective quality labeling method, namely Mean Opinion Score (MOS). The video data comes from our target business. Since the subjective quality of video is a relatively subjective index, each user will have fluctuations in the process of annotation. Therefore, increasing the number of user tags per video is a method to reduce fluctuations and improve the authenticity of final results. After obtaining reliable business-related data sets, models can be selected according to time complexity and neural network performance. In most scenarios, we use the common ResNet50 backbone network to extract depth features of images and videos, and then regression to get predicted subjective quality scores. After sufficient training, the algorithm model with stable convergence can achieve or even exceed the subjective quality prediction accuracy consistent with the user’s subjective labeling effect. Although our algorithm can obtain excellent correlation indicators and low generalization errors on self-built data sets, the upper limit of performance still depends on the subjective annotated data sets. Incomplete scene coverage of data sets and noisy subjective annotations will lead to limitations of the algorithm in practical business. Therefore, while optimizing the algorithm, we also need to continuously iterate and expand the subjective annotation data set at multiple levels. In addition, the black box feature of deep learning algorithm leads to poor interpretation of prediction results, and it is impossible to give a reasonable attribution explanation for video quality changes. In order to improve the ability to quickly and accurately locate and solve business problems, multi-directional indicators of other dimensions are also needed to be integrated.
-
Noise detection algorithm: Because the video source is varied, there are a lot of non-expert users film and video production, film equipment performance is poorer, scene in bad environment conditions such as light, led to the introduction of there will be noise in video, the video quality problems will bring the user poor subjective experience, increased video at the same time some ineffective and harmful information, will increase the bit rate of video. For such a common problem, we design the algorithm to detect and measure the noise intensity in video noise and normal content has different characteristics in space and time, thus we extract the video on the space characteristics such as sobel, GLCM, and the characteristics of time, such as light intensity and flow direction, with the aid of the deep learning method, To use these features to regression the noise intensity of the video, and finally can accurately predict the noise intensity of the video, so as to provide quality services and denoising services.
-
Block effect detection algorithm: because the uploaded video has many kinds of compression parameters, encoders, etc., some videos also have multiple transcoding. Excessive compression will lead to block-effect distortion of the video, resulting in the deterioration of the user’s subjective quality, and also increase the bit rate of the video. Therefore, we design a block effect detection algorithm to detect and measure the block effect intensity in the video. Block effect has a very regular distribution in space, so we analyze the distribution of block boundary in space, and use Fourier transform to transform the spatial domain into the frequency domain, so as to find the law more obviously, so as to reflect the strength of block effect.
-
Contrast detection algorithm: video contrast is an important link of video quality, usually, the video quality with poor contrast will be correspondingly poor. Therefore, we design a video contrast detection algorithm, and use the entropy of video gray component distribution to represent the corresponding video contrast, and obtain the detection results consistent with the subjective user.
Decision-making link
The quality indicators of each dimension are obtained. How to obtain the final video quality needs the final decision-making process. First of all, the following decision-making strategies should be determined according to service requirements and characteristics of input multimedia files:
- Time-domain fusion strategy: for the quality of different video frames, it is necessary to do fusion, including time-domain averaging, hysteresis fusion, etc. At the same time, this process can be integrated into the above algorithm for end-to-end optimization. This process also sometimes takes into account business requirements, such as the need to increase weights for poor quality frames to better detect problems.
- Fusion strategy of various quality indicators: For quality indicators of different dimensions of videos, fusion is also required. Specific strategies are usually related to the type of videos according to business requirements. For example, videos with strong noise need to be screened and the weight of noise detection in the quality score needs to be increased.
Application scenarios
VQScore definition algorithm, the goal is to accurately capture all information related to the quality of multimedia, and quality is critical for video companies. The application scenarios of our algorithm are also very wide:
- Multimedia quality monitoring: Real-time/offline monitoring of the quality of vod and live broadcast, understanding the video quality of each video link node, timely alarm, precise positioning of specific video quality problems.
- Low-quality screening/suppression: it can screen out low-quality videos from thousands of videos, eliminating the cost of labor, and can do follow-up enhancement or recommendation suppression of low-quality videos.
- Video quality optimization: it can obtain video-related quality indicators, locate video quality problems and determine optimization strategies, so as to accurately improve video quality, save computing resources and prevent excessive deterioration of enhancement algorithms.
- Source quality test: Quality inspection is conducted on videos uploaded by users, and timely feedback is provided to users, so as to improve the quality of original videos uploaded by users.
Introduction to aesthetic quality algorithms
Aesthetic quality objective assessment algorithm mainly studies the public aesthetic experience of the different pictures, although each person as a result of the individual experience background have different aesthetic, but in the current highly information-based society, the average mass aesthetic perception of beauty and ugliness is a relatively consistent, objective aesthetic quality evaluation algorithm, can be quantified by means of automation of aesthetic factors, For assisting users to take photos, video cover extraction, user recommendation and other different application scenarios have obvious promotion effect.
The earliest research on aesthetic quality evaluation was the related work of James Wang from Pennsylvania State University, which mainly used the experience knowledge related to photography such as the three-line to evaluate the aesthetic degree of pictures by extracting some basic indicators such as brightness, color, saturation and tone, three-line composition, texture and depth of field. But there is still a big difference between the actual prediction accuracy and subjective perception. With the growth of subjectively annotated data sets and the rise of deep learning, aesthetic evaluation based on deep learning has gradually become the mainstream method.
Here we use the commonly used Aesthetic Visual Analysis(AVA) data set to train the Aesthetic quality assessment algorithm, which contains 25K photographic works of various themes obtained from DPchallenge, a professional photography website. Each picture was peer-reviewed by 100~200 professional users in the community, and the score was related to the quality of the photographic work and the subject of challenge. Our deep learning algorithm mainly uses the commonly used backbone network ResNet18 to extract visual features of images, and also uses hyper-column structure to strengthen low-level and mid-level visual features in addition to high-level semantic information. At the same time, we also added the auxiliary task of scene classification to achieve better training effect through multi-task training, and achieved a binary classification accuracy of 80%+ on AVA data set. Our aesthetic algorithm can also be used to evaluate the aesthetic degree of video directly by the way of video extraction frame processing.
conclusion
Systematic quality assessment algorithm is video multimedia end-to-end user subjective experience is optimized to work in an indispensable link, the article introduces the video resolution algorithm and aesthetics has been serving the video transcoding, live and enhancement of different business scenarios, such as quality control, quality optimization and recommend on demand, However, in the face of complex and diversified actual business scenes, we still need to continuously expand the magnitude of subjective annotation videos, cover more comprehensive scene types, optimize the scene generalization ability and fine-grained evaluation accuracy of the definition algorithm, so as to better serve different video services such as PGC/UGC.