Tencent has a number of video business lines, vod video has Tencent Video, Penguin Film, short video has micro video, K Song, live broadcast has Now live, Penguin e-sports, real-time transmission has QQ and wechat audio and video calls, wireless screen projection and Tencent conference.

Users have varying levels of expectation for different products: will high-definition video, with visible hair, be available on a 27-inch monitor in an ideal Internet environment? When using 3G and other weak network environment, can video calls ensure that the picture is not frequently jammed?

For service providers, all problems boil down to one goal: ensuring the best video viewing experience for users under different network conditions. Throughout the video link, we can accurately measure most modules, such as acquisition, upload, preprocessing, transcoding and distribution. The part we don’t know the most is the part that matters the most: how users will experience watching video.

This paper introduces the progress of video quality assessment in the industry and proposes a full reference video quality assessment algorithm based on 3d convolutional neural network.

What is Video Quality Assessment?

The purpose of video quality assessment is to accurately measure the perceived quality of video content by human eyes. Uncompressed source video is not suitable for Internet transmission because of its large bit rate. We must use standard codecs such as H.264/AVC, HEVC, or our own codecs to encode and reduce the bit stream size. However, video compression inevitably introduces compression distortion. Using H.264/AVC compression as an example, Figure 1 shows an example of compression distortion. The left side of the white line corresponds to the uncompressed original picture, where the texture of the bricks on the ground is clearly visible and the blue sky in the background is a natural transition. The right side of the white line corresponds to the compressed low bit rate video picture. You can obviously see the compression distortion, the brick texture becomes blurred, and the blue sky has an unnatural cloud bar due to the block effect.

Figure 1. H.264 compression distortion screenshot. The left side of the white line is hd source video, and the right side of the white line is low-bit-rate compressed video

In industry and academia, there are two common methods to evaluate video quality: subjective experiment and objective algorithm. The two methods have their own application scenarios and limitations.

We can measure video quality accurately through subjective experiments. On some core questions, such as coDEC performance comparisons, we still need to get definitive answers through subjective experiments. Meanwhile, subjective experimental scoring data is usually used as Ground Truth to verify the performance of objective quality assessment algorithms. Complete subjective experiment process generally includes: 1) a representative of the source video, 2) added measure of video processing schemes, 3) according to the ITU standard subjective experiment design, grade 4) recruiting volunteers to watch the video and, 5) collect subjective ratings and eliminate invalid data, and 6) data modeling and throw the experiment conclusion. ITU has a series of standards to guide how to do subjective experiments, such as ITU-T P910 [2], ITU-R BT.2020[3], etc., which will not be expanded in this paper.

Obviously, subjective experiment is a long cycle, time-consuming and laborious process. It is not feasible to resort to subjective scoring to validate all video quality requirements. Fortunately, we can use objective quality evaluation algorithm to simulate subjective scoring to achieve video quality evaluation. However, it is still a challenging task to develop accurate and rapid objective quality assessment algorithms.

The traditional VQA algorithm cannot use the motion information of video effectively

Objective video quality evaluation algorithm only needs to calculate the video quality score. From the perspective of industry, the classical objective algorithms include PSNR, SSIM [4] and MS-SSIM [5]. These algorithms judge the difference between distorted video and lossless video source based on the classical signal fidelity, and then fit the perceived quality of video according to the difference. Recent algorithms include VQM [6], which extracts spatio-temporal joint features from multiple dimensions to approximate subjective quality. At present, the mainstream algorithm is VMAF [7], which uses machine learning method to fuse multiple objective image quality algorithms. With the idea of fusion, VMAF can flexibly add new objective algorithms. On the other hand, by retraining with new data sets, VMAF can also easily migrate to video quality assessment tasks of subdivided dimensions.

Image quality assessment is mainly to measure the perceptibility of in-picture distortion under the influence of image masking effect. The video quality evaluation not only depends on the distortion in the picture, but also includes the distortion in the time domain and the time domain masking effect. Here the masking effect can be simply understood as the complexity of the background. If the background is more complex, we call this a stronger masking effect, and vice versa. For example, in Figure 1, the skateboard is in a state of fast movement, and the masking effect is strong, so the distortion in the skateboard area is more difficult to detect. While the blue sky in the background is a large smooth area, the masking effect is weak, and subtle compression distortion can be easily detected. Therefore, in developing an objective video quality assessment algorithm, we must take the inherent motion information of video into account.

In academia, many corresponding strategies have been proposed. The most common approach is to extract two features, one to describe the picture quality and the other to describe the size of the video motion. The mainstream Motion features include TI (Temporal Information), Motion Vector, Optical Flow and so on. The biggest defect of this approach is that it completely strips the picture information and motion information, and the video is no longer treated as 3d data, but two-dimensional data plus one-dimensional data.

In order to solve the above problems, another more intuitive method is to conduct three-dimensional slicing of the video [8]. As shown in Figure 2, we use (x, y, t) to label the spatial and temporal axes. Here, if the slice is perpendicular to the time axis, i.e. in the direction of (x, y), then the cut is the traditional video frame; If it is parallel to the time axis, that is, in the (x, t) or (y, t) direction, we have a two-dimensional slice of the union of space-time. To some extent, the latter two slices contain motion information. Using the image quality assessment algorithm for the above three slices, and then combining the slice scores, good quality improvement can be achieved. However, 3d slices do not make maximum use of motion information.

Figure 2. Schematic diagram of video section in 3d space

Many image quality assessment algorithms are based on classical DCT or wavelet transform, and then extract feature vectors from transform coefficients. For video, a more intuitive extension is to use 3d transformation, such as 3D DCT transform, 3d wavelet transform and so on. After 3d transformation, we further extract features from the transformation coefficient to do quality assessment. This method preserves the spatio-temporal joint information of the video, but the 3d transformation will introduce the problem of high complexity.

Learning the spatio-temporal joint features of videos using 3D Convolutional Neural Network (C3D, Convolutional 3D Neural Network)

In recent years, deep learning has made remarkable achievements in many computer vision image tasks. At the same time, some scholars extended two-dimensional neural network to three-dimensional neural network to better process video tasks [9]. We try to use three-dimensional convolutional neural networks to learn spatio-temporal features and apply them to video quality tasks. We first present the basic two – and three-dimensional convolution modules, and then further introduce the proposed network structure.

FIG. 3A shows the convolution operation of the 2d convolution kernel on the 2d input. To avoid ambiguity, we assume that a convolution operation is performed on a two-dimensional image. The input image size is HxW, the convolution kernel size is KXK, and the time-domain depth of the image and the time-domain depth of the convolution kernel are both 1. After convolution, the output is still two-dimensional. Neither input nor output contains any motion information.

FIG. 3B shows the convolution operation of 2d convolution kernel on 3D input. We can assume that the input is a HxW video containing L frames. The depth of the convolution kernel is no longer one, it’s the same as the number of frames in the video. After the convolution operation, the output is still two-dimensional, and the output size is the same as that in FIG. 3A. This convolution operation helps to use the motion information of the before and after frames of the video, but it consumes all the motion information in a single convolution step.

FIG. 3C shows the convolution operation of 3d convolution kernel on 3D input. Compared with FIG. 3B, the depth of convolution kernel here is D, and D is less than L. After the three-dimensional convolution operation, the output is still three-dimensional. When d=1, it is equivalent to the convolution operation in FIG. 3A to process the video frames frame-by-frame, but the motion information of the front and back frames is not utilized. When d=L, its effect is equivalent to figure 3b. Therefore, when d is less than L, three-dimensional convolution can make more controllable use of motion information. If we want the motion information to disappear faster, we increase the depth of the three-dimensional convolution d. In contrast, the smaller d is used to extract motion information more slowly.

FIG. 3. Schematic diagram of 2d and 3D convolution operations

On this basis, we design our own video quality evaluation algorithm C3DVQA. The core idea is to use three-dimensional convolution to learn spatio-temporal joint features, and then better characterize the video quality.

FIG. 4 shows the network structure diagram proposed by us, whose input is damage video and residual video. The network consists of two layers of two-dimensional convolution to extract spatial features frame by frame. After cascaded, the spatial characteristics still retain the time relation of the before and after frames. The network then uses four three-dimensional convolution layers to learn spatio-temporal joint features. Here, the three-dimensional convolution output describes the spatio-temporal masking effect of the video, and we use it to simulate the perception of video residuals by human eyes: where the masking effect is weak, the residuals are more easily perceived; Where the masking effect is strong, the complex background can mask the distortion better.

Finally, the network has the pooling layer and the full connection layer. The input of the pooling layer is the result of the masking effect of the residual frame, which represents the residual perceived by human eyes. The full connection layer learns the nonlinear regression relationship between the global perceived quality and the target quality score interval.

Figure 4. The network structure diagram proposed in this paper. It contains two layers of two-dimensional convolution, four layers of three-dimensional convolution, pooling and full connection layers. Convolution parameter represents :(channel, kernel size, stride, padding)

The results of

We verify the performance of the proposed algorithm on two video quality datasets, LIVE and CSIQ. The LIVE database contains 10 reference videos and 15 distorted videos corresponding to each reference video. The CSIQ dataset contains 12 source videos and 18 corresponding distorted videos. We use standard PLCC and SROCC as quality criteria to compare the performance of different algorithms.

Because these two databases are relatively small, we refer to the practice of another deep learning article [10] and randomly select 80% reference videos and the distorted videos obtained from them as the test set. We repeated this dataset partition 20 times and each time trained the model from scratch. The scatter diagram of the specific quality assessment is shown in Figure 5.

Figure 5. Scatter diagram of quality estimation results, where each point represents a video to be tested. The Y-axis is the estimated video quality, and the X-axis is the subjective scoring result. The left figure shows the test result on LIVE, and the right figure shows the test result on CSIQ.

We compared with commonly used full-reference quality assessment algorithms, including PSNR, MOVIE [11], ST-MAD [12], VMAF and DeepVQA [10]. Each test resulted in a PLCC and SROCC. In the table below, we used the median of the results of multiple experiments to represent the final performance.

We can clearly see that the algorithm C3DVQA designed in this paper is significantly ahead of traditional algorithms such as PSNR, MOVIE, ST-MAD and VMAF in both databases. It is worth mentioning that DeepVQA is also an algorithm based on deep learning and has achieved good performance. We attribute these performance improvements to two reasons: 1) USING CNN to learn quality-related features is superior to traditional feature extraction algorithms; 2) Both DeepVQA and C3DVQA learn the spatio-temporal joint features of videos, and the explicit use of motion information can better characterize the video quality.

Table 1. Performance comparison of different full reference algorithms on LIVE and CSIQ databases

conclusion

This paper briefly reviews the current development of video quality assessment in academia and industry. For complexity reasons, the industry still prefers to use less complex schemes based on image quality assessment. The downside is that the video as a whole can’t be used to learn spatio-temporal properties, although the results are poor and a good compromise between performance and complexity.

We propose a full reference algorithm based on 3d convolutional neural network. The problem of motion information loss can be solved better by learning the spatio-temporal joint feature of video. Compared with the traditional feature extraction algorithm, our algorithm can greatly improve the accuracy.

Of course, this is just the beginning and there is still a lot of work to be done. We want detailed complexity analysis, especially in scenarios where no GPU is available. We also want to know the performance of the trained algorithms on other databases, not just PGC videos, but UGC videos as well.

The good news is that we have plans to train code for open source models in the industry so that anyone can easily use their own database to train and test specific video business scenarios. At the same time, we welcome any form of collaborative development, whether it’s contributing databases, contributing pre-trained models, or even throwing out problems encountered in business scenarios.

Thank you

This project is based on the joint project of Tencent Multimedia Lab and Professor Li Ge’s team of Shenzhen Graduate School of Peking University.

reference

1 Wang, Haiqiang et al. “VideoSet: A large-scale compressed video quality dataset based on JND measurement.” 2017.

2 ITU-T P.910. “Subjective video quality assessment methods for multimedia appli- cations. ” 1999.

3 ITU-R BT.2022. “General viewing conditions for subjective assessment of quality of SDTV and HDTV television pictures on flat panel displays.” 2012.

4 Wang, Zhou et al. “Image quality assessment: from error visibility to structural similarity.” 2004.

5 Wang, Zhou et al. “Multiscale structural similarity for image quality assessment.” 2003.

6 Wolf, Stephen et al. “Video quality model for variable frame delay (VQM-VFD).” 2011.

7 Zhi, Li et al. “Perceptual video Quality Metric of Practical perceptual.” 2016.

8 Phong Vu, et al. “ViS3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices.” 2014.

9 Tran Du, et al. “Learning Spatiotemporal Features with 3D Convolutional Networks.” 2015.

10 Woojae Kim, et al. “Deep video quality assessor: From spatio-temporal visual sensitivity to a convolutional neural aggregation network.” 2018.

11 Seshadrinathan, Kalpana, et al. “Motion tuned spatiotemporal quality assessment of natural videos.” 2009.

11 Phong V Vu, et al. “A Spatiotemporal Most vision-Distortion Model for Video Quality Assessment.” 2011.