Author: Zhao Shijie
1. The background
Bytedance’s video service line receives a large number of submitted videos every day. When we receive these videos on the server, we can find that there are a large number of low-definition and low-quality videos due to uneven performance levels of the shooting equipment of the uploaded users. In order to improve the sharpness of this kind of low definition video, we can use super-resolution algorithm on the server side.
Super Resolution is a technology to improve the Resolution of images and movies. Resolution is an indicator used to describe the resolution of details in an image, that is, the number of pixels in each direction. The higher the resolution, the higher the resolution of the image.
Fig.1. Images at different resolutions [10]
1.1 Super-resolution based on neural network
Taking SRCNN (Super-resolution convolution Neural network) as an example, we briefly introduce the super-resolution algorithm based on neural network. The input of the network is a low resolution image and the output is a high resolution image. This process is mainly divided into three steps: Patch extraction and representation, non-linear mapping and reconstruction.
Fig.2. Schematic diagram of SRCNN [2]
The training method of the network is generally to use high-resolution pictures as labels, and use degradation methods such as Bicubic or Bilinear to obtain low-resolution pictures as network input, and usually use L1 or L2 loss function for supervision.
Fig. 3. Schematic diagram of SRCNN effect [2]
In this way, the trained network significantly outperforms the Scaling method based on interpolation such as Bicubic in terms of the objective index PSNR, where PSNR refers to the peak signal ratio and is an important objective index to measure the super-resolution algorithm.
Fig.4 comparison of objective data of more super-resolution algorithms [8]
After SRCNN, there are many other researches on super-resolution networks, such as EDSR and RCAN, etc., but the network framework and training methods are almost the same as SRCNN.
3. Subjective effect optimization of super-resolution
3.1 The actual scene effect is not good
In practice, we often find that direct use of the deep-learning super-resolution network mentioned above will lead to poor subjective effects. For example, using the pre-trained EDSR network for the emojis below, the subjective effect is similar to that of Bicubic, with very limited improvement in input. Next, we introduce some methods to optimize the subjective effect of super-resolution network from several perspectives.
Fig.5. Network images [9] Used EDSR and Bicubic to compare upsampling effects
3.2 Super-resolution network based on GAN
First we introduce the first optimization method. Super-resolution is a one-to-many task. A low-resolution map may have countless HIGH-RESOLUTION maps corresponding to it. In this case, using L1 or L2 as loss function training will lead to the results of network learning pointing to the average of all possible HIGH-DEFINITION images, which is a blur subjectively.
Fig.6. Super resolution based on adversarial generative network [4]
Fig.7. Super resolution based on adversarial generative network [4]
Figure 8 shows the subjective and objective performance of some super-resolution networks in the Set14 test picture. We can find that the PSNR index is not consistent with the subjective effect, and the PSNR index of gan-based super-resolution algorithm is even lower than that of Bicubic.
Fig.8. Subjective and objective performance of adversarial generative network [4]
3.3 Training data optimization based on real downsampling
Another optimization point is training data. The super-resolution training data pairs we discussed above are often obtained by Bicubic or other known downsampling methods. However, the real scene is often not, resulting in a large gap between the model training data and the actual prediction data, which makes the effect of the superfraction algorithm is not ideal. Here we introduce a real down-sampling generation method also based on adversarial generation network. As shown in FIG. 9, for a high resolution graph, we train the lower sampling generator G and discriminator D to make the low resolution graph generated by G close to the real low resolution graph, thus obtaining the real lower sampling G. After obtaining G, we can use the high-resolution graph to generate a large number of training data pairs that meet the real down-sampling degradation.
Fig.9. Kernelgan-based real down-sampling generation method [6]
3.4 Introduction of prior information
Fig. 10. Schematic diagram of SFTGAN structure [5]
However, in the actual scene, semantic information is often too complex to be analyzed and expressed by a network, so it is difficult to improve the subjective effect obviously by combining the method of general segmentation network. However, if limited to some special scenes, the introduction of prior knowledge can improve the effect of super-resolution.
Fig.11. Schematic diagram of DICGAN structure [1]
Fig.12. Subjective and objective performance of DICGAN in CelebA and Helen data sets [1]
3.5 Consider the input noise
In real life, low-resolution inputs tend to suffer from other degradation besides downsampling, such as noise, compression distortion, blurring, and so on. If we do not consider these factors in the process of training super-resolution network, the subjective effect of the network will be poor. Therefore, we should introduce corresponding degradation in the training process according to the application scenario.
When we actually use it on the server side, we need to do quality analysis on the input image. Based on the server-side non-reference quality evaluation algorithm established by us, we estimate the quality of the input video in multiple dimensions, and process it with the corresponding degenerate training superfraction model.
Fig.13. Self-developed video quality evaluation system without reference
4. To summarize
After combining the optimization points mentioned above, we review the low-resolution emoticons in Figure 5. The input belongs to the face scene, and there is a strong problem of low-quality degradation. We use a series of algorithms discussed above to optimize it, and we can find that the subjective effect of the processing results has been greatly improved.
This paper discusses the subjective effect optimization of super-resolution algorithm based on deep learning, including generative model, introduction of semantic information, optimization of training data, and consideration of input degradation. In general, as the video business continues to expand, we receive a steady stream of submissions every day. On the one hand, there are many kinds of videos, and on the other hand, the quality of the videos is not uniform. On the other hand, users have higher requirements for video clarity. In terms of improving the resolution of the video, we need to combine multiple algorithm points and optimize the use according to the specific scene, so as to bring better experience to users.
Fig.13. The left figure shows the upsampling results of Bicubic and the right figure shows the superfractional optimization results
5. References
[1] C. Ma, Z. Jiang, Y. Rao, J. Lu and J. Zhou. Deep Face Super-Resolution With Iterative Collaboration Between Attentive Recovery and Landmark Estimation, 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, pp. 5568-5577, 2020.
[2]C. Dong, C. C. Loy, K. He and X. Tang. Image Super-Resolution Using Deep Convolutional Networks, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295-307, 1 Feb. 2016.
[3] C. Ledig et al., Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network, 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 105-114, 2017.
[4] Wang X. et al. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In: Leal-Taixé L., Roth S. (eds), ECCV 2018 Workshops.
[5] X. Wang, K. Yu, C. Dong and C. Change Loy. Recovering Realistic Texture in Image Super-Resolution by Deep Spatial Feature Transform, 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 606-615, 2018.
[6] Bell-Kligler, Sefi et al. Blind Super-Resolution Kernel Estimation using an Internal-GAN. NeurIPS 2019.
[7] Lim, Bee & Son, Sanghyun & Kim, Heewon & Nah, Seungjun & Lee, Kyoung. Enhanced Deep Residual Networks for Single Image Super-Resolution, CVPRW, 2017.
[8] Zhang, Yulun and Li, Kunpeng and Li, Kai and Wang, Lichen and Zhong, Bineng and Fu, Yun. ECCV, 2018.
[9] baike.baidu.com/item/ Black question mark
[10] zh.wikipedia.org/wiki/ Resolution