In this paper, from
RTC developer community, author Xiaoran Wu is a senior video engineer at Sonnet, focusing on video codec and related technology research. His personal technical interests include multimedia architecture and deep learning.

After AlphaGo defeated Lee Sedol and Ke Jie, more industries began to try to optimize existing technical solutions through machine learning. Machine learning has been in the works for years for real-time audio and video, and the real-time image recognition we shared is just one application. We can also use deep learning to do super-resolution. This time, we will share the basic framework of deep learning for super-resolution, and the various network models derived from it, some of which have good performance in real-time.

Machine learning and deep learning

Developers who are less exposed to machine learning and deep learning may confuse the two and even think that machine learning is deep learning. In fact, we can easily distinguish this concept with a picture.

In the 1950s, there was the idea of artificial intelligence, and later there were more basic applications, such as chess. But in the 1970s, due to the limitations of hardware performance and the lack of training data sets, artificial intelligence experienced a trough. Artificial intelligence includes a lot, such as machine learning, scheduling algorithms, expert systems and so on. It was only in the 1980s that more applications of machine learning began to emerge, such as using algorithms to analyze data and make judgments or predictions. Machine learning includes logical trees, neural networks and so on. Deep learning, on the other hand, is a method of machine learning derived from neural networks.

What is super resolution?

Super resolution is a concept based on the human visual system. David Hubel and Torsten Wiesel, winners of the 1981 Nobel Prize in medicine, discovered that the human visual system processes information in a hierarchical manner. The first layer is the raw data entry. When people see a face image, they will first recognize the edge of the point, line and so on. Then enter the second layer, which identifies some basic elements of the image, such as eyes, ears and nose. Finally, an object model is generated, which is a complete set of faces.

The convolutional neural network in deep learning (as shown in the picture below) imitates the processing process of human visual system. Therefore, computer vision is one of the best applications of deep learning. Superresolution is a classic application in computer vision.

Super resolution is a method to improve image resolution by software or hardware. Its core idea is to trade time bandwidth for spatial resolution. To put it simply, when I can’t get a super high resolution image, I can take several more images and combine these low resolution images into one high resolution image. This process is called super-resolution reconstruction.

Why can super resolution improve image resolution by taking more images?

This involves shaking. We often say that photography anti jitter, in fact, is more obvious jitter, but tiny jitter always exists. Between each image taken of the same scene, there are subtle differences. These tiny shakes actually contain additional information about the scene, and when combined, they produce a clearer image.

One might ask, why do we need super-resolution technology when we can put $20 million front and back on our phones? Isn’t there a lot of use for this technology?

Not really. Anyone who knows anything about photography knows that. On the same photosensitive component, the higher the resolution of the image taken, the smaller the area of the single pixel on the photosensitive component, which will lead to the lower transmittance. When your pixel density reaches a certain level, it will bring a lot of noise, which directly affects the image quality. Super resolution solves this problem. Super resolution has many applications, such as:

  • Digital HD, through this method to improve the resolution

  • Microscopic imaging: Synthesis of a series of low resolution microscopic images to obtain high resolution images

  • Satellite image: used for remote sensing satellite imaging to improve image accuracy

  • Video Recovery: This technique can be used to recover video, such as old movies

However, there are many cases where we only have one image and cannot take multiple images, so how do we do super resolution? And that’s where machine learning comes in. A typical example is a “black technology” proposed by Google in 2017. They can use machine learning to remove the mosaics in video images. Of course, there are some limitations to this black technology. For example, the neural network it trains is for facial images, so if you give the Mosaic image is not a face, it cannot be restored.

Principle of super-resolution neural network

Super-resolution neural network (SRCNN) is the first model of deep learning applied in the field of super-resolution. The principle is simple. It has three layers of neural network, including:

  • Feature extraction: the low-resolution image is transformed into a fuzzy image through binomial difference, from which image features are extracted, Channel is 3, the size of convolution kernel is F1 * F1, and the number of convolution kernel is N1.

  • Nonlinear mapping: low-resolution image features are mapped to high resolution, and the convolution kernel size is 1*1;

  • Image reconstruction: restore details to obtain clear high-resolution images, and the convolution kernel is F3 * F3;

Parameter adjustment is a relatively mysterious part of neural network, but also the most criticized part. Many people think that parameter adjustment is very similar to the old traditional Chinese medicine treatment, usually lack of theoretical basis. Here are a few examples of the training time and PSNR (a parameter used to judge picture quality, the higher the better) used when n1 was set at different values.

In training, the use of Mean Squared Error (MSE) as the loss function is beneficial to obtain a higher PSNR.

How did the training go? In the following table, the results of several traditional methods are compared with those of SRCNN. The image collection is in the left column, and the training time and image peak signal-to-noise ratio for each method are listed separately on the right. It can be seen that although the results obtained by traditional methods are better than those obtained by deep learning in some images, on the whole, deep learning is slightly better and even takes less time.

They say a picture is worth a thousand words. So what does the actual picture look like? We can look at the following two sets of pictures. The first image of each group is the original image of small resolution, followed by a different method to achieve the large image of high resolution. Compared with traditional methods, SRCNN image edges are clearer and details are recovered better. This is the original super-resolution deep learning model.

Nine super-resolution neural network models

SRCNN is the first super-resolution neural network model. After the appearance of SRCNN model, it is more applied to super-resolution neural network model. Let’s share a few:

FSRCNN

Compared with SRCNN, this method does not need to use binomial difference value for the original image, and can directly process the image with small resolution. After eigenvalues are extracted, the image is reduced, and then through mapping, expending and deconvolution layers, the high-resolution image is obtained. The advantage of this is that shrinking images can reduce training time. At the same time, if you need to get pictures with different resolutions, you can train the deconvolution layer separately, which saves more time.

ESPCN

This model is trained on a small graph. And then we extract r squared channels. For example, if I want to enlarge the image by 3 times, then R is the scale factor of 3 and Channel is 9. By expanding a pixel into a 3×3 matrix and simulating it into a matrix of one pixel, the effect of super-resolution is achieved.

The experimental results of super resolution processing for real-time video are also very good. Triple magnification of 1080 HD video requires 0.435s per frame for SRCNN and 0.038s for ESPCN.

VDSR

This is one of the winning models in 2016. We do video codec all know that there is residual between the images. It considers that the low frequency component is almost the same between the original low resolution image and the high resolution image, while the high frequency component, that is, the image details, is missing. So when you train, you only need to train for the high frequency components.

Therefore, its input is divided into two parts, one is to take the whole original image as an input, the other part is to train the residual and get an input, add the two together to get a high-resolution image. In this way, the training speed is greatly accelerated and the convergence effect is better.

DRCN

It’s still divided into three layers. But at the layer of nonlinear mapping, it uses a recursive network, that is, data loops through the layer multiple times. Unwinding this loop is equivalent to a series of convolution layers using the same set of parameters.

RED

Each convolution layer corresponds to an unconvolution layer. In simple terms, it can be understood that a picture is encoded and then decoded. The advantage is that it solves the problem of gradient disappearance and restores a cleaner image. It has similar ideas to VDSR. The training of the middle convolutional layer and deconvolution layer is aimed at the residuals between the original image and the target image. Finally, the original image is added to the training output to obtain a high-resolution image.

DRRN

In this model you can see the shadow of DRCN and VDSR. It uses a deeper network structure to improve performance. There are many image enhancement layers. An image that is blurry is enhanced by several layers to make it clearer and clearer until it is in high definition. You can find the source code on Github called Tyshiwo.

LapSRN

LapSRN is special in that it introduces a hierarchical network. Each level only doubles the original image and adds the residuals to get a result. If the image is magnified 8 times, this processing performance is even higher. Also, at each level of processing, an output result can be obtained.

SRDenseNet

It introduces a Desent Block structure. The eigenvalues trained by the network of the upper layer will be transmitted to the network of the next layer, and all the features are connected in series. This has the advantage of alleviating the gradient extinction problem and reducing the number of parameters. In addition, the following layers can reuse the eigenvalues obtained from the previous training without repeated training.

SRGAN

It can use perceptual and adversarial losses to improve the restored images.

In this model, there are two networks, one is generative network and the other is discriminant network. The former will generate a high-resolution image, and the latter will judge whether the image is the original one. If the result is “no”, the former will train and generate again until the discriminant network can be fooled.

These neural network models can be applied to video processing, but the actual application still needs to consider many factors, such as system platform, hardware configuration, performance optimization. In fact, in addition to super resolution, machine learning and real-time audio and video have many application scenarios that can be combined, such as audio and video experience optimization, yellow detection, QoE improvement, etc. We will invite technical experts from Google, Meitu, Sogou and other companies to share more practical experience and dry goods at the RTC 2018 Real-time Internet Conference in September.