Article | He Ming Netease cloud letter audio-visual algorithm engineer

Vision is the main way for humans to obtain information, and a large number of video information is produced and transmitted every day. Uncompressed video content occupies huge storage space and transmission bandwidth. Take common 30fps HD video as an example, YUV420 video stream stored in AVI format has a size of 2GB in one minute and requires 40MB/s transmission bandwidth. Therefore, the video information we get from mobile phones, computers and other electronic devices is compressed and then transmitted and stored. The technology to compress these video information is called video codec technology.

1. Color space

The visual information of our eyes is obtained by rods and cones. Rods sense changes in light and shade to get information about visual brightness, while cones sense color. According to the principle of bionics, video signal is collected by computer by means of brightness information and speed information respectively. Since the number of rod cells in human eyes is much more than that of cone cells, human eyes are more sensitive to brightness information, so YUV420 format is often used to collect video information. The specific operation is shown in Figure 1. Four luminance pixels Y share a group of red and blue chromaticity pixels.

Figure 1 YUV420 color space

The collected pixels are stored in the form of base 2 in the computer. In order to represent the color space of human vision, 256 colors are generally stored in the computer, that is, the value range of brightness information and chroma information is 0 to 255. Different combinations of values can represent different colors, for example, pink when both yuVs are 255, and dark green when both yuVs are 0. This requires 1.5 bytes of data to represent a pixel, and there are thousands of pixels in an image. A video with 720p resolution contains 920,000 pixels, a video with 1080p resolution contains 2.07 million pixels, and a one-second video contains 30 frames of images. YUV video data with a large amount of data needs to be compressed before it can be stored and transmitted.

2. Compression principle

Compressed video information is different from compressed file information because there is a lot of redundant information in the video. As shown in Figure 2, the similarity between adjacent frames has temporal redundancy, and there is spatial redundancy between adjacent blocks of the same frame. Since human eyes are more sensitive to low-frequency information, there is perceptual redundancy.

Figure 2 redundancy information in the video

The basic principle of video compression is to remove these redundant information and compress the video 300 to 500 times. The common compression methods include predictive coding, transform coding and entropy coding. As shown in Figure 3, the input video information is processed in accordance with each coding module, and the output codestream is called the video coding process. Then, the corresponding entropy decoding, transformation decoding and predictive decoding processes are used to restore the codestream to YUV420 video stream. The process of video codec can be regarded as the process of compression and decompression, but in this process, due to the difference of compression algorithm, so the decoding algorithm must correspond to the encoding algorithm, different codec algorithm is called video standard.

Figure 3 video coding technology

Now commonly used video standards, such as H.26X series standards, the most commonly used is H.264 standard, due to the popularity of open source encoder X264, also called x264 standard. The development and progress of THE H. 26X series standard has always been the benchmark of the industry. The latest H.266/VVC standard incorporates many new technologies, which can be summarized in the following aspects:

FIG. 4 Block division of H.265

Block partitioning technology: Except for deep learning technology, traditional video codec technology processes video by block, and the trend is that the largest block is getting bigger, the smallest block is getting smaller, and the block types are increasing. For relatively static regions, large blocks can improve compression efficiency. One or two flag bits or a few residual data can represent a large block, which can greatly compress the video picture. For the area with more movement, the use of small pieces can improve the quality of the picture, the details of the movement more complete expression. In order to better divide the moving and static regions, block partition technology uses rectangular blocks of various shapes to replace the original square block partition. In engineering, more and more complex block partition technology to waste a lot of computing resources, so a lot of fast algorithm are for the block partition mode to forecast, the applications of machine learning algorithms and deep learning algorithms to block partition prediction process, can be in the case of mass loss little, fast block partition mode are obtained.

FIG. 5 Intra-frame prediction diagram

Intra prediction: Intra prediction belongs to the category of prediction coding. In a video sequence, the prediction information of some frames and blocks cannot be obtained from the reference frame, which is called I frame or intra prediction block. All blocks in I frame are intra prediction blocks, and intra prediction blocks can exist in I frame or P frame and B frame. Specific how implement frame prediction, prediction block for a frame, the first in a circle around the block compensation pixels (for edges, can adopt the way of extension), according to the pixel values, with the method of Angle or plane prediction compensation out of the current block, and compared with the original image, select the minimum loss prediction model. Since the pixel values used in intra-frame prediction compensation are all from the current frame without reference to frame information, intra-frame prediction is often used in the first frame of the sequence or in the area where the video information changes greatly.

FIG. 6 Schematic diagram of inter-frame prediction

Interframe prediction: Corresponding to intra-frame prediction is interframe prediction technology, which both belong to predictive coding technology. The reference image information of inter-frame prediction comes from the reference frame, so the inter-frame prediction technique cannot be used when the first frame or the reference frame is missing. The key processes of inter – frame prediction are motion search and motion compensation. The motion search process is responsible for searching the image block closest to the current block on the reference frame and generating the motion vector, while motion compensation generates the current frame information according to the reference frame information. With the latest inter-frame prediction technology, motion information can include panning, scaling, and rotation. Since the position of motion vector is not necessarily the position of integer pixel, subpixel compensation technology is also involved in the process of motion compensation. Inter-frame prediction can greatly improve the compression rate of the video. For example, if there are blocks with high similarity in the reference frame, skip mode can be used to encode the current block, and only one flag bit can encode all YUV information in the original block.

Figure 7. 16x16DCT transform kernel

Transformation quantization: transformation and quantization technology are used together. In the analysis just done, since human eyes are insensitive to high-frequency information, high-frequency information needs to be compressed, which is easier to operate in the frequency domain, so image transformation is required. The commonly used transformation methods include Hadamar, integer DCT and integer DST. Due to the existence of predictive coding technology, the transformation usually operates on residual information. According to different compression rate requirements, the coefficients after transformation can be quantified, and only the low-frequency information that is more sensitive to human eyes can be retained. In the decoding process, the corresponding inverse quantization and inverse transformation techniques are needed to restore the residual coefficient after compression.

Figure 8. CABAC encoder frame

Entropy coding: the logo and residual error coefficient, also need a set of coding techniques to further compress the information, for some key information, can use index of Columbus, run-length coding methods such as compression, for a large number of residual coefficients coding information and image frame, entropy coding technique based on context model is commonly used now in compression. The basic principle of entropy coding is to use more bit coding for low-probability symbols and less bit coding for high-probability symbols. Most high-probability symbols can be compressed through context model. Different from prediction coding and transformation quantization, entropy coding is lossless.

Figure 9. Four boundary compensation modes of loop filter SAO

Loop filtering: For the reference frame, since subsequent videos are compensated according to the information of the previous video frame, the loss and error in the reference frame will continue to the whole sequence, along with the process of motion compensation, or spread to the whole video frame. In order to reduce the loss of video, after each frame is encoded, it is postprocessed. The filter that processes these video frames is called loop filter, which makes it closer to the original video sequence. At present, many post-processing techniques based on deep learning have been applied to loop filter, which has a good effect in the process of encoding and decoding.

FIG. 10 Schematic diagram of WPP parallel technology

In addition to the above technologies, in the process of engineering implementation, code control technology, parallel technology, instruction set technology also affect the encoder effect. Video codec technology consists of a series of technical integration of algorithms, which are combined to form various video codec standards. In addition to H. 26X video standard, there are also AV1 standard of open Video Standards Alliance, AVS standard and so on.

3. Challenges and development of video coding technology

According to current technical requirements, future video coding technologies need to face the challenges of higher resolution, higher frame rate, wider color gamut and HDR video. At the same time, in the face of more forms of video content, such as panoramic video, point cloud, deep learning feature map, etc., video coding technology needs to keep pace with The Times and keep developing. Existing technology is still in the ascendant, and future technology is still in the horizon.

Live preview

Video codec technology has always been the core business of video content application. Video content collection and distribution based on various platforms and channels all involve the intervention of video codec technology. In the RTC business scenario, how to build efficient and fast video codec engine, how to optimize and improve the existing codec technology, how to implement private protocol on the basis of public protocol, how to rewrite the codec framework and other issues are worth paying attention to.

At 19:30 tonight, He Ming, an audio and video algorithm engineer of netease Yunxin.com, will introduce in detail the optimization and practice of codec technology under the RTC business scenario of netease Yunxin.com, as well as its future development direction.

More technical dry goods welcome to pay attention to [netease wisdom enterprise technology +] public number