H264 basic principles

preface

H264 video compression algorithm is now undoubtedly the most widely used and popular of all video compression technologies. With the release of open source libraries such as X264 / Openh264 and FFMPEG, most users don’t need to do much research into the details of H264, which significantly lowers the cost of using H264.

However, in order to use H264 well, we still need to understand the basic principles of H264. Today we’ll take a look at the basics of H264.

The H264 overview

H264 compression technology mainly uses the following methods to compress video data. Include:

Intra-frame predictive compression solves the problem of spatial data redundancy.
Interframe predictive compression (motion estimation and compensation) solves the time-domain data redundancy problem.
Integer discrete cosine transform (DCT) transforms spatial correlation into frequency domain independent data and then quantizes it.
CABAC compression.

After compression, the frame is divided into I frame, P frame and B frame:

I frame: key frame, using the intra – frame compression technology.
P frame: Forward reference frame. When compressed, only the previously processed frame is referred to. Frame tone compression technology is adopted.
B frame: bidirectional reference frame, in compression, it refers to the frame before and after it. Interframe compression is used.

In addition to I/P/B frames, there is the image sequence GOP.

GOP: There is an image sequence between two I frames, and there is only one I frame in an image sequence. As shown below:

Here we will describe the H264 compression technology in detail.

H264 compression technology

The basic principle of H264 is actually very simple, we will simply describe the process of H264 compression data. The video frames captured by the camera (30 frames per second) are sent to the buffer of the H264 encoder. The encoder first divides each picture into macro blocks.

Take this picture:

Divided into macro block

By default, H264 uses a 16X16 area as a macro block, which can also be divided into 8X8 sizes.

After dividing the macroblock, the pixel value of the macroblock is calculated.

In the same way, calculate the pixel value of each macroblock in an image, and all the macroblocks will look like this after processing.

Cross molecular block

H264 uses 16X16 macroblocks for relatively flat images. However, for higher compression, you can also divide the 16X16 macro block into smaller sub-blocks. Sub-block sizes can be 8X16, 16X8, 8X8, 4X8, 8X4, 4X4 are very flexible.

In the previous picture, most of the 16X16 macro block in the red box is blue background, and part of the images of the three eagles are delimit in this macro block. In order to better process part of the images of the three eagles, H264 divides several sub-blocks in the 16X16 macro block.

In this way, more efficient data can be obtained through intra-frame compression. The image below is the result of compressing the above macroblock using MPEG-2 and H264 respectively. The left part is the result of mPEG-2 sub-block partitioning and compression, and the right part is the result of H264 sub-block partitioning and compression. It can be seen that THE PARTITIONING method of H264 is more advantageous.

Once the macro block is divided, all images in the H264 encoder cache can be grouped.

The frame group

There are two kinds of data redundancy for video data, one is time data redundancy and the other is space data redundancy. Among them, data redundancy in time is the largest. Let’s talk about the time redundancy of video data first.

Why is there the greatest redundancy in time? Suppose the camera captures 30 frames per second, and the 30 frames of data are mostly correlated. There may be more than 30 frames of data, there may be dozens of frames, hundreds of frames of data are particularly closely related.

For these closely related frames, in fact, we only need to save the data of one frame, and other frames can be predicted by this frame according to certain rules. Therefore, the time redundancy of video data is the largest.

In order to compress the data by the method of prediction, it is necessary to group the video frames. So how do you determine if certain frames are close enough to be grouped together? To take an example, here is a video frame captured of a moving billiard ball rolling from the top right to the bottom left.

The H264 encoder will take out two adjacent frames in sequence for macro block comparison and calculate the similarity of the two frames. The diagram below:

The correlation between the two frames is very high through macro block scan and macro block search. And found that this group of frames are very high correlation. Therefore, the above frames can be grouped together. The algorithm is: in several adjacent images, the difference of pixels is generally within 10%, the change of brightness difference is no more than 2%, and the change of chroma difference is only within 1%. We believe that such images can be divided into a group.

In such a set of frames, after encoding, we only retain the complete data of the first frame, the other frames are calculated by referring to the previous frame. We call the first frame IDR/I frame, and the other frames P/B frame, so the data frame group after coding is called GOP.

Motion estimation and compensation

After grouping the frames in the H264 encoder, it is necessary to calculate the motion vector of the objects in the frame group. Using the billiard ball video frame above as an example, let’s see how it calculates the motion vector.

The H264 encoder first fetches two frames of video data sequentially from the buffer head and then performs a macro block scan. When an object is found in one of the images, a search is performed in the vicinity of the other image (in the search window). If the object is now found in another picture, the motion vector of the object can be calculated. The following picture is the position of the billiard ball after the search.

From the billiard ball position difference, you can calculate the direction and distance of the drawing. H264 records the distance and direction the ball moves in each frame to create the following image.

Once the motion vector is calculated, subtract the same part (the green part) and you get the compensation data. Finally, we only need to compress the compensation data and save it so that we can restore the original image when decoding. Compressed and compensated data requires only a small amount of data to be recorded. As follows:

We call motion vector and compensation interframe compression technology, which solves the data redundancy of video frame in time. In addition to inter-frame compression, intra-frame data compression also needs to be carried out, and intra-frame data compression solves data redundancy in space. Now let’s introduce the intra – frame compression technology.

Frame prediction

The human eye has a degree of recognition of the image, is very sensitive to the brightness of low frequency, not very sensitive to the brightness of high frequency. So based on some studies, it is possible to remove the data in an image that is not sensitive to human eyes. Thus the intra – frame prediction technique is proposed.

H264’s in-frame compression is very similar to JPEG. After an image is divided into macro-blocks, each macro-block can be predicted in 9 modes. Find a prediction model that is closest to the original picture.

The following diagram shows the process of predicting each macroblock in the whole diagram.

The image after intra-frame prediction is compared with the original image as follows:

Then, the original image is subtracted from the predicted image within the frame to obtain the residual value.

And then we store the predictive pattern information that we got earlier, so that we can retrieve the original image when we decode it. The effect is as follows:

After intra-frame and inter-frame compression, the data is significantly reduced, but there is still room for optimization.

DCT was performed for residual data

Integer discrete cosine transform can be performed on residual data to remove the correlation of data and further compress data. As shown in the figure below, the macro block of original data is on the left, and the macro block of residual data is on the right.

The residual data macro block is digitized as shown in the figure below:

The residual data macro block is converted into DCT.

After removing the associated data, we can see that the data is further compressed.

After DCT is done, it is not enough to carry out CABAC for lossless compression.

CABAC

The above intra-frame compression is a lossy compression technique. That is to say, after the image is compressed, it cannot be completely restored. CABAC is a lossless compression technology.

Lossless compression techniques are probably best known as Huffman coding, in which high frequency words are given a short code and low frequency words a long code to achieve data compression. VLC used in MPEG-2 is such an algorithm. Let’s take a-Z as an example, where A is high frequency data and Z is low frequency data. Let’s see how it works.

CABAC also gives short codes to high frequency data and long codes to low frequency data. It is also compressed according to the context, which is much more efficient than VLC. The effects are as follows:

Now replace A-Z with A video frame, and it looks like this.

It is obvious from the above figure that the lossless compression scheme using CACBA is much more efficient than VLC.

# # summary

At this point, we will be H264 coding principle over. This article mainly talks about the following points:

Brief introduction to some basic concepts in H264. Such as I/P/B frames, GOP.
The basic principles of H264 coding are explained in detail, including:

Partition of macroblocks
Image group
The principle of intra-frame compression technology
Principle of interframe compression technology.
DCT
CABAC compression principle.

Hope the above content can be helpful to you. Thank you very much!