“This is the first day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021”
Audio and video master advanced (two), strive to make small white can understand, due to the length of the reason, one can not be fully integrated into an article, so several times published, if there is a need for a complete PDF document, can go to my home profile contact me to take the PDF version
H264 video compression algorithm is now undoubtedly the most widely used and popular of all video compression technologies. With the release of open source libraries such as X264 / Openh264 and FFMPEG, most users don’t need to do much research into the details of H264, which significantly lowers the cost of using H264.
However, in order to use H264 well, we still need to understand the basic principles of H264. Today we’ll take a look at the basics of H264.
The H264 overview
H264 compression technology mainly uses the following methods to compress video data. Include:
- Intra-frame predictive compression solves the problem of spatial data redundancy.
- Interframe predictive compression (motion estimation and compensation) solves the time-domain data redundancy problem.
- Integer discrete cosine transform (DCT) transforms spatial correlation into frequency domain independent data and then quantizes it.
- CABAC compression.
After compression, the frame is divided into I frame, P frame and B frame:
- I frame: key frame, using the intra – frame compression technology.
- P frame: Forward reference frame. When compressed, only the previously processed frame is referred to. Frame tone compression technology is adopted.
- B frame: bidirectional reference frame, in compression, it refers to the frame before and after it. Interframe compression is used.
- In addition to I/P/B frames, there is the image sequence GOP.
GOP: There is an image sequence between two I frames, and there is only one I frame in an image sequence. As shown below:
Here we will describe the H264 compression technology in detail.
H264 compression technology
The basic principle of H264 is actually very simple, we will simply describe the process of H264 compression data. The video frames captured by the camera (30 frames per second) are sent to the buffer of the H264 encoder. The encoder first divides each picture into macro blocks.
Take this picture:
Divided into macro block
By default, H264 uses a 16X16 area as a macro block, which can also be divided into 8X8 sizes.
After dividing the macroblock, the pixel value of the macroblock is calculated.
In the same way, calculate the pixel value of each macroblock in an image, and all the macroblocks will look like this after processing.
Cross molecular block
H264 uses 16X16 macroblocks for relatively flat images. However, for higher compression, you can also divide the 16X16 macro block into smaller sub-blocks. Sub-block sizes can be 8X16, 16X8, 8X8, 4X8, 8X4, 4X4 are very flexible.
In the previous picture, most of the 16X16 macro block in the red box is blue background, and part of the images of the three eagles are delimit in this macro block. In order to better process part of the images of the three eagles, H264 divides several sub-blocks in the 16X16 macro block.
In this way, more efficient data can be obtained through intra-frame compression. The image below is the result of compressing the above macroblock using MPEG-2 and H264 respectively. The left part is the result of mPEG-2 sub-block partitioning and compression, and the right part is the result of H264 sub-block partitioning and compression. It can be seen that THE PARTITIONING method of H264 is more advantageous.
Once the macro block is divided, all images in the H264 encoder cache can be grouped.
The frame group
There are two kinds of data redundancy for video data, one is time data redundancy and the other is space data redundancy. Among them, data redundancy in time is the largest. Let’s talk about the time redundancy of video data first.
Why is there the greatest redundancy in time? Suppose the camera captures 30 frames per second, and the 30 frames of data are mostly correlated. There may be more than 30 frames of data, there may be dozens of frames, hundreds of frames of data are particularly closely related.
For these closely related frames, in fact, we only need to save the data of one frame, and other frames can be predicted by this frame according to certain rules. Therefore, the time redundancy of video data is the largest.
In order to compress the data by the method of prediction, it is necessary to group the video frames. So how do you determine if certain frames are close enough to be grouped together? To take an example, here is a video frame captured of a moving billiard ball rolling from the top right to the bottom left.
The H264 encoder will take out two adjacent frames in sequence for macro block comparison and calculate the similarity of the two frames. The diagram below:
The correlation between the two frames is very high through macro block scan and macro block search. And found that this group of frames are very high correlation. Therefore, the above frames can be grouped together. The algorithm is: in several adjacent images, the difference of pixels is generally within 10%, the change of brightness difference is no more than 2%, and the change of chroma difference is only within 1%. We believe that such images can be divided into a group.
In such a set of frames, after encoding, we only retain the complete data of the first frame, the other frames are calculated by referring to the previous frame. We call the first frame IDR/I frame, and the other frames P/B frame, so the data frame group after coding is called GOP.
Motion estimation and compensation
After grouping the frames in the H264 encoder, it is necessary to calculate the motion vector of the objects in the frame group. Using the billiard ball video frame above as an example, let’s see how it calculates the motion vector.
The H264 encoder first fetches two frames of video data sequentially from the buffer head and then performs a macro block scan. When an object is found in one of the images, a search is performed in the vicinity of the other image (in the search window). If the object is now found in another picture, the motion vector of the object can be calculated. The following picture is the position of the billiard ball after the search.
From the billiard ball position difference, you can calculate the direction and distance of the drawing. H264 records the distance and direction the ball moves in each frame to create the following image.
Once the motion vector is calculated, subtract the same part (the green part) and you get the compensation data. Finally, we only need to compress the compensation data and save it so that we can restore the original image when decoding. Compressed and compensated data requires only a small amount of data to be recorded. As follows:
We call motion vector and compensation interframe compression technology, which solves the data redundancy of video frame in time. In addition to inter-frame compression, intra-frame data compression also needs to be carried out, and intra-frame data compression solves data redundancy in space. Now let’s introduce the intra – frame compression technology.
Frame prediction
The human eye has a degree of recognition of the image, is very sensitive to the brightness of low frequency, not very sensitive to the brightness of high frequency. So based on some studies, it is possible to remove the data in an image that is not sensitive to human eyes. Thus the intra – frame prediction technique is proposed.
H264’s in-frame compression is very similar to JPEG. After an image is divided into macro-blocks, each macro-block can be predicted in 9 modes. Find a prediction model that is closest to the original picture.
The following diagram shows the process of predicting each macroblock in the whole diagram.
The image after intra-frame prediction is compared with the original image as follows:
Then, the original image is subtracted from the predicted image within the frame to obtain the residual value.
And then we store the predictive pattern information that we got earlier, so that we can retrieve the original image when we decode it. The effect is as follows:
After intra-frame and inter-frame compression, the data is significantly reduced, but there is still room for optimization.
DCT was performed for residual data
Integer discrete cosine transform can be performed on residual data to remove the correlation of data and further compress data. As shown in the figure below, the macro block of original data is on the left, and the macro block of residual data is on the right.
The residual data macro block is digitized as shown in the figure below:
The residual data macro block is converted into DCT.
After removing the associated data, we can see that the data is further compressed.
After DCT is done, it is not enough to carry out CABAC for lossless compression.
DCT principle in plain English
This is the first frame: P1 (our reference frame)
This is frame 2: P2 (frames that need to be coded)
The two images captured from the video with an interval of 1-2 seconds are similar to the actual situation. Below, we conduct several motion searches:
This is a demo program, mouse select any 16×16 Block on P2, can search for BestMatch macroblock on P1. Although the vehicle is moving from far to near, the nearest macroblock coordinates are still found.
This is a demo program, mouse select any 16×16 Block on P2, can search for BestMatch macroblock on P1. Although the vehicle is moving from far to near, the nearest macroblock coordinates are still found.
Search demo 2: Aerial wire crossing location (above P1, below P2)
Search demo 3: Newspaper stop advertising posters
Similarly, the position of macroblock closest to the poster in P2 was found successfully in P1.
Image full search: P2′ restored from P1 and motion vector data (the most similar position set of each macroblock in P1 is searched in P2), that is, the picture P2′ most similar to P2 is pieced together from macroblocks at each position of P1, the effect is as follows:
If you look at it, it’s a little bit broken, right? Now we subtract P2′ and P2 pixels to get the difference graph D2 = (P2′ -p2) / 2 + 0x80:
This is the original image of P2 ‘, which is clearly visible after adding the error D2, basically restoring the original image of P2.
Since D2 is only 5KB and the compressed motion vector is only 7KB, we only need an additional 7KB of data to fully represent P2 by referring to P1. If P2 is compressed independently with a lossy compression method of good quality, it will be at least 50-60KB, saving about 8 times of space. That’s the basic principle behind what’s called motion coding.
In practice, the reference frame is not necessarily the previous frame or the I frame of the same GOP, because when the GOP interval is long, the following picture may have changed greatly from the I frame. Therefore, it is common practice to select the frame with the smallest error as the reference frame in the last 15 frames. Although the color picture has three components of YUV, But the bulk of the prediction work and the most selective is usually judged by the y-component gray frame.
In addition, we save the error of (p2-P2 ‘) /2 + 0x80. In practice, we will use a more efficient way, such as making the accuracy of the color difference between [-64,64] is 1, [-255,-64], [64, 255] is 2-3, which will be more realistic.
At the same time, direct LZMA2 is used for simple storage in many places above, and entropy coding is generally introduced in actual use to sort out data at a certain level and then compress it, so the performance will be much better.
CABAC
The above intra-frame compression is a lossy compression technique. That is to say, after the image is compressed, it cannot be completely restored. CABAC is a lossless compression technology.
Lossless compression techniques are probably best known as Huffman coding, in which high frequency words are given a short code and low frequency words a long code to achieve data compression. VLC used in MPEG-2 is such an algorithm. Let’s take a-Z as an example, where A is high frequency data and Z is low frequency data. Let’s see how it works.
CABAC also gives short codes to high frequency data and long codes to low frequency data. It is also compressed according to the context, which is much more efficient than VLC. The effects are as follows:
Now replace A-Z with A video frame, and it looks like this.
It is obvious from the above figure that the lossless compression scheme using CACBA is much more efficient than VLC.
H264 coding (in-frame prediction)
Forecast? It feels like this word has some magical power that can lead you into the future
Is it so
then
Is intra prediction more powerful
What does it do?
Intra – frame prediction can prevent video aliasing.
In the intra-frame prediction mode, the prediction block P is formed based on the encoded reconstruction block and the current block. For luminance pixels, the P block is used for operations related to 4×4 sub-blocks or 16×16 macro blocks. There are 9 optional prediction modes for 4×4 brightness subblock, independently predicting each 4×4 brightness subblock, which is suitable for image coding with a lot of details; There are four prediction modes for 16×16 brightness block, which are suitable for flat region image coding. Chroma block also has 4 prediction modes, similar to 16×16 brightness block prediction mode. Encoders usually choose the prediction mode that minimizes the difference between p-block and code block.
4×4 brightness prediction mode
As shown in FIG. 6.14, the upper and left pixels A ~ M of the 4×4 luminance block are encoded and reconstructed pixels, which are used as predictive reference pixels in the codec. A ~ p is the pixel to be predicted, which is realized by a ~ M value and 9 modes. Mode 2(DC prediction) is based on the pixel prediction encoded in A ~ M, while the other modes can only be used when all the required predicted pixels are provided. The arrows in Figure 6.15 indicate the predicted direction for each model. For modes 3 ~ 8, the predicted pixels are weighted by A ~ M. For example, in pattern 4, d=round(B/4+C/2+ d /4).
model | describe |
---|---|
Mode 0 (vertical) | The corresponding pixel values are vertically derived from A, B, C and D |
Pattern 1 (Horizontal) | Deduce the corresponding pixel value from I, J, K and L level |
Mode 2 (DC) | By a. |
Mode 3 (lower left diagonal) | The corresponding pixel value is obtained by pixel interpolation in the direction of 45° |
Mode 4 (lower right diagonal) | The corresponding pixel value is obtained by pixel interpolation in the direction of 45° |
Mode 5 (right vertical) | The corresponding pixel value is obtained by interpolation of pixel value in 26.6° direction |
Mode 6 (Lower Level) | The corresponding pixel value is obtained by interpolation of pixel value in 26.6° direction |
Mode 7 (Vertical left) | The corresponding pixel value is obtained by interpolation of pixel value in 26.6° direction |
Pattern 8 (Upper Level) | The corresponding pixel value is obtained by interpolation of pixel value in 26.6° direction |
Table 2 16×16 prediction model
model | describe |
---|---|
Mode 0 (vertical) | Deduce the corresponding pixel value from the upper pixel |
Pattern 1 (Horizontal) | Deduce the corresponding pixel value from the left pixel |
Mode 2 (DC) | The corresponding pixel value is derived from the average of the top and left pixels |
Mode 3 (Plane) | Use the linear “plane” function and the left and upper pixel to derive the corresponding pixel value, suitable for the flat brightness change area |
8×8 chromaticity block prediction model
The 8×8 chromaticity component of the coding macro block in each frame is predicted from the upper left chromaticity pixel encoded, and the two chromaticity components usually use the same prediction mode.
The 4 prediction modes are similar to the 4 prediction modes of 16×16 prediction within the frame, but the mode numbers are different. DC (mode 0), horizontal (mode 1), vertical (mode 2), and plane (mode 3).
For the current block C, the codec evaluates as follows
Probableprediction mode = • min{prediction mode of A, predictionmodes of B} • Prediction mode of A = 2.Copy the code
For example,
The prediction modes of block A and B are 3 and 1 respectively
most probable mode for block C =1
Copy the code
The encoder sends a flag for each 4×4 block, which the decoder decodes as follows
Ifflag==1, Prediction mode= MOST_probable_mode Ifflag==0 If REM_intra4 ×4_pred_mode< most_probable_mode • prediction Mode = REM_intra4 ×4_pred_mode else • Prediction mode= REM_intra4 ×4_pred_mode+1Copy the code
This means that only 8 values (0 to 7) are needed for the 9 prediction patterns.