One, foreword

The last article “Audio and video learning (1) — Basic knowledge preparation” we have a general understanding of the basic knowledge of audio and video, this article we will in-depth understanding of video coding technology.

1. Why encode video

Assuming an hour of uncompressed movie (1920 * 1080), the pixel data format is RGB24, calculated at 25 frames per second: 3600 * 25 * 1920 * 1080 * 3 = 559.872GByte more than 500 G of the movie is obviously too big, download time also occupy a place, online watching is also slow, after all, we usually see the general is only a few G. So we’re going to compress the video, and encoding is the process of compression.

2. How to code

Frame rate (FPS) represents the number of frames (frames displayed) per second and can also be measured in the unit of frequency, Hz. Due to the phenomenon of visual retention, when the frame rate reaches a certain value, the human eye will feel that the picture is continuous motion. At more than 16 frames a second, the brain considers it coherent, compared with 25 frames for cartoons and movies.

  • Redundancy in video information

There are three types of redundancy in video information: spatial redundancy, time redundancy and statistical redundancy. These three types of redundant information are often handled in different ways:

(1) Spatial redundancy: intra-frame predictive coding compression is adopted;

(2) Time redundancy: motion search and motion compensation compression;

(3) Statistical redundancy: Entropy coding compression is adopted.

Coding is divided into soft coding and hard coding:

  • Soft coding: the use of CPU coding features: direct, simple, convenient parameter adjustment, easy upgrade, but the CPU load, performance is lower than hard coding, low bit rate quality is usually better than hard coding a little bit.

  • Hardcoding: Coding using a non-CPU, such as a graphics card GPU. Features: High performance, the quality is usually lower than that of the hard encoder at low bit rate. However, some products have transplanted excellent soft coding algorithms on the GPU hardware platform, and the quality is basically equal to that of the soft coding.

There are many techniques to encode video, and H.264 video compression algorithm is the most widely used among all video compression techniques. Let’s get to know H.264.

Second, h. 264

1, the introduction of

There are two international organizations that develop video codec technology, one is “INTERNATIONAL Electricity Union (ITU-T)”, which develops standards such as H.261, H.263, H.263+, and the other is “International Organization for Standardization (ISO)”, which develops standards such as MPEG-1, MPEG-2, MPEG-4, etc. H.264 is a new digital Video Coding standard jointly developed by the Joint Video Group (JVT) formed by two organizations, so it is not only THE H.264 of ITU-T, but also mPEG-4 Advanced Video Coding of ISO/IEC. AVC).

  • Advantages: The biggest advantage of H.264 is its high data compression ratio. Under the same image quality conditions, the compression ratio of H.264 is more than 2 times of MPEG-2 and 1.5 ~ 2 times of MPEG-4.

2. VCL and NAL stratification

H.264 source stream (bare stream) is composed of one NALU after another, its function is divided into two layers, VCL (video coding layer) and NAL (network extraction layer).

  • VCL: Video Coding Layer (VCL), including core compression engine and syntax-level definitions of blocks, macroblocks and slices, designed to encode efficiently and compress raw Video data as independently of the network as possible.
  • NAL:Network Abstraction Layer (Network Abstraction Layer, hereinafter referred to asNAL). On Ethernet each packet size is1500Bytes, and a frame tends to be greater than this value, so it needs to be used in a certain format, yesVCLThe data output by the video coding layer is split into multiple packets and provided with packet headers (headeR) to transfer or store over networks at different rates, all unpacking and grouping packetsNALLayer to deal with. Covers all grammar levels above slice level.

Before VCL data is transferred or stored, it is mapped to one NALU, and H264 data contains one NALU. The diagram below:

An NALU = a set of NALU headers corresponding to the video encoding + a Raw Byte Sequence Payload (RBSP)

H.264 bit stream The first NALU is SPS, the second NALU is PPS, and the third NALU is IDR (Instant Decoder refresh).

  • SPS: Sequence Parameter Sets (SPS), which store information about a Sequence, including how many frames there are
  • PPS: Picture Parameter Sets (PPS), which store information about a frame. The information of SPS and PPS must be obtained during decoding to decode the following data.

3. Frames, slices and macroblocks

(1) frame

I-frame (key frame)

I-frames, also known as keyframes, use intra-frame compression technology.

  • I frame concept understanding: when we shoot video, 1 seconds shooting content rarely have a big change, but the camera will grab a second dozens of frames, such as a movie is 1 second 25 frames, the change of such a set of frame is very small, in order to compress the data, there is no need to save, we are only the first frame complete preserved as a key frame. The I-frame is critical, and all the decoded data depends on it.

P frame (forward reference frame)

P-frames, also called forward reference frames, use interframe compression technology. P-frames only save data that is different from the previous frame, and only refer to the previous frame during compression. The first frame of the video will be completely saved as a key frame, and the subsequent frames will be forward dependent, that is, the second frame depends on the first frame, and all the subsequent frames will only be stored in the difference of the previous frame, so that the data can be greatly reduced and a high compression rate can be achieved.

B frame (bidirectional reference frame)

B frames, also known as bidirectional reference frames, use interframe compression technology. Compression refers to both the preceding and subsequent frames, resulting in a higher compression rate and a smaller amount of data stored. The more B frames you have, the higher your compression rate will be. This is the advantage of B frame, but the biggest disadvantage of B frame is that if it is a live interactive broadcast, the B frame can only be decoded by referring to the later frame, so it has to wait for the later frame to be transmitted in the network. This has to do with the Internet. If the network state is good, decoding will be faster, if the network is not good decoding will be slightly slower. When a packet is lost, it needs to be retransmitted. For live interactive broadcasts, B frames are generally not used. If a certain delay is acceptable in a pan-entertainment live broadcast, B frame can be used if a relatively high compression ratio is required. If we’re in a live interactive broadcast, and we need to improve timeliness, we can’t use B frames.

(2) GOF(Group of Frame) or GOP(Group of Picture)

We call a set of frames GOF or GOP. A set of frames is the data from one I-frame to the next. The following figure clearly shows the decoding order and display order of three frames in a group of GOP. It can be found that both decoding and display start from i-frame first.

(3) Slice

The main function of the chip is to serve as a carrier for macro blocks. The purpose of setting slices is to limit the spread and transmission of error codes. The coding slices are project independent, and the prediction of one slice cannot be referenced by the macro blocks in other slices. It ensures that the prediction error in one slice does not spread to other slices. We can understand that an image can contain one or more slices, and each slice contains an integer number of macro blocks, that is, at least one macro block per slice, and at most each slice contains the macro block of the entire image. Slice can be subdivided into “slice header + slice data”, because one frame data may not be transmitted at a time, so the header information needs to be recorded.

(4) Macroblock (Macroblock)

The macro block is the main bearer of video information because it contains the brightness and chroma information of each pixel. The main task of video decoding is to provide an efficient way to obtain the pixel array in the macro block from the bit stream. Components: A macro block consists of a 16×16 brightness pixel and an additional 8×8 Cb and 8×8 Cr color pixel block. A form in which several macroblocks are arranged into slices in each image. Macro Block contains macro Block type, prediction type, Coded Block Pattern, Quantization Parameter, pixel brightness and chrominance data set, and so on.

(5) The relationship between frame, slice and macro block

We know that a frame represents an image. Where 1 frame = n slices 1 slice = n macro blocks 1 macro block = 16x16YUV data

IDR
IDR
I
H.264
IDR
IDR
IDR
IDR

Here is a complete bit stream hierarchy for H264:

4. The principle of predictive coding

To understand H264 coding, let’s first understand a concept, that is predictive coding.

  • Predictive coding: Predictive coding is a very simple and effective method for information with contextual correlation. At this time, the predictive coding output is no longer the original signal value, but the difference between the predicted value and the actual value of the signal. This design is because a large number of adjacent signals have the same or similar phenomenon. By calculating the difference between them, a large amount of data volume for storing and transmitting original information can be reduced. Let’s take a code example to illustrate how predictive coding works. Let’s assume the following 10 numbers:
2 , 2 , 2 , 7 , 2 , 2 , 2 , 2 , 2 , 13
Copy the code

We can also express it in the following way:

prediction = 2;
Difference = { (5.3), (11.9)};Copy the code

The above indicates that the predicted value of the target signal is 2, and the residual exists at the position of the 5th and 11th elements, and the difference is 3 and 9, respectively. By predictive coding, not only the number of bits needed to represent pixel information is reduced, but also the picture quality of video image can be preserved.

Predictive coding was not the first technique adopted in H.264. In the early compression coding technology, the method of predicting data + residual error is adopted to represent the pixels to be encoded. However, in these standards, predictive coding is only used for inter-frame prediction to remove spatial redundancy, while for intra-frame coding, direct DCT+ entropy coding is still used, and the compression efficiency is difficult to meet the new demands in multimedia field. The H.264 standard deeply analyzes the information correlation of the spatial domain in THE I-frame, adopts a variety of predictive coding modes, further compresses the spatial redundant information in the I-frame, greatly improves the coding efficiency of i-frame, and lays a foundation for the breakthrough in the compression ratio of H.264.

(1) Intra-frame predictive compression

Intra-frame predictive compression solves the problem of spatial data redundancy. What is spatial data, is that the data in this picture contains a lot of color, a lot of light in the wide and high space. Data that is hard to detect with the human eye. For this data, we can call it redundancy. It’s just compressed.

(2) Inter-frame predictive compression

Interframe predictive compression solves the problem of data redundancy in time domain. As we illustrated earlier, the data captured by the camera over a period of time does not change significantly, and we compress the same data over that period of time. This is called time-domain data compression.

The above summary refers to and extracts the following articles. Thank you very much for sharing with the following authors! :

1. Lei Xiaohua’s video class “Making a video Player based on FFmpeg+SDL – Section 1 – Outline and Basic Knowledge of Video and Audio” (PS: tribute to Mr. Lei Xiaohua, the god of audio and video, thank you for your selfless sharing achievements for us before your death)

2. Cloud-directed H.264/AVC Layered Structure and Picture Division

3. Hierarchical Structure of H264 Series thirteen Syntactic Elements by Hefei Laopi

4, Abson in the simple book “Understanding the Structure of video Coding H264”

【H.264/AVC video codec Technology detailed explanation 】 23, interframe prediction coding (1) : The basic principle of interframe prediction coding

Reprint please note the original source, shall not be used for commercial dissemination – fan how much