Preface: in domestic live “thriving” (ps: in fact, most are losing money, in order to fry concepts) s, believe a lot of friend also into the vast ocean of technology (ps: is actually move brick), day after day, the mysteries of the audio/video began to make more friend off, but this blogger, also only just watched doorways, slowly groping. Ok, without further discussion, today we will talk about the structure of H.264 encoding format that we often use in video coding. I believe that H.264 is familiar to many friends and has their own understanding, but it is quite huge, and there are thousands of algorithms in it. Bloggers will not talk about too profound things. If you are not interested in h.264, please take a detour. If you are skilled, please point out the stupid head of this blogger.

Let’s start with some familiar official quotes about H.264

H.264: the purpose of the h.264 /AVC project was to create a standard that could provide good video quality at lower bit rates than previous video compression standards (i.e., half or less than mpeg-2, h.263, or mpeg-4 Part2). At the same time, it should not add too much complexity to the design. Advantages: 1) Network affinity, which can be applied to all kinds of transmission networks 2) high video compression ratio, the original proposed index is h.263, MPEG-4, about two times of them, now has been realized;

So obviously, when do you need compression? Of course, when the file size is too large, let’s think about it, the so-called video is like the comic strip in childhood, more than 24 pictures in one second, the image is continuous, this is the principle of video. But have you ever wondered how big a picture is? Our screen resolution is 1280 × 720, which is about 2.64 MB per second of video. Think about it, most of us have to save a lot of money to download a little fun video, and then we can’t even watch a live stream. That’s definitely not the case, so we have to compress it, and H.264 has a high compression ratio and network compatibility, so most people choose H.264 as the encoding format for live broadcasts.

Encoding flow: what is the encoding and decoding flow of H.264? It can be divided into five parts: Estimation and intra-frame prediction, Transform and inverse Transform, Quantization and inverse Quantization, Loop Filter, and Entropy Coding. It looks very sophisticated, in fact, it is also very sophisticated, because it contains a lot of algorithms and professional knowledge, here we will not do too much explanation, interested students can surf the Internet, enough to see you sleep. H.264 detailed documentation


Introduction of the principle

H.264 raw code stream (also known as raw stream) is composed of one NALU after another, and its function is divided into two layers: VCL (Video Coding Layer) and NAL (Network Abstraction Layer). VCL data is the output of the encoding process, which represents the video data sequence after being compressed and encoded. Before VCL data is transmitted or stored, the encoded VCL data is mapped or encapsulated into NAL Unit (NALU, NAL Unit). Each NALU includes a Raw Byte Sequence Payload (RBSP) and a set of NALU headers corresponding to the video encoding. The basic structure of RBSP is that the end bit is added to the end of the original encoded data. One bit “1” and several bits “0” for byte alignment.

The NALU header + RBSP in the figure above is equivalent to an NALU (Nal Unit), with each Unit being transmitted as a separate NALU. To put it plainly, all the structures in H.264 are based on NALU. Understanding NALU means understanding the structure of H.264.


An image associated with NALU:

How does NALU come from a single image, and why is H.264 so amazing?

After passing through the H.264 encoder, a picture is encoded into one or more slices, and the carrier carrying these slices is CALLED NALU. We can look at the relationship between NALU and slices.

The concept of slice is different from that of frame. A frame is used to describe an image, and a frame corresponds to an image, while a slice is a new concept proposed in H.264. It is a concept that can be integrated in an efficient way by encoding images and dividing them. An image has at least one or more slices.

As can be seen from the figure above, slice is loaded and transmitted by NALU, but this does not necessarily mean that slices are in NALU, which is a sufficient and unnecessary condition, because NALU may also load other information used to describe the video.


What is a slice?

The main function of a slice is to serve as a carrier for macroblocks (ps: the concept of macroblocks will be introduced below). The chip was created to limit the spread and transmission of error codes. How to limit the spread and transmission of error codes? Each slice should be transmitted independently of each other, and the prediction of one slice (within and between slices) should not be a reference image of macroblocks in other slices.

Let’s use a picture to illustrate the specific structure of slice:

We can understand that an image/frame can contain one or more slices, and each Slice contains an integer Macroblock, that is, each Slice contains at least one Macroblock, and at most each Slice covers the whole image.

In the above structure, it is not difficult to see that each fragment also contains header and data: 1. The fragment header contains the fragment type, the type of macro block in the fragment, the number of fragment frames, which image the fragment belongs to, and the Settings and parameters of corresponding frames. 2. In the fragmented data is the macro block, which is where we want to store the pixel data.

What is a macro block?

Macroblock is the main bearer of video information because it contains the brightness and chroma information of each pixel. The main task of video decoding is to provide an efficient way to obtain pixel arrays from the code stream. Components: A macro block consists of a 16×16 brightness pixel and an additional 8×8 Cb and an 8×8 Cr color pixel block. In each image, a number of macroblocks are arranged in the form of slices.

Let’s take a look at the structure of the macroblock:

From the figure above, we can see that the macro Block contains macro Block type, prediction type, Coded Block Pattern, Quantization Parameter, pixel brightness and chroma data set, etc.


The relationship between slice types and macro block types

For slice, it can be divided into the following types:

0 P-slice. Consists of P-macroblocks (each macro block is predicted using one reference frame) and / or I-macroblocks. 1 B-slice. Consists of B-macroblocks (each macroblock is predicted using one or two reference frames) and / or I-macroblocks. 2 I-slice. Contains only I-macroblocks. Each macroblock is predicted from previously coded blocks of the same slice. 3 SP-slice. Consists of P and / or I-macroblocks and lets you switch between encoded streams. 4 SI-slice. It consists of a special type of SI-macroblocks and lets you switch between encoded streams.

I slice: only the I macro block is included. The I macro block uses the decoded pixels from the current slice as reference for intra-frame prediction (the decoded pixels from other slices cannot be used as reference for intra-frame prediction).

P slice: can include P and I macro block, P macro block uses the previously encoded image as the reference image for intra-frame prediction, an intra-frame encoded macro block can be further segmented for macro block: 16×16, 16×8, 8×16 or 8×8 brightness pixel block (and the attached color pixel); If an 8×8 submacroblock is selected, it can be subdivided into sub-macroblocks of size 8×8, 8×4, 4×8, or 4×4 brightness pixel blocks (and accompanying color pixels).

B slice: can wrap B and I macro blocks, B macro block uses bidirectional reference image (current and incoming encoded image frame) for intra-frame prediction.

SP slice (switch P) : Used for switching between different encoding streams, containing P and/or I macroblocks

SI slice: a mandatory switch in the extension level, which includes a special type of encoding macro block called the SI macro block. SI is also a mandatory function in the extension level.


With so many small pieces dissected, it’s time for a world map, and our overall NALU structure is ready to go, using an image from the H.264 document below


In fact, the code stream structure of H.264 is not as complex as we think, and each GOP of the encoded video is given PPS and SPS. Therefore, our overall structure should be as follows:

GOP (image group) is mainly used to describe the number of frames between one I frame and the next I frame. Increasing the image group can effectively reduce the volume of the encoded video, but also reduce the quality of the video. As for the choice, it depends on the demand.


Off-topic :(to be continued)

So what information does the type in the NALU header determine? NALU: nal_unit_type_e: nal_unit_type_e: nal_unit_type_e: nal_unit_type_e: nal_unit_type_e: nal_unit_type_e

Enum nal_unit_type_e {NAL_UNKNOWN = 0, // NAL_UNKNOWN = 1, // NAL_SLICE_DPA = 2, // Partition A NAL_SLICE_DPB = 3, // partition B NAL_SLICE_DPC = 4, // partition C NAL_SLICE_IDR = 5, / ref_idc! = 0 // / NAL_SEI = 6 in IDR image, / ref_idc == 0 // / Supplementary enhancement information unit – Parameter set is a new concept in H.264 standard, which is a method of enhancing error recovery ability by improving video bitstream structure. NAL_SPS = 7, // Sequence parameter set (including all information of an image sequence, that is, all image information between two IDR images, such as image size, video format, etc.) NAL_PPS = 8, // Image parameter set (including all relevant information of all fragments of an image, – NAL_AUD = 9, // NAL_FILLER = 12, // Fill (dummy metadata, For padding bytes)/ref_idc == 0 for 6,9, 10 (indicating that the next image is an IDR image),11 (indicating that there are no images in the stream),12 /}; Ps: The type in parentheses () above is described

Among the above NALU types, the concept of slicing/slice is very clear, but there are SEI, SPS, PPS and so on that use NALU as carrier.

Today we are not going to summarize the role of these types in the overall process, but we will pick out two types that fit our topic today, PPS and SPS.


So today we talk about h.264 code stream structure believe that we have a general outline of understanding, summary of a sentence is:

In H.264, syntactic elements are organized into five levels: sequence, image, slice, macroblock and sub-macroblock.

I hope you can understand it by heart. After all, manual typing and drawing are not easy. If you can pay attention to it, if you have spare money, you can give a reward


Final benefits: At the end of the article, for the convenience of your friends to learn, the blogger wrote an example code in iOS device with FFMPEG. The main function is to use the camera to record yuV format files in real time and encode H.264, then analyze the NALU data in every frame in H.264 in real time. Hope to be helpful to small partners, find the way of entry. The project address