——————————————————–

Letter Address:H.264 Data Structure

Blog Address:H.264 Data Structure

Address of nuggets:H.264 Data Structure

——————————————————–

Why ?

I believe in your computer, there must be some downloaded good video files, if you say no, I believe you have always had it? Ever? Then I want to tell you that there was a time when you didn’t download movies for free, and then you regretted missing out on downloading blockbusters in those years.

Ok, to get back to the point, we must have seen a lot of video files in avi, MP4, RMVB, FLV and other formats in our daily life. And very few people actually dig into what are those files? In fact, all of the above formats are encapsulated video formats.

What is encapsulated format?

Specification for packaging audio and video data into a single file. Do not package the format gap is not large, each has its own strengths.

Play a video file from the Internet from video player

Need to go through the following steps: protocol unpacking, unpacking, decoding video audio, video audio synchronization. If local files are played, you do not need to decompress the protocol. The other steps are the same.

  • Protocol solution: The data of the streaming media protocol is parsed into the corresponding encapsulated data in the standard format
  • Unencapsulation: Separates the input encapsulated data into audio stream compressed encoded data and video stream compressed encoded data.
  • Decoding: it is to compress and encode video/audio data and decode it into uncompressed video/audio raw data.
  • Video and audio synchronization: the video and audio data decoded are synchronized according to the parameter information obtained in the process of unsealing module, and the video and audio data is sent to the system’s graphics card and sound card for playback.

Why do you encode video data

The main function of video coding is to compress video pixel data (RGB,YUV, etc.) into video code stream, thus reducing the amount of video data. Here’s an example: For example, the current screen resolution of the mobile phone is 1280 * 720(that is, we usually choose 720P in the video software), assuming 30 frames per second (that is, 30 pictures per second), So the data in one second is 1280 * 720(bit pixels)*30(pieces) / 8(1 byte 8 bits)(result B), that is, the data volume in one second is 3.456M, and the data volume in one minute is 207.36M, so we usually watch a movie is about 18G of traffic. Imagine how scary that would be for storage, or network transport.

It is because of the above reasons that we need to encode video data in order to minimize the definition of the program and reduce the amount of data. H264 is currently widely used as a coding format, and we will mainly introduce the code stream structure of H264.

Code flow structure

Refresh image concept

In our mind, an image is an image, whereas in H264 an image is a collection.

Frames, top field and bottom field can all be called images. A frame is usually a complete image. When video signals are collected, if progressive scanning is used, the signal obtained from each scan is an image, that is, a frame. When the video signal is collected, if interlaced scanning (odd and even lines) is used, the scanned frame image is divided into two parts, each part is called “field”, which is divided into “top field” and “bottom field” according to the order. The concepts of “frame” and “field” lead to different coding methods: frame coding and field coding. Progressive scanning is suitable for moving images, so frame coding is better for moving images. Interlaced scanning is suitable for non-moving images, so it is better to use field coding for non-moving images.

H264 original bit stream

  • Structure: Consists of one NALU after another, and its function is divided into two layers, VCL(video coding layer) and NAL(network extraction layer).
    • VCL: includes syntax level definitions for core compression engines and blocks, macroblocks and slices, designed for efficient coding as much as possible independent of the network.
    • NAL: Is responsible for adapting vCL-generated bit strings to a wide variety of network and multivariate environments, covering all syntax levels above the chip level.

  • Composition: NALU (Nal Unit) = NALU header + RBSP Before VCL data is transmitted or stored, the encoded VCL data is mapped or encapsulated into the Nal Unit (NALU, Nal Unit for short). Each NALU includes a Raw Byte Sequence Payload (RBSP) and a set of NALU headers corresponding to the video encoding. The basic structure of RBSP is that the end bit is added to the end of the original encoded data. One bit “1” and several bits “0” for byte alignment.
A raw H.264 NALU cell usually consists of [StartCode] [NALU Header] [NALU Payload]

  • StartCode: StartCode is used to indicate that this is the Start of a NALU unit and must be “00 00 00 01” or “00 00 01”.

  • NALU Header The following table shows NAL Header types

Such as:

00 00 00 01 06: SEI info 00 00 00 01 07: SPS 00 00 00 01 08: PPS 00 00 00 01 05: IDR SliceCopy the code
  • RBSP: A NAL packet stores its Payload Data in the Raw Byte Sequence Payload (RBSP). RBSP is a series Of SODB(strings Of Data Bits).

  • The association between a frame and NALU: after passing through the H.264 encoder, a frame is encoded into one or more slices, and the carrier carrying these slices is called NALU. Note: The concept of slice is different from frame, which is used to describe an image and a frame corresponds to an image, while slice, a new concept proposed in H.264, is integrated in an efficient way by encoding the image and dividing it into slices. An image has at least one or more slices. Slices are loaded and transmitted by NALU, but this does not necessarily mean that slices are in NALU, which is a sufficient and unnecessary condition, because NALU may also load other information used to describe the video.

What is a slice?

The main function of a slice is to serve as a carrier for macroblocks (ps: the concept of macroblocks will be introduced below). The chip was created to limit the spread and transmission of error codes.

Let’s use a picture to illustrate the specific structure of slice:

In the above structure, we can see that each shard also contains the header and the data:

  • The fragment header contains the fragment type, the type of macro block in the fragment, the number of fragment frames, which image the fragment belongs to, and the Settings and parameters of the corresponding frame.
  • In the fragmented data, we have the macroblocks, and this is where we want to store the pixel data.

What is a macro block?

  • Macroblock is the main carrier of video information, which contains the brightness and chroma information of every pixel. The main task of video decoding is to provide an efficient way to obtain the pixel array in the macro block from the code stream.

  • Macro block composition: A macro block consists of a 16*16 brightness pixel and an 8 * 8Cb and an 8 * 8Cr color pixel block attached. In each image, a number of macroblocks are arranged in the form of slices.

Here is the structure of the macroblock:

The relationship between slice types and macro block types

In terms of slice, it can be divided into the following types:

  • P-slice. Consists of P-macroblocks (each macro block is predicted using one reference frame) and / or I-macroblocks.

  • B-slice. Consists of B-macroblocks (each macroblock is predicted using one or two reference frames) and / or I-macroblocks.

  • I-slice. Contains only I-macroblocks. Each macroblock is predicted from previously coded blocks of the same slice.

  • SP-slice. Consists of P and / or I-macroblocks and lets you switch between encoded streams.

  • SI-slice. It consists of a special type of SI-macroblocks and lets you switch between encoded streams.

  • I slice: only the I macro block is included. The I macro block uses the decoded pixels from the current slice as reference for intra-frame prediction (the decoded pixels from other slices cannot be used as reference for intra-frame prediction).

  • P slice: can include P and I macro block, P macro block uses the previously encoded image as the reference image for intra-frame prediction, an intra-frame encoded macro block can be further segmented for macro block: 16×16, 16×8, 8×16 or 8×8 brightness pixel block (and the attached color pixel); If an 8×8 submacroblock is selected, it can be subdivided into sub-macroblocks of size 8×8, 8×4, 4×8, or 4×4 brightness pixel blocks (and accompanying color pixels).

  • B slice: can wrap B and I macro blocks, B macro block uses bidirectional reference image (current and incoming encoded image frame) for intra-frame prediction.

  • SP slice (switch P) : Used for switching between different encoding streams, containing P and/or I macroblocks

  • SI slice: a mandatory switch in the extension level, which contains a special type of encoding macro block called the SI macro block. SI is also a mandatory function in the extension level.

The overall structure

The code stream structure of H.264 is not so complicated. After coding, each group of images (GOP) of the video is given the transmission sequence (PPS) and the image parameters (SPS) of the frame itself. Therefore, the overall content is as follows

GOP image group mainly describes how many frames are separated from one I frame to the next I frame. Increasing the image group can effectively reduce the volume of the encoded video, but also reduce the video quality. How to choose depends on the demand.

Type of NALU header

Enum nal_unit_type_e {NAL_UNKNOWN = 0, // NAL_UNKNOWN = 1, // NAL_SLICE_DPA = 2, // Partition A NAL_SLICE_DPB = 3, // partition B NAL_SLICE_DPC = 4, // partition C NAL_SLICE_IDR = 5, / ref_idc! = 0 // / NAL_SEI = 6 in IDR image, / ref_idc == 0 // / Supplementary enhancement information unit - Parameter set is a new concept in H.264 standard, which is a method of enhancing error recovery ability by improving video bitstream structure. NAL_SPS = 7, // Sequence parameter set (including all information of an image sequence, that is, all image information between two IDR images, such as image size, video format, etc.) NAL_PPS = 8, // Image parameter set (including all relevant information of all fragments of an image, - NAL_AUD = 9, // NAL_FILLER = 12, // Fill (dummy metadata, used to fill bytes)/ref_idc == 0for6,9, 10 (indicating that the next image is an IDR image),11 (indicating that there are no images in the stream),12 /}; Ps: The type in parentheses () above is describedCopy the code

added

  • I,P,B frames with PTS/DTS
The I frame P frame B frame
Intra-frame encoding frame Forward prediction coding frame Bidirectional predictive coding frames
I frame is usually the first frame of each GOP, which is moderately compressed. As a reference point for random access, it can be regarded as the product of a compressed image Compression of transmission data encoded images by time redundancy information that is sufficiently lower than the previous encoded frames in the image sequence, also called predictive frames The encoded image that compresses the amount of transmitted data by considering the time redundancy information between the encoded frames before and after the source image sequence, also known as the bidirectional prediction frame
I, P, B frames
  • I Frame: It can be decompressed into a single complete image by the video decompression algorithm
  • P frame: refer to the I frame or B frame before it to generate a complete image
  • B frame: a complete picture is generated by referring to the preceding I frame or P frame and the following P frame.
DTS,PTS
  • PTS: PTS is mainly used to measure when the decoded video is displayed
  • DTS: DTS is mainly used to identify when the Bit stream in memory starts to be sent into the decoder for decoding
GOP

A GOP is a group of pictures, and a GOP is a group of continuous pictures. GOP generally has two numbers, such as M=3 and N=12.M defines the distance between frames I and P, and N specifies the distance between two frames I. So now the GOP structure is I BBP BBP BBP BB I

IDR

The first image in a sequence is called an IDR image (refresh now image), and IDR images are i-frame images.

Both I and IDR frames use intra-frame prediction. Frame I does not refer to any frame, but subsequent P and B frames may refer to the frame before frame I. IDR does not allow this.

  • Core function: H.264 introduces IDR image for decoding resynchronization. When decoder decodes IDR image, the reference frame queue will be emptied immediately, all the decoded data will be output or discarded, and the parameter set will be searched again to start a new sequence. This gives you an opportunity to resynchronize if a major error occurred in the previous sequence. Images after IDR images are never decoded using data from images before IDR.
Intra-frame prediction and inter-frame prediction
  • Intra-frame prediction (also called intra-frame compression)

We can infer and calculate the encoding of block 6 from the encoding of block 1, 2, 3, 4, and 5, so there is no need to encode block 6, thus compressing block 6 and saving space

You can see that the difference between the two frames is actually very small, so it makes sense to use interframe compression. Several important concepts are involved: block matching, residuals, motion search (motion estimation), and motion compensation. The most common form of inter-frame compression is Block Matching. Look for a block that is most similar to my current block in previous frames, so I don’t have to encode the contents of the current block, just the difference (residuals) between the current block and the one I found. The process of finding the most desirable block is called Motion Search, also known as Motion estimation. Motion Compensation is the process by which residuals and original blocks are used to calculate the current block.

Reference article:Understand H264 in simple terms.Learn the H264 structure from zero.Fundamentals of Audio and Video