I. Video files
1.1 Video Files and Playback
Video files can be perceived in two aspects: video and audio. A complete video file may contain multiple sub-videos of different scenes and different sub-audio.
We will be these video stream, audio stream to encapsulate format data, forming our common MP4, MKV, AVI files, can be transmitted on the network.
When we get a video file, we can play it through the video playback software that supports this format, which mainly includes the following steps:
decapsulation
: Obtain the compressed audio data and video data respectively. Common audio compression data include: AAC, MP3, and so on, while video includes H.264, MPEG, and so on.decoding
: Call the decoder to decode audio and video respectively to get the corresponding sample data: audio sample data (PCM), video pixel data (YUV).Audio and video synchronization
play
: Play through the monitor and sound.
2. Coding
2.1 Why code?
There are two main purposes in the file encoding section. One is to form a unified data form for convenient storage and transmission; the other is to delete redundant data.
Imagine a 1080P 30 frame, 32bit color 1 second video file, if stored per frame, the data size would be:
32bit * 30 * 1080 * 1920 ≈ 237MB space, unless there are special needs, this way of storage, transmission video is obviously unacceptable.
If we adopt encoding algorithms, such as MPEG4, H.264 and other algorithms to eliminate the redundancy of the video file, after compression, the actual file size will be greatly reduced.
2.2. Data redundancy
As mentioned above, the main purpose of encoding is to compress, and all kinds of encoding methods are to make the video volume small. The core idea is to remove redundant information, which mainly includes:
1. Spatial redundancy: the redundancy caused by strong correlation between adjacent pixels in the image.
For example, in this screenshot of a video, if the background color is all black, we actually do not need to store the black according to the size of the video (1124*772), we can extract the pixels that store the black and only store the colors of other pixels.
2. Time redundancy: in the video image sequenceDifferent frames
The correlation between
To put it simply, frame A and frame B are the relationship between frames before and after, and the picture changes between the two frames are relatively small, so it is unnecessary for frame B to store A complete picture frame and record the changes.
3. Visual redundancy
Part of the image data that is difficult to be perceived by the human eye or not sensitive can be compressed and stored.
For example, in the process of image coding and decoding, some changes occur to the image due to noise introduced by compression or quantity ratio truncation. If these changes cannot be perceived visually, the image is still considered good enough.
In fact, the general resolution of the human visual system is about 26 gray levels, while the general image quantization uses 28 gray levels. This kind of redundancy is called visual redundancy.
4. Information entropy redundancy
Also known as coding redundancy, the number of bits needed to express a certain information is always larger than the minimum number of bits theoretically required to express the information, and this gap is called information entropy redundancy.
Entropy coding: Huffman algorithm: 111122222 can be expressed as: 4152, and the data of nine bits can be expressed by four bits.
5. Knowledge redundancy
Some images contain information about certain validation knowledge.
2.3. How to code
In video compression, a new concept IPB frame is introduced. As mentioned above, some pictures with little change can only record the data of the changed part, and the decoding is realized by a complete frame + the data of the changed part during the decoding. We divided frames into three types:
The I frame
Key frame: key frame of a frameintact
; Decoding only needs this frame data can be completed.P frame
: difference frame, the difference between this frame and a previous key frame (or P frame), when decoding, need to use the cached picture before the superimposed on the difference defined by this frame, to generate the final picture.B frame
: bidirectional differential frame, the difference between this frame and the before and after frames. To decode the B frame, you don’t just have to get the previous oneCache images
, but alsoThe decoded screen
, the final picture is obtained through the superposition of the front and back pictures and the data of this frame. B frame compression rate is relatively the highest. B frame is compressed by comparing the data difference between the previous frame and the later frame. It involves a front and back comparison, so B frames also have the highest CPU usage.
In general, the i-frame compression rate is the lowest, because it requires the most raw image data. P frame only needs I frame (or a P frame) + changed data, while B frame requires: the previous I frame or P frame and a subsequent P frame to generate a complete video picture.
In terms of the kind of redundancy:
- I-frame removes redundant information in dimensions of space, vision and information entropy.
- P frame and B frame remove redundant information in time. So they all rely on i-frames.
In the MPEG4 encoded video file, after we get a certain frame, its fixed beginning is: 00 00 01 B6, and then in the next two bits, we can get its IBP frame classification:
- 00: the I frame
- 01: P frame
- 10: B frames
Iii. Introduction to H.264
H.264 is the mainstream code stream format, the main features are: high quality image, high compression ratio and good network affinity. H.264 is based on MPEG-4 technology, the encoding process mainly includes:
- interframe
- Frame prediction
- Transformation, inverse transformation
- Quantization, inverse quantization
- The loop filter
- Entropy coding
The specific content will not be expanded temporarily. However, we need to clarify the difference between H.264, MPEG, MP4 and AVI. In fact, MPEG and H.264 are the video stream information obtained after compression of our original video, which can be further encapsulated to get our specific video files such as MP4. H.264 is a coding standard, while MP4 is a kind of encapsulation.
3.1 level
H.264 bitstream files are divided into two layers:
Video Coding Layer (VCL) is the Video Coding Layer
Is responsible for efficient video content representation. VCL data is the output of coding processing, which represents the data sequence of compressed video.Network Abstraction Layer (NAL) is the Network extraction Layer
The transport layer is responsible for packaging and transmitting data in the appropriate manner required by the network. It is the transport layer through which the data is transmitted whether it is played locally or over the network.
In fact, the two “layer”, the concept of the layer and the concept of the computer network layer is the same, that is, a video, we pass VCL coding, and then into the NAL layer packaging, you can get a package after the HTTP packet, can be transmitted on the TCP link, to the end after the same layer for unpacking, decoding can get the original video.
In our programming, the data will be organized into Javabeans, and the Javabeans will be serialized if they want to be transmitted over the network, and then they will be transmitted over TCP, and then they will be deserialized and extracted from the Javabeans.
Note that:
- A NAL unit is a Slice, and a frame is not necessarily a NAL unit. There may be multiple NAL units (slices) in a frame.
3.2 pieces and macroblocks
A frame of an image, after being encoded by an H.264 encoder, is encoded into one or more slices, each containing an integer number of macroblocks (at least one), and the NAL unit is the carrier that carries these macroblocks.
In the figure, five horizontal regions are divided by white horizontal lines, one of which is one"Slice"
“, and like the white line in the second line (lazy didn’t finish drawing) divided into macro blocks. The NAL layer, on the other hand, is loaded differently"Piece"
Containing the macro block.
3.3 Annexb format
The default output of an H.264 encoder is the start code + NALU (NAL Unit), and the start code is 0x00000001 or 0x000001.
The first byte of information after the start code is a NAL Header. (A byte has eight bits, or two in hexadecimal numbers.) The lower five bits of the eight bits represent the NAL U type. For example, (HEX: 67) 0110 0111, 00111 indicates the information about the current NAL unit.
Five bits, together, can represent 32 different NAL units.
The metadata for the video file is given in the first three NAL units of H264, so you can see that the StartCode for the fourth NAL unit is basically 0000000141, while the first three are different: 67 (Hex) ->01100111(Bin) -> 00111 Similarly, we can get that 68 corresponds to: the set of image parameters.
It is worth noting that the third, 65, corresponds to the type: 5, which is the encoding strip of IDR image, which sounds strange, but is actually the key frame (I frame) we mentioned above. Note that the sequence parameter set and the image parameter set must be followed by an i-frame.
Why? Because only keyframes can be decoded independently, subsequent P-frames depend on the preceding keyframes.
The type 41 (Hex) corresponds to is: 1, that is, a non-IDR image encoding strip, which is referred to as a P-frame (and subsequently marked with a P Slice). In fact, some clue can be seen from their respective lengths. The Length of I-frame is longer while that of P-frame is shorter. Because i-frame, as a key frame, needs to retain sufficient data to decode a complete frame, it naturally needs more description information, while p-frame only needs to record changes, so p-frame will naturally be shorter.
3.4 H264 conversion practice and view
First, if we don’t have h.264 files, we need to use a tool: FFmpeg, which can be downloaded from the official FFmpeg website. If you have an H264 file, you don’t need to go to the next step.
Secondly, we need to find a video. I personally downloaded the MV from the music software.
Then download a tool: BSAnalyser, which is mainly used for parsing H264 files. This file is searchable on Github, but the Internet connection is slow, and can be downloaded from the code Cloud’s mirror site:
After that, we set the Path to the Bin file in FFmpeg. After that, we do the following for the video in CMD:
ffmpeg -i cp.mp4 -vcodec h264 -preset fast -b:v 2000k hello.h264
Copy the code
Cp. Mp4 is our video file and hello. H264 is our output H264 file.
We drag the generated Hello.h264 into the BSAnalyser, which is:
3.5. H264 and MPEG video coding specification
The more important codec standards are: International Telecommunication Union (ITU-T) H.261, H.263, H.264 and so on, the Moving Still Image expert Group (established by ISO international Organization for Standardization and IEC International Electronics Committee in 1988) MPEG series of standards MPEG1, MPEG2, MPEG4 AVC and so on.
Among them, ITU-T H.264/MPEG-4 AVC is a new standard jointly customized by ITU-T and ISO/IEC. It is called H.264 by ITU-T, but the NEW standard is summarized by ISO/IEC in THE MPEG series, which is called MPEG-4 AVC.
H.265 is known as the successor of ITU-T H.264/MPEG-4 AVC standard, also known as High Efficiency Video Coding (HEVC).
Although the updated H265 has been launched, but the current mainstream video is basically H264 coding.