Audio frame, video frame and its synchronization

PS: This week’s key word “Habits”

The last article introduced the basic knowledge of audio and video development. Today, I will introduce the main parameters and analysis methods of audio and video frames, as well as the synchronization of audio and video. The main contents are as follows:

Audio frame
Video frame
PTS and DTS
Audio and video synchronization

Audio frame

The concept of audio frame is not as clear as video frame, almost all video encoding formats can be simply considered that a frame is an image after encoding, and audio frame will be different due to different encoding formats, such as PCM audio stream can be played directly, the following to MPEG audio frame format as an example to introduce audio frame.

The frame size

Frame size refers to the number of samples per frame, which is constant as follows:

	MPEG 1	MPEG 2	The MPEG 2.5
Layer Ⅰ	384	384	384
Layer Ⅱ	1152	1152	1152
Layer Ⅲ	1152	576	576

The length of the frame

Frame length refers to the length of each frame during compression, including the frame header and the filling bit. The frame length is not constant because of the filling and bit rate transformation. The filling bit is obtained from the ninth bit in the frame header.

Padding is used to fit the bit rates exactly. For an example: 128K 44.1khz Layer II uses a lot of 418 bytes and some of 417 bytes long frames to get the exact 128K bitrate. For layer I slot is 32 bits long, for Layer II and Layer III slot is 8 bits long.

It can be seen that the fill bit of Layer ⅰ is 4 bytes, and that of Layer ⅱ and Layer ⅲ is 1 byte. This value must be calculated when reading MPEG files to find adjacent frames. The frame length can be calculated as follows:

// Layer I(SampleSize = 384) unit: byte
FrameLengthInBytes = SampleSize / 8 * BitRate / SampleRate + Padding * 4
FrameLengthInBytes = 48 * BitRate / SampleRate + Padding * 4
// Layer II & III(SampleSize = 1152) unit: byte
FrameLengthInBytes = SampleSize / 8 / SampleRate + Padding
FrameLengthInBytes = 144 * BitRate / SampleRate + Padding
Copy the code

SampleSize indicates the number of samples per frame, which is fixed and can be viewed in the frame size section. Padding indicates the filling bit, BitRate indicates the BitRate, and SampleRate indicates the sampling rate. The values of BitRate and sampling rate can be obtained in the frame header.

If an MP3 audio file has a bit rate of 320kbps, a sampling rate of 44.1KHz, and no fill bits, the frame length of the file is 144 x 320/44.1 ≈ 1044 bytes.

Baud rate

The baud rate (KBPS) can be obtained from bits 12 to 15 in the MPEG audio frame header.

bits	V1, L1	V1, L2	V1, L3	V2, L1	V2, L2 & L3
0000	free	free	free	free	free
0001	32	32	32	32	8
0010	64	48	40	48	16
0011	96	56	48	56	24
0100	128	64	56	64	32
0101	160	80	64	80	40
0110	192	96	80	96	48
0111	224	112	96	112	56
1000	256	128	112	128	64
1001	288	160	128	144	80
1010	320	192	160	160	96
1011	352	224	192	176	112
1100	384	256	224	192	128
1101	416	320	256	224	144
1110	448	384	320	256	160
1111	bad	bad	bad	bad	bad

For the instructions in the table:

V1: MPEG Version 1
V2: MPEG Version 2 and Version 2.5
L1: Layer Ⅰ
L2: Layer Ⅱ
L3: Layer Ⅲ

MPEG files can have variable bit rates, meaning that the bit rate will change, knowing that the bit rate is obtained.

Sampling rate

The sampling rate can be obtained from the MPEG audio frame header 10~11 bits, in Hz, as follows:

bits	MPEG1	MPEG2	MPEG2.5
00	44100	22050	11025
01	48000	24000	12000
10	32000	16000	8000
11	reserv.	reserv.	reserv.

Duration of each frame

The duration of a frame is calculated as follows:

Ms / / unit
FrameTime = SampleSize / SampleRate * 1000
Copy the code

SampleSize indicates the number of samples (frame size), and SampleRate indicates the sampling rate.

For example, the duration of each frame of AN MP3 audio file with a sampling rate of 44.1khz is 1152/44100 * 1000 ≈ 26ms, which is the origin of the fixed playing time of each frame of MP3 that is often heard.

Video frame

In the video compression technology, different compression algorithms are used for video frames to reduce the amount of data. Usually, only the differences between images are encoded, and the same element information does not need to be sent repeatedly. Different algorithms for video frames are generally called picture types or frame types. The three main image types are I, P and B, with the following characteristics:

I frame: the in-frame encoding frame, usually the first frame of each GOP (described below), has the lowest compressibility and can be decoded without other video frames. It can be said to be a complete picture. Generally, I frame is used for random access and as a reference for decoding other pictures.
P frame: Forward prediction frame, which represents the difference with the previous frame (I or P frame). It needs to refer to the previous I frame or P frame to generate a complete picture. Compared with I frame, P frame is more compressible and saves space, so P frame also becomes incremental frame.
B frame: bidirectional predictive coding frame, which represents the difference between the two frames before and after. It needs to refer to the previous I frame or P frame and the following P frame to generate a completed picture, with the greatest compressibility.

The above frame or image is usually divided into several macroblocks, the basic unit of motion prediction. A complete image is usually divided into several macroblocks, such as mpeg-2 and earlier codecs definition macroblocks are 8×8 pixels, and specific prediction types are selected against the Macroblock as a baseline. Instead of using the same prediction type for the whole image, as follows:

I frame: contains only node macroblocks.
P frame: can contain node macro block or prediction macro block.
B frame: can contain node, prediction and pre – and post-prediction macroblocks.

The schematic diagram of FRAME I, P and B is as follows:

In the H.264 / MPEG-4 AVC standard, the granularity of prediction types is reduced to the level of slices, which are spatially distinct regions of the frame that are encoded separately from any other region in the same frame. I slices, P slices, and B slices replace I, P, and B frames, That’s all for now.

As mentioned above, GOP is the abbreviation of Group of Pictures, which can be translated into picture Group. Each GOP starts with FRAME I, and the others are P and B, as shown below:

The order shown in the figure above is:

I1, B2, B3, B4, P5, B6, B7, B8, P9, B10, B11, B12, I13Copy the code

Codec sequence is as follows:

I1, P5, B2, B3, B4, P9, B6, B7, B8, I13, B10, B11, B12Copy the code

Where the subscript number represents PTS in the original frame data, which can be understood as the position in GOP.

DTS and PTS

DTS(Decoding Time Stamp) : indicates the Decoding Time of the compressed frame, which is equivalent to telling the player when to decode the data of this frame
PTS(Presentation Time Stamp) : indicates the display Time of the original frame after decoding the compressed frame, which is equivalent to telling when to display the data during playback.

For audio, DTS and PTS are the same. For video, DTS and PTS are different because B frame is bidirectional prediction frame. If there is no B frame in each GOP, DTS and PTS are the same; otherwise, DTS and PTS are different.

	I	B	B	P	B	P
According to the order	I1	B2	B3	P4	B5	P6
Codec sequence	I1	P4	B2	B3	P6	B5
PTS	1	2	3	4	5	6
DTS	1	4	2	3	6	5

When the receiver receives the code stream for decoding the frame order is obviously not the correct order, need to reorder according to PTS before display.

Audio and video synchronization

To outline the process of video playback, a microphone and camera to collect data, respectively through audio, video coding, again by reusing, namely packages for audio and video format generated media files, when receiving a media file, need the reuse separate audio and video, then respectively through audio, video decoding, and audio and video broadcast independently, Because of the difference in playback rate, there will be different audio and video. The corresponding two indicators of audio and video playback are as follows:

Audio: sampling rate
Video: Frame rate

Sound cards and video cards generally play on a per-frame basis, so it is the same to calculate the duration of each frame of audio and video. For example:

It is known from the above that the duration of each frame of an MP3 audio file with a sampling rate of 44.1khz is 26 ms. If the frame rate of the video is 30 FPS at this time, the duration of each frame of the video frame is 1000/30 ≈ 33ms. If the value can be played according to the calculated value under ideal circumstances, Then audio and video can be considered synchronous.

The actual situation is caused by various reasons, such as the difference in the decoding and rendering time of each frame, the decoding and rendering of video frames with rich colors may be slower than the decoding and rendering of video frames with single colors and calculation errors, etc. There are three main ways of audio and video synchronization:

Video syncs to audio
Audio syncs to video
Audio and video synced to external clock

Is generally audio video synchronous to the clock, mainly because of the delay and caton, man’s hearing is more sensitive to the visual, as far as possible to maintain the normal output audio, audio and video synchronization here said is allowed a certain delay, the delay in the acceptable range of delay, the equivalent of a feedback mechanism, when the video is slow in audio will speed up the video playback, You can appropriately lose frames to make up for catching up with the audio. If there is already a delay, you can also reduce the delay. Otherwise, you can reduce the playback speed of the video.

Can pay attention to personal wechat public number practice exchange learning.