PS: This week’s key word “Habits”

The last article introduced the basic knowledge of audio and video development. Today, I will introduce the main parameters and analysis methods of audio and video frames, as well as the synchronization of audio and video. The main contents are as follows:

  1. Audio frame
  2. Video frame
  3. PTS and DTS
  4. Audio and video synchronization

Audio frame

The concept of audio frame is not as clear as video frame, almost all video encoding formats can be simply considered that a frame is an image after encoding, and audio frame will be different due to different encoding formats, such as PCM audio stream can be played directly, the following to MPEG audio frame format as an example to introduce audio frame.

The frame size

Frame size refers to the number of samples per frame, which is constant as follows:

MPEG 1 MPEG 2 The MPEG 2.5
Layer Ⅰ 384 384 384
Layer Ⅱ 1152 1152 1152
Layer Ⅲ 1152 576 576
The length of the frame

Frame length refers to the length of each frame during compression, including the frame header and the filling bit. The frame length is not constant because of the filling and bit rate transformation. The filling bit is obtained from the ninth bit in the frame header.

Padding is used to fit the bit rates exactly. For an example: 128K 44.1khz Layer II uses a lot of 418 bytes and some of 417 bytes long frames to get the exact 128K bitrate. For layer I slot is 32 bits long, for Layer II and Layer III slot is 8 bits long.

It can be seen that the fill bit of Layer ⅰ is 4 bytes, and that of Layer ⅱ and Layer ⅲ is 1 byte. This value must be calculated when reading MPEG files to find adjacent frames. The frame length can be calculated as follows:

// Layer I(SampleSize = 384) unit: byte
FrameLengthInBytes = SampleSize / 8 * BitRate / SampleRate + Padding * 4
FrameLengthInBytes = 48 * BitRate / SampleRate + Padding * 4
// Layer II & III(SampleSize = 1152) unit: byte
FrameLengthInBytes = SampleSize / 8 / SampleRate + Padding
FrameLengthInBytes = 144 * BitRate / SampleRate + Padding
Copy the code

SampleSize indicates the number of samples per frame, which is fixed and can be viewed in the frame size section. Padding indicates the filling bit, BitRate indicates the BitRate, and SampleRate indicates the sampling rate. The values of BitRate and sampling rate can be obtained in the frame header.

If an MP3 audio file has a bit rate of 320kbps, a sampling rate of 44.1KHz, and no fill bits, the frame length of the file is 144 x 320/44.1 ≈ 1044 bytes.

Baud rate

The baud rate (KBPS) can be obtained from bits 12 to 15 in the MPEG audio frame header.

bits V1, L1 V1, L2 V1, L3 V2, L1 V2, L2 & L3
0000 free free free free free
0001 32 32 32 32 8
0010 64 48 40 48 16
0011 96 56 48 56 24
0100 128 64 56 64 32
0101 160 80 64 80 40
0110 192 96 80 96 48
0111 224 112 96 112 56
1000 256 128 112 128 64
1001 288 160 128 144 80
1010 320 192 160 160 96
1011 352 224 192 176 112
1100 384 256 224 192 128
1101 416 320 256 224 144
1110 448 384 320 256 160
1111 bad bad bad bad bad

For the instructions in the table:

  • V1: MPEG Version 1
  • V2: MPEG Version 2 and Version 2.5
  • L1: Layer Ⅰ
  • L2: Layer Ⅱ
  • L3: Layer Ⅲ

MPEG files can have variable bit rates, meaning that the bit rate will change, knowing that the bit rate is obtained.

Sampling rate

The sampling rate can be obtained from the MPEG audio frame header 10~11 bits, in Hz, as follows:

bits MPEG1 MPEG2 MPEG2.5
00 44100 22050 11025
01 48000 24000 12000
10 32000 16000 8000
11 reserv. reserv. reserv.
Duration of each frame

The duration of a frame is calculated as follows:

Ms / / unit
FrameTime = SampleSize / SampleRate * 1000
Copy the code

SampleSize indicates the number of samples (frame size), and SampleRate indicates the sampling rate.

For example, the duration of each frame of AN MP3 audio file with a sampling rate of 44.1khz is 1152/44100 * 1000 ≈ 26ms, which is the origin of the fixed playing time of each frame of MP3 that is often heard.

Video frame

In the video compression technology, different compression algorithms are used for video frames to reduce the amount of data. Usually, only the differences between images are encoded, and the same element information does not need to be sent repeatedly. Different algorithms for video frames are generally called picture types or frame types. The three main image types are I, P and B, with the following characteristics:

  • I frame: the in-frame encoding frame, usually the first frame of each GOP (described below), has the lowest compressibility and can be decoded without other video frames. It can be said to be a complete picture. Generally, I frame is used for random access and as a reference for decoding other pictures.
  • P frame: Forward prediction frame, which represents the difference with the previous frame (I or P frame). It needs to refer to the previous I frame or P frame to generate a complete picture. Compared with I frame, P frame is more compressible and saves space, so P frame also becomes incremental frame.
  • B frame: bidirectional predictive coding frame, which represents the difference between the two frames before and after. It needs to refer to the previous I frame or P frame and the following P frame to generate a completed picture, with the greatest compressibility.

The above frame or image is usually divided into several macroblocks, the basic unit of motion prediction. A complete image is usually divided into several macroblocks, such as mpeg-2 and earlier codecs definition macroblocks are 8×8 pixels, and specific prediction types are selected against the Macroblock as a baseline. Instead of using the same prediction type for the whole image, as follows:

  • I frame: contains only node macroblocks.
  • P frame: can contain node macro block or prediction macro block.
  • B frame: can contain node, prediction and pre – and post-prediction macroblocks.

The schematic diagram of FRAME I, P and B is as follows:

In the H.264 / MPEG-4 AVC standard, the granularity of prediction types is reduced to the level of slices, which are spatially distinct regions of the frame that are encoded separately from any other region in the same frame. I slices, P slices, and B slices replace I, P, and B frames, That’s all for now.

As mentioned above, GOP is the abbreviation of Group of Pictures, which can be translated into picture Group. Each GOP starts with FRAME I, and the others are P and B, as shown below:

The order shown in the figure above is:

I1, B2, B3, B4, P5, B6, B7, B8, P9, B10, B11, B12, I13Copy the code

Codec sequence is as follows:

I1, P5, B2, B3, B4, P9, B6, B7, B8, I13, B10, B11, B12Copy the code

Where the subscript number represents PTS in the original frame data, which can be understood as the position in GOP.

DTS and PTS

  • DTS(Decoding Time Stamp) : indicates the Decoding Time of the compressed frame, which is equivalent to telling the player when to decode the data of this frame
  • PTS(Presentation Time Stamp) : indicates the display Time of the original frame after decoding the compressed frame, which is equivalent to telling when to display the data during playback.

For audio, DTS and PTS are the same. For video, DTS and PTS are different because B frame is bidirectional prediction frame. If there is no B frame in each GOP, DTS and PTS are the same; otherwise, DTS and PTS are different.

I B B P B P
According to the order I1 B2 B3 P4 B5 P6
Codec sequence I1 P4 B2 B3 P6 B5
PTS 1 2 3 4 5 6
DTS 1 4 2 3 6 5

When the receiver receives the code stream for decoding the frame order is obviously not the correct order, need to reorder according to PTS before display.

Audio and video synchronization

To outline the process of video playback, a microphone and camera to collect data, respectively through audio, video coding, again by reusing, namely packages for audio and video format generated media files, when receiving a media file, need the reuse separate audio and video, then respectively through audio, video decoding, and audio and video broadcast independently, Because of the difference in playback rate, there will be different audio and video. The corresponding two indicators of audio and video playback are as follows:

  • Audio: sampling rate
  • Video: Frame rate

Sound cards and video cards generally play on a per-frame basis, so it is the same to calculate the duration of each frame of audio and video. For example:

It is known from the above that the duration of each frame of an MP3 audio file with a sampling rate of 44.1khz is 26 ms. If the frame rate of the video is 30 FPS at this time, the duration of each frame of the video frame is 1000/30 ≈ 33ms. If the value can be played according to the calculated value under ideal circumstances, Then audio and video can be considered synchronous.

The actual situation is caused by various reasons, such as the difference in the decoding and rendering time of each frame, the decoding and rendering of video frames with rich colors may be slower than the decoding and rendering of video frames with single colors and calculation errors, etc. There are three main ways of audio and video synchronization:

  • Video syncs to audio
  • Audio syncs to video
  • Audio and video synced to external clock

Is generally audio video synchronous to the clock, mainly because of the delay and caton, man’s hearing is more sensitive to the visual, as far as possible to maintain the normal output audio, audio and video synchronization here said is allowed a certain delay, the delay in the acceptable range of delay, the equivalent of a feedback mechanism, when the video is slow in audio will speed up the video playback, You can appropriately lose frames to make up for catching up with the audio. If there is already a delay, you can also reduce the delay. Otherwise, you can reduce the playback speed of the video.

Can pay attention to personal wechat public number practice exchange learning.