This section is ffMPEG development player learning notes section 6 “Audio and Video synchronization”
Generally speaking, video syncing refers to video and audio syncing, which means that the sound played should be consistent with the picture currently displayed. Imagine watching a movie where the mouth moves and no sound comes out. Or the picture is a fierce battle scene, and the sound is not the sound of guns but the voice of the characters, which is a very bad experience.
✅ section 1 – Hello FFmpeg ✅ section 2 – Soft unclip video stream, render RGB24 ✅ section 3 – understand YUV ✅ section 4 – hard decoding,OpenGL render YUV Disk section 5 – Metal render YUV disk section 6 – Decode audio, play with AudioQueue 🔔 section 7 – audio and video synchronization 📗 Section 8 – Perfect playback control 📗 Section 9 – double speed play 📗 section 10 – increase video filtering effect section 11 – audio changes
The Demo address of this section is github.com/czqasngit/f… The example code provides both Objective-C and Swift implementations. This article cites Objective-C code for ease of illustration, since Swift code Pointers don’t look neat.
The target
- The background of audio and video synchronization and the cause of asynchronization
- Audio and video synchronization processing scheme and selection
- Encoding audio and video synchronization
The background of audio and video synchronization and the cause of asynchronization
The video stream and audio stream contain relevant data at what speed it plays. The Frame Rate of a video indicates the number of frames (images) displayed in a second. The audio Sample Rate indicates the number of audio samples played in one second. The above data can be used to obtain the playing time of a certain Frame (Sample) through simple calculation. At such a speed, audio and video play each other. Under ideal conditions, they should be synchronized without deviation. If you use the above simple calculation method, the audio and video may slowly appear out of sync. Either the video is playing fast, or the audio is playing fast. This requires a linear increase in the amount of time, video and audio playback speed are based on this amount as the standard, play fast will slow down the playback speed; If it’s fast, speed it up. Therefore, the synchronization of video and audio is actually a dynamic process. Synchronization is temporary, while non-synchronization is normal. With the selection of the amount of playback speed as the standard, the fast wait for the slow, the slow speed up, is a you wait for me to catch up the process.
Audio and video synchronization processing scheme and selection
There are three ways to deal with audio and video synchronization:
1. Synchronize the video clock to the audio clock
With audio clock as the standard clock, audio plays naturally. When playing a video frame, compare the time after the current video frame is played with the time of the current audio clock. If the time after the current video frame is played is earlier than the time of the audio clock, the time of the current video player thread is temporarily delayed to ensure the synchronization with the audio clock after playing. If the end time of the current video frame is later than that of the audio frame, the system discards the current video frame and reads the next frame to ensure synchronization with the audio clock after playing.
2. Synchronize the audio clock to the video clock
With video clock as the standard, the video plays naturally. The synchronization logic is consistent with the synchronization logic of point 1: that is, if the audio is fast, the audio player thread will be suspended to wait for the time difference; if the audio is slow, the current audio frame will be discarded. To ensure that the current audio frame plays and video frame clock synchronization.
3. The audio and video clocks are synchronized to the external clock.
The synchronization logic is consistent with points 1 and 2. It should be noted that the external clock should use millisecond clock as far as possible to ensure accurate synchronization.
Synchronization Scheme Selection
The above three schemes can achieve audio and video synchronization, but how to choose a more suitable scheme?
- Human eyes and ears have different sensitivity to images and sounds, so it may be difficult for human eyes to detect when a frame or several frames are missing occasionally. This is because the consistency of the picture is relatively strong, the difference between the two frames is sometimes very small, and the eye is less sensitive than the headset. When there is a change in sound, such as a missing sound or an abnormal sound, the ear immediately detects it.
- On most platforms sound playback costs less than rendering. The data processing of sound is simpler and smaller in quantity. The sound thread has a low probability of playing a stutter.
- On macOS/iOS, for example AudioQueue playback, the playback cache object of the sound is reused, and this utilization is called back and forth by the specific thread that actually plays the sound. Unlike rendering every frame of video, pausing sound is more expensive than discarding it.
To sum up the three points, this paper chooses the first point synchronization scheme to synchronize the video clock to the audio clock
Encoding audio and video synchronization
Audio and video synchronization basics
There are two kinds of Time stamps in FFmpeg: DTS (Decoding Time Stamp) and PTS (Presentation Time Stamp). As the name implies, the former is the time to decode and the latter is the time to display. To understand these two concepts in detail, you need to understand the concepts of packet and frame in FFmpeg.
In FFmpeg, AVPacket structure is used to describe the compressed packet before decoding or encoding, and AVFrame structure is used to describe the signal frame before decoding or encoding. For video, an AVFrame is a frame image of the video. When this frame is displayed to the user depends on its PTS. DTS is a member of AVPacket and indicates when the compressed packet should be decoded. If the frames in the video are encoded in the input order (that is, the display order), then the decoding and display times should be the same. In fact, in most codec standards (such as H.264 or HEVC), the encoding order and the input order are not the same. Hence the need for two different timestamps, PTS and DTS.
Basic Concepts:
I frame: An intra picture is usually the first frame of each GOP (a video compression technique used by MPEG), which is moderately compressed to serve as a reference point for random access and can be used as an image. I frame can be regarded as the product of an image after compression.
P frame: Forward predictive coding frame, also known as predictive frame, compresses the encoded image of transmitted data by fully reducing the time redundancy information of the previous encoded frame in the image sequence.
B frame: Bi-directional interpolated prediction frame Is also called bi-directional prediction frame. It compresses an encoded image with transmitted data by considering the time redundancy between the encoded frames before and after the encoded frames of the source image sequence.
PTS: Presentation Time Stamp. PTS is mainly used to measure when the decoded video frame is displayed
DTS: Decode Time Stamp. DTS mainly identifies when the bit stream read into memory starts to be fed into the decoder for decoding.
The order of DTS and PTS should be the same in the absence of B frames.
IPB frame differences:
I frame: it can be decompressed into a single complete picture by video decompression algorithm.
P frame: a complete image needs to be generated by referring to the I frame or B frame in front of it.
B frame: a complete image is generated by referring to the previous I or P frame and the following P frame.
A GOP is formed between two I frames. In X264, bf can be set by parameters at the same time, namely: I and P or the number of B between two PS. The above basically shows that the last frame of a GOP must be P if there is a B frame.
Differences between DTS and PTS:
DTS is mainly used for video decoding and is used in the decoding phase. PTS is mainly used for video synchronization and output. This is used in display. In the absence of a B frame. The output order of DTS and PTS is the same.
2. Add clock data to audio and video data buffer objects
@interface FFQueueAudioObject : NSObject
@property (nonatomic, assign, readonly)float pts;
@property (nonatomic, assign, readonly)float duration;
- (instancetype)initWithLength:(int64_t)length pts:(float)pts duration:(float)duration;
- (uint8_t *)data;
- (int64_t)length;
- (void)updateLength:(int64_t)length;
@end
Copy the code
@interface FFQueueVideoObject : NSObject
@property (nonatomic, assign)double unit;
@property (nonatomic, assign)double pts;
@property (nonatomic, assign)double duration;
- (instancetype)init;
- (AVFrame *)frame;
@end
Copy the code
Add the variables PTS and Duration to the audio and video buffer objects above, respectively.
pts
: Time when the current frame is played or displayed.duration
: Duration of the current frame to play or display. The audio frame contains multiple audio packets, while the video frame can be calculated by FPS to obtain the display duration of each frame.
3. Synchronize the video clock to the audio clock
pthread_mutex_lock(&(self->mutex)); Double ac = self->audio_clock; pthread_mutex_unlock(&(self->mutex)); FFQueueVideoObject *obj = NULL; Int readCount = 0; / / / first reads a frame video data obj = [self. VideoFrameCacheQueue dequeue]; readCount ++; Double vc = obj. PTS + obj.duration; If (ac-vc > self->tolerance_scope) {if(ac-vc > self-> Tolerance_scope) {if(ac-vc > self-> Tolerance_scope) {if(ac-VC > self-> Tolerance_scope) {if(ac-VC > self->tolerance_scope) {if(ac-VC > self->tolerance_scope) {if(ac-VC > self->tolerance_scope) {if(ac-VC > self->tolerance_scope) Because in sync can form the lag relatively limited while vc (ac - > self - > tolerance_scope) {FFQueueVideoObject * _nextObj = [self. VideoFrameCacheQueue dequeue]; if(! _nextObj) break; obj = _nextObj; vc = obj.pts + obj.duration; readCount ++; }} else if (vc-ac > self->tolerance_scope) {// The video is too fast, pause and render to show the current video frame float sleep_time = vc-ac; usleep(sleep_time * 1000 * 1000); } else {/// after the error range, no processing required}Copy the code
Tolerance_scope is the allowable error, or a time difference between audio and video that is smaller than that is considered synchronous. This is because absolute time consistency is impossible to achieve and there is a loss of precision in the calculation of time.
- Get the time of the current audio clock (this time is the time after the current audio frame ends playing)
- Read a video frame and calculate the time after the video frame is played
- Judge the difference between audio time and video time and synchronize it
To this, audio and video synchronization is completed, now go to see the video will not find the mouth and voice inconsistent 🏄🏄🏄🏄 port.
conclusion
- Understand the background of audio and video synchronization and the cause of dissynchronization
- Understand the processing scheme of audio and video synchronization and choose the reasonable scheme of video clock synchronization to audio clock
- Encoding audio and video synchronization
For more content, please pay attention to wechat public number << program ape moving bricks >>