All audio and video acquisition and playback systems based on network transmission will have the problem of audio and video synchronization. As a representative of modern Internet real-time audio and video communication system, WebRTC is no exception. In this paper, the principle of audio and video synchronization and the implementation of WebRTC in-depth analysis.
Author: Liang Yi
Proofread: Tai yi
Time stamp (timestamp)
Synchronization problem is the problem of speed, will involve the corresponding relationship between time and audio and video streaming media, there is the concept of time stamp.
The timestamp is used to define the sampling time of the media load data and is obtained from a monotonically linearly increasing clock whose accuracy is determined by the sampling frequency of the RTP load data. The sampling frequency of audio and video is different. Generally, the sampling frequency of audio is 16KHz, 44.1khz, 48KHz, etc., while the sampling frame rate of video is reflected in 25fps, 29.97fps, 30fps, etc.
Traditionally, the increase of audio time stamp is the sampling rate. For example, if a frame is collected every 10ms for 16KHz sampling, the time stamp of the next frame is numerically 16 x10=160 more than the time stamp of the last frame, that is, the increase of audio time stamp is 16/ms. The sampling frequency of video is usually calculated according to 90KHz, which is 90K clock ticks per second. 90K is used because it happens to be a multiple of the video frame rate mentioned above, so 90K is used. So the time stamp of a video frame grows at a rate of 90/ms.
Generation of timestamp
Audio frame time stamp generation
The time stamp of WebRTC’s audio frames is accumulated from the first packet, which is 0. Each frame is increased by = encoded frame length (ms) x sampling rate /1000. If the sampling rate is 16KHz and the encoded frame length is 20ms, then the time stamp of each audio frame is increased by 20 x 16000/1000 = 320. Here just said the audio frame timestamp before packaging, and packaging to the inside of the RTP packets, will the audio frame timestamp tired again plus a random offset (generated) in the constructor, then as a timestamp in the RTP packets, sending out, as shown in the following code, note that the same logic apply to video package.
Video frame time stamp generation
WebRTC generates video frames in a completely different way than audio frames. The time stamp of a video frame is derived from the system clock. It is a time from the completion of collection to the time before encoding (this transmission link is very long, different video frames with different configurations will have different obtaining positions according to different logic). It is used to obtain the time of the current systemtimestamp_us_
And then calculate the corresponding time of this systemntp_time_ms_
And then calculate the time stamp of the original video frame according to the NTP timetimestamp_rtp_
, see the code below, the calculation logic is also thereOnFrame
In this function.Why do video frames use a different timestamp calculation mechanism than audio frames? As far as I know, in general, the sampling interval and clock accuracy of audio acquisition equipment are more accurate, 10ms per frame, 100 frames per second, and generally there will be no large jitter, while the frame interval time of video frame is relatively high, 25 frames per second, and 40ms per frame. If audio is also used to increase in accordance with the sampling rate, it may be inconsistent with the actual clock, so each frame is directly taken, and a time stamp is calculated according to the system clock of the time taken out, so that the corresponding relationship between real video frames and the actual time can be reproduced.
As with audio, the original video frame timestamp is added to the RTP packet with a random offset (which is not the same as the audio offset) and sent as the RTP packet timestamp. It is important to note that the computed NTP timestamp is not sent with the RTP packet at all, because there is no NTP field in the RTP packet header, and we do not include this value even in the extended field, such as the time related extended field in the video below.
Core basis for audio and video synchronization
Can be seen from the above, RTP packets containing only inside each flow of independent, monotone increasing timestamp information, that is to say, audio and video two timestamps are completely independent, there is no relationship, not only according to this information to synchronize, because can’t correlate to the time of two flow, we need a mapping relationship, the two independent timestamp.
At this time, the sender SR (SenderReport) packet in the RTCP packet comes into play. Please refer to the detailsRFC3550.One of the functions of the SR packet is to tell us how the RTP packet timestamp corresponds to the NTP time for each stream. Depending on the NTP timestamp and RTP timestamp marked in the above picture, we know through the description of RFC3550The two timestamps correspond to the same time, which indicates the time when the SR package is generated. This is the core basis for synchronizing audio and video, and all other calculations revolve around this core basis.
SR package generation
As can be seen from the above discussion,NTP time and RTP time stamps are different representations of the same time, but with different precision and units. The NTP time is an absolute time in milliseconds, while the RTP timestamp is a monotonically increasing value related to the sampling frequency of the media. The process of generating an SR package isRTCPSender::BuildSR(const RtcpContext& ctx)
< span style = “box-sizing: break-word! Important; color: black; color: black; color: black;
The calculation idea is as follows
First, we need to get the NTP time at the current time (when the SR package was generated). This is obtained directly from the passed parameter CTX:Next, we need to calculate what the RTP timestamp should be for the current moment. Based on the timestamp of the last RTP packet sentlast_rtp_timestamp_
And its collection time of the system timelast_frame_capture_time_ms_
, and the growth rate per ms of the current media stream timestamprtp_rate
And fromlast_frame_capture_time_ms_
The elapsed time up to the present moment can be calculated. Pay attention to,last_rtp_timestamp_
Is the original timestamp of the media stream, not the randomly offset RTP packet timestamp, so the offset is addedtimestamp_offset_
. The time of the last RTP packet to be sent is updated using the following function:
Computation of audio and video synchronization
Because the local system time of audio stream and video stream on the same machine is the same, that is, the NTP format time corresponding to the system time is the same, in the same coordinate system, so NTP time as the horizontal axis X, unit is ms, and RTP timestamp value as the vertical axis Y, draw together. The figure below shows the principle and method of calculation of audio and video synchronization, actually very simple, is to use two recent SR points, two points determine a straight line, after any RTP timestamps, can work out the corresponding time (NTP), and since the NTP time of video and audio are on the same base, so you can calculate the difference of the two.In the figure above, two audio SR packets are taken as an example to determine the line of the mapping between RTP and NTP. Then, given any RTP_A, the corresponding NTP_a is calculated. Similarly, the time point of NTP_v corresponding to any video packet rtp_V can be calculated.
Here is the code for calculating the coefficient rate and offset corresponding to the line in WebRTC:In WebRTC, the NTP time corresponding to the latest received audio RTP packet and the latest received video RTP packet is calculated as the out-of-synchronization duration introduced by network transmission. Then, according to the JitterBuffer of current audio and video and the size of playback buffer, the out-of-synchronization duration introduced by playback is obtained. According to the two unsynchronized duration, the final unsynchronized duration of audio and video is obtainedStreamSynchronization::ComputeRelativeDelay()
In the function, and then it goes throughStreamSynchronization::ComputeDelays()
Function for exponential smoothing and a series of processing and judgment, the final control audio and video minimum delay time, respectivelysyncable_audio_->SetMinimumPlayoutDelay(target_audio_delay_ms)
和 syncable_video_->SetMinimumPlayoutDelay(target_video_delay_ms)
Applied to the audio and video playback buffer.
This a series of operations are called by the timer RtpStreamsSynchronizer: : Process () function to deal with it.
The other thing to note is that if the sampling rate is known, it can be calculated from an SR packet,Without SR packets, accurate audio and video synchronization is not possible.In WebRTC, SR packets are the means to achieve audio and video synchronization. The core basis is the NTP time and RTP timestamp in SR packets. The last twoNTP Time -RTP time stamp
If you can see the coordinate diagram (actually very simple, just solve the equation of the line to calculate NTP), then you really understand the principle of audio and video synchronization in WebRTC. If there is any omission or error, welcome to communicate!
“Video cloud Technology” is your most noteworthy audio and video technology public account. It pushes practical technical articles from ali Cloud every week, and communicates with first-class engineers in the field of audio and video.