The first frame display of audio and video real-time call is an important user experience standard. This paper mainly through the analysis of the receiver to understand and optimize the first frame of video display time.
The process is introduced
The sender collects audio and video data and generates frame data through the encoder. The data is packaged as AN RTP packet and sent to the receiver via the ICE channel. The receiving end receives the RTP packet, takes out the RTP payload, and completes the frame grouping operation. The audio and video decoder then decodes the frame data to generate video images or audio PCM data.
The part of this article about parameter tuning is in step 4 of the figure above. Because it is the receiving end, it receives the Offer request from the receiving end. Set SetRemoteDescription first and then SetLocalDescription. The blue part of the picture below:
Parameter adjustment
Video parameter adjustment
When SetRemoteDescription is received from the Signal thread, the VideoReceiveStream object is created in the Worker thread. The specific process for SetRemoteDescription VideoChannel: : create WebRtcVideoReceiveStream SetRemoteContent_w. WebRtcVideoReceiveStream contains a Stream_ object of type VideoReceiveStream, Through the webrtc: : VideoReceiveStream * Call: : CreateVideoReceiveStream created. Start VideoReceiveStream immediately after creation by calling the Start() method. At this point VideoReceiveStream contains an RtpVideoStreamReceiver object ready to start processing the Video RTP package. Set local descritPion by setLocalDescription after createAnswer is created by the receiver. In a Worker thread corresponding setLocalContent_w methods according to the SDP set receives the parameters of the channel, will call to WebRtcVideoReceiveStream: : SetRecvParameters. WebRtcVideoReceiveStream: : SetRecvParameters implementation is as follows:
void WebRtcVideoChannel::WebRtcVideoReceiveStream::SetRecvParameters( const ChangedRecvParameters& params) { bool video_needs_recreation = false; bool flexfec_needs_recreation = false; if (params.codec_settings) { ConfigureCodecs(*params.codec_settings); video_needs_recreation = true; } if (params.rtp_header_extensions) { config_.rtp.extensions = *params.rtp_header_extensions; flexfec_config_.rtp_header_extensions = *params.rtp_header_extensions; video_needs_recreation = true; flexfec_needs_recreation = true; } if (params.flexfec_payload_type) { ConfigureFlexfecCodec(*params.flexfec_payload_type); flexfec_needs_recreation = true; } if (flexfec_needs_recreation) { RTC_LOG(LS_INFO) << "MaybeRecreateWebRtcFlexfecStream (recv) because of " "SetRecvParameters"; MaybeRecreateWebRtcFlexfecStream(); } if (video_needs_recreation) { RTC_LOG(LS_INFO) << "RecreateWebRtcVideoStream (recv) because of SetRecvParameters"; RecreateWebRtcVideoStream(); }}Copy the code
According to the SetRecvParameters code in the figure above, VideoReceiveStream restarts if codec_Settings, RTP_Header_Extensions, or Flexfec_Payload_TYPE are not empty. Video_needs_recreation Indicates whether to restart VideoReceiveStream. The restart process is to release the previously created VideoReceiveStream and rebuild the new VideoReceiveStream. Take coDEC_settings as an example. The initial video COdec supports H264 and VP8. If the peer end supports only H264, the negotiated CODEC supports only H264. Codec_settings in SetRecvParameters is not empty for H264. In fact, before and after VideoReceiveStream have H264 CODEC, there is no need to rebuild VideoReceiveStream. You can configure the initial list of supported Video Codec and the RTP Extensions to make the generated Local SDP and remote SDP receive parameters consistent and determine if codec_Settings are equal. If not, set video_needs_recreation to true. This setting will prevent SetRecvParameters from triggering a restart of the VideoReceiveStream logic. In the debug mode, the modified to verify no “RecreateWebRtcVideoStream recv () because of SetRecvParameters” print, can prove no VideoReceiveStream restart.
Audio parameter adjustment
Similar to the video above, the audio will be re-created due to a discrepancy in the RTP Extensions. This will release the previous AudioReceiveStream and re-create the AudioReceiveStream. Reference code:
bool WebRtcVoiceMediaChannel::SetRecvParameters( const AudioRecvParameters& params) { TRACE_EVENT0("webrtc", "WebRtcVoiceMediaChannel::SetRecvParameters"); RTC_DCHECK(worker_thread_checker_.CalledOnValidThread()); RTC_LOG(LS_INFO) << "WebRtcVoiceMediaChannel::SetRecvParameters: " << params.ToString(); // TODO(pthatcher): Refactor this to be more clean now that we have // all the information at once. if (! SetRecvCodecs(params.codecs)) { return false; } if (! ValidateRtpExtensions(params.extensions)) { return false; } std::vector<webrtc::RtpExtension> filtered_extensions = FilterRtpExtensions( params.extensions, webrtc::RtpExtension::IsSupportedForAudio, false); if (recv_rtp_extensions_ ! = filtered_extensions) { recv_rtp_extensions_.swap(filtered_extensions); for (auto& it : recv_streams_) { it.second->SetRtpExtensionsAndRecreateStream(recv_rtp_extensions_); } } return true; }Copy the code
The constructor of AudioReceiveStream starts the audio device by calling StartPlayout of the AudioDeviceModule. The destructor method of AudioReceiveStream stops the audio device by calling StopPlayout of the AudioDeviceModule. So restarting AudioReceiveStream triggers StartPlayout/StopPlayout multiple times. In tests, these unnecessary actions resulted in a short break in audio playback when entering a video conference room. The solution is to configure the local supported Audio Codec initial list and the RTP Extensions so that the generated local SDP and remote SDP have the same receiving parameters and avoid AudioReceiveStream restart logic. In addition, most of the audio Codec is WebRTC internal implementation, remove some unused audio Codec, can reduce WebRTC corresponding library files.
Audio and video interact
There are three very important threads inside WebRTC, the Woker thread, the Signal thread and the Network thread. Calls to the PeerConnection API go from the Signal thread to the worker thread. Media data processing is completed in worker thread, while network thread handles network-related transactions. It is explained in channell. h file that the method ending with _w is worker thread’s method, and the call from signal thread to worker thread is synchronous operation. The InvokerOnWorker in the figure below is a synchronous operation, and setLocalContent_w and setRemoteContent_w are methods in the worker thread.
bool BaseChannel::SetLocalContent(const MediaContentDescription* content,
SdpType type,
std::string* error_desc) {
TRACE_EVENT0("webrtc", "BaseChannel::SetLocalContent");
return InvokeOnWorker<bool>(
RTC_FROM_HERE,
Bind(&BaseChannel::SetLocalContent_w, this, content, type, error_desc));
}
bool BaseChannel::SetRemoteContent(const MediaContentDescription* content,
SdpType type,
std::string* error_desc) {
TRACE_EVENT0("webrtc", "BaseChannel::SetRemoteContent");
return InvokeOnWorker<bool>(
RTC_FROM_HERE,
Bind(&BaseChannel::SetRemoteContent_w, this, content, type, error_desc));
}
Copy the code
The SDP information in setLocalDescription and setRemoteDescription will be sent to audio/video via the PushdownMediaDescription method of PeerConnection RtpTransceiver Sets SDP information. For example, a long execution of SetRemoteContent_w for audio (such as the execution time of InitPlayout for audio AudioDeviceModule) can affect the later setup time for video SetRemoteContent_w. PushdownMediaDescription code:
RTCError PeerConnection::PushdownMediaDescription( SdpType type, cricket::ContentSource source) { const SessionDescriptionInterface* sdesc = (source == cricket::CS_LOCAL ? local_description() : remote_description()); RTC_DCHECK(sdesc); // Push down the new SDP media section for each audio/video transceiver. for (const auto& transceiver : transceivers_) { const ContentInfo* content_info = FindMediaSectionForTransceiver(transceiver, sdesc); cricket::ChannelInterface* channel = transceiver->internal()->channel(); if (! channel || ! content_info || content_info->rejected) { continue; } const MediaContentDescription* content_desc = content_info->media_description(); if (! content_desc) { continue; } std::string error; bool success = (source == cricket::CS_LOCAL) ? channel->SetLocalContent(content_desc, type, &error) : channel->SetRemoteContent(content_desc, type, &error); if (! success) { LOG_AND_RETURN_ERROR(RTCErrorType::INVALID_PARAMETER, error); }}... }Copy the code
Other issues affecting the first frame display
Android images are 16 bytes wide and high aligned
AndroidVideoDecoder is the WebRTC Android platform video hard solution class. AndroidVideoDecoder uses the MediaCodec API to complete calls to the hardware decoder. MediaCodec has apis related to decoding:
- DequeueInputBuffer: If greater than 0, returns the index of the buffer filled with encoded data. This operation is synchronous.
- GetInputBuffer: An array of ByteBuffers that can be filled with encoded data. Combined with the return value of dequeueInputBuffer, you can obtain a ByteBuffer that can be filled with encoded data.
- QueueInputBuffer: This method tells MediaCodec the buffer index of the encoded data that has been filled in after the application has copied the encoded data into the ByteBuffer.
- DequeueOutputBuffer: If greater than 0, returns the index of the buffer filled with decoded data. This operation is synchronous.
- GetOutputBuffer: An array of ByteBuffers that can be filled with decoded data. Combined with the return value of dequeueOutputBuffer, you can get a ByteBuffer that can be filled with decoded data.
- ReleaseOutputBuffer: Tells the encoder that data processing is complete and releases the ByteBuffer.
In practice, it is found that the video sent by the sender needs to be 16 bytes aligned. Because on some Android phones the decoder requires 16-byte alignment. Video decoding on Android starts with queueInputBuffer feeding the data to be decoded to MediaCodec. Then check again with the dequeueOutputBuffer to see if there are any unsolved video frames. If not 16 bytes aligned, dequeueOutputBuffer will once have MediaCodec.INFO_OUTPUT_BUFFERS_CHANGED. Instead of being able to successfully decode a frame from the start. Tests show that frame width and height are about 100 ms slower for frames with non-16-byte alignment than for frames with 16-byte alignment.
The server needs to forward the keyframe request
The iOS mobile devices, WebRTC after an App into the background, video decoding by returning to kVTInvalidSessionErr VTDecompressionSessionDecodeFrame, said decoding session is invalid. This triggers a keyframe request from the viewer to the server. The server must forward the keyframe request from the receiver to the sender. If the server does not forward the key frame to the sender, the receiver will have no images to render for a long time, resulting in a black screen problem. In this case, you can only wait for the sender to generate the key frame and send it to the receiver, so that the receiver with the black screen can recover.
Some examples of discard data logic within WebRTC
WebRTC also verifies the correctness of the data on its way from receiving it to the decoder.
For example 1
The PacketBuffer contains first_seq_num_, the minimum number in the current cache (this value will be updated). When InsertPacket is inserted in PacketBuffer, if seq_num of the packet to be inserted is less than first_seq_num, this packet will be discarded. If the packet continues to be discarded because of this, the video will not be displayed or will stall.
For example 2
Normally, the picture ID and timestamp of a frame in a FrameBuffer are always positive. If the FrameBuffer receives picTURE_ID more hours than the picture ID of the last decoded frame, there are two cases:
-
- If the timestamp is larger than that of the last decoded frame and it is a key frame, it will be saved.
-
- Frames except in case 1 are discarded; The code is as follows:
auto last_decoded_frame = decoded_frames_history_.GetLastDecodedFrameId(); auto last_decoded_frame_timestamp = decoded_frames_history_.GetLastDecodedFrameTimestamp(); if (last_decoded_frame && id <= *last_decoded_frame) { if (AheadOf(frame->Timestamp(), *last_decoded_frame_timestamp) && frame->is_keyframe()) { // If this frame has a newer timestamp but an earlier picture id then we // assume there has been a jump in the picture id due to some encoder // reconfiguration or some other reason. Even though this is not according // to spec we can still continue to decode from this frame if it is a // keyframe. RTC_LOG(LS_WARNING) << “A jump in picture id was detected, clearing buffer.”; ClearFramesAndHistory(); last_continuous_picture_id = -1; } else { RTC_LOG(LS_WARNING) << “Frame with (picture_id:spatial_id) (” << id.picture_id << “:” << static_cast(id.spatial_layer) << “) inserted after frame (” << last_decoded_frame->picture_id << “:” << static_cast(last_decoded_frame->spatial_layer) << “) was handed off for decoding, dropping frame.”; return last_continuous_picture_id; }}
Therefore, in order for the received stream to play smoothly, the sender and the forwarding server need to ensure that the picture_ID and timestamp of the video frame are correct. WebRTC also has a lot of other frame loss logic. If the network is normal and there is continuous receiving data, but the video is stuck or the black screen is not displayed, it is mostly the problem of the stream itself.
Ending
By analyzing the processing logic of WebRTC audio and video receiver, this paper lists some points that can optimize the display of the first frame. For example, by adjusting the relevant parts of local SDP and remote SDP that affect the processing of the receiver, This avoids Audio/Video ReceiveStream reboot. In addition, the influence of Android decoder’s requirement on video width and height, server’s processing of key frame request, and some frame loss logic in WebRTC code on video display are also listed. All of these points improve the display time of the first frame of the video in the Melt Cloud SDK and improve the user experience.