This post is based on a translation shared by Peikang at WWDC 2021. Speakers include Peikang, from the Video Coding and Processing team. Translator Tao Jinliang, senior audio and video development engineer of netease Yunxin, has many years of experience in end-to-side audio and video work.
Support for low latency encoding has become an important aspect of the video application development process and has a wide range of applications in low latency live broadcasting and RTC fields. This share focuses on how VideoToolbox, a low-level framework that provides direct access to hardware encoders and decoders, provides video compression and decompression services, as well as conversion between raster image formats stored in the CoreVideo pixel buffer, supports low latency H.264 hardware encoding. To minimize end-to-end latency and achieve new levels of performance, ultimately enabling optimal real-time communication and high-quality video playback.
Share video records: https://developer.apple.com/videos/play/wwdc2021/10158
Low latency coding is important for many video applications, especially real-time video communication applications. In this presentation, I’ll introduce a new coding pattern for low latency coding in VideoToolbox. The goal of this new mode is to optimize the existing encoder pipeline for real-time video communication applications. So what do real-time video communication applications need? We need to minimize end-to-end delays in communication.
We expect to increase interoperability by having video applications communicate with more devices. For example: the encoder pipeline should also be efficient when there are multiple receivers in the call, and the application needs to render the video at the best visual quality. Well, we need a reliable mechanism to recover communication from errors introduced by network loss.
The low latency video coding that I’m going to talk about today is going to be optimized for these things. With low latency coding, our real-time applications can reach new levels of performance. In this presentation, I’ll start with an overview of low latency video coding. We have a basic understanding of how to achieve low latency in a pipeline. Then I’ll show you how to use the VTCompressionSession API to build the pipe and code it in low latency mode. Finally, I’ll discuss several features that we introduced in low latency mode.
Low latency video coding
Let me start with an overview of low latency video coding. This is a schematic of the video encoder pipeline on the Apple platform. VideoToolbox uses CVImageBuffer as an input image and requires the video encoder to perform compression algorithms such as H.264 to reduce the size of the raw data. The output compressed data is encapsulated in CMSAMPLEBUFFER and can be used for video communication through network transmission. As you can see from the figure above, end-to-end latency can be affected by two factors: processing time and network transmission time.
To minimize processing time, low-latency mode eliminates frame reordering and follows a one-in-one-out encoding pattern. In addition, the rate controller in the low latency coding mode can adapt to network changes faster, so the delay caused by network congestion can also be minimized. With these two optimizations, we can already see a significant performance improvement over the default mode. For 720p 30fps video, low latency encoding can reduce latency by up to 100 milliseconds. This savings is critical for videoconferencing.
By reducing latency in this way, we can achieve a more efficient coding pipeline for real-time communications such as video conferencing and live streaming.
In addition, the low latency mode always uses a hardware-accelerated video encoder to save power. Please note that the video codec type supported in this mode is H.264, which we will introduce on iOS and MacOS.
Use low latency mode in VideoToolbox
Next, I want to talk about how to use low latency mode in VideoToolbox. I’ll first review the use of VTcompressionSession and then show you the steps required to enable low-latency coding.
The use of VTCompressionSession
When we use the VTCompressionSession, the first to use VTCompressionSessionCreate API to create a session. The session, such as the target bit rate, is configured through the VTSessionSetProperty API. If no configuration is provided, the encoder will run as the default behavior.
After create a session and configured correctly, we can by calling VTCompressionSessionEncodeFrame pass CVImageBuffer session, at the same time can provide during the period of from the session to create output handlers retrieve encoding results.
Enabling low latency encoding in a compressed session is easy, the only thing we need to do is to modify it during session creation, as shown in the following code:
- First, we need a CFMutableDictionary for the encoder specification, which specifies the specific video encoder that the session must use.
- Then we need to set in encoderSpecification EnableLowLatencyRateControl flag.
- Finally, we will shrink the encoderSpecification give VTCompressionSessionCreate, the session will run in low latency mode.
The configuration steps are the same as usual. For example, we can use the averageBitRate property to set the target bit rate.
Okay, so we’ve covered the basics of the Video Toolbox’s low latency mode. Next, I’d like to introduce new features in this pattern that can further help us develop real-time video applications.
VideoToolbox new features for low latency mode
So far, we’ve discussed the latency benefits of using a low-latency mode, and the rest of the benefits can be achieved through the features I’ll describe.
The first feature is the new Profiles, which we have enhanced interoperability by adding two new Profiles to the pipeline. We’ll also talk about time-domain layered SVC, which is very useful in video conferencing. The maximum frame quantization parameter (MAX QP) can also be used for fine-grained control of image quality. Finally, we hope to improve error recovery by adding support for Long Term Reference (LTR).
New Profiles support
Let’s talk about the new Profiles support. Profile defines a set of encoding algorithms that the decoder can support. Profile is used to determine which algorithms are used for interframe compression in video encoding (for example, whether B frames are included, CABAC support is included, color space support is included, etc.). The higher the Profile, the more advanced compression features are used. Corresponding to the codec hardware requirements are also higher. In order to communicate with the receiver, the encoded bitstream should conform to a specific configuration file supported by the decoder.
In VideoToolbox, we support a series of profiles, such as Baseline Profile, Main Profile, and High Profile. Today, we have added two new profiles to the series: the Congregation Baseline Profile (CBP) and the Congregation High Profile (CHP).
CBP is primarily used for low cost applications, while CHP has more advanced algorithms for better compression ratios. We can first check the decoder functionality to determine which Profile to use.
To use CBP, simply set the ProfileLevel session property to ContraInedBaseline_AutoLevel. Similarly, we can set the PROFILE level to ContrainedHigh_AutoLevel to use CHP.
Time-domain layered SVC
Now let’s talk about the time domain layered SVC. We can use time domain layering to improve the efficiency of multi-party video calling.
For example: a simple three-way video conference scenario. In this model, the bandwidth of receiver A is lower, at 600kbps, and the bandwidth of receiver B is higher, at 1,000 KBPS. Typically, the sender needs to encode two sets of codestreams to meet the downstream bandwidth of each receiver. This may not be optimal.
This model can be implemented more efficiently with time-domain layered SVC, where the sender only needs to encode one bitstream, but the bitstream output can be divided into two layers.
Let’s see how this process works. This is a sequence of encoded video frames in which each frame uses the previous frame as a predictive reference.
We can pull half of the frames into another layer, and we can change the reference so that only the frames in the original layer are used for prediction. The original layer is called the base layer, and the newly built layer is called the enhancement layer. The enhancement layer can be used as a supplement to the base layer to increase the frame rate.
For receiver A, we can send base layer frames because the base layer itself is already decodable. More importantly, since the base layer contains only half of the frames, the data rate transmitted is very low.
Receiver B, on the other hand, can enjoy smoother video because it has enough bandwidth to receive both base layer frames and enhancement layer frames.
Let’s take a look at a video that uses time-domain layered SVC coding. I’m going to show you two videos, one from the base layer and one from the base layer and the enhancement layer. The base layer itself will play fine, but at the same time we may notice that the video isn’t very smooth. If we play the second video, we can immediately see the difference. The right side video has a higher frame rate than the left side video because it contains both the base layer and the enhancement layer.
The left-hand video has a 50% input frame rate and uses a 60% target bit rate. The two videos require the encoder to encode only one bit stream at a time. It’s much more energy efficient when we’re doing multi-party videoconferencing.
Another benefit of time-domain layering is error recovery. As we can see, the frames in the enhancement layer are not used for prediction, so there is no dependency on these frames. This means that if one or more Enhancement Layer frames are lost during network transmission, other frames are not affected. This makes the entire session more robust.
The way to enable time-domain layering is simple. We in low latency mode creates a new session attribute called BaseLayerFrameRateFraction, only need to set this property to 0.5, this means that half of the input frame allocated to the base layer, the rest are assigned to the enhancement layer.
We can check the layer information from the sample buffer attachment. For base layer frames, cmSampleAttachmentKey_ ISdepEndedonByOthers will be true, otherwise false.
We can also optionally set the target bit rate for each layer. Remember that we use the session property AverageBitRate to configure the target bit rate. After the target bit rate configuration is completed, we can set a new BaseLayerArbitrateFraction property to control the target bit rate percentage required by the base layer. If this property is not set, the default value of 0.6 is used. We recommend that the base layer bit rate score should be in the range of 0.6 to 0.8.
The biggest frame QP
Now, let’s look at the maximum frame quantization parameter or maximum frame QP. Frame QP is used to adjust image quality and data rate.
We can use low frame QP to produce high quality images. But in this case, the image size will be very large.
On the other hand, we can use high frame QP to generate low quality but small size images.
In low delay mode, the encoder adjusts frame QP using factors such as image complexity, input frame rate, and video motion to produce the best visual quality under the constraints of the current target bit rate. So we encourage you to rely on the default behavior of the encoder to adjust the frame QP.
However, in some cases where the client has specific requirements for video quality, we can control the maximum frame QP used by the encoder. When the maximum frame QP is used, the encoder will always select a frame QP less than this limit, so the client can have fine-grained control over the image quality.
It is worth noting that even if a maximum frame QP is specified, the normal rate control is still in effect. If the encoder reaches the maximum frame QP limit but the bit rate budget runs out, it will start dropping frames to preserve the target bit rate.
An example of using this capability is the transfer of on-screen content video over a poor network. The tradeoff can be made by sacrificing the frame rate to send a clear image of the screen content, which can be met by setting the maximum frame QP.
We can use the new session attribute maxAllowedFrameQP to pass the maximum frame QP. The standard maximum frame QP must be between 1 and 51.
Long Term Reference Frame (LTR)
Let’s talk about the last feature we developed in low latency mode, Long Term Reference Frame. Long – term reference frames, or LTR, can be used for error recovery. Let’s take a look at the diagram showing the encoder, the sender client, and the receiver client in the pipe.
Suppose the video is communicating over a poorly connected network. Frame loss may occur due to transmission errors. When the receiving client detects a frame loss, it can request a frame refresh to reset the session. If the encoder receives a request, it usually encodes a keyframe for refresh purposes, but the keyframe is usually quite large. Large keyframes take longer to reach the receiver. Because network conditions are already poor, large frames can exacerbate network congestion problems. So, can we use predictive frames instead of keyframes for refreshes? The answer is yes, if we have frame validation. Let’s see how it works.
First, we need to decide which frames to confirm. We call these frames long-term reference frames or LTR, and it’s up to the encoder to decide. When the sender client transmits an LTR frame, it also needs to request acknowledgement from the receiver client. If the LTR frame is successfully received, an acknowledgement needs to be returned. Once the sender client receives the acknowledgement and passes this information to the encoder, the encoder knows which LTR frames were received by the other party.
Consider the case of a bad network: when the encoder receives a refresh request, because this time, the encoder has a bunch of recognized LTRs, it can encode a predicted frame from one of those recognized LTRs, and the frame encoded in this way is called LTR-P. The encoded frame size of LTR-P is usually much smaller than that of key frames, and therefore easier to transmit.
Now, let’s talk about the LTR’s API. Note that frame validation needs to be handled by the application layer and can be done through mechanisms such as RPSI messages in the RTP control protocol. Here we will focus only on how the encoder and the sender client communicate during this process. With low latency encoding enabled, we can enable this feature by setting the EnableLTR session property.
When LTR frame is coded, the encoder will use signal in the sample attachment RequireLTRAcknowledgementToken a unique frame token.
The sender client is responsible for reporting the confirmed LTR frames to the encoder via the AcknowledgedLTRtokens frame attribute. Since we can receive more than one acknowledgement at a time, we need to use an array to store these frame tokens.
We can request a refresh of the framework at any time through the ForcTrRefresh framework property. Once the encoder receives this request, an LTR-P will be encoded. If no confirmed LTR is available, in this case, the encoder will generate a keyframe.
conclusion
The above is the translation of all the content shared by Peikang at WWDC 2021. If there is any improper translation, please correct and communicate with us.
At present, netease Yunxin has implemented the software coded SVC and long-term reference frame scheme at the client level, while the server has also implemented the SVC scheme on forwarding. SVC provides an additional means to control video streaming server of the forward rate, coupled with the size of the flow and rate, and the client downlink network bandwidth detection and congestion control, netease cloud letter in pursuit of perfection of viewing experience, continuous grinding products, the sharing of content, I believe will soon be in a cloud of products to obtain the very good use.
Share video records: https://developer.apple.com/v…
More technical dry goods, welcome to pay attention to “netease intelligent enterprise technology +” WeChat public number