Author: Wang Xinghe, senior audio and video engine development engineer of netease Intelligent enterprise Yunxin

With the arrival of AI and 5G, audio and video applications will become more and more extensive, and people’s demand for audio and video quality is also getting higher and higher. The video resolution has developed from HD to ULTRA HD and VR, and the video frame rate has also appeared in applications such as 60fps and 120FPS. Interactive applications also put forward higher requirements for end-to-end delay. At the same time, the hardware capabilities of the devices have advanced by leaps and bounds. It can be predicted that with the arrival of 5G, the data transmitted in the network will show explosive growth, and a large amount of data will bring great challenges to network transmission. Therefore, how to ensure a high-quality audio and video experience for users? Transmission is going to be a big part of it.

Common network problems include packet loss, jitter, congestion, and delay. The impact of these problems on the video experience is described below.

Video frames are often split into small group package, when the network packet loss occurs, the receiver can’t successful group of frame, not only affects the frame data decoding, starting from the frame of the whole GOP will be unable to decode, according to the user see pictures will appear caton, until the full frame I arrived at the receiving end to recover.

Jitter is also a killer for lag. Jitter includes sending data jitter, network transmission jitter, and packet loss and recovery jitter. In the end-to-end full link, any link will introduce jitter and affect the smoothness of the picture. The Jitter Buffer at the receiving end is responsible for eliminating Jitter and providing a stable decoding frame rate, but the Jitter Buffer is at the expense of delay.

When network congestion occurs, audio and video playback becomes worse. The direct impact of congestion is sudden packet loss or sudden jitter. If you do not timely predict the transmission of congestion and reduce the amount of data to be sent, the receiving end will have problems such as lag, long delay, and poor picture quality.

Packet loss, jitter, and congestion tend to be accompanied by an increase in latency. The definition of latency by ITU StandardG.114 is that when the end-to-end delay is greater than 400ms, the vast majority of interactive experiences will be unacceptable.

We use QoE to represent the subjective experience of users at the receiving end, while QoS refers to the service that measures the transmission quality of the network and provides high-quality network services through some quantitative indicators.

So what are QoS’s weapons against these network problems?

1 FEC

FEC (forward error correction code) is a common anti-packet loss method. It sends redundant data to the sender to resist packet loss on the network. When detecting packet loss of media data, the receiver can use the received redundant data for packet loss recovery. FEC has the advantage of low packet loss recovery delay, but the disadvantage is that redundant data occupies extra bandwidth. In bandwith-limited scenarios, the original bit rate of video is squeezed, resulting in low picture quality. The packet loss recovery capability of an FEC is affected not only by the number of redundant data but also by the packet size. The larger the packet size is, the stronger the anti-packet loss capability is, but the corresponding packet loss recovery delay is also larger.

The packet size and anti-packet loss capability are shown

FEC codec algorithms include XOR, RS, etc. The main indexes to measure FEC algorithm are computing performance, anti-packet loss ability, and recovery delay. Among them, RS (Reed-Solomon code) based on Cauchy matrix is a popular FEC algorithm. The algorithm combined with appropriate grouping strategy has good performance on the above three indicators. Different FEC algorithms have better performance for different application scenarios, such as fountain code and convolution code, etc. Fountain code combined with the feedback mechanism of the receiver has better performance for scenarios with large packet loss fluctuation such as wireless channel.

In general, FEC is a technology that uses bit rate traffic to exchange anti-loss capability. The advantage of FEC is low recovery delay. The key point of FEC technology is how to design reasonable redundancy strategy and packet size, so as to achieve an effective balance among anti-loss capability, video bit rate and recovery delay.

2 Packet loss and retransmission

Different from the anti-packet loss technology of FEC, retransmission does not need to waste too much bit rate. Only when the receiver detects packet loss, it retransmits corresponding data through the feedback of packet loss information (NACK) to the sender. The receiver initiates retransmission requests for the same packet every RTT until the corresponding packet is successfully received. The biggest advantage of retransmission is high bit rate utilization, but the disadvantage is that additional packet loss recovery jitter will be introduced to prolong the delay. Obviously, the larger the network RTT is, the worse the recovery effect of retransmission is.

In the case of bidirectional packet loss, the receiver can send retransmission requests at an interval of 1/2 RTT for the same packet to avoid the possibility of packet loss. The packet response interval of the sender for the same SEQ is controlled according to the RTT interval to prevent excessive waste of retransmission bit rate. Sending and receiving retransmission request and retransmission response are processed according to the above interval, so as to achieve an effective balance between bit rate and anti-loss capability and maximize benefits. A good retransmission strategy design also needs to consider whether it needs to tolerate out-of-order and control the retransmission rate reasonably.

3 Jitter Buffer

An important link on the receiving side is the Jitter Buffer. The Jitter Buffer eliminates data Jitter at the lowest cost of Buffer delay and provides smooth playback frame rate. Because the video is decoded to play by frame, the delay calculation of the Jitter Buffer is also based on the video frame rather than the video packet. The input parameter of the Jitter Buffer is the Jitter of each frame of video data. There are many factors causing frame Jitter, including acquisition Jitter, encoding Jitter, transmission Jitter, network Jitter, Jitter introduced by packet loss repair, etc. In short, data Jitter introduced in any link before decoding will be summarized into the Jitter Buffer module for Jitter elimination.

Effective use of the anti-packet loss capability of retransmission requires the stretching strategy of Jitter Buffer. Because retransmission only ensures that data can reach the receiving end, in addition, the receiving end needs to have a large enough Jitter Buffer to wait for these late data frames, otherwise, even if the data retransmitted to the receiving end will be discarded due to hysteresis.

Retransmission combined with Jitter Buffer tensile strategy is a technology that uses delay exchange to resist packet loss. RTT is the key factor affecting the cost performance of this exchange. The smaller RTT is, the greater the retransmission gain is; otherwise, the worse the gain is, and FEC is more needed to realize anti-packet loss.

4 Long term reference frame

In addition to retransmission, FEC and other conventional means, long term reference frame technology, namely, selected reference frame technology, is a network module and encoder work together to complete the technology. In the RTC scenario, the general encoding reference strategy is to refer to the previous frame, because the closer the reference distance is, the better the compression effect is. For real-time consideration, only I frame and P frame are encoded, but no B frame. The long term reference frame is a cross-frame reference frame selection strategy, which breaks the traditional rule of the forward reference frame and can choose the reference frame more flexibly.

Is the purpose of the long-term strategy of reference frames in a packet loss scenario, the receiver does not need to wait for packet loss recovery can continue to show the picture, its biggest advantage is low latency, no need to wait for the retransmission recovery, but the compression ratio of sacrifice, under the same bit rate performance to sacrifice the quality of the images, but the exchange of the sacrifice and benefits in some scenarios is worth it. In addition, the normal long-term reference frame technology greatly improves the ability to resist sudden packet loss. When the network suddenly loses packets, the immediate recovery effect of FEC and retransmission is generally poor, especially in the network with basic RTT. However, the long-term reference frame can bypass the lost frame and use any recovered frame after the lost frame to decode and display, thus improving the fluency of sudden packet loss.

According to the characteristics and purpose of long-term reference frame, the application of long-term reference frame technology requires the feedback information from the receiver side. The encoder selects the reference frame encoding according to the frame information feedback from the receiver side. In the case of packet loss, the receiver will quickly restore the picture through short-term frame rate sacrifice. Due to the existence of receiving feedback, a longer feedback delay in case of high RTT will lead to a larger reference distance, while the reference distance exceeds the encoding buffer limit of the encoder, which will lead to the failure to find the reference frame. Therefore, long-term reference frame is suitable for low RTT scenes. In a multi-person conference scenario, too many downlinks will restrict the selection of reference frames for long-term reference frames.

In general, the long-term reference frame is suitable for 1V1 communication scenes and weak network scenes with low RTT accompanied by packet loss or congestion. In such scenes, the long-term reference frame has better real-time performance and fluency than the traditional reference frame of the previous frame. At the same time, combined with the contribution of retransmission and FEC to anti-packet loss, its anti-weak network ability will be greatly improved.

5 Large and small streams and SVCS

Long-term reference frames are better for 1V1 scenarios, while multiplayer scenarios require large and small streams and SVC to come into play.

The media server can forward streams of the corresponding quality according to the actual downstream bandwidth. If the bandwidth is sufficient to forward high-quality streams, the bandwidth is insufficient to forward low-quality streams. The advantages of this size streaming mechanism are as follows: 1) It can provide two kinds of video bit rates to the media server without adjusting the source bit rate; 2) When downlink receivers have different bandwidths, it can be flexibly forwarded to avoid the situation where only one coding source affects each other.

SVC has the same features as large and small streams. The difference lies in that SVC provides different optional specifications for different frame rates. Media services can select different SVC layers for forwarding and reduce the bit rate by reducing the frame rate.

When bandwidth is insufficient, different users have different requirements for clarity and fluency. SVC and size streaming provide flexible mechanisms to meet different application requirements.

6 Scenario Differentiation

We simply divide our application scenarios into two categories, communication scenarios and live broadcast scenarios. Communication scenarios are simple, such as video conference, host communication with microphone, online interview and so on. Such interactive application scenarios have high requirements on real-time performance, which is characterized by low delay and fluency first. However, the characteristics of the live broadcast scenes, such as the live broadcast of the host selling goods and the online teaching of the teacher, are that the host or the teacher speaks alone most of the time, so it is characterized by high delay and priority of clarity. In these two scenarios, there are different strategic tendencies. In communication scenarios, FEC and retransmission are more used as auxiliary to improve real-time performance. Retransmission and FEC are used to improve the clarity of live scenes.

conclusion

This paper mainly introduces the basic QoS strategy against weak network. In addition to the above technologies, there are many modules involved in the balance of three dimensions of delay, clarity and fluency. There is rarely a technology can be perfect, can not have both fish and bear’s paw, we have to do the balance strategy is to learn from each other, seek advantages and avoid disadvantages, in different network conditions, different application scenarios, combined with the characteristics of each technology, the combination of a set of boxing, to maximize the benefits.