This article is the second article in this series, which is based on the speech of Wu Tong, senior technical architect of netease Yunxin Multimedia, at QCon Shanghai, titled “Architecture Design of Ultra-HIGH DEFINITION 4K Video Low-delay Live Broadcasting and RTC Fusion”. Review the first article in this series, “Ultra-HD 4K Video Low-latency Live Streaming and RTC Convergence Architecture design ① : 5G and the Future Network Landscape”. In this article, Wu Tong started from the selection of transmission protocol and congestion control algorithm, and shared netease’S optimization practice of BBR algorithm in low-delay live broadcast and RTC scenarios, as well as the design scheme of netease’s low-delay topology.
Live broadcasting is popular, which is the most intuitive feeling of everyone in recent years. Since 2015, all platforms have been flourishing, and in 2016 and 2017, people have begun to seek differentiated experience of interactive Lianyimai Live. Now, live broadcasting has entered the second half, and low delay live broadcasting has become a new breakthrough.
I believe many people have seen zootopia, although lightning is very cute, but no one wants the female anchor in the live broadcast, reaction as slow as he is.
So what factors determine low latency?
(1) Transmission dimension
-
Transfer protocol
-
Congestion control algorithm
-
Server distribution network
-
Kilometer 1: Distribution and routing
(2) Other dimensions
-
Delay of collection, encoding and pre-processing at sender
-
The Jitterbuffer processing delay on the receiving end
-
Receiver decoding, post-processing, rendering and playback delay
Today, in the interest of time, we will focus on the determinants of the transport dimension.
Transport protocol selection
First let’s look at transport protocol selection.
1. The RTMP protocol
RTMP protocol is currently the most widely used live push and pull protocol. The standard RTMP protocol uses TCP as its transport layer, so its defects are obvious:
(1) Poor anti-packet loss ability. When packet loss occurs, the sending window recedes directly, resulting in slow sending rate;
(2) The overall delay is uncontrollable and mainly depends on the reliable transmission of ACK at the receiving end;
(3) Congestion control optimization cost is high, TCP algorithm congestion control strategy is implemented in the operating system kernel layer, the cost of optimization is high, generally only in the server kernel level TCP congestion control optimization, and the client can only rely on the operating system of each platform to do their own congestion control optimization.
In summary, standard RTMP is not suitable for low-latency scenarios.
2. QUIC agreement
Most of you are familiar with QUIC (Quick UDP Internet Connection). In recent years, with the in-depth study of Google QUIC protocol, many friends and projects began to replace the traditional RTMP over TCP with RTMP over QUIC.
QUIC is such an excellent protocol that it has now become a standard protocol for HTTP3. Its advantages should be well known:
(1) 0 RTT connection. QUIC uses UDP, so it has the flexibility to redesign some of the protocol features, including the ability to establish a connection with 1RTT handshake for the first time, and then re-connect with 0RTT handshake.
(2) support the same multiplexing HTTP2, but also because of the use of UDP protocol, so also solve the queue blocking problem;
(3) FEC is supported, so the anti-packet loss capability is improved compared with TCP;
(4) Because it is an application layer protocol, it can improve congestion control flexibly. All kinds of congestion control algorithms can be used in QUIC in pluggable mode.
To sum up, QUIC is excellent. However, AS a low delay live broadcast protocol, QUIC also has several drawbacks:
(1) It is a reliable protocol in nature, and we cannot conveniently actively control the protocol delay;
(2) As a general protocol, it is not friendly to audio and video media and cannot understand the specific meaning of audio and video media data. It treats these media data equally.
3. The SRT agreement
In recent years, there is another Transport protocol, namely Secure Reliable Transport (SRT). It was created by Haivision and Wowza and opened source in 2017. It has the following characteristics:
-
Security: SRT supports AES encryption to ensure end-to-end video transmission security.
-
Reliability: SRT ensures transmission stability through forward correction technology (FEC);
-
Low latency: SRT uses UDT protocol at the bottom layer. UDT protocol is an old reliable transmission protocol based on UDP. Of course, native UDT transmission delay is relatively high, so SRT makes a lot of optimization of congestion control strategy to reduce transmission delay.
SRT is still a relatively niche protocol, and we didn’t choose it as our transport protocol for two main reasons:
(1) High protocol complexity;
(2) Speed retreat is larger in packet loss scenario.
4. Self-developed low delay transmission protocol
In fact, no matter QUIC or SRT, their designs have many commonalities:
(1) They are both using udP-based underlying transport protocols, which I think is a very correct direction;
(2) They all adopt FEC to improve anti-packet loss ability;
(3) They redesigned the congestion algorithm to achieve more accurate bandwidth estimation and congestion judgment to reduce transmission delay.
Therefore, our choice is to develop our own low delay transmission protocol based on UDP protocol and apply it in low delay live broadcast and RTC scenes at the same time.
Here we share with you the main features of the protocol we designed:
(1) Different requirements for reliability and priority of each packet can be described in the protocol, so that packets with different reliability and priority can be differentiated for differential processing on the whole link of transmission;
(2) The protocol is completely oriented to audio and video media, which can easily describe all kinds of audio and video media types. No matter H.264, H.265, VP8, VP9 or AV1 and VVC can be directly described in the protocol;
(3) The protocol supports description of video fragment information, video time-domain hierarchical information and multi-stream information, which can be used to make relevant segmented QoS policies for the server and complete video packet assembly more conveniently at the receiving end;
(4) Support unified global order of transmission layer, simplify the processing of feedback, and at the same time, it is very meaningful for packet loss calculation and jitter calculation;
(5) Support FEC, especially in high RTT scenarios (such as cross-border transmission), the reasonable use of FEC is of great significance to fight against random packet loss;
(6) Protocol packets are smaller than MTU slices at the application layer, which is very meaningful for combating packet loss and intermediate routing.
(7) Support the extension of rich media information, whether it is timestamp, video rotation Angle, audio volume and so on, these information we partly reference the DESIGN of RTP, of course, also simplified RTP design;
(8) The protocol supports application layer routing information, which is of great significance to realize convenient server cascade and edge acceleration.
In fact, RTP is already a very excellent media transmission protocol, and we design the protocol to be more convenient to use under the cloud communication scenario and server architecture. If you do not have special requirements, you can directly use RTP protocol and use RTP Extension to achieve other relatively complex functions.
With the protocol negotiated, the next step is how to choose a congestion control algorithm.
Congestion control algorithm selection
Congestion control algorithm is always a hot topic in the field of audio and video, and it is also a direction that the industry has been constantly exploring.
1. TCP congestion control algorithm
In TCP protocol, from the earliest Reno and BIO to the widely used Cubic algorithm, there are very clever algorithms and mathematical principles. Because they are based on packet loss, such congestion control algorithm is too sensitive to packet loss and the delay is uncontrollable, so it cannot be used for low-delay transmission.
2. GCC
GCC algorithm is a commonly used congestion control algorithm in RTC, which is also the default use of open source WebRTC. The algorithm is mainly based on delay + packet loss, with three core modules:
(1) Arrival time filter;
(2) Overload inspector;
(3) Rate controller.
GCC algorithm has been constantly updated and iterated in recent years. In the early stage, the bandwidth was estimated at the receiver, and the sender needed to carry ABS-SendTime on the media packet header. The receiver used kalman filter, and then used REMB protocol to bring back the bandwidth to the sender. Later, bandwidth estimation is performed at the sender. The sender needs to carry the unified transport-sequence-number sequence on the media packet header. Then, the receiver uses transportCC packets to send the packet information, including the packet receiving time, to the sender, and uses linear filters at the sender to estimate the relevant bandwidth.
Because GCC algorithm is relatively mature in the industry, I will not make more introductions here, but simply share some problems we found in the process of practice:
(1) Bandwidth estimation is not accurate enough, especially at low bandwidth;
(2) The competition with TCP will be excessive retreat;
(3) In the delay jitter scenario, the bandwidth fluctuates greatly.
Of course, some of these problems will be gradually optimized as GCC continues to be optimized, but some problems are not able to be solved under the GCC framework. Then we looked at BBR.
3. BBR
BBR (Bottleneck Bandwidth and Round-trip Propagation Time) was proposed and promoted by Google in 2016. BBR algorithm is based on Bandwidth and delay feedback. BBR algorithm has many advantages, among which I think the three core points are:
(1) The algorithm idea is to avoid queuing as far as possible and keep low delay;
(2) Accurate bandwidth estimation;
(3) In most scenarios, it can compete with the traditional TCP congestion control algorithm based on packet loss without loss.
Since 2018, netease Yunxin.com has been using BBR in its projects and has optimized BBR in the field of low delay. Today, I would like to briefly talk with you about the BBR algorithm in the field of low delay live broadcast and RTC.
BBR application in low delay live broadcast and RTC scenarios
The core of BBR algorithm is to find the maximum bandwidth and minimum delay of the current link. The product of maximum bandwidth and minimum delay is BDP, which is the maximum capacity of data stored in a network link. Knowing BDP solves the problem of how much data to send, and maximum network bandwidth solves the problem of how fast to send. So BBR basically solves the question of whether you should send it and how fast you should send it.
Let’s take a look at a classic graph from the BBR author’s share. The horizontal axis is the amount of data in the network link, and the vertical axis is RTT and bandwidth respectively. It can be found that when RTT remains unchanged, the bandwidth keeps rising, because there is no network congestion at this time, while when the bandwidth stops rising, RTT keeps increasing until packet loss occurs. At this time, the network becomes congested and the packets are accumulated in the router’s buffer. As a result, the RTT continues to increase but the bandwidth does not.
The red dotted boxes in the figure indicate the maximum bandwidth and minimum latency under ideal conditions. Obviously, to find BDP, it is difficult to find the minimum RTT and maximum bandwidth at the same time. Thus the minimum RTT and maximum bandwidth must be detected separately.
The way to detect the maximum bandwidth is to continuously increase the amount of data sent and fill up the network buffer until the bandwidth does not increase for a period of time.
The way to detect minimal RTT is to empty the buffer as much as possible to keep data transfer delays as low as possible.
BBR’s state machine is divided into four phases: Startup, Drain, ProbeBW, and ProbeRTT.
-
Startup
Startup is similar to the slow start in normal congestion control. The gain coefficient is 2/ln2 = 2.88. Each round trip increases the packet rate by this coefficient, and the Drain state enters when the estimated bandwidth is full.
-
Drain
Enter Drain state, gain coefficient ln2/2 is less than 0.3, also slow down. A packet is sent back and forth to empty the data in the Startup state. The number of packets sent that have not yet been ACK is inflight. If inflight < BDP, the packets are empty. If inflight > BDP, the packets cannot move to the next state.
-
ProbeBW
ProbeBW is in a steady state, where a maximum bottleneck bandwidth has been measured and there is as little queuing as possible. Each subsequent round trip loops through the ProbeBW state (unless entering the ProbeRTT state mentioned below), polling the following gain coefficients, [1.25, 0.75, 1, 1, 1, 1, 1, 1], so that the maximum bottleneck bandwidth hovers up and down at the point where it stops growing. You should be ProbeBW 98% of the time.
-
ProbeRTT
The first three states, all of which can go into ProbeRTT state. If a smaller RTT value is not estimated for more than ten seconds, the ProbeRTT state will be entered to reduce the amount of packets sent and leave the road to accurately measure an RTT value. Exit from this state after at least 200ms or a round trip of packets. Check whether the bandwidth is full and enter the Startup state. If the bandwidth is full, enter the ProbeBW state.
BBR has many advantages, but it does have problems when applied in low-delay live broadcast and RTC scenarios:
(1) Slow convergence;
(2) The anti-packet loss ability is insufficient, and the original algorithm loses more than 20% of its packets, resulting in a precipitous drop in bandwidth;
(3) ProbeRTT only sends 4 packets, which is not suitable for low-delay streaming media applications.
How to solve these problems?
(1) Slow convergence
-
Method 1: BBR V2 proposed in ProbeBw 0.75x cycle of a drain in place;
-
Method 2:6 1x smooth transmission cycles are randomized to shorten the time needed for empting.
(2) Insufficient anti-packet loss capability: the anti-packet loss capability can be effectively improved by compensating the non-congestion packet loss rate to each ProbeBw cycle.
(3) ProbeRTT sends only 4 packets: BBR V2 shortens probe RTT to 2.5s once, and uses 0.5xBDP to send.
Through the above optimization, let’s take a look at the effect comparison between BBR and GCC:
In the first scenario, the bandwidth limit is 300K. After a period of time, the limit is lifted. It can be found that:
(1) BBR is more accurate than GCC in bandwidth estimation (GCC: 250K, BBR: 300K);
(2) After the bandwidth limit is lifted, GCC needs more than 20 seconds to restore to the maximum bandwidth, and BBR only needs 2 seconds to restore to the maximum bandwidth.
The second scenario is bandwidth limit of 2.5Mbps, 200ms Delay, 200ms Jitter. It can be found that GCC bandwidth detection is only 300K, and BBR bandwidth fluctuates around 2.5m.
Server distribution network
Let’s talk about server distribution networks for low latency livestream.
The distribution network of traditional CDN is generally a tree structure or a variant of the tree structure, which has strong scalability and can easily accept a large number of users. However, this structure also has drawbacks in low latency or RTC scenarios:
(1) With the increasing number of upstream data users, the top-level node is prone to become the bottleneck and single point of the system;
(2) The greater the tree depth, the greater the end-to-end delay of media data, so the delay of live broadcast based on traditional CDN is at least 3~4 seconds.
Next, how does netease Yunxin design the low-delay topology scheme?
On the whole, we adopt a hybrid architecture, that is, Mesh Mesh topology + tree topology + proxy acceleration fusion architecture.
The center of the whole topology is a core routing and forwarding network connected in pairs. The design of the center with this topology can ensure the stability and high availability of the center, reduce the complexity of data routing between the central nodes, and ensure the low delay of data transmission in the center.
On the outside of the central mesh topology, we use a two-layer tree to ensure the edge scalability and the closest access of users. This 2-layer root is not fixed, and if the roots of the current leaf node go down, they can be automatically connected to other nodes in the mesh topology to ensure high availability.
The existence of proxy nodes is to solve some special scenarios, for example, when some leaf nodes do not have the access capability of small operators, they can use this proxy to accelerate node access. The biggest difference between proxy acceleration nodes and leaf nodes is that they have different requirements for machine resources. The proxy node is a stateless pure acceleration node invisible to the business and is a high-performance server completely customized to our transport protocol.
We usually choose BGP room as the Mesh node, and single-line room as the leaf node of the tree. The cost of a single-line machine room is often several times cheaper than that of a BGP machine room. Therefore, when choosing the nodes that users access and which machines to use to form the entire network topology, we should not only consider the access effect of users to the nodes and the network between nodes, but also pay attention to the cost. Only by balancing the effect and cost, can the product last long. By designing the architecture in such a way, we both guarantee the horizontal scaling of the framework and control the latency, while keeping the machines in the topology free of single points of problem.
Kilometer 1: Distribution and routing
The first mile is how our dispatch center assigns ACCESS Nodes to clients.
Firstly, the nearest principle, carrier matching principle and load balancing principle are adopted to select multiple nodes with the best theory for users. The priority adjustment is triggered by connection data (packet loss, delay, and delay) reported by real clients for the nearby rule and carrier matching rule.
After the user gets the allocation list, the local client will start to do a fast speed measurement of each node, select the best node this time, to access. Other nodes are selected for quick access during Failover.
By adopting this strategy, we can combine theoretical allocation, big data analysis and actual speed measurement, and ensure fast connection recovery of users when the server is unavailable.
The above is the content to share about live streaming and RTC low delay scheme. Please look forward to the third article of this series, “Design of Ultra-HD 4K Video Low Delay Live streaming and RTC Fusion Architecture ③ : Design of Live Streaming and RTC Fusion Architecture”. Welcome to leave your comments and communicate with Teacher Wu Tong one to one.