Author: Zhan Haibo and Zeng Xuhong

Background and Requirements

At the beginning of live broadcasting, there were few interactive links in live broadcasting, and the anchor controlled the scene alone, so the delay of more than ten seconds had little impact on user experience. Most common live broadcasts adopt RTMP, HLS and FLV protocols, which have the advantages of mature technology, good compatibility and support for large-scale concurrency. However, the end-to-end delay can only be controlled within 4-6 seconds at the minimum, which reduces the interactive experience of live broadcasting and hinds the promotion of live broadcasting in some scenes, and is not conducive to the prosperity of live broadcasting application ecosystem.

With the accelerated development of “live broadcast +” mode in all walks of life, diversified interactive live broadcast forms such as e-commerce live broadcast, online classroom, sports event, interactive entertainment, etc., make users have higher requirements for real-time interaction, and the end-to-end delay has entered the era of millisecond level live broadcast.

  • E-commerce selling: The anchor starts to shout “instant killing action”. After 3-2-1 Shouting, it takes 5 seconds for the audience to give feedback in the comment area and place the transaction order. The delay delays the interaction enthusiasm between buyers and sellers and reduces the transaction conversion rate of the goods.
  • Online class: when the teacher has moved on to the next point, the student asks a question about the previous point in the comment section, and the teacher has to return to answer the question.
  • Sports: do you know a goal has been scored when you are watching it live on TV or in real time?
  • Show live broadcast: users give rewards to their favorite anchors. If there is a large delay, users can only hear the host’s verbal thanks 5 seconds later. The UI display effect of bullet screen and gifts may have passed long ago, affecting the positive interaction of both sides.
  • Online answer :(1). Some users can see the host’s question screen in advance of the douyin online live broadcast system through the satellite live broadcast system, so the answer is given priority in the comment section. (2) In the case of TCP pullet-streaming, the late audience in the broadcast room will have a longer delay than the advanced audience, resulting in a large gap in the delay of the questions, and the fairness of the answers cannot be guaranteed.

The target

  • The end-to-end latency of the live broadcast is less than 1200 ms.
  • It supports live streaming with low latency in online large-scale and tens of millions of concurrent scenarios.
  • Mature and stable, whole-link core service indicators can be monitored.
  • Guarantee the ultimate live viewing experience with low delay, low speed, and smooth seconds.
  • It is simple and easy to use. The business side integrates RTSSDK based on the existing playback SDK, without changing the large push-pull flow architecture. The existing live broadcast services can be smoothly migrated and upgraded to the low-delay live broadcast system, and the control switch can be configured to facilitate AB testing and Fallback.
  • It can seamlessly switch from a live broadcast to a continuous mic scene and from a continuous mic scene to a live broadcast scene without interruption of audio and video media streams.

Program features

The real-time Streaming system RTS(REAL-TIME Streaming) is based on the RTC real-time audio and video engine and the traditional RTMP live Streaming system, and integrates the latest RTP push and pull Streaming protocol and real-time media transmission strategy into the RTC module respectively. Provide easy access, millisecond latency, high concurrency, high definition, smooth audio and video live streaming and continuous mic services. With the gradual decrease of network bandwidth and CPU computing cost, and the large-scale deployment, research and development and technology maturity of domestic and foreign cloud manufacturers, low-delay direct broadcasting system is definitely the future trend of technology development.

Traditional live broadcast scheme

Based on TCP/RTMP/HTTP protocol, large delay, weak network resistance. HLS: 10-20 seconds HTTP-FLV: 3-5 seconds RTMP: 3-5 secondsCopy the code

Low latency live broadcast scheme

N-to-n real-time audio and video conference:Based on UDP/RTP protocol, MCU or SFU meeting mode architecture system is adopted. Within the delay of 800ms, multiple participants in a conference room can make a video call, and each participant can see and hear other participants. Each participant has both push and pull streams. The data is bidirectional, and the consumption of CPU computing and network bandwidth increases exponentially with the increase of the number of participants. Therefore, the number of participants should not be too large, generally no more than 20.1 to N low delay live broadcast:Based on UDP/RTP protocol, it still belongs to the traditional broadcasting service architecture, but the delay is relatively low, between 400-1500ms. The business model of low-delay broadcasting is relatively simple, with one-way data transmission, one anchor pushing the stream, and no limit on the number of viewers. Millions of viewers can play at the same time.2 pairs of N low delay continuous mic live:Based on UDP/RTP protocol, it integrates two service architectures of real-time audio and video conference and traditional broadcast broadcast, with low delay. Between 400-1500ms, both sides of continuous mic communicate with each other in real time through MCU nodes, and the audience can pull the conjoined pictures and sound mixing of the participants of continuous mic through traditional CDN network. The scheme balances the need for real-time interactive mics with the need for low-cost live streaming, allowing millions of viewers to pull mixed images simultaneously.

System Architecture Diagram

Low latency live broadcast fusion architecture diagram

  • RTMP push stream and RTP push stream/continuous mic audio and video link coexist.
  • The FLV pull stream and RTP pull stream play audio and video links coexist.
  • CDN nodes coexist with RTC nodes to improve the bandwidth overcommitment rate of servers and networks.

Nearby access and intelligent scheduling system

1. Traditional RTMP live push and pull flow process:

  • The SLB global load balancing system is responsible for scheduling and nearby access of CDN edge nodes.
  • L1 edge nodes are generally single-line access servers of China Mobile, China Unicom, or China Telecom (with low cost and a large number of deployed servers), and L2 nodes are generally multi-line access servers (with high cost and a small number of deployed servers).
  • The cost of a BGP multi-line server is often several times that of a single-line access server. Therefore, when selecting nodes for user access and which machines to use to form the entire network topology, we should consider the access effect of users to nodes and the network condition between nodes, and also pay attention to the cost. Only by balancing the effect and cost can the product be durable.

2. Push-pull flow chart of COEXISTENCE of RTS live broadcast and RTMP live broadcast:

  • All media streaming links are RTMP except for the first kilometer of RTP edge push and the last kilometer of RTP edge pull.
  • Smooth transmission, Nack retransmission, FEC anti-packet loss, bandwidth prediction and bit rate adaptive, video JitterBuffer and audio NetEQ anti-network jitter are implemented between the RTP push end and MCU node, and between the RTP play end and L1 fusion node.

Push-stream architecture

  • Pure live push stream end

  • Anchor and anchor and maiduan:

  • Anchor and fan Lian Maiduan:

Play side architecture

Traditional player: receive data frames based on THE RTMP/TCP protocol, and resist network jitter by relying on the buffer of the player, resulting in a large delay.Low-latency player: Receives packets based on THE RTP/UDP protocol and resists network jitter by relying on network anti-buffering and weak network countermeasures. The latency is small and the player buffer is set to 0.

Traditional live broadcast delay distribution

Low latency live broadcast delay distribution

Key link transformation point

RTC audio codec should support AAC

  • At present, CDN and live broadcasting center generally only accept AAC audio bit stream, while RTC client default is Opus encoding.
  • The audio codec of the RTC client should be changed from Opus to AAC
  • MCU server audio codec processing should also be changed to AAC

RTC video codec should support B frame

  • Traditional RTMP live broadcasting generally supports B frames, which can achieve higher definition at the same bit rate.
  • By default, RTC does not support B frames. In order not to reduce the definition standard of low-delay live broadcasting system at the same bit rate, we require that the stream pushing end, the playback end and the edge node be transformed to support B frames.
  • In addition, in the process of opening the continuous mic, the MEDIA stream link of RTC real-time video communication between the anchor and the anchor or between the anchor and the fans should close the B frame, because B frame will increase the delay of the real-time communication between the participants in the continuous mic.

RTC supports external audio collection and playback

  • Design and implementation of external PCM audio frame input interface.
  • Design and implementation of PCM format audio frame output interface.
  • The adaptation and alignment of sampling rate, frame duration, number of coded sampling points per frame, and number of sound channels.

RTC supports external video capture and rendering

  • Design and implementation of external YUV video frame input interface.
  • Design and implementation of YUV format video frame output interface.
  • YUV specific format (I420/NV12), frame size, frame rate and other adaptation and alignment.

Qos policy for streaming media transmission

RTCPeerConnection should be established between the RTP stream pushing end and MCU node, and between the RTP playback end and L1 fusion node, and then smooth transmission, Nack retransmission, FEC anti-packet loss, bandwidth prediction and bit rate adaptive, video JitterBuffer and audio NetEQ anti-jitter buffer should be implemented.

  • Smooth send and lose frame policy:
  • Nack retransmission and FEC anti-packet loss: Video packet loss is retransmitted using Nack, and audio packet loss is protected by RS-FEC forward error correction.
  • Bandwidth prediction and bit rate adaptive: a hybrid congestion control algorithm is used to control bit rate based on packet loss rate and delay. When playing the card immediately, the sender will drop the bit rate.
  • Anti-jitter mechanism of the video JitterBuffer:
  • Audio NetEQ anti-jitter mechanism:

TCP and UDP retransmission policy

  • When packet loss occurs during network jitter, TCP is the ACK mechanism. After sending a packet, the sender waits for the response from the recipient before sending the packet. If a packet is lost on the network, it will be retransmitted after a Retransmission Time Out (RTO). The value of RTO is related to the Round Trip Time (RTT), and the minimum value of RTO is 200ms. To ensure smoothness, your player must be able to tolerate at least 200ms of jitter, but there are too many jitterbuffers on the player side to meet the low latency requirements.

  • When a packet is lost due to network jitter, RTP is a NACK acknowledgement mechanism. After sending a packet, the sender can continue to send other packets without waiting for a response. If a packet is lost on the network, the receiver returns the sequence number of the lost packet through THE RTCP packet in the form of Nack, and the sender resends the lost packets in the network, greatly reducing the transmission efficiency and delay.

Optimization of signaling and media connection process

  • Remove the STUN punching process: In commercial scenarios, all edge nodes have public IP addresses. The STUN punching process is deleted. The IP address and port number of edge nodes are directly obtained from the signaling server, and the SDP information is directly filled in to initiate an ICE media connection.
  • Delete the DTLS encryption process: The DTLS encryption and decryption processes are deleted to ensure the security of signaling channels and save the negotiation times of signaling connection establishment.
  • Removing SRTP/SRTCP encryption: Unlike audio and video conferencing, the content of live media streams is generally released to the public. Therefore, SRTP/SRTCP encryption and decrypting can be removed to reduce CPU consumption and save connection negotiation times.

RTC video streams support SEI information

  • SEI information is generated at the push stream end and parsed and used at the play end.
  • SEI JSON format:
  • "BizCode" : "Live | Linkmic", "BaseWidth:" 720, "BaseHeight" : 1280, "VideoRect" :,80,720,640 [0], "LinkmicType":"BBLink|BCLink|BBLink_PK", "CallerId": 26368040117, "CalleeId:" 211997614285, "LinkmicStatus" : 0 | 1 | 2, / / 0 - even aam began 1-2 - even of the wheat wheat end "BCLinkLayout" : 0 | 1 | 2. / / 0 - anchor small fan picture 1 - the host big picture fans around 2 - such as side by side big picture "BBLinkPKSession" : {" PKId ": 123," PKType ": 0 | 1, / / 0 - thumb up" PKDuration "1 - purchased: 5*60, //PK duration, in seconds}Copy the code

Modification of playback end

There are two ways to transform the player:

advantage disadvantages
Existing player integrated RTC network downlink module (FFmpeg Demux plug-in) Keep the original player framework and process; Reuse existing data burial points and index monitoring and maintain consistency Latency is not minimum; Poor integrity after integration; Network buffer and audio and video frame play split, not good linkage; If there is both low-delay push stream or continuous mic and low-delay play in an App, there will be two WebRTC components and symbol conflict
Use the new low delay playback RTSSDK (based on WebRTC downlink) Lowest delay; Good integrity after integration; JitterBuffer/ decode/fetch frame play three links form a good overall linkage; As long as there is only one WebRTC underlying component in an App, it has all the functions of low-delay streaming, continuous mic and playback There is a certain interface adaptation; Data burial point and index monitoring need to be unified

Implementation logic of the playback side:

  • Optimization of player link establishment: The player must first send a play request to the media server so that the mapping relationship between the player and the NAT gateway where the player resides can be established. Then the media server can directly send RTP data to the player without STUN punching.
  • Play link down problem: Because UDP has no connection, there is no TCP wave to confirm that the connection is closed when the connection is down. If the player is no longer watching, the media server will still send the data to the IP and port specified by the player, resulting in many invalid transmissions. This will be reflected in traffic and charging logs. We will use RTCP packets to notify the disconnection of play and push connections, saving network resource consumption and avoiding traffic preemption.

Push flow end transformation

Based on the legacy of the past, there are two ways to promote the transformation of the stream:

advantage disadvantages
The existing pusher is integrated with the RTC network uplink module (corresponding to the RTMP module of OBS). Maintain the original OBS push flow framework and process; Maintain the consistency of existing data burial points and monitoring indicators Poor integrity after integration; After collection, the logic of frame loss/resolution and automatic adjustment of bit rate are separated from congestion control/smooth transmission, which cannot be well linked. If there are both low-delay push-stream or continuous mic and low-delay play in an App, there will be two WebRTC components and symbol conflicts. There is a break in the middle of the switch between push flow and continuous mic
Use the new low delay push stream RTSSDK(based on WebRTC upstream and downstream full link) Good integrity after integration; After collection, frame loss logic/resolution and automatic adjustment of bit rate/coding/congestion control/smooth transmission and other links form a good overall linkage; There is only one WebRTC underlying component in an App, and it has all the functions of low-delay streaming, continuous mic and playback. Push flow and connect the seamless switch, without interruption There is a certain interface adaptation; Data burial point and index monitoring need to be unified

Out of sync between sound and painting

  • Both the client and the server need to report the time stamps of audio frames and video frames to the background big data server and display them graphically to facilitate the development and testing personnel to find and locate problems.
  • The original collection time stamp is usually used on the client. When the time stamp difference between audio and video exceeds a certain degree (assume that the time stamp difference is more than 1 second), an alarm is generated.

An edge node is connected to a nearby node

  • Edge detection:
    • The signaling server periodically sends Ping packets to the node server and client, media packet loss rate and RTT between the client and edge node to detect the network status of the client and edge node, and sends the detected data to the server, thus providing a big data basis for network construction of SD-RTN node.
    • The scheduling server performs big data calculation based on the collected packet loss rate and RTT detection results every certain period and assigns an optimal edge A-Node to each client.
  • Edge access
    • Firstly, the nearest node, carrier matching principle and load balancing principle are used to select the optimal nodes for the users. For the proximity rule and carrier matching rule, the connection data (packet loss, delay, and lag data) reported by real clients periodically triggers priority adjustment.
    • After the user gets the allocation list, the local client will start to do a quick speed measurement for each node and select the best node for access. Other nodes are selected for quick access during Failover.
    • By using this strategy, we have combined theoretical allocation, big data analysis and actual speed measurement, and ensured that users can quickly recover their connections when the server is unavailable.

Deliver push-pull flow configuration parameters

  • Two types of policies can be delivered:

    1. The client pulls. The client periodically obtains configuration information from the server and updates the local cache. 2. Server Push delivery. The server notifies the client to update the configuration through the control signaling.Copy the code
  • On the basis of the existing live streaming service, a new RTS broadcast basin name, a push stream, RTMP/RTP two ways to coexist pull stream.

  • By configuring the whitelist, both the pusher and the player can publish the version in grayscale through the service server.

Quality control and evaluation indicators

The index name Index definition note
End-to-end delay The total duration of a frame of audio or video data, starting from the collection of the push stream end, through the intermediate processing and transmission process, and finally rendered at the playback end. It affects the real-time interaction between viewers and anchors
The hundred seconds are longer If the interval between two frames is greater than XXX milliseconds, a lag time is counted. The lag time of all playing behaviors is adjusted to the lag time of 100s. For example, if the watching duration of a play is 10s and the lag time is 1s, the indicator is 10s Due to the low probability of cattons occurring, this indicator will mostly be a 0/1 value
Hundredth of a second If the interval between two frames is greater than XXX milliseconds, one lag is counted. The lag times of all playback behaviors are normalized to the lag times of 100 seconds. For example, if the duration of a playback is 10 seconds and the lag time is 1 time, the indicator is 10 times Due to the low probability of cattons occurring, this indicator will mostly be a 0/1 value
The first frame length The duration from DNS resolution, connection establishment, packet receiving, decoding to rendering the first frame The first screen time is the time between the user’s tap or swipe to the live broadcast and the first frame rendered. In the slide scenario, the first screen duration is less than the first frame duration due to preloading. The first screen duration is larger than the first frame duration when the user directly searches and clicks.
Second open than The number of times that the first frame is played within 1 second is greater than the number of times that all streams are successfully played For a single user, the metric has little meaning. If the user plays only once, the ratio can only be 0/1
Pull rate User pull stream playback, the first frame render successful play times than all play times For a single user, the metric has little meaning. If the user plays only once, the ratio can only be 0/1
  • Comparison of data on different pull-flow links in the same network status and the same pull-flow devices:
    • RTS pull-in end-to-end latency is reduced by 75%+
    • RTS stream playback error rate reduced by X %
    • RTS pulling seconds ratio increased by 0. X %
    • RTS pull-down rate is basically flat.

Landing Deployment Procedure

Change the push stream end first/change the play end first

1. First transform the push stream end: the number of concurrent anchors is generally only tens of thousands, while the number of players is millions. Compared with the transformation, the number of RTS push stream node servers is much smaller than the number of pull stream node servers. In addition, the bandwidth of the upstream link is usually smaller than that of the downstream link. If the uplink is blocked, all users will be blocked.

2. First transform the playback end: reduce latency significantly. However, there are many pull edge nodes that need to be modified and deployed, and the cost will be higher. Can not solve the uplink flow link caused by the blackout of all live broadcast users.

RTS live and RTMP live exist in parallel

Traditional live broadcasting of RTMP has low cost, has been deployed on a large scale and runs stably, and can be retained in live broadcasting scenarios with low requirements for delay.

RTS live broadcast is mainly used for real-time interactive live broadcast scenes with high delay requirements, small scale gray scale, step-by-step iteration, and steady parallel progress.

Network damage meter and audio quality meter

Ixia network simulation instrument is used to simulate weak network scenarios such as random packet loss, burst packet loss, bandwidth limitation, extended delay and delay jitter of upstream and downstream links in laboratory environment.

Using MultiDSLA, the objective voice quality of the two solutions can be compared and tested by MOS score such as PESQ/POLQA under various packet loss and delay jitter conditions.Speech quality test schematic

Test results of tiktok live broadcast integrated with RTS

1. Comparison of live broadcast cost:

  • Based on THE SD-RTN network, the bidirectional + network topology low-delay live broadcasting system has immature technology, few globalization and large-scale deployment, and high cost. It is 2-3 times the cost of traditional CDN live broadcast.
  • The one-way + broadcast + cascading RTS low-delay live broadcast system based on THE TRANSFORMATION of CDN is mature + large-scale + global deployment. The cost is higher than that of traditional CDN live broadcast, but compared with sD-RTN real-time audio and video network, it will be greatly reduced.

2. Comparison of end-to-end delay of live broadcast:

  • Test effect of tiktok live streaming: left push stream source (reference clock), middle RTS pull stream (delay: 800ms), right FLV pull stream (delay: 5000ms), end-to-end delay, first frame duration and other key indicators of live streaming RTS pull stream has obvious advantages over FLV pull stream.

3. Comparison of core metrics and user experience of live streaming:

  • It has been verified that the core indicators of low-delay live broadcasting have excellent performance :(1) with the same lag rate, the delay of RTS low-delay live broadcasting is reduced from 6 seconds to 1 second compared with the traditional live broadcasting of RTMP, which is reduced by more than 80%. (2) Under the same network delay and packet loss rate, the performance of indicators such as the length and number of 100-second seconds, pull success rate, second opening rate and first frame duration of RTS low-delay live broadcast are improved by 2-20%, which greatly improves the real-time interactive experience of live broadcast.