From July 29th to July 30th, CIC 2021 Cloud Computing Summit held by Qingyun technology was successfully held in Beijing. Shen Weifeng, an expert on The server of Bileyun, was invited to attend the summit and gave a speech on “Practice and Evolution of Large-scale Real-time Audio and video Technology Architecture” at the audio and video Technology Forum. Several common architectures and network topologies of real-time audio and video communication are shared, the complexity and diversity of the actual scene of real-time audio and video construction, and some practices of Paileyun in super-large real-time audio and video system are also shared.

As the global epidemic continues to repeat itself, online interaction is still the norm for people’s work, life and entertainment in the post-epidemic era, and the demand for real-time audio and video is still increasing. After continuous evolution, Paileyun can support online for ten thousand people in a single room, video for one thousand people and audio for ten thousand people, and achieve 99.95% high availability, serving global users. This lecture will take you step by step through the technical details behind real-time audio and video communication systems.

Common architecture of real-time audio and video

The directly connected

Direct connection, also known as peer-to-peer network (P2P). In this structure, when each client starts up, it needs to register with the registration server, so that others can find themselves. Under normal circumstances, when establishing audio and video communication, it does not need the intervention of the media server, and each client connects to each other for audio and video communication directly. However, when one or more clients are behind a NAT (or even behind multiple NAts), direct connection becomes difficult and sometimes impossible. At this time, STUN is needed to “hole” through the NAT for communication. If the “hole” fails, the server relay is also needed to communicate.

MCU

The Multipoint Conferencing Unit (MCU) scheme is relatively early and the corresponding technology is very mature. The scheme consists of a star structure composed of a server and multiple clients, each end will send audio and video data to the server end. The server will decode, synchronize, resampling, layout, mix and encode all the audio and video data of all clients, and finally push the media data to all clients. In fact, the server side is an audio and video mixer, which is very stressful for the server. Generally, before mixing audio data, the server removes the target user’s own audio data to prevent the client from hearing its own echo. Before mixing video data, the server will check whether each target user has a customized layout, otherwise, it will mix code according to the default layout of the system. In some complex network environments, MCU can also make some adaptive adjustments to the coding rate of video data according to the bandwidth of the target user.

SFU

Selective Forwarding Unit (SFU) is a new architecture popular in recent years. SFU is similar to MCU. Each client sends audio and video data to the server, which then forwards the data to different clients. Unlike MCU, SFU does not mix audio and video. After receiving audio and video data from a certain client, SFU forwards the audio and video data to the target client as required (whether the target client subscribes or not). It is essentially an audio and video routing repeater. In this scenario, all mixing is done on the client side, and the computation requirements on the server side are greatly reduced. In some complex network environments, the video data source side will use Simulcast or SVC to send multi-layer video stream data with different resolutions. The server side will forward the most appropriate resolution to the target client according to different network bandwidths and network conditions of the target client, so as to achieve the best experience of each client.

Comparison and summary

From the above comparison, we can see that the direct connection scheme is basically not suitable for the conference scene, and the network content can not be audited. The direct connection scheme is basically only seen in the free scene in the market at present. With the significant decrease of computing cost and bandwidth cost, and the demand for large concurrency, the advantages of SFU scheme become very obvious, while MCU scheme in some enterprises based on audio and video terminal communication and other traditional application scenarios are still relatively common.

The above audio and video communication architecture is the most basic audio and video service architecture. There is no way to achieve high availability and high concurrency. If the server goes down, the service will become unavailable.

Construct the network topology of real-time communication

Loop network structure

The loop structure consists of several nodes in the network connected end to end by point-to-point links to form a closed loop. This structure forms a loop connection through a common transmission link. In the loop, data is transmitted between nodes in one direction, and information is transmitted from one node to another.

This network is simple to implement, build, and route, with no central dependencies. However, it is difficult to add and delete nodes in the loop network structure. If any node fails, the loop transmission will be interrupted and the whole network will be paralyzed. The ring network structure was mainly used for token network at first, but has been rarely adopted.

Star/tree network structure

A star topology is one center with multiple nodes. It has the advantages of simple structure, convenient connection, relatively easy management and maintenance, strong scalability and low network delay. The central node is the bottleneck, once the failure, the entire network will be paralyzed, single leaf nodes are independent, do not affect each other.

The tree topology is shaped like an upside-down tree with roots at the top that branch down and each branch can branch down again. The roots receive data from each site and then broadcast it to the whole network. Easy to scale, easy to diagnose errors, and the root node is the bottleneck. In the audio and video service architecture, when large-scale expansion of concurrent users is required, multiple edge computing control nodes are generally deployed and connected to the central DC in a tree mode. There is a slight increase in latency for users connected in this way.

Mesh network structure

In the mesh topology, each device in the network has a point-to-point link connection, and the mesh topology is the most widely used. It has the advantages of no central node, high reliability, strong fault tolerance and low delay. However, its structure is complex. Because of multiple transmission paths, route selection and flow control are complex, the timing consistency of messages cannot be guaranteed.

In the audio and video service architecture, our design goals are: very low latency, efficient transmission, large throughput, but there is no requirement for the timing and consistency of media data from different users. Therefore, the network structure is generally the preferred topology among the main DC service control nodes in the audio and video service architecture.

Diversity of actual scenarios

Diversity of network access

Mobile network: 3G/4G/5G access bandwidth varies (3G: < 2Mbps; 4 g: 10 to 100 MBPS; 5G:~10Gbps), and the signal strength, access base station changes with mobile.

Wired broadband: LAN access, shared egress bandwidth, easy to appear bandwidth competition between users; ADSL access, asymmetric upstream and downstream bandwidth, low upstream bandwidth; PON & FTTH direct optical fiber access, stable bandwidth.

Wireless access: mainstream router < 150Mbps (Max. 300Mbps in 2.4g band, Max. 867Mbps in 5G band).

Diversity of devices along the transmission path

The throughput of routers and switches on the transmission path is different. When performance bottlenecks occur, packet loss policies and resource reservation policies are different.

Diversity of terminal devices

The devices we need to deal with are desktop devices, mobile devices, wearable devices, Internet of Things devices, etc., and the quality and performance of the key modules in these devices are uneven. These key modules include: network module, media acquisition module (Camera/Mic), computing module (CPU), rendering module (GPU).

Diversity of server access

The Border Gateway Protocol (BGP) equipment room provides single-IP, multi-line access, intelligent route selection, line backup, and automatic switchover to an available line after a fault occurs. For multi-carrier dedicated line access, you need to select routes from the application layer and switch lines when faults occur. Single carrier dedicated line access cannot solve the problem of interconnection between different carriers.

It is due to the complexity of actual scenarios that diversity leads to:

  • Network dynamic changes: bandwidth, packet loss, jitter, delay, etc.

  • Dynamic changes of collected data: noise (noise point), distortion, output data jitter, etc.

Architecture evolution and cloud practice

High concurrency and availability of services

To achieve high concurrency and high availability of services, the following technologies are mainly involved:

  • High concurrency service cluster

We talked about in the audio and video services infrastructure, single service concurrency value is limited by the server computing resources, when need very high concurrency, we must expand the server-side computing resources, form cluster service, and through the load balancing in front of the client to the service request was balanced distribution of each computing resources in the cluster.

  • Automatic recovery and degradation of service failures

We know that any code can be logically flawed at some point, even by the best people in the world, so we need a mechanism to ensure that when a failure occurs, the system can automatically catch the error and restore service availability. When some physical bottlenecks, such as service computing resources, network bandwidth bottlenecks, continue to provide services in the normal way, may produce avalanche effect, lead to the entire service is not available at this time, we need service degradation ways to safeguard the main function is available. For example, turning off the video does not have a serious impact on communication, but turning off the audio may prevent communication from continuing, so keep the audio and turn off the video to ensure the availability of the entire service.

  • Service resources can be flexibly scaled

With the support of virtualization, SDN and other technologies, it is possible to dynamically allocate computing resources and network resources. Therefore, when service resources become the bottleneck, it is technically feasible to dynamically scale service resources.

  • Remote Dr Backup

In some extreme cases, such as a fire in the machine room where we deploy the service cluster, how can we ensure that our services are available? This time, we will deploy multiple service clusters in different geographical position, under normal circumstances, different location service cluster will provide services at the same time, occurs when a cluster is not available, our service monitoring detects the corresponding event, and timely adjust routing, the new service requests to other services available cluster scheduling.

With the support of the above major technologies, Paileyun has been able to achieve 99.95% high availability and serve global users.

High quality of service

In the face of complex and changeable network environment, how to ensure the provision of high-quality audio and video services is a very important task for the audio and video service system. In this regard, Patleyun mainly uses these technologies to ensure the high quality of service: Bandwidth assessment, congestion control and smooth transmission, packet loss retransmission and forward redundant error correction, error hiding and recovery, multi-layer distribution based on time domain and space domain (SVC & Simulcast AVC), image and voice based denoising and enhancement, echo cancellation, volume adaptive adjustment, network resource reservation. Through the application of these technologies, Paileyun can still provide very high quality audio and video services even with a packet loss rate of 70%.

Super scale, super concurrency

Selective data forwarding in SFU architecture

As we all know, in SFU scheme, audio and video data is fully forwarded, that is to say, in a meeting of 10 people, the server should forward each person’s data to the other 9 users (109). When the number of users is small, the problem will not be too serious, but with the increase of users, the problem will become more and more serious. Suppose there are 100 people in the meeting, and the video data of each person is 1Mbps. The actual server needs to forward 10099 = 9.9Gbps. In reality, this is almost impossible. In practice, due to the limitation of screen size, it is impossible for everyone to watch the other 99 users’ videos at the same time. The most common cases are 1 large + 6 small, or 22, 33, 44, 55 and other modes. In this way, the amount of data can be greatly reduced through on-demand forwarding.

Edge computing and acceleration nodes

As mentioned in the network topology of real-time communication system above, the scale of conference can be greatly expanded by deploying multiple edge computing nodes linked to the main DC in a tree structure. Edge compute nodes can also be connected to the nearest node to solve the last kilometer access problem.

Live on one-way scenario, we can also through third-party CDN network scale to extend the meeting, but this kind of solutions of delay will be big, will reach the delay of 3 ~ 10 seconds, basic unable to interactive communication, can only be one-way broadcast, when need to interactive communication, access must be switched to the edge of the compute nodes or central DC.

Shot the technical architecture of cloud audio and video system

On the left side of the architecture diagram, services are registered, authenticated, configured, discovered, and scheduled. The right side is mainly big data analysis platform, service health monitoring and alarm, service resources elastic expansion. In the middle is Paileyun’s products and services: voice call, video call, interactive whiteboard, interactive live broadcast, cloud recording, etc.

Industry trends and latest technology

In recent years, the development of audio and video communication field is very fast, there are a variety of cutting-edge emerging technologies, some have been implemented, some are still in in-depth research, many of the application prospects are very good. Here are a few examples:

WebRTC

In May 2010, Google acquired Global IP Solutions’ GIPS engine, open-source it and renamed it WebRTC. In July 2014, WebRTC became a W3C standard and released the browser standard API1.0. Since then, the threshold of real-time audio and video communication services has been greatly reduced, and many real-time audio and video services based on WebRTC have mushroomed. It can be said that the emergence of WebRTC has changed the market pattern of audio and video communication.

WebRTC is still evolving, and those interested can get the latest information from the link below:

[1] webrtc.org/

[2] groups.google.com/forum/#! The for…

[3] twitter.com/webrtc

[4] webrtcweekly.com/

SDN

Software Defined Network is a new Network innovation architecture originally proposed by the CLean State research group of Stanford University. It can define and control the Network in the form of Software programming. Its control plane and forwarding plane are separated and open and programmable. It is regarded as a revolution in the field of network, providing a new experimental approach for the research of new Internet architecture, and also greatly promoting the development of the next generation of Internet.

Its core concept is: control and forwarding separation, management and control separation. Its programmable and virtualization features help quickly define networks and automate deployment, operation and maintenance. The centralization of control and management makes network path optimization easier to achieve.

New algorithms based on machine learning

Machine learning has been widely applied in many fields, and many application directions have emerged in the field of real-time audio and video communication, such as:

  • Network transmission related: intelligent congestion algorithm, intelligent bandwidth evaluation algorithm, intelligent routing, etc.

  • Video image related: virtual background, hyper resolution, video fusion, Deepfake, etc.

  • Speech related: speech recognition, speech enhancement, etc.

Virtual reality, augmented reality and 3D

Virtual Reality is the combination of Virtual and Reality. It is a computer simulation system that can create and experience a Virtual world. It makes use of computers to generate a simulated environment and immerse users in the environment. Augmented Reality is a technology that combines virtual information with the real world, allowing real environments and virtual objects to overlap and coexist in the same picture and space. Through the combination of VR, AR and 3D technology, it is believed that in the near future, real-time audio and video communication can achieve the effect similar to the meeting room in the real world.

This article is published by OpenWrite!