Write in front:

MCU: multi-point Control Unit

SFU: Selective Forwarding Unit

1. MCU solution

The MCU approach is a traditional solution for multiplayer video conferencing and has been used with great success for many years. This success can be attributed to the fact that it requires less client consumption. The architecture is based on having a central point to maintain a one-to-one flow with each participant. The central element receives and mixes each incoming audio stream and video stream to synthesize a single outflow to each participant. A common term in the videoconferencing industry for these centralized components is multipoint control unit (MCU). In practice, using an MCU usually means a mixer container.

The above figure is a composite broadcast code stream diagram to illustrate the forwarding process. Among them, VMP is the screen synthesizer, MT1, 2, 3 are participating in the screen synthesis terminal, and MT1, VMP mounted under the forwarding board 1, MT2, MT3 mounted under the forwarding board 2.

The black solid line is the first step of the bit stream. Each terminal calls the bit stream to its mounted forwarding board. The black dotted line is the second step of the code stream. The forwarding board sends the received code stream to the screen synthesizer. The dashed lines of MT2 and MT3 need to be noted, because the code stream sent is not a broadcast nature, but a single call to the frame synthesizer, so as mentioned above, from the forwarding board mounted by the sender, directly to the receiver. The red solid line is the third step of the code stream transmission. After the code stream comes out of the picture synthesizer, it is broadcast and will be called to all receiving terminals of the conference. On MP2, the bit stream is copied into two copies and called MT2 and MT3 respectively. The red dotted line is the bridge exchange between the two forwarding boards mentioned above.

Advantages and disadvantages:

Pros: MCU is a good solution for traditional device operators. MCU transcodes and mixes the received multi-channel streams and outputs a single channel stream to each terminal, which saves the downlink bandwidth of end-users and can customize output video streams with different bit rates for users in different network conditions, thus providing better user experience in multiplayer scenarios. The typical application scenario is multi-person audio and video calls. They also allow full bit rate adaptation, since mixers can produce different output streams, so each receiver has different qualities. Another advantage of the mixer solution is that it can take advantage of hardware device codec.

Cons: The main drawback is the high cost of infrastructure in MCU. In addition, since mixing requires decoding and recoding, this introduces additional delay and loss of quality. Finally, transcoding and composition can theoretically lead to less elastic user interfaces to the application (although there are solutions to this problem)

Ii. SFU solution:

The architecture is based on having a central point to receive a stream from each sender and send a stream to each participant. This central point only does packet detection and forwarding, not expensive encoding and decoding of the actual media. The common term is SFU, a method of routing and forwarding WebRTC client audio and video data streams through a server.

It has better delay and no quality degradation compared to traditional Mixer solutions. SFU is an attractive solution to server performance problems because it does not involve the computational costs of video decoding and encoding, and it forwards media streams with minimal overhead, typically in the case of 1-to-many live services

This scheme is ideal for large concurrent factual meetings and live broadcasts. At present, the mature service provider is the voice network

The transmitting end collects the audio and video stream, pre-processing and coding, and then sends it to the real-time transmission network of the sound network. The receiving end obtains the audio and video stream from the real-time transmission network, and then decodes, post-processing, and plays/renders. The above voice network channel (channel) calls, SDRTN only for transmission, not for confluence

Audio processing flow

Generally speaking, the voice call between A and B goes like this:

1) Audio collection by A client device (such as microphone)

2) Pre-process the collected audio (such as noise reduction and echo cancellation)

3) Encode the audio and send it to the Agora cloud server

4) Agora sound network server transmits audio to receiver B

5) B client will decode/post-process the audio after receiving it

6) B The client plays audio

Video processing process:

Generally speaking, the video communication process between A and B is as follows:

A) A Client device capture video (such as video data acquisition by camera, etc.)

B) Pre-processing the captured video (such as beautifying and filtering)

C) Local preview is available

D) The video is encoded and sent to the Agora cloud server

E) The Agora server transmits the video to the receiver B

F) B The client decodes the video after receiving it

G) B The client performs post-processing on the decoded video

H) B Client rendering video

Three, the applicability of comparison:

1. Assuming you offer an enterprise-class service with good broadband and efficient hardware (i.e. an in-house service) and limited participation, you are well suited for the MCU solution.

2. Generally speaking, if you are providing large-scale services, you should consider the Router approach first. Router transmission is an example of a network that brings information close to the boundaries of the network, building end user applications to achieve better scalability and flexibility.

Question and answer

Q1: What are the similarities and differences between platform interaction and streaming interaction?

A1: Platform interaction is codec through MCU, picture synthesis and sound mixing and then send to SVR for picture display. Streaming media interaction is the transmission of raw data to the sound network, which is then transmitted to the SVR for picture synthesis.

Q2: The forwarding and principle of bit stream?

A2: Platform interaction is a display where the terminal sends the code stream to MCU for codec and then sends the code stream to SVR through picture synthesis. Streaming media interaction is to send the code stream information of the terminal to the Agora server of the sound network. It only transmits and sends, but does not confluence. Then the bit streams are sent to the SVR for display, and the SVR performs picture synthesis.

Q3: Real-time performance of MCU and SFU?

A3: THE real-time performance of SFU is better. First, the acoustic network is equivalent to the development of a unique transmission layer on the network, and the transmission rate can reach the speed of the private network. Second, SFU only performs forwarding, while codec and picture synthesis are all in SVR device. However, MCU needs to process a large amount of codec information at the same time, so the efficiency will be reduced.

Q4: What are the network pressures and bandwidth requirements in the two modes?

A4: MCU has lower network pressure and bandwidth requirements. Because multiple terminals send code streams to the MSU for processing, only the code streams of the composite picture are sent back to the SVR. However, the SFU sends as many bar code streams as there are terminals to the SVR. Therefore, the network pressure and bandwidth usage of the SVR receiver are greater.