At present, there are two common forms of real-time Communication Service (RTC) : Software as a Service (SaaS) and Platform as Service (PaaS)[1]

SaaS services tend to be a large collection. For example, the online education scenario not only contains RTC services, but also a series of basic services required by the online class process, as well as some teaching AIDS to improve class efficiency and experience, such as instant messaging IM service, course management system, interactive courseware and electronic whiteboard, vod playback, etc.

PaaS services focus on real-time audio and video communication capabilities, relying on audio 3A pre-processing (echo cancellation, noise suppression and automatic gain control algorithm), audio and video codec, channel transmission, weak network countermeasures, network scheduling and other technologies, to provide users with high availability, high quality and low delay audio and video communication services. Users can integrate SDKS of different platforms to quickly build multi-terminal real-time applications and realize voice call, video call, audio interactive live broadcast, and video interactive live broadcast in projects, which are suitable for online education, video conference, interactive entertainment, and audio and video social networking scenarios.

1. Problems and challenges

First embarks from the needs and problems, for public cloud RTC PaaS services for different platforms, putting aside av equipment acquisition, coding, network, network receiving, decoding and rendering differences aside, around the main problem is how to in the current complex and unstable public cloud environment provides high availability, high quality and low latency of audio and video calls, Several factors that mainly affect the Quality of audio and video QOS [2] mainly include the following aspects:

  • Network Latency/Delay
  • Network Packet Loss (Packect Loss)
  • Out-of-order delivery network out-of-order delivery
  • Network Jitter
  • Bandwidth Overuse/network congestion

Therefore, how to design the service architecture, provide a high quality and efficient network transmission link and a series of means or strategies to guarantee the audio and video QOS in the case of weak network is particularly important. For the network layer, users are usually given priority to solve the problem of the last kilometer, which is also inherited from the traditional edge node or CDN idea.

2. RTC service architecture

According to function division, a complete set of RTC PaaS service is designed [3], which needs to be completed:

  • Scheduling server (Coordinator)

    No matter how the compression algorithm and code are optimized, the network is still the most important factor contributing to the delay. The service architecture needs to provide a mechanism to ensure that users access the nearest optimal media node with the lowest load, so as to reduce the Round Trip Time (RTT) and ensure the network quality of the last kilometer of users as much as possible. That’s where the scheduling server comes in.

    As shown in the picture above steps 1, 2, when new users access to RTC service request scheduling server first to obtain a list of available media server, scheduling server via IP for users who request to carry the network access to the user’s location and carrier information, according to certain strategy to return to the media server list, then in step 3, The client can detect the real-time network status between all candidate media servers through ping messages, so as to select the best media server for push-pull flow. The list of network conditions obtained by these probes, especially the latency information, is later fed back to the scheduling server so that it can more accurately allocate the most appropriate media server to serve subsequent new users.

  • Signaling Server (Signaling Server)

    Signaling negotiation and message interaction are an essential part in the design of a video conference system. However, I have always advocated that “credulity is important to media” and “the number of message interactions and the size of message body should be reduced as much as possible on the premise of ensuring the necessary message interactions. It is recommended that the signaling server design be simple and sufficient. As shown in the figure above, it is recommended to put the ** thin ** signaling service on the side of the media server. In this way, it is convenient to maintain and the media server can provide services independently from other services, even if other services are abnormal. In addition, testing and troubleshooting are more efficient and convenient. The disadvantage is that for the scheduling server, the user media information obtained is limited, and the prediction and control of load consumption can not be very fine.

    Here, I want to discuss a question, “Does RTC PaaS service necessarily need the concept of conference, Channel or room?”

    To this issue, I think can be abstract as “long vs. short connections” * * * *, because for the same meeting room or this kind of concept, listen to more, comes from the IM instant messaging, the people in the room can immediately received or the consumption by others, and can perceive others online or offline state. For RTC PaaS services, there are currently two ways for users to use:

    1. “No conference, no room concept, no package long connection service”. When users use SDK to push and pull streams, they do not need to use JoinChannel or JoinRoom first. They only need to select an optimal node to push and pull streams after getting the list of media services through the scheduling server. At this time, the RTC PaaS service itself cannot sense the joining and leaving of participants. For pull flow, it can only rely on the regular retry of SDK to ensure that it cannot provide message publishing and Subscribe. The advantage of course is that there is no need to develop a similar IM service, to do a stable and efficient long connection IM service is not very easy, and the development logic and call process are very simple; On the contrary, some things have to be sacrificed, such as the SDK side pull flow which requires constant retries, which is not very elegant, and the server cannot sense the actual online state on the end due to the lack of a long connection. For example, when the client crashes, many messages cannot be triggered properly. For some tasks, session or record destruction depends heavily on the arrival of messages on the end, or on timed timeouts to clean up.

      Suitable scene: a lot of business users often have their own IM instant messaging services, because now the app online long connection and the air hair message, thumb up, gifts and other basically has become a standard, so you can rely on partial upper business channel messages to publish and subscribe, and the timing of the push-pull flow.

    2. “There is the concept of meeting, room, packaged long connection service”. In contrast to the first case, it is necessary to provide users with stable and efficient IM services. This can be achieved by using websocket+ message queue (such as RabbitMQ). Users must JoinChannel or JoinRoom before using SDK to push and pull streams. You can then subscribe or consume messages to handle callbacks when others are notified when they join, leave the room, or push the stream. Due to the existence of the long connection, the server can accurately sense the status of the client. When the user goes offline, the server clears the task session or record in time.

      Suitable scenario: Some small companies don’t have the ability to develop their own IM services. From their point of view, just provide me with a complete solution. I don’t need to care about other technical details, just need to know what event callbacks perform what calls.

    At the signaling level, we also want to discuss that TCP is obviously inferior to UDP in the performance of weak networks, so there are often cases where media transmission can also be signaling but cannot be established. Therefore, it is suggested that udp reliable transmission, such as QUIC, can be tried at the signaling level.

  • Media Server or Streaming Server

    All media servers are completely independent, and when a server joins or leaves the RTC network, the system automatically adjusts its allocation without user perception or influence. In cases of high load, “horizontal scaling” can be achieved by adding machine deployment services.

    According to the network topology, media servers can be divided into two categories [4] : SFU(Selected Forward Unit) and MCU(Multipoint Control Unit).

As shown in the figure, SFU is only responsible for forwarding media streams without doing too much complicated media processing. The bottleneck is IO, and the concurrency capability is strong. The disadvantage is that there are many downstream forwarding routes, which occupy high bandwidth and have high requirements on user-side bandwidth and devices. In addition to receiving/sending media streams, MCU also needs transcoding and mixed streaming, which has high requirements on server performance, high cost of service deployment, low bottleneck in CPU, poor concurrency, and low number of forwarded streams. In the real-time transmission system, transcoding will bring extra delay and poor real-time performance. Advantage is that it consumes less downstream bandwidth and has low requirements on user-side bandwidth and devices.

SFU architecture is preferred for meeting scenarios with strong requirements on real-time performance and high concurrency. With the promotion of 5G technology in the future, it can be predicted that bandwidth will become less and less of a problem, and the advantages of SFU will become more obvious.

However, in actual business scenarios, the service architecture is often “SFU+MCU hybrid” to meet business requirements. So the question is, is MCU and SFU together? That is, the service can both forward and mixed stream, or are SFU and MCU completely independent, and the media nodes on the forwarding path are pure SFU?

SFU can be regarded as a subset of MCU from the perspective of media processing process and function, and MCU supports certain forwarding capabilities. For example, after mixed streaming is completed, different media encapsulation output modes can be selected and directly pulled by the client (this is that MCU also has the function of SFU).

As shown, it is suggested that the ability of different points of different nodes to achieve [5], such as forwarding service SFU node is responsible for forwarding media and multinational cascade, mixed flow transcoding server MCU node load, mixing and mixed flow, the design can be as simple as possible, because of service low coupling, at the same time for different types of service scheduling strategy can also be much simpler, Reduce overall system complexity and improve efficiency.

For a complete SET of RTC PaaS services, it is often necessary to complement various capabilities. In addition to the above media service types, it is also necessary to:

“Cloud recording server (RTP to FLV or MP4)” and “cloud forwarding server (RTP to RTMP)”. Cloud recording is mainly used for playback on demand, while cloud forwarding is mainly used for bypass live broadcasting, which reduces the pressure on forwarding nodes and reduces costs in the scenario of 1vsN. As shown in the figure above, both SFU and MCU can be used as the input end of cloud recording and cloud push, and choose different output methods of media packaging by analogy. In addition, the recording of CDN RTMP live broadcast can also be used for playback on demand, but with the increase of links, the probability of corresponding problems also increases, so it is necessary to make a reasonable choice based on the business form and the research and development stage.

If it is a private protocol, in order to support the Web side, also need to design and develop ** “WebRTC gateway” **, for signaling and media conversion.

3. Evolution of network architecture

History is always surprisingly similar, and you can refer to the development and evolution of the RTMP live broadcast architecture, because in essence, the problems RTMP live broadcast and RTC have to solve at the network level are the same.

  • In the first phase, “single zone source station mode”, each zone is pushed and pulled from the source station. As shown in the figure above, it is clear that push-pull flow is not guaranteed across geographies and carriers.

  • In the second phase, multi-area and Multi-line BGP+ Private line cascading back to the source mode, as shown in the preceding figure, can effectively improve the network status between users and servers. Ali Cloud ECS multi-line BGP room, different areas through the drawline to maintain cascading communication, compared with the backbone network of operators will be much more stable; However, in this mode, the cost is very high. The multi-line BGP machine room is charged by traffic, and the cost of dedicated line is higher. For ali cloud ECS, the other is not each region has the corresponding VPC, there are some areas to cover, and the introduction of new node, may need to buy special line, special line between nodes and node some still limited by physical factors, such as domestic abroad only from the east China sea cable out of Shanghai.

    At this stage, it is necessary to carry out “reasonable zoning”, “deployment of selected points” and “customized scheduling strategy” according to the actual demand **, as a room in a region is allocated to the same server as far as possible to avoid cascade; Routes are dynamically generated based on the actual network topology. When users pull traffic, the scheduling server selects the optimal path and dynamically returns to the source. Avoid the consumption of private line bandwidth under certain circumstances, and monitor the consumption of private line bandwidth. When the consumption exceeds a certain threshold, alarm and use external network communication to avoid avalanche.

  • Phase 3: Multi-area Edge Node (ENS) access + Multi-line BGP(ECS) transregional and transcarrier transfer + Cloud Enterprise Network (CEN) mode. As shown in the figure above (steal a lazy ~😊), the same region and the same operator through the edge node transfer; In the case of different regions or carriers, ECS are cascaded for transfer. Different ECS communicate with each other through the cloud enterprise network and connect to each other through private lines. Compared with ECS node, edge node ENS has wider coverage, lower cost and stronger network topology scalability. Cloud enterprise network can help the user is in when introducing new VPC node saves on the two pull line, and can share the shuttle between the bandwidth resources, at the same time, need not care about the actual line network topology structure and the physical limits, the routing between VPC to ali cloud to safeguard, it is more convenient to use, scaling up more convenient and efficient. Have to say, Ali cloud is really thinking about solving the pain points of users, the achievement of customers at the same time is to achieve their own.

  • The fourth stage is “Edge node CDN+WebRTC standardization mode”. This is also based on the historical law of the development of RTMP live broadcast, which is still in an ideal state. As far as I know, the major cloud manufacturers are also making efforts. Cloud vendors support standard WebRTC protocol in CDN services on edge nodes. Users do not need to maintain their own source stations and servers, but only need to access the nearest optimal CDN for push-pull flow. The network link layer is guaranteed by cloud vendors. But it’s still a long way from happening. Future for small factory, only need to be carried out in accordance with the standard signaling negotiation and media of push-pull flow, real-time audio and video interaction, can not only greatly reduced in the development and maintenance costs, the stability of the service on also have greater security, but for those who have a special customization demand and the scene, the corresponding flexibility will be less.

4. Development trend and destination of audio and video industry

The rapid development of the Internet has accelerated the dissemination of information, changed people’s lives, and also promoted the continuous updating and iteration of technology. With people’s tireless pursuit of audio and video sensory experience, various forms of audio and video applications emerge at the historic moment, impacting people’s lives and changing people’s habits. And the development of value-driven technology, all kinds of technology from the rise to the prevailing, and then to the popular, the threshold from high to low, small workshop cost high into a unified cabbage price. The Internet is constantly changing the secular life, the development of technology is also constantly changing the life of the program ape, holding a technology to sit idle era has long ceased to exist, leaving are the continuous forerunner of the storm and the pain of 35 years old. The audio and video industry has a relatively high degree of specialization compared with other fields, but it still cannot escape the law of mass production.

Before and after, the accumulation of talents in the domestic audio and video industry generally comes from three eras:

  • “In the era of player,” the corresponding application forms are “On-demand and short video”. The representative apps of on-demand are Tencent Video, IQiyi, Youku, Mango TV, the declining Sohu Video, LetV, Tudou, etc. The representative apps of short video are Douyin and Kuaishou. The technology stack required is player + FFMPEG, salute FFMPEG, really feed a lot of people.

  • “RTMP live.” the corresponding application form is “* * live entertainment, sports, live, live shows, live with goods, online education of large class” * *, etc., and live on behalf of the app has a tiger tooth, bettas, mirror, such as fallen panda, since the outbreak of this year, live with goods very fire, every appliance merchant didn’t seem to do not too not bashful. In the early stage of the emergence of live broadcasting, there were self-established source stations and CDN, and then there were specialized CDN manufacturers, such as Wangsu, Lanxun, Dilian, Baiyunshan, etc. Gradually, giants such as Tencent Cloud, Ali Cloud, Jinshan Cloud, Huawei Cloud entered the CDN, and the manpower and maintenance cost of live broadcasting research and development became lower and lower.

    OBS is the most popular open source project for the live streaming client, and nginx-Rtmp-Module and SRS are commonly used on the server.

  • “In the era of RTC,” the corresponding application forms are “** Mianmai, audio and video calling, video conferencing, online education 1VS1 and group class”, etc. The old and traditional video conferencing manufacturers include Cisco, Polycom, HuaWei, Zoom, and the low-key Vidyo. Video calling apps include Apple FaceTime, Microsoft Skype, And Google Hangouts & Duo. Well-known DOMESTIC RTC PaaS manufacturers such as Agora, Zego, Tencent Cloud and Ali Cloud.

    Although VOIP and video conferencing have been developed in foreign countries for many years, they are still a niche market with limited accumulation of technical talents before 2017. It has to be mentioned that Google opened source WebRTC in 2011. Before the emergence of WebRTC, real-time audio and video communication was an unattainable field, requiring a large number of professional accumulation to get started, but now, more and more developers have a deep understanding of RTC technology through WebRTC. I also had the honor to meet her in 2014, looking forward to the time when WebRTC standards will be unified, and follow Google’s old iron mix to eat meat.

    As described in the fourth stage of “Evolution of Network Architecture” above, RTC gradually has only two forms, just like the development routine of live broadcasting:

    1. For the WebRTC CDN of cloud vendors, users only need to use the SDK of cloud vendors or their own customized SDK to push to the nearest optimal CDN node, and the network link is guaranteed by the cloud vendors. It supports both private cloud and public cloud deployment. Users can choose additional cloud recording, cloud push, transcoding mixed stream services, etc.
    2. Do not touch the bottom, combined with RTC technology APP front-end business form of continuous innovation and small factory specialization or security needs, choose to build.

In this day and age, what never changes is change. From the point of view of a technology developer, we can not stop the tide of history, we can only adapt and adapt, we need to be hard, what we can do is to lay a good foundation, constantly meet the challenges, and strive to master and professional.

Welcome everyone to leave a message exchange, welcome everyone to pay attention to my personal public number!

Reference

[1]

IaaS, PaaS, SaaS differences:

http://www.ruanyifeng.com/blog/2017/07/iaas-paas-saas.html

[2]

preparing-ip-network-video-conferencing:

https://support.polycom.com/content/dam/polycom-support/products/uc-infrastructure-support/management-scheduling/dma/oth er-documents/en/preparing-ip-network-video-conferencing.pdf

[3]

Open Source Cloud Gaming with WebRTC:

[4]

How to Build a Video conference based on WebRTC:

https://zhuanlan.zhihu.com/p/130406233

[5]

How to construct real-time network scheduling System to deal with transnational scenarios:

https://zhuanlan.zhihu.com/p/47951276

This article is formatted using MDNICE