B-end operation-level video service technology platform was established

By Li Zhitao

Organizing/LiveVideoStack

Hello, everyone. I am Li Zhitao from Beijing 263 Enterprise Communication Co., LTD. I am mainly responsible for the development of the company’s audio and video business line.

We may be relatively unfamiliar to 263 Network communication Co., LTD., with a sentence: 263 has been deeply cultivated in the industry for 20 years, to do the most understand enterprise Internet communication service providers. The company was founded in 1993, formerly known as Haicheng Paging. In 1999, it became the largest dial-up access service provider in China besides basic operators. In 2001, 263 self-built IDC room became the first four-star IDC room in China. In 2004, 263 telecom value-added service providers with national multi-party communication license. In 2005, 263 launched the enterprise mailbox become the enterprise mailbox outsourcing service first brand in China. In 2010, 263 Network Communication was listed in Shenzhen (stock code: 002467). In the same year, 263 Teleconferencing system was launched, which was also the fastest growing teleconferencing service provider in China. In 2015, 263 acquired Zhanshi Interactive and became the largest multimedia interactive live streaming technology provider in China at that time. In 2018, 263 obtained the first mobile resale license in China and launched customized mobile communication services for enterprises. In the same year, 263 launched the video + strategy. Since then, it has made a significant strategic transformation from paging, access, message, mailbox and teleconference to real-time service provider based on audio and video.

Next, I will introduce the 263 audio and video section from four aspects, which is oriented by video + strategy and the iterative direction of technology construction.

1. 263 Cloud Vision Product Introduction

After years of building, based on 263 video cloud, it supports multi-protocol and multi-terminal scenarios, mainly including 263 cloud terminal. Cloud terminals include a variety of hardware terminals. Hardware terminals are mainly based on enterprise office scenarios, and we propose a variety of terminal solutions for individual participants, small, medium, and large conferences. In addition to meeting room level, we also support any scene, any region, any time from mobile, PC, Windows, Mac, iOS, android terminal to join a meeting, real-time audio and video conference communication.

263 Video cloud is compatible with all kinds of audio and video communication methods in the whole protocol layer, mainly with WebRTC protocol, compatible with VP8/VP9/H.264 and other codec methods, as well as compatible with multi-browser adaptation, access support for Microsoft Lync protocol. Existing considerable number of enterprise customers have a large number of hardware terminals based on SIP and H.323 protocol, 263 video cloud also do protocol support for these terminals, can be directly accessed and used. Some customers buy hardware MCU of Cisco and Polycom with relatively high cost, and we get through the whole video cloud to access and use.

The video cloud integrates teleconferencing on our traditional PSTN network. Mobile phones and landlines can use the conference call platform for audio access and video cloud audio integration. Real-time content of the video cloud is pushed to the cloud for live broadcast or on-demand using the RTMP standard protocol.

1.1 Capability Matrix

263 The whole capability matrix of video cloud system mainly includes management system with business, support management, user management, multi-service platform management, user authentication and permission information management. 263 Video cloud provides a variety of video service scenarios: conference service is used to solve enterprise telecommuting; Educational services, such as big class, small class, double teacher and K12; Tele-medical service, tele-medical training, tele-surgery, etc.

Message system, mainly has the following types of messages, IM messages, application messages, message notification; Signaling transfer system, including voice signaling, audio and video signaling, dispatching signaling. Attendance system is mainly to solve the user after login, his message scheduling positioning and control.

Real-time RTC system is our core, including WebRTC Service and Streaming Service. WebRTC Service is mainly connected with Web Service and App Service of real-time audio and video communication, while Streaming Service is mainly connected with user push stream and live broadcast Service. Core Core system mainly manages and schedules the entire cluster, including hardware availability management of the entire cluster, upper limit and lower limit management of hardware servers, load balancing and failover management, system parallel expansion management, and room-level task scheduling management according to system load. MCU transcoding and screen mixing based on audio and video configuration. The SIP Service connects to SIP modules of external systems, including teleconferencing, third-party hardware and third-party systems based on SIP restrictions. Recording is an on-demand Recording service for conferences, education, and telemedicine.

263 live broadcast network, mainly connected to 263 existing live broadcast system, but also to Ali Cloud, Tencent cloud push stream. The broadcast guide station provides some value-added service functions such as broadcast guide and interstitial broadcast. Live broadcast management Manages the rights and venue control of live broadcast. The cloud storage system is associated with recordings. After the recordings are stored in the cloud storage system in object storage mode, they can be VOD on-demand for service requirements.

The public application system has multiple application services, such as questionnaire, voting and tipping. Teleconferencing system PSTN conference system with hardware conference bridge as the core. SIP MCU can be connected to an external SYSTEM SIP MCU or SIP terminal.

1.2 SaaS&PaaS

We provide SaaS& PAAS interface capabilities, the top half of which is mainly SaaS layer capabilities, conferencing applications, education applications, and telemedicine applications. The whole system includes message SDK, sharing SDK, annotation SDK, RTC SDK, ON-DEMAND SDK and live SDK. We can also provide a PaaS layer capable development interface to customers with in-depth development capabilities. The whole call method is method, function call, the underlying communication based on Socket or RPC form.

2. Technical architecture

What follows is the iterative steps of our entire technical architecture in recent years.

263 Cloud Vision technology is based on Google’s open source WebRTC and Intel’s OWT projects.

2.1 Architecture Topology V1.0

The entire technical framework was built from the first generation. The first-generation system is divided into two layers based on functions and adopts cluster distributed deployment, basically distributing four types of IDCs. The first layer of IDC is our core BJ DC, and the second layer mainly solves the access problems of inter-connectivity between north and south in China and inter-operator access. For overseas user access systems, we provide overseas access points.

So far, due to cost considerations, it is impossible to deploy all the nodes in major cities across the country, so we have deployed a few nodes in places where users are more likely to use them, and we are using Ali Cloud to supplement the other nodes. We mainly use ECS and its bandwidth of Ali Cloud. At present, the quality of several nodes accessing Ali Cloud is good in terms of national user access 263 video cloud system, which can guarantee the full coverage of domestic users. This is a 1.0 system for our overall system architecture topology.

The biggest problem of the 1.0 system is that the interaction between idCs of different regions and carriers needs to be switched to the central node, which costs a lot. In addition, the data link is long and the user delay is relatively large. For RTC applications, the normal time is acceptable in 400 milliseconds, while the physical distance is 2000 kilometers, which is a considerable delay. Based on these issues, we developed version 2.0.

2.2 Architecture Topology V2.0

Based on the problem of architecture topology 1.0, we developed version 2.0, which mainly made a layer and added a Relay pool, so that the user access layer of the same region, the same operator, or the same operator across the region can communicate with each other. If they cannot communicate with each other, they can communicate with each other through the Relay pool, which can be balanced and extended. The entire architecture is changed from Layer 2 of 1.0 to layer 3 of 2.0. IDC in Beijing and IDC at the Relay layer are deployed in a multi-line BGP machine room.

For overseas, Relay layer nodes are added based on the United States, Germany, Europe and Hong Kong. Users interact with each other based on the local area. They go directly from the local area and do not need to go to the core equipment room for data transfer. Therefore, users have a good experience of delay. At the same time, the core machine room in Beijing has been upgraded with high availability. When one of the core machine rooms is attacked, it can quickly carry out hot backup and switch to other machine rooms.

2.3 Media Signaling Logic

The media signaling logic in the system mainly embodies three layers: background core layer, multi-line access machine room Relay layer in the middle, and access access layer where users access nearby. Core logic OWT Core, a secondary development based on Intel’S OWT, is responsible for computing, control, and scheduling of the entire system. SRC solves the problem of intelligent routing. Users will Access from all over the world. SRC is responsible for finding an Access network with the best quality. The MC system is responsible for the access screening of server availability and the allocation of low-load servers running homogeneous services to users.

3. Media communication mode

3.1 Basic Mode

In all aspects of the current 263 cloud video technical support have been able to solve some of the quality problem of the access, but if some business scenarios, business model, incorrect use of the basic communication model, which resulted in increased traffic, limited bandwidth, server computing resource usage is too high, the network jitter, packet loss, delay of real-time audio and just the right amount of influence, etc. The figure above mainly describes the communication modes WebRTC has used so far. In the first MESH mode, WebRTC and WebRTC are directly connected, and media communicate through P2P mode.

The second SFU has some similarities with MESH, except that it forwards all media through the server. For each client, the upstream flow is 1 and the downstream flow is N-1. The disadvantages of the two are basically the same. The advantage is that the current Relay through the server, which is convenient for the mixed stream to push the stream live or its recording operation.

The third is based on MCU. The advantage of MCU is that there is one in-line channel and one down-line channel, and the client consumes less bandwidth. As this mode requires Mixer flow of audio and video, computing resources of server CPU and GPU are consumed. Each of the above communication modes has its own advantages and disadvantages. The combination of basic communication mode and mixed communication mode based on different business usage scenarios will be discussed later.

Version 3.2 1.0

The figure above is a data flow diagram for 1.0. Client publishes and subscribes audio and video according to business logic. On the right side is the server. In 1.0, we support MCU and SFU at the same time. I illustrate four Access modules, with dotted lines representing udP-based user layer Access. If SFU mode is used, the computing power of the server background is not involved. Different codec methods used on the end will also use the transcoding ability of the MCU module in the server background, and then deliver the code to the client. The MCU module is used for audio and video mixing and screen-closing. In this version, SFU lacks the support of SVC or Simulcast, so the audio and video quality is not guaranteed.

Version 3.3 1.5

In the middle, we also launched version 1.5, which mainly used server MCU to combine the screen and client screen clipping to achieve the flexibility of user usage scenarios. After cutting each way can be displayed separately, flexible layout.

Version 3.4 2.0

Version 2.0 is based on version 1.0, with the SFU adding Simulcast and MCU. MCU we have also made some expansion of functions, including RTMP push stream, RTMP push stream can be pushed according to user-defined encoding format, customized layout. It can connect to the SIP gateway for integration with hardware MCU system and PSTN teleconferencing through the SIP gateway.

3.5 Hybrid mode

As mentioned above, our whole system is based on SFU and MCU. SFU has the advantages of flexible distribution, high concurrency, and high real-time performance. SFU has the disadvantages of a large number of downstream forwarding routes, high bandwidth consumption, poor experience, and high cost for clients to maintain multiple connections. The advantage of MCU is that it occupies less downstream bandwidth, while the disadvantage is that it has high server performance requirements and high deployment cost. Besides, it also adds server links, resulting in slightly poor real-time performance.

The hybrid mode we adopt is based on MCU+SFU, and the business scenario determines whether MCU or SFU is used. If it is less than five, SFU advantage is appropriate, the cost is low. If it is more than 6 parties and the customer value is higher, use MCU computing resources. The interaction mode determines the communication mode.

4. Superposition of operation-level technology

We talked about WebRTC and OWT. In practice, we optimized the NACK and FEC functions of audio and video transmission according to the weak network quality problems we encountered. Solve the problem of audio and video labial out of sync. Through the function of cutting stream, we can solve the problem that the TV, computer and mobile terminals of users hope to receive different resolutions in MCU mode. The large screen is 1080P, the PC terminal is 720P, and the mobile terminal may be 360P, which is enough. It also solves the problem that different users can obtain different bit stream data according to their own network quality.

When the system is deployed across IDCs, occasional network disconnection may occur in the cluster, and some exceptions may occur. The fault tolerance mechanism is added to ensure system robustness. The MongoDB cluster is used for the database, and the RabbitMQ message bus uses HAProxy+3RMQ high availability.

The above sharing is a suitable operation level transformation of the system we made.