In order to cope with the challenges of increasingly diverse audio-video interaction scenarios, Agora has started to design its own next-generation video processing engine. In the process, we have summed up a lot of experience in engine architecture, performance tuning, plug-in system design and other aspects, hoping to share with you audio-video enthusiasts and industry practitioners.

On the afternoon of May 26th, in the “Real-time Audio and Video special” activity of QCon, Agora architect Li Yaqi brought a large number of dry goods sharing with the theme of “next generation video Processing engine”. This article is the collation of the shared content.

Today’s sharing will be mainly divided into three parts:

First, why are we building the next generation video processing engine, and what are the design principles and goals of the engine?

Second, how we achieve design principles and objectives;

Third, the actual landing effect of the next generation video engine.

Design objectives of the next generation video processing engine

With the rapid development of audio and video technology, real-time audio and video interaction has been widely applied and developed in various fields, such as social entertainment, online live broadcasting and medical treatment. Since the outbreak, more and more scenes have been moved online. Many readers may have watched live broadcasts and accompanied their children to participate in online education.

So what are the pain points in these two scenarios? In the live video scene, the demand of multi-channel video source processing is more and more extensive. For example, in the scene of e-commerce live broadcast, anchors usually need to use multiple cameras to shoot from multiple angles, so as to achieve a better effect of carrying goods. This kind of live broadcast may also use the director station synchronously to carry out multiple live broadcast combinations and seamless switching among multiple video sources.

For the online education scene, the more traditional way is that the camera shoots the teacher, and the teacher shares with the screen. In order to enrich the means of online education, we can also add a way of video shooting, filming teachers writing on the tablet, and even add an additional function to support teachers to play local courseware or online courseware.

On the basis of multi-channel video source, there will also be a need for real-time editing and merging of multi-channel video source. In the application scenario of live broadcast assistant, anchors may need to merge and edit the multi-channel video source collection in real time, and then add local materials and dynamic emoticons to enrich the live broadcast effect and reduce the pressure of uplink bandwidth. In a multi-person interactive scenario, in order to reduce bandwidth pressure and performance loss at the receiving end, the multi-channel anchor video needs to be combined and streamed in the cloud and then sent to each receiving end.

With the rapid development of AI technology in image processing, advanced video pre-processing functions fused with AI algorithm have been applied more and more, such as some advanced beauty functions, background segmentation, background replacement. Combined with these three scenarios, we can see that the flexible and extensible capability of the next generation video engine is put forward higher requirements.

In addition, as our business and team size continue to grow, the number of users of the next generation video engine is also increasing, and different users have different needs for integrated development. For developers with small teams or individual developers, they need an engine that is easy to integrate, low code, and fast to launch. For the developers of enterprise business, the engine is required to open more basic video processing functions, so that video processing services can be customized.

To meet the needs of various developer groups, sonnet’s next generation video processing engine needs a flexible design to meet differentiated integration requirements.

Not only is the design flexible, but the live video experience is also an important indicator. With the arrival of THE 5G era, the network infrastructure is sufficient to support users’ requirements for clearer and smoother live streaming experience. The next generation SDK must achieve the utmost in performance optimization, and ** support higher video resolution and frame rate. ** Considering the continuous expansion of real-time interactive business scenarios and the increasing distribution of users, in order to provide better live video experience in countries and regions with weak network infrastructure and low-end models with poor performance, we need to support better weak network resistance and optimize performance resource consumption.

Combine the above mentioned scene richness, user diversity, and demand for live video experience. The design principles and objectives of the next generation video processing engine can be summarized into four aspects:

1. Meet the differentiated needs of different users for integration;

2. Flexible and extensible, can quickly support the implementation of various new business and new technology scenarios;

3. The core system of the video processing engine should provide rich and powerful functions, reduce the mental burden of developers, and achieve fast and reliable;

4. The performance is excellent and can be monitored. It is necessary to continuously optimize the performance of the live video processing engine and improve the monitoring means to form a closed loop and continuously iterate to optimize the processing performance of the engine.

Next, we will enter the second part, which software design methods are used to implement the sound net according to the four design objectives mentioned above.

02 Architecture design of the next generation video processing engine

As for the first design objective mentioned above, we should meet the differentiated integration needs of different users: the users of the engine are naturally layered. Some users pursue low code and fast online, so they need the engine to provide functions close to their business as possible. The other part of users need us to provide more core video basic capabilities, on which customers can customize video processing services according to their own needs.

According to this user form, our architecture also adopts hierarchical business design, which is divided into High Level and Low Level. The Low level part is to model the core functions of video processing, abstracting the video source processing unit, pre/post processing unit, renderer unit, codec unit, core infrastructure unit, etc. Through the combination and development of these basic modules, on the basis of Low level, we abstracted the concept of Track, the concept of network video Stream and the concept of scene, which encapsulated the High level API closer to user business.

Let’s take a look at the difference between High Level and Low Level using a practical example: Suppose you want to implement a very simple scenario, open the local camera, open the preview, and publish to the remote side.

If High Level APIS are used, the real-time interactive business scenario can be built by invoking only two simple apis. Start the local camera and preview it using the StartPreview API, then join the channel and publish the video stream using the JoinChannel API. If users want to implement more custom business functions in this simple scenario, they can use the Low Level API. First, create a local camera acquisition pipeline, CreateCameraTrack. This Track provides a variety of interfaces for configuration and state control. At the same time, we decouped the local media processing and the network publishing node, and the video stream could be published to the RTC system developed by sonnet or to the CDN network.

This can be seen from the example above. In order to meet the differentiated needs of users, we adopt hierarchical design. The High Level provides ease of use for business, while the Low Level provides core functions and flexibility.

Let’s look at the second goal, flexible scalability how do we do that? Before that, I will briefly introduce the basic concepts of video processing. Video processing takes video frames as the carrier of video data. Take the local sending process as an example. After the video data is collected, it will go through a series of pre-processing units and then be sent to the encoder unit for compression and coding. Finally, it will be encapsulated and sent to the remote network according to different network protocols. The processing process of receiving is to unpack the video stream after receiving it from the network, send it to the decoder, go through a series of post-processing units, and then display it to the renderer.

The serialized video processing with video frames as data carrier is called video processing pipeline, and each video processing unit is called a module. Each specific video processing unit can have different implementations, such as the video source module, which can be a self-captured video source, can be a video source captured by the camera or a video source shared by the screen. Different encoders can also have different extension functions depending on encoding standards and encoder implementations. The network sending node can send to its own RTC network or CDN according to different protocols. Different video services are actually formed by flexible arrangement of basic video processing units according to services. We want to open up flexible orchestration capabilities to developers as the foundation of our video processing engine, so that developers can build pipelines that meet their business needs through flexible and free API combinations.

To achieve this, the core Architecture of our video processing engine adopts the Architecture of Microkernel Architecture, which separates the variables and invariants of the whole engine. As shown in the figure, there are two parts: Core System in the middle and Pluggable Modules in the periphery. The yellow part in the middle is the core system part of the whole, corresponding to the invariants of the whole next generation video processing engine. In the core system, we abstract out the modules of each basic video processing unit, and provide a unified control surface and data surface interface. The engine also provides control interfaces for the assembly and flexible orchestration of these basic video modules. In addition, the core system also provides a series of infrastructure functions, such as video data format conversion related to video processing, basic video processing algorithm, memory pool optimized for video processing characteristics, thread model, logging system and message bus, etc.

By using the underlying capabilities of the core system, each module can easily expand its business. For example, the video source module can have a video source module in push-stream mode, a video source module that supports pull-stream mode, or even a special video source. That is, in the process of transcoding we can add the video frame decoded by the remote user to the local transmission pipeline as a new video source. Pre-processing module and post-processing module can also be extended to a variety of implementations, such as basic scaling function, beauty, watermark function, etc. Codec module is more complex, on the one hand to support a variety of coding standards, and the corresponding variety of implementation, hard and soft, etc.. At the same time, the selection of codec is still a complex dynamic decision-making process, we built in the basic module of codec encoder selection strategy for dynamic selection switching according to the ability negotiation, model and real-time video coding quality.

Next, we will see how to flexibly build video processing pipelines to meet different business combination scenarios based on actual application scenarios. Back to online education scenario, suppose now online education in a complex scene, the teacher blackboard writing, need a camera camera all the way to the teacher’s portrait, and at the same time, the teacher will through the screen sharing courseware to share, or use a media player to play local or online multimedia video files. In some advanced scenes, the teacher will open the mode of background segmentation and background replacement to superimpose the teacher’s head picture and courseware to achieve better effects in order to achieve better effect. Teachers can also enable the recording function in the local, and record the video of their live class to the local.

Complex combined applications can be achieved through pipeline construction. The figure above is a concept diagram of local processing pipeline. For the blackboard shooting, teacher portrait and courseware sharing mentioned just now, we can achieve it through dynamic replacement of acquisition source module. Background segmentation is a special pre-processing module, which can analyze the teacher’s picture in real time and then superimpose it on the screen sharing acquisition source. Local recording is a special form of renderer module, which encapsulates local video frames in accordance with file format and stores them in the local path. The whole media processing is decouped from the final network transmission, and it can dynamically choose whether to push them to our RTC network or to CDN.

Next see a receiving line composite application scenario, we have a background background media processing center, business processor can according to user needs to deal with the real-time streaming services, including cloud to record video to dump (received) the cloud for video jian huang, low code high-resolution processing, graph transcoding service, etc. There’s also Cloud Player, which pulls remote videos down and pushes them to RTC channels. And bypass push stream, which can push the received video stream to the CDN in our RTC network.

So let’s look at how we can build the receiving pipeline to meet different application scenarios.

The first is the module of our network receiving source, which can receive video streams from RTN network or CDN through dynamic switching. After the decoder module is sent to a series of post-processing modules, including the yellow module, low code high definition post-processing module, and so on. The number and location of renderer modules received can be flexibly customized, such as the cloud recording function, which is essentially a special renderer module.

What we just introduced is that we achieved the goal of flexible expansion through the micro-kernel architecture design, and the functions of each module can be rapidly expanded. Video processing pipelines can also be constructed in a building block manner to achieve flexible orchestration of services. Next, we look at the goal of fast and reliable. What we mean is that our core system should provide rich and stable functions, on this basis, we can greatly reduce the mental burden of development staff and improve the research and development efficiency.

Before we get to that, let’s think about what a developer would need to think about developing a beauty plug-in on our pipeline from scratch if we didn’t have a solid core system.

The first, of course, is to develop the business logic of beauty itself. In addition, when integrating with the pipeline, we should first consider whether the module can be loaded to the correct position in the pipeline, what influence the preprocessing module has on it, and what influence its business module has on subsequent functions.

The second is the problem of data format. When the data format in the pipeline is not the format needed by the beauty module, it needs to convert the data format. The business logic realized by the data conversion algorithm also needs to be realized by the developer of the module.

Next comes the pipeline integration process, which requires an understanding of the pipeline’s threading model and memory management pattern. In the state switch of the matching pipeline, the beauty module itself should also realize the corresponding state control business logic. At the same time, in a pipeline, if the subsequent node has feedback to the previous node based on the video quality, for example, the subsequent node says you need to adjust your throughput, it also needs a mechanism to receive and process messages from the subsequent modules. At the same time, when the beauty plug-in is running, there are some message notifications to be sent to the user, you need to design a message notification mechanism.

Because beauty is integrated to provide the core functionality of the SDK, plug-in development will become very simple, the plugin author as long as in accordance with the provisions of the core system interface protocols to achieve the relevant interface, the core system will automatically according to its function, and from the perspective of global performance optimization, it is loaded into the correct position, Then our SDK users can use this plug-in.

To sum up, the fast and reliable part realizes rich and powerful core system functions, which can greatly reduce the mental burden of module developers and thus improve the research and development efficiency.

Finally, we look at the performance is superior to monitor this, first of all, we to the whole video processing line on the mobile data transmission efficiency is optimized, realized the full link for mobile terminal primary data format support, including acquisition module, rendering module, pretreatment module, using hardware can be realized under the condition of zero copy of the entire processing link, At the same time, according to the negotiation of processing characteristics of each module, the position of the corresponding module on the pipeline can be optimized to reduce the crossing between CPU and GPU, so as to better improve the data transmission efficiency.

In addition, we have just mentioned that through the basic video processing unit, the control surface and data surface are separated to a certain extent, which has certain benefits, for example, users can get timely response to the module control. For the camera and other equipment operations, it is a relatively heavy operation. When a user frequently switches between front and rear cameras, such operations block the USER UI and cause a long delay. By separating the control plane from the data plane, we can realize fast responsive camera operation on the premise of ensuring the correctness of the final state. And let the control path no longer block the flow of data. At the same time, the control path can not block the data flow, so we can do real-time editing and sending of the map source.

Reduce consumption of system resources, we built is suitable for the video memory pool of data storage format, supports multiple video formats interframe memory multiplexing, at the same time, according to system memory usage and line load, dynamic adjustment to achieve dynamic balance state, thus reducing frequent memory allocation and release, so as to reduce the CPU usage.

Finally, in order to form a closed-loop feedback channel for performance optimization, we realized the performance quality monitoring mechanism of the whole link. For each basic video processing unit, the resolution and frame rate of incoming and outgoing frames, as well as some mod-specific data, will be calculated and reported. At the system level, time-consuming tasks are also monitored and reported. According to the needs of different problem investigations, we import these data into local logs of users as required, and report experiential data to the online quality monitoring system to achieve rapid problem location and optimize performance feedback.

To sum up, in terms of excellent performance and monitoring, we first optimized the mobile terminal data processing link, separated the control surface from the data surface, and improved the overall video data transmission efficiency. In addition, memory pools related to video processing features are built to reduce system resource consumption. Finally, a full-link video quality monitoring mechanism is implemented to achieve closed-loop feedback for video optimization performance.

In fact, we are now in the ground and polish phase of the next generation video processing engine. Architectural advantages have been demonstrated in practice, so let’s now look at a practical application case.

The next generation video engine is highly flexible and extensible. Based on this video engine, a general framework for composite image transcoding is built by way of business combination. Based on this framework, we can quickly respond to various composite image requirements of front-end and back-end. For example, online video dating scene is a typical real-time multi-person interaction scene. Traditionally, one guest needs to subscribe to matchmaker and other guest video streams, which puts great pressure on downlink bandwidth and machine processing performance. In order to solve this problem, we quickly applied the universal framework of syngraph transcoding to launch cloud syngraph project. The video of each guest and matchmaker will be combined in the cloud and then pushed to the audience. Simultaneous layout and background image, guest video interrupt display strategy can be customized according to the user’s business: show the last frame, background image, placeholder image, etc.

The same graph transcoding framework of application in local, we realized the function of local real-time video editing mixed flow, can be used in the field of electricity live, and so on, the host can be all sorts of figure in the local source, such as multiple cameras, multiple screen sharing, media player and the remote real-time confluence users video and pictures of different material to push.

That’s all we share, thank you!