On May 15, Gao Chun, senior architect of Agora, attended WebAssembly Meetup, the first offline event held by WebAssambly community, and shared practical experience on the application of Agora’s real-time video and portrait segmentation technology on the Web. The following is a summary of the speech.

The RTC industry has developed rapidly in recent years, with online education and video conferencing flourishing. The development of scenes also put forward higher requirements for technology. As a result, machine learning is increasingly applied to real-time audio and video scenes, such as super resolution, beauty, real-time bel can. These applications have the same requirements on the Web, and are a challenge for all audio and video developers. Fortunately, WebAssembly technology offers the potential for high performance computing on the Web. Firstly, we explored and practiced the application of portrait segmentation on the Web.

Video portrait segmentation is used in what scene?

When it comes to portrait segmentation, the first application scene we think of is green screen matting in film and television production. After the video is shot in a green-screen environment, the background is replaced with computer-generated movie scenes in post-production.

Another application scenario you’ve probably seen. On station B, you will find that some video barrage does not block the figure on the screen, and the text will pass behind the figure. This is also based on portrait segmentation technology.

The image segmentation techniques listed above are all implemented on the server side and are not real time audio/video scenes.

And we do the portrait segmentation technology is suitable for video conferencing, online teaching and other real-time audio and video scene. We can use portrait segmentation technology to blur the background of the video, or replace the background of the video.

Why do these real-time scenarios need this technology?

A recent study found that out of an average 38-minute conference call, a full 13 minutes are wasted dealing with distractions and interruptions. From online interviews, presentations and employee training courses to brainstorming, sales pitches, IT assistance, customer support and webinars, all face the same problem. Therefore, using background blur or choosing one of many virtual background options, both custom and preset, can greatly reduce interference.

In another survey, 23% of U.S. workers said videoconferencing made them uncomfortable; Seventy-five percent said they still prefer voice conferencing to video conferencing. This is because people do not want to expose their living environment and privacy to public view. So by replacing the background of the video, you can solve this problem.

At present, portrait segmentation and virtual background in real-time audio and video scenarios are mostly run on native clients. Only Google Meet has ever used WebAssembly to split people in real-time Web video. The realization of sound network is combined with machine learning, WebAssembly, WebGL and other technologies.

Realization of Web real-time video virtual background

Technical components and real-time processing flow of portrait segmentation

When we do Web portrait segmentation, we will also need these components:

WebRTC: Do audio and video collection and transmission. TensorFlow: As a framework for portrait segmentation model. WebAssembly: Realization of human image segmentation algorithm. WebGL: GLSL realizes the image processing algorithm to process the image after portrait segmentation. Canvas: Final rendered video and image results. Agora Web SDK: Real-time audio and video transmission.

The real-time processing process for portrait segmentation goes like this. The first step is to use the W3C’s MediaStream API for collection. The data is then given to WebAssembly’s engine for prediction. Because the computing overhead of machine learning is large, it requires that the input data should not be too large, so it is necessary to do some scaling or normalization of the video image before input to the machine learning framework. After output from WebAssembly, the results undergo some post-processing before being passed to WebGL. WebGL will perform filtering, superposition and other processing through these information and the original video information, and finally generate the results. The results are sent to the Canvas and then transmitted in real time via the Agora Web SDK.

Choice of machine learning framework

We have to think about whether there is a machine learning framework in place before we do this kind of portrait segmentation. Currently available include onnx.js, Tensorflow.js, keras.js, MIL WebDNN and more. Both adopt WebGL or WebAssembly as their computing backends. However, when trying these frameworks, I found some problems: 1. Lack of necessary protection for the model files. Typically, at runtime, the browser will load the model from the server. The model is exposed directly on the browser client. This is not conducive to intellectual property protection.

2. The generic JS framework IO design does not consider the actual scenario. For example, tensorflow.js’s input is a generic array, and the content is wrapped into an InputTensor when the computation is done, and then handed to WebAssembly or uploaded as a WebGL texture for processing. This process is relatively complex, and the performance can not be guaranteed when processing video data with high real-time requirements.

3. Operator support is not perfect. Generic frameworks more or less lack operators that can process video data.

To solve these problems, our strategy is as follows: 1. Implement the Wasm port of the native machine learning framework.

2. For operators that are not implemented, we complement them through customization.

3. In terms of performance, we used SIMD (instruction set with single instruction and multiple data streams) and multithreading for optimization.

Video data preprocessing

Data preprocessing requires scaling of the image. There are two ways to do this on the front end: one is to use Canvas2D and the other is to use WebGL. Use canvas2d.drawImage () to draw the contents of the Video element onto the Canvas, and then use canvas2D.getimageData () to get the size of the image you want to scale. WebGL itself can upload the Video element itself as a parameter to become a texture. WebGL also provides the ability to read Video data from the FrameBuffer. We have also tested the performance of these two methods. As shown in the figure below, in the x86_64 Window10 environment and in two browsers, we have tested the pre-processing time overhead of video with three resolutions on Canvas2D and WebGL respectively. This will give you a good idea of how to preprocess video at different resolutions.

Web Workers and multithreading issues

Due to the high computation overhead of Wasm, the JS main thread will be blocked. Moreover, when encountering some special situations, such as entering a coffee shop and there is no power supply nearby, the device will be in low power mode, and the CPU will slow down, which may cause frame loss in video processing. So, we need to improve performance. In this case, we’re using Web Workers. We run the machine learning inference operation on Web Worker, which can effectively reduce the blocking of JS main thread. The use method is also relatively simple. The main thread to create the Web Worker, which will run on another thread. The main thread sends a message to it via worker.postMessage for the worker to access. (See the following code example)

But it may also introduce some new questions:

Structured copy overhead from postMessage data transfer

Shared memory brings resource competition and Web engine compatibility

In view of these two problems, we also did some analysis. When transferring data, your data is JS raw data type, ArrayBuffer, ArrayBufferView, lImageData, or File/FileList/Blob. Or Boolean/String/Object/Map/Set, then postMessage will do a deep copy using the structured clone algorithm. We performed performance tests for data transfers between the JS main thread and WebWorkers or between different pages. As shown in the figure below, the test environment is an X86_64 Window10 computer. The test results are as follows:

The pre-processed data is less than 200KB, so it can be seen from the comparison in the figure above that the time cost is less than 1.68ms. This performance overhead is almost negligible. If you want to avoid structured copies, use SharedArrayBuffer. As the name implies, the principle of SharedArrayBuffer is to share a memory area between the main thread and the Worker, so that data can be accessed at the same time.

However, as with all shared memory methods (including native ones), SharedArrayBuffer is subject to resource contention. In this case, JS needs to introduce additional mechanisms to handle the competition. Atomics in JavaScript was created to solve this problem. We have also tried SharedArrayBuffer for human image segmentation and found that it can cause some problems. The first is compatibility. Currently, SharedArrayBuffer is available only in Chrome 67 and above. Before 2018, SharedArrayBuffer was supported on Both Chrome and Firefox. However, Meltdown and Spectre, two critical flaws that broke down data isolation between processes, were found on all cpus. SharedArrayBuffer is disabled in both browsers. It wasn’t until Chrome 67 did process isolation on the site that SharedarrayBuffer was allowed again. Another problem is the difficulty of development. The introduction of Atomics objects to address resource competition makes front-end development as difficult as native language multithreaded programming.

Function and implementation strategy of WebAssembly module

WebAssembly principal like Split. The main functions and implementation strategies to be implemented are as follows:

Among them, the machine learning model will have different vector, matrix operation framework. TensorFlow, for example, has three operational frameworks: XNNPACK, Eigen, and Ruy. In fact, they all perform differently on different platforms. We tested that, too. The test results in x86_64 Window10 environment are as follows. You can clearly see that XNNPACK performs best in our processing scenario because it is an algorithm framework optimized specifically for floating point operations.

Here we only show the results of the computation test on x86 and do not represent the final results on all platforms. Because the RUy framework is the default computing framework for TensorFlow on mobile platforms, it is better optimized for ARM architecture. So we tested it on different platforms as well. I won’t share them all here.

WASM multithreading

Enabling WASM multithreading maps PThreads to Web Workers and pThread mutex methods to Atomics methods. After multi-threading is enabled, the performance of the portrait segmentation scene reaches the maximum at 4 threads, with a performance improvement of 26%. The use of more threads is not always better because of the scheduling overhead.

Finally, after portrait segmentation processing, we will implement image filtering, dither elimination and picture synthesis through WebGL, and finally get the effect as below.

conclusion

There are still some pain points in working with WebAssembly. First of all, as mentioned above, we will optimize the calculation efficiency through SIMD instruction. Currently, WebAssembly’s SIMD instruction optimization supports only 128-bit data widths. So there are a lot of people in the community suggesting that if we can implement 256-bit AVX2 and 512-bit AVX512 instruction support, we can further improve the performance of parallel computing. Second, currently WebAssambly does not have direct access to the GPU. The JSBridge performance overhead of OpenGL ES to WebGL can be avoided if it can provide more direct OpenGL ES invocation capability. Third, WebAssambly does not yet have direct access to audio and video data. Data collected from Camera and Mic need to go through more processing steps before reaching WASM. Finally, for the Web side portrait segmentation, we summed up the following points: WebAssembly is one of the right ways to use machine learning on the Web platform. In certain cases, the performance gains from enabling SIMD and multithreading are significant. When the basic computing performance and algorithm design are poor, the performance gains from SIMD and multithreading are insignificant. WebAssembly output data should be kept compatible with WebGL texture sampling format. When using WebAssembly for real-time video processing, the key overhead in the entire Web processing process should be considered, and appropriate optimization should be made to improve the overall performance. If you want to learn more about the practical experience related to Web side portrait segmentation, Please visit rtcdeveloper.com to post and communicate with us. For more technical practices, visit agora. IO /cn/community.