WebRTC (Web Real-time Communication), as an open source technology that supports web browsers to conduct real-time voice or video conversations, solves the technical threshold problem of Internet audio and video communication, and is gradually becoming a global standard.
Over the past decade, thanks to the contributions of many developers, the application scenarios of this technology have become wider and richer. Where will WebRTC go in the era of artificial intelligence? This article mainly shares the related direction of WebRTC and artificial intelligence technology combination and the innovation practice of Rongyun (Rongyun Global Internet communication cloud service number for more information).
WebRTC+ ARTIFICIAL intelligence, make sound more real, video higher clarity
The application of artificial intelligence technology in audio and video is becoming more and more extensive. In audio, artificial intelligence technology is mainly used in noise suppression, echo removal, etc. In video, artificial intelligence technology is more used for virtual background, video super resolution and so on.
AI voice noise reduction
Voice noise reduction has many years of history, and analog circuit noise reduction is often used at the earliest. With the development of digital circuit, noise reduction algorithm replaces traditional analog circuit, which greatly improves the quality of speech noise reduction. These classical algorithms estimate noise based on statistical theory and can eliminate steady noise cleanly. For unsteady noise, such as the sound of keyboard, table, and the sound of cars coming and going on the road, the classical algorithm is powerless.
AI speech noise reduction arises at the right moment. It is based on a large amount of corpus, through the design of complex algorithms, combined with continuous training and learning, eliminating the tedious and ambiguous parameter tuning process. AI voice noise reduction has a natural advantage in dealing with unsteady noise. It can recognize the characteristics of unsteady data and reduce unsteady noise specifically.
Echo cancellation
Echo is the sound released by loudspeaker after attenuation and delay by microphone collection. When we send audio, we remove unwanted echoes from the voice stream. WebRTC linear filter adopts adaptive processing of frequency domain block, but the problem of multi-person call is not considered carefully. Wiener filter is used in nonlinear echo cancellation.
Combined with artificial intelligence technology, we can directly eliminate linear and nonlinear echoes using deep learning methods, speech separation, and carefully designed neural network algorithms.
The virtual background
Virtual background relies on segmentation technology, which can be realized by dividing the foreground of the picture and replacing the background picture. The main application scenarios include live broadcasting, real-time communication and interactive entertainment, and the technologies involved mainly include image segmentation and video segmentation. A typical example is shown in Figure 1.
(Figure 1) The black background in the figure above is replaced by the purple background in the figure below)
Video super-resolution
Video super resolution is to make high paste video clear, transmit low quality video under the condition of limited bandwidth and low bit rate, and then restore it to hd video through image super resolution technology, which is of great significance in WebRTC. A typical image is shown in Figure 2. In the case of limited bandwidth, high resolution video can still be obtained by transmitting low resolution video code stream.
(Figure 2 Original low resolution image vs processed high resolution image)
The innovation practice of Rongyun
WebRTC is an open source stack that needs a lot of optimization in order to truly achieve perfection in real-world scenarios. Rongyun modified part of the source code of WebRTC audio processing and video compression based on its own business characteristics to achieve deep learning-based audio noise suppression and efficient video compression.
Audio processing
In addition to the original AEC3, ANS and AGC of WebRTC, rongyun has added AI voice noise reduction module for pure voice scenes such as conference and teaching, and optimized AEC3 algorithm to greatly improve the sound quality in music scenes.
AI speech noise reduction: The industry mostly adopts the mask method in the time domain and frequency domain, combined with the traditional algorithm and deep neural network. Through the estimation of SNR by deep neural network, the gain of different frequency bands can be calculated. After the conversion to the time domain, the gain of a time domain can be calculated again, and finally applied to the time domain, noise can be eliminated to the maximum extent and speech can be preserved.
As RNN (recurrent neural network) is often used in deep learning speech denoising models, the model still believes that there is human voice within a period of speech ending, and the delay time is so long that the speech cannot mask the residual noise, resulting in transient noise after speech ending. Rongyun adds a prediction module on the basis of the existing model to predict the end of speech in advance according to the speech amplitude envelope and SNR decline degree, and eliminate the residual noise that can be detected at the end of speech.
(FIG. 3 Noise trailing before optimization)
(FIG. 4 No trailing noise after optimization)
Video processing
In the WebRTC source code, the video coding part mainly uses open source OpenH264, VP8, VP9 and repackages into a unified interface. Rongyun completes background modeling and region of interest coding by modifying OpenH264 source code.
Background modeling: In order to complete real-time video coding, it is necessary to put the background modeling processing on GPU. It is found that the background modeling algorithm in OpenCV supports GPU acceleration. In actual operation, we convert the original YUV image acquired by the camera and other acquisition equipment into RGB image, and then send RGB image into GPU. The background frame is then fetched in the GPU and transferred from the GPU to the CPU. Finally, the background frame is added into the long term reference frame list of OpenH264 to improve compression efficiency. The flow chart is shown in Figure 5.
(FIG. 5 Background Modeling flow chart)
Region of interest extraction: Yolov4TINY model is used in the region of interest coding part to detect targets and fuse with foreground regions extracted from background modeling. Some of the code is shown in Figure 6. After the network is loaded, CUDA is selected for acceleration, and the input image is set to 416*416.
(Figure 6 part of the program that loads the network onto the GPU)
Experimental effect of video coding on WebRTC: In order to verify the effect, we use videoloop test program in WebRTC to test the modified OpenH264. Figure 7 shows the video collected by the camera in the field, and the effect of background modeling with 1920*1080 resolution. Figure 8 shows the output results. In order to ensure real-time performance, WebRTC will discard the frames that are not actually encoded within the set time due to various reasons. Figure 8 shows that the algorithm we use does not consume much encoding time and does not cause the encoder to produce discarded frames.
(Fig.7 Current frame and background frame)
(Fig.8 Actual effect of encoder)
In conclusion, the use of artificial intelligence-based noise reduction in audio can significantly improve the experience of existing voice calls, but the model prediction is not accurate enough, and the calculation is relatively large. With the continuous improvement and optimization of the model and the expansion of the data set, AI voice noise reduction technology will certainly bring us a better call experience. In the video, the background modeling technology to add background frame to the long list of reference frames, effectively improve the coding efficiency of the monitor class scene, based on the target detection and background modeling and efficient rate allocation scheme to improve the quality of video area of interest coding, effectively improve the in weak network environment people viewing experience.
Technology is constantly changing, and we have entered the era of full intelligence. Artificial intelligence technology is deeply applied in all kinds of scenes. In the field of audio and video industry, the combination of advanced technology and WebRTC is also promising. Service optimization will never end, Rongyun will continue to follow the trend of science and technology, continue to actively explore innovative technology, and precipitation it into a convenient use of the underlying capabilities for developers, enabling developers for a long time.