Despite a lot of effort being put into improving the quality of video captured by smartphone cameras, the audio quality in video is often overlooked. For example, a video with multiple people talking or a subject with high background noise can be confusing, distorted, or difficult to understand. To address this problem, two years ago we launched Look to Listen, a machine learning (ML) technology that uses visual and audio cues to isolate the speech of a video subject. By training models on a large number of online videos, we are able to capture correlations between speech and visual signals, such as mouth movements and facial expressions, which can then be used to separate one person’s speech from another in a video, or from background sounds. We show that this technique not only achieves the most advanced results in speech separation and enhancement (a significant improvement of 1.5dB over pure audio models), but also improves the results of pure audio processing especially when there are multiple audio models. People are talking because visual cues in the video help determine who is saying what.

Now, we’re excited to offer our Look to Listen technology to users through our new audio-visual voice enhancement feature in YouTube Stories (iOS), which allows creators to take better selfie videos by automatically enhancing sound and reducing background noise. Getting the technology into users’ hands will not be easy. Over the past year, we’ve worked closely with users to understand how they want to use features like this, in what scenarios, and what balance of voice and background sound they want in their videos. Our highly optimized Outlook Listening mode makes it run efficiently on mobile devices, overall reducing the running time from 10 X real time on the desktop when our paper comes out, to 0.5 X real time performance on phones. We also conducted extensive testing of the technology to verify that it performed consistently in different recording conditions and with people with different looks and voices.

From research to product

Optimizing find Listen for fast and robust operation on mobile devices required us to overcome many challenges. First, all processing needs to be done on the device in the client application to minimize processing time and protect the user’s privacy; No audio or video information is sent to the server for processing. In addition to the resource-consuming video recording itself, the model needs to coexist with other ML algorithms used in the YouTube application. Finally, the algorithm needs to run quickly and efficiently on the device while minimizing battery consumption.

The first step in the Find-listening pipeline is to isolate a thumbnail image of the speaker’s face from the video stream. By leveraging MediaPipe BlazeFace and GPU-accelerated reasoning, this step can now be performed in milliseconds. We then switched the model portion that processed each thumbnail individually to the lighter MobileNet (V2) architecture to output the visual features learned for speech enhancement at 10 ms per frame extracted from face thumbnails. Because the computation time for embedding visual features is short, it can be done while the video is still being recorded. This avoids the need to keep frames in memory for further processing, thus reducing the overall memory footprint. Then, after the video has been recorded, the audio and calculated visual features are streamed to an audiovisual speech separation model, which generates isolated and enhanced speech.

We reduce the total number of parameters in the audio-visual model by replacing the “normal” 2D convolution with separable convolution (1D in the frequency dimension, and then 1D in the time dimension) with fewer filters. We then further optimized the model using TensorFlow Lite — a set of tools that can run the TensorFlow model on mobile devices with low latency and small binary sizes. Finally, we reimplemented the model within the Learn2Compress framework to take advantage of built-in quantization training and QRNN support.

These optimizations and improvements range from 10 reduced running times X using the original preparation found on the real-time desktop to 0.5 X real-time performance using only the iPhone CPU; The model size was reduced from 120MB to the present 6MB, which made deployment much easier. Because YouTube Stories videos are short (limited to 15 seconds), the results of the video processing are available within a few seconds of recording.

Finally, to avoid dealing with videos with clean speech (to avoid unnecessary calculations), we first run our model only for the first two seconds of the video, and then compare the voice-enhanced output to the original input audio. If there is enough difference (meaning the model cleans up the speech), then we enhance the speech in the rest of the video.

Research user needs

Early versions of Look to Listen were designed to completely isolate speech from background noise. In a user study we conducted with YouTube, we found that users prefer to keep some background sound to provide context and preserve some general atmosphere of the scene. Based on this user study, we used a linear combination of the original audio and the clean voice channel we generated: output_audio = 0.1x original_audio + 0.9x speech. The video below shows different levels of background sound in a clean speech combination scene (10% background is the balance we use in practice).

Fairness analysis

Another important requirement is that the model be fair and inclusive. It must be able to handle different types of sounds, languages and accents, as well as different visual appearances. For this, we conducted a series of tests, exploration model on attribute various visual and speech/audio performance: the speaker’s age, skin color, language, voice tone, speakers face the visibility of the percentage of (video) the speaker in the picture), the whole video’s head posture, facial hair, glasses the existence of (input) and the level of background noise in the video.

For each of the above visual/auditory attributes, we ran our model on segments of the evaluation set (separate from the training set) and measured speech enhancement accuracy, subdividing according to different attribute values. The results of some attributes are summarized in the figure below. Each data point in the figure represents hundreds (and in most cases thousands) of videos that meet the criteria.

Use the Feature

YouTube creators who are eligible to create YouTube Stories can record videos on iOS and then select “Enhanced Speech” from the volume control editing tool. This will immediately apply the voice enhancement to the track and will play the enhanced voice on a loop. The feature can then be turned on and off multiple times to compare the enhanced speech to the original audio.

As we roll out this new feature on YouTube, we’re exploring other places for the technology. More content is coming later this year — stay tuned!

Update note: first update blog, later update wechat public number “rain night’s blog”, after will be distributed to each platform, if the first to know more in advance, please pay attention to “rain night’s blog”.

Blog Source: Blog of rainy Night