In the real-time audio interactive scene, in addition to the codec we discussed in the last article will affect the sound quality and experience, on the end, noise cancellation, echo cancellation, automatic gain module also play an important role. In this article, we will focus on echo cancellation and noise reduction modules, and talk about the technical challenges in real-time interactive scenarios, as well as our solutions and practices.

Three algorithm module optimization of echo cancellation

In voice communication systems, Echo Cancellation has always been the core algorithm. In general, echo cancellation is affected by a number of factors, including:

  • Acoustic environment, including reflection, reverberation, etc.;
  • The acoustic design of the equipment itself, including the design of the voice cavity and the nonlinear distortion of the device, etc.
  • System performance, the computing power of the processor, and the ability of the operating system to schedule threads.

At the beginning of the design of the acoustic net echo cancellation algorithm, the algorithm performance, robustness and universality are regarded as the ultimate optimization goals, which is crucial for a good AUDIO and video SDK.

First of all, how do echoes come about? In simple terms, your voice comes from the speaker of the other party, the voice is picked up by his microphone, and the voice picked up by the microphone goes back to you, and you hear the echo. To eliminate the echo, we need to design an algorithm to remove the sound signal from the microphone signal.

So how does an AEC (Acoustic Echo Cancellation) cancel an Echo? Specific steps are shown in the diagram below:

  • The first step is to find the delay between the reference signal/speaker signal (blue broken line) and the microphone signal (red broken line), that is, delay=T in the figure.
  • The second step is to estimate the linear echo component in the microphone signal according to the reference signal and subtract it from the microphone signal to obtain the residual signal (black broken line).
  • The third step is to completely suppress the residual echo in the residual signal through nonlinear processing.

Corresponding to the above three steps, echo cancellation also consists of three large algorithm modules:

  • Delay Estimation
  • Linear Adaptive Filter
  • Nonlinear Processing

Among them, “delay estimation” determines the lower limit of AEC, “linear adaptive filter” determines the upper limit of AEC, and “nonlinear processing” determines the final call experience, especially the balance between echo suppression and double-talk.

Note: Double-talk refers to an interactive scene in which two or more parties speak at the same time, and one party’s voice will be suppressed, resulting in intermittent situations. This is because the echo cancellation algorithm “overcorrects”, eliminating parts of the audio signal that should not be removed.

Next, we first talk about the technical challenges and optimization ideas of these three algorithm modules.

1. Delay estimation

Affected by the specific system implementation, when the reference signal and microphone signal are respectively sent into AEC module for processing, there is a time delay between the data buffers stored by them, that is, “delay=T” as we see in the figure above. If the echo-producing device is a mobile phone, some of the sound from its speakers will pass through the inside of the device to the microphone, and possibly back to the microphone through the external environment. Therefore, this delay includes the length of the playback buffer collected by the device, the time for the sound to be transmitted in the air, and the time difference between the playback thread and the collection thread. ** Because there are many factors affecting latency, the value of latency varies from system to system, from device to device, and from SDK to SDK. ** It may be a constant value during a call, or it may change mid-call (so-called OVERRUN and underrun). This is why one AEC algorithm may work on device A, but it may not work as well on another device. The accuracy of delay estimation is the prerequisite for AEC to work. Excessive estimation deviation will lead to a sharp decline in AEC performance or even failure to work. The failure to quickly track the delay variation is an important factor for the occurrence of occasional echo.

Enhance the robustness of the delay estimation algorithm

Traditional algorithms usually determine the delay by calculating the correlation between the reference signal and the microphone signal. The calculation of correlation can be carried out in the frequency domain, typically in the Binary Spectrum method. By calculating whether the signal energy at a single frequency point exceeds a certain threshold, the reference signal and the microphone signal are actually mapped into a two-dimensional 0/1 array, and then the delay is found by constantly shifting the array offset. The latest WebRTC AEC3 algorithm uses multiple NLMS linear filters in parallel to find the delay. This method has achieved good results in the detection speed and robustness, but it is very computation-based. When the cross-correlation of two signals is calculated in the time domain, an obvious problem is that the speech signal contains a large number of harmonic components and has time-varying characteristics, its related signals often show the characteristics of multiple peaks, some peaks do not represent the real delay, and the algorithm is prone to noise interference.

By reducing the de-correlate of the signals, the acoustic net delay estimation algorithm was able to effectively suppress the value of Local Maxima to greatly enhance the robustness of the algorithm. As shown in the figure below, the left is the cross-correlation between the original signals, and the right is the cross-correlation after the acoustic network SDK processing. It can be seen that the preprocessing of signals greatly enhances the robustness of delay estimation:

The algorithm ADAPTS itself to reduce the amount of calculation

Generally, in order to reduce the computational requirements of delay estimation algorithms, echo signals are assumed to appear in a low frequency band in advance. In this way, the signals can be sent to the delay estimation module after the downsampling, which reduces the computational complexity of the algorithm. However, with tens of thousands of devices and routes on the market, this assumption is often not true. The following figure is the spectrum diagram of VivoX20 microphone signal in headphone mode. It can be seen that the echoes are concentrated in the frequency band above 4kHz, and the traditional algorithm will lead to the failure of the echo cancellation module in these cases. The acoustic network delay estimation algorithm will search the region where the echo appears in the whole frequency band, and adaptively select this region to calculate the delay, so as to ensure that the algorithm has accurate delay estimation output under any device and route.

Photo: the microphone signal of VivoX20 after connecting the headset

Update audio algorithm library dynamically to improve device coverage

To ensure continuous iterative improvement of the algorithm, Soundnet maintains a database of audio algorithms. We collected various combinations of reference and microphone signals in various acoustic environments using a large number of different test equipment, and the delay between them was calibrated by off-line processing. In addition to the real data collected, the database also contains a large number of simulated data, including different speakers, different reverberation strengths, different noise levels, and different types of nonlinear distortion. In order to measure the performance of the delay estimation algorithm, the delay between the reference signal and the microphone signal can be changed randomly to observe the algorithm’s response to the burst delay change.

Therefore, to judge the merits and demerits of a delay estimation algorithm, we also need to examine:

1. Adapt to as many equipment and acoustic environment as possible, and match appropriate algorithm according to the factors of equipment and acoustic environment in as short a time as possible;

2. Can adjust algorithm strategy dynamically in time after sudden random delay change.

The following is the performance comparison of delay estimation between acoustic SDK and other vendors, using a total of 8640 groups of test data from the database. It can be seen from the data in the figure that the ACOUSTIC network SDK can find the initial delay of most test data in a shorter time. The Soundnet SDK found their correct latency within 1s of 96% of the test data, compared to 89% of the friend-based SDK.

The second test is the random delay jitter in the call process, and the test delay estimation algorithm needs to find the accurate delay value in the shortest possible time. As shown in the figure, sonnet SDK can find the exact delay value after change within 3s in 71% of the test data, while the ratio of Youshang is 44%.

Second, linear adaptive filter

For linear filters, there are a lot of literature to introduce their principle and practice. When applied to the application scenario of echo cancellation, the main indicators to be considered include convergence rate, steady-state misalignment, and tracking capability. There are often conflicts among these indicators. For example, a larger step size can improve the convergence speed, but will lead to a larger imbalance. This is the No Free Lunch Theorem in adaptive filters.

For the types of adaptive filters, in addition to the most commonly used NLMS filter (Model Independent), RLS filter (Least Squares Model) or Kalman filter (state-space Model) can be used. In addition to various assumptions, approximations and optimizations in their theoretical derivation, the performance of these filters ultimately boils down to how to calculate the optimal step factor (in Kalman filter, the step factor is incorporated into the calculation of Kalman Gain). When the filter does not converge or the environmental transfer function mutates, the step factor should be large enough to track the environmental change. When the filter converges and the environmental transfer function changes slowly, the step factor should be reduced as much as possible to achieve the smallest steady-state imbalance. For the calculation of step factor, the energy ratio between residual echo and residual signal after adaptive filter should be considered, which is modeled as leakage coefficients of the system. This variable is often equivalent to finding the difference between the filter coefficient and the real transfer function (called the state space state vector error in Kalman filter), which is also the difficulty in the whole estimation algorithm. In addition, the divergence of the filter in the two-talk stage is also a point that needs to be considered. Generally speaking, this problem can be solved by adjusting the filter structure and using two Echo Path Models.

Acoustic net adaptive filter algorithm does not use a single filter type, but takes into account the advantages of different filters, and calculates the optimal step factor with the adaptive algorithm. In addition, the algorithm estimates the real-time transfer function of the environment through linear filter coefficients, and automatically corrects the filter length to cover communication equipment connected to HDMI peripherals such as high reverb, strong echo scenes. The following is an example. In a medium-sized conference room of the Sonnet office (an area of about 20m2, with three glass walls), Macbook Pro was used to connect xiaomi TV through HDMI. The change trend of linear filter time domain signal is shown in the figure. The algorithm can automatically calculate and match the length of the real environment transfer function (the strong reverberation environment was automatically detected around the 1400 frame) to optimize the performance of the linear filter.

Similarly, we also used a large number of test data in the database to compare the performance of acoustic network SDK and other vendors, including steady-state dissonance (the degree of echo suppression after the filter converges) and convergence speed (the time required for the filter to reach the convergence state). The first figure represents the steady-state maladjustment of the adaptive filter. In 47% of the test data, the SOUNd-net SDK can achieve echo suppression of more than 20dB, while the proportion of friends is 39%.

The figure below shows the convergence speed of the adaptive filter. In 51% of the test samples, the ACOUSTIC network SDK can converge to the steady state within 3s before the call, while the proportion of friends is 13%.

Three, nonlinear processing

Nonlinear processing aimed at curbing the echo of the linear filter did not predict composition, usually by computing the reference signal, the microphone signal, linear echo and the correlation between the residual signal, or the correlation map directly to inhibit gain, or through correlation estimate the residual echo power spectrum, further by the traditional wiener filter algorithm to reduce the noise of suppress the residual echo.

As the last module of echo cancellation algorithm, in addition to suppressing residual echo, nonlinear processing unit is also responsible for monitoring the normal operation of the whole system. For example, is the linear filter unable to work properly because of delay jitter? Are there any residual echoes that the hardware echo cancellation failed to process before the SDK echo cancellation?

The following is a simple example. Internal parameters such as echo energy estimated by the adaptive filter can detect the phenomenon of delay change more quickly and prompt NLP to take corresponding actions:

As sound net SDK covers more and more scenes, the transmission of music signals has become an important scene. The SOUND Net SDK has made a lot of optimizations for the echo cancellation experience of music signals. A typical scene is the improvement of comfort noise estimation algorithm. The traditional algorithm uses the algorithm principle based on Minimum Statistics to estimate the bottom noise in the signal. When the algorithm is applied to the music signal, the noise power is overestimated because the music signal is more stable than the voice signal. When reflected in echo cancellation, the background noise (background noise) between echo period after processing and no echo period is not stable, and the experience is very poor. Acoustic net SDK completely solves the phenomenon of bottom noise fluctuation caused by CNG estimation through signal classification and module fusion.

In addition, the SOUND Net SDK also makes a lot of optimizations for all possible extreme situations, including non-causal system, device frequency offset, acquisition signal overflow, sound card including system signal processing, etc., to ensure that the algorithm can work in all communication scenarios.

Sound quality first noise reduction strategy

The effect of noise reduction on signal sound quality is greater than that of echo cancellation module. This is because at the beginning of the design of noise reduction algorithm, we assume that the background noise is stationary signal (at least for a short time), and according to this assumption, the distinction between music and background noise is obviously weaker than that between voice and background noise.

The sound net SDK presets a signal classification module in the front end of the noise reduction module, which can accurately detect the signal type and adjust the type and parameters of the noise reduction algorithm according to the signal type. Common signal types include general voice, cappella and music signals. The following figure shows the signal fragments processed by two noise reduction algorithms. The first one is the mixed signal of voice and music. The first 15 seconds is the voice signal containing noise, followed by 40s is the music signal, followed by 10s is the voice signal containing noise. The results show that under the premise that the noise reduction performance of the voice segment signal is similar, the music part of the processed signal of the competing product is seriously damaged, and the processing of sonnet SDK does not reduce the sound quality of the music.

In the second example, the audio used is the singer’s cappella, in which the singer repeatedly says “ah”. In the spectrogram below, from top to bottom are the original signal, the processing results of friends, and the processing results of acoustic network SDK. The results show that youshang’s noise reduction has seriously damaged the spectrum components of the original speech, while sonnet SDK retains the harmonic components of the original speech completely, ensuring the vocal quality experience of singers when singing.

conclusion

Since 1967, M. M. Sondhi of Bell Laboratory proposed the method of adaptive filter to eliminate echo for the beginning, countless research and practice have been invested in this most basic problem of voice communication. In order to solve the echo problem perfectly, in addition to a powerful algorithm as the basis, also need to do a lot of optimization in the field of engineering optimization. Soundnet will continue to improve the experience of echo cancellation in different application scenarios.

In the next part of this series, we will follow the audio signal from the device to the real network environment, while walking around Shanghai, and talk about the delay, jitter, and optimization strategies behind packet loss in audio interactive scenarios. (A picture to briefly reveal the plot, stay tuned)