Nowadays, with the rapid development of real-time communication technology, people’s demand for noise reduction is also increasing. Deep learning is also being applied to real-time noise suppression. In this LiveVideoStackCon 2021 Shanghai station, we have invited Mr. Feng Jianyuan, the head of Agora audio algorithm, to share with us the examples, problems and future prospects of deep learning landing mobile terminal.
By Feng Jianyuan
Organizing/LiveVideoStack
Hello, distinguished guests, I’m Feng Jianyuan from Sonnet. Today I will introduce to you how we do real-time noise suppression based on deep learning, which is also an example of deep learning landing on mobile terminals.
Let’s do it in that order. First of all, there are actually some different kinds of noise, how they are classified, how to choose algorithms and how to solve the problem of noise through algorithms; In addition, it will introduce how to design some such networks through deep learning and how to design algorithms through AI models. In addition, as we all know the computation power of deep learning network, the model will inevitably be larger. When we land some RTC scenarios, we will inevitably encounter some problems. What problems need to be solved? How to solve the problems of model size and computing power? Finally, it introduces the current noise reduction can achieve what kind of effect and some application scenarios, and how to do better noise suppression.
01. Noise classification and noise reduction algorithm selection
Let’s first understand what kind of noise we have at ordinary times.
In fact, noise inevitably follows the environment you are in, and the objects you are confronted with make a variety of sounds. Every sound has its own meaning, but if you’re communicating in real time and only the human voice has meaning, you might think of all the other sounds as noise. A lot of noise is actually a steady state noise, or a stationary noise. For example, when I record like this, there may be some background noise, which you may not hear now. For example, when the air conditioner is running, there will be some howling wind. Noise like this is stationary noise, it doesn’t change over time. And I can do that by knowing what the noise was like before, and I estimate it, in this way, and then if the noise keeps coming up then I can get rid of it by very simple subtraction. Smooth noise like this is common, but not always smooth and easy to remove. In addition, a lot of noise is uneven, and you can’t predict if someone in the room will suddenly have their cell phone ring. Suddenly someone plays a piece of music or the sound of cars roaring past in the subway or on the road. These sounds are random and can’t be solved by prediction. In fact, this is also the reason why we use deep learning, as the traditional algorithm is difficult to eliminate and suppress the unsteady noise.
In terms of the use scenario, even if you are in a very quiet conference room or at home, it may be inevitable that some noise introduced by the equipment or some sudden noise will have some impact. This part is also an inevitable pre-processing process in real-time communication.
Forget the sensory understanding of noise that we usually encounter. See what it looks like in terms of numbers, in terms of signals. Noise, sound is transmitted through the medium of air and finally reaches your ear, through the induction of your ear hair, and finally forms the perception in your heart. In these processes, for example, we take some microphone signal, and in some cases it’s a wave signal. It’s some wave form that oscillates up and down. So if it’s a clean voice, he’s going to see some waveform when he’s talking, he’s going to see basically zero when he’s not talking, and then if you add some noise it’s going to be the same as it is on the right, there’s going to be some aliasing on the waveform, and the vibration of the noise is going to be aliased with the vibration of the human voice, it’s going to be a little blurry. There are waves even when you’re not talking. This is directly from the level of wave signal, if it is changed to the frequency domain through the Fourier transform, at different frequencies, the sound of human voice is generally between 20 Hertz and 2k Hertz, people will also have fundamental frequency, vibration peak, harmonic generation. You can see that people have some shapes like this on the spectrum, but you add noise and the spectrum becomes blurred, and there’s a lot of energy in the spectrum where it shouldn’t be.
Doing noise suppression is actually doing a inverse, a reverse process. Take these signals in the time domain and turn them into a pure signal through some filtering. These noisy points can also be removed by means of frequency domain to form some relatively pure corpus.
The algorithm of noise reduction has been around for a long time. When Bell LABS invented the telephone, it was found that noise would have a great impact on communication. Different signal-to-noise ratios can cause your bandwidth to be affected by Shannon’s theorem, and you are a pure signal that can even be transmitted with a relatively small bandwidth. Before 2000 we could call these algorithms collectively known as knowing is knowing.
First, they are mainly aimed at Stationary Noise, which is known as Stationary Noise. Why is it known? When you stop talking and there is no human voice, there is only Noise. Since it is steady-state noise, its change over time is not so drastic. In the future, even if there is human noise, you can do some spectral subtraction or Wiener filtering through your estimate of the model. This kind of Stationary Noise is because our components had a lot of Noise at the beginning, so they were the first to eliminate this kind of Stationary Noise. In fact, the method is some spectral subtraction, Wiener filtering, and maybe later advanced point difference, wavelet decomposition, all of these methods are the same. It will estimate its noise by the mute segment, which can be solved by some spectral subtraction methods in the future.
Gradually, you will find that in addition to the Stationary Noise, we only want to keep the human voice during the call, and we also need to deal with other noises. After 2000, we will talk about this, because the distribution of the human voice is different from the distribution of the wind. Some wind passes through the microphone, for example, I blow like this. The low frequencies might be higher, the high frequencies might decay faster. In fact, the human voice and noise can be separated by clustering. The main idea is to project sound signals to higher dimensional space for clustering. Some adaptive methods can be used gradually in clustering, which is similar to the predecessor of deep learning, which divides sound into different types. In the high-dimensional space for noise reduction, the characteristics consistent with the human voice can be retained, and other parts can be eliminated. This method, such as Subspace decomposition, has achieved great success in the field of image, and has a good non-negative matrix decomposition for wind noise removal in the field of audio. Let’s say there’s more than one kind of noise, there’s a lot of different kinds of noise, like dictionary learning.
One of the most common types of Noise we call non-stationary Noise with Simple Patterns is Noise that is not Stationary, like the whir of wind, but it may have a fixed pattern. For example, the whir of wind sometimes appears and sometimes does not appear, but it is to follow the low frequency of the wind is relatively dense and so on. One of them can be learned one by one, such as the sound of wind, thunder and lightning, noise and so on, can be realized by learning. Now, it turns out that birds of a feather flock together, and there’s an infinite variety of noise, and every kind of machinery and every kind of friction and every kind of wind blowing may cause different eddies. In this case, we could not exhaust a lot of noise aliasing. At this time, we thought of training a model through a large amount of data, so that the collected noise, no matter the mixing of human voice, can be learned through continuous learning. We called it practice makes perfect 2020. Through training and a large number of data samples, the model can learn enough knowledge and be more robust to noise, without having to decompose one by one.
According to this idea, there are many deep learning models that can achieve such noise suppression, while ensuring that it has suppression effect on different noises.
A lot of noise does not exist singly, especially some compound noise. If you’re in a coffee shop, for example, you might hear the sounds of all these bottles and drinks mixed in with the sounds of people talking and talking. We call that background Babble noise, which is the kind of background noise that you want to get rid of. You get multiple sounds mixed together and you find that the spectrum is like a flood and everything gets mixed up in it and it’s hard to get rid of. If you use the traditional algorithm, it will retain the obvious human sound, and the aliasing at higher frequencies will be more severe, and it’s actually hard to tell apart, it will uniformly eliminate the high frequencies above 4K as noise. These are some of the drawbacks of traditional noise reduction methods.
Like the deep learning method, there are two main points to judge the quality of a noise reduction method:
First, what is the degree of retention of the original voice? Is it possible to minimize the damage to the language spectrum?
Second, make the noise as clean as possible.
Satisfying these two points, on the right is the method of deep learning. The speech spectrum can also be preserved at high frequency, and the noise is not mixed in it.
02. Algorithm design based on deep learning
Now how to design deep learning methods.
As with all deep learning, these steps are involved.
The first step is to feed the model what kind of input, the input can be selected, our acoustic signal can be given to it in the form of wave, in the form of spectrum or in the form of higher dimensional MFCC or even in the form of psychological threshold BARK domain. The structure of your model varies depending on the input. In terms of model structure, image-like methods may be selected, and in case of spectrum, methods similar to CNN may be adopted. Sound has a certain amount of time continuity, and you can also do it directly with waveform. We choose different model structures for this part, but we find that the mobile terminal is also limited by computing power and storage space, so we may make some combination of models instead of using a single model. The selection of the model will be considered, and another important part is to choose a suitable data to train the model.
The process of training the model is relatively simple, just mix the human signal with the noise signal and feed it into the program, so the model will give you a pure human signal. By this time I will choose the data in order to cover all the different languages, also mentioned at a meeting on different factors of language is also different, such as Chinese than the Japanese more than five or six phonemes, if there are five or six phonemes in English and Chinese is different, in order to cover up the language might choose multilingual data. Another gender is also different, if the corpus training is not balanced, the noise reduction ability of male and female voices may be biased. In addition, there may be some selection considerations for the type of noise, because it is impossible to exhaust all noise, so some typical noise is chosen. Here, the selection of different features, model design and data preparation are outlined to see which directions should be paid attention to.
Let’s take a look at what data we’re going to select for the model.
The first consideration is to take the original wave signal and do an end-to-end processing to survive a wave signal. This idea was rejected at the very beginning, because wave signal is related to its sampling rate. Maybe the sampling rate of 16K will have 160 points in 10 milliseconds per frame, so the data volume is very large. If the model is fed directly, it may require a large model to handle. We were wondering earlier if we could convert it to the frequency domain, and do it in the frequency domain to reduce the input of data. Until 17 or 18 years ago this was done in the frequency domain, but in 2018 models like Tasnet have been able to generate an effect of noise reduction end-to-end in the time domain.
The frequency domain may be a little earlier, before doing noise removal in the frequency domain, through the form of mask to solve the problem of noise. Like take the energy out of the noise and just keep the energy of the human voice.
A comparison was made in a paper in nineteen nineteen, which showed that a better noise reduction effect could be obtained in both time domain and frequency domain, and the computational complexity of the model was not the same. This input signal does not greatly determine the power or effect of your model, but it does.
On this basis, if both time and frequency domains are ok, we may need to choose some high-dimensional forms like MFCC to further reduce the computational power of the model, which is also the consideration of the initial design of the model. According to the limit of computing power, there are only 40 bin from more than 200 frequency points to MFCC, so the input can be reduced. Because sound has some masking effect and you can break it up into subbands that are small enough to do noise suppression, it’s also an effective way to reduce the computational power of the model.
Just talking about the signal input, there will be a lot of considerations on the calculation force of the model structure when choosing the model structure. You can draw an XY axis to show the complexity of the model calculation force and the number of model parameters. Like some CNN methods, because of convolution, a lot of operators are reusable, and the convolution kernel can be reused over the whole spectrum. In this case, it’s going to have the highest computational complexity in the same parameter structure, because it’s reusable and it has a very small number of parameters. If some mobile apps have restrictions on the number of parameters, for example, the mobile APP cannot be larger than 200M, the model may only give you 1-2 megabytes of space. In this case, try to choose CNN model.
The number of parameters is not a huge limit and the computing power may be challenged, for example, for a chip with poor computing power, only 1GHz. In this case, the convolutional neural network is not suitable, and it may be represented by some linear layers, so Linear is also matrix times. Matrix multiplication in some DSP chips and traditional CPU performance is not very high computing power, the disadvantage is that each operator is not reusable. In this case, the number of parameters is larger, but the computational power may be even smaller. But just using Linear is like DNN with only Linear layer, which has a lot of parameters and a lot of computation power.
As mentioned above, people’s speaking time is continuous. We can use RNN, which has short-term or long-term memory, to remember the current noise state through real-time adaptive parameters, which can further reduce its computation power.
To sum up, use Linear Layers as little as possible when choosing a model, which will lead to a large increase in the number of parameters and the improvement of computing power. You can combine these different structures, for example, using CNN first and then RNN as CRN form, which first compresses the dimensions you input, and then reduces the computational power of the model further by means of short and short memory.
Depending on different scenarios, it may be best to use bidirectional artificial neural networks for offline processing. You cannot add latency in RTC scenarios. A one-way network like LSTM may be more suitable. If you want to further reduce the calculation force, the LSTM of three gates is still too large, then use GRU of two gates structure, etc., to improve the ability of the algorithm in some details.
How to choose the model structure depends on the application scenario and computing power. The other piece is how to choose the data to feed to the model. One of the data is the damage of language spectrum. We should prepare more adequate and clean corpus, including different languages and genders, and the corpus itself may contain background noise. We should try to choose pure corpus recorded in silencer room of recording studio. In this way, your reference will determine that your goal is more pure and the effect will be better.
Another question is whether you can cover the noise, which is endless. You can choose some typical office voices and mobile phone prompts according to your scene, such as meeting scene, as training language materials. In fact, many noises are a combination of simple noises. When the number of simple noises is large enough, the robustness of the model will be improved, even if some noises that have not been seen can cover. Sometimes noise can not be collected if you can make some, artificial some, such as daylight tube, glow effect caused by the noise, 50 Hertz alternating current emitting 50 Hertz, 100 Hertz harmonic noise all the time. This noise can be artificially added to the training set to improve the robustness of the model.
03.RTC mobile terminal dilemma
Assuming we already have a good model, what are the difficulties in landing?
In the scene of real-time interaction, first of all, it is different from offline operation, and has higher requirements on real-time performance. It requires frame-by-frame calculation, non-causality is unavailable, and future information cannot be obtained. In such scene, some bidirectional neural networks are unavailable.
In addition to fit different phones and mobile terminals, it is affected by the various chip work force, if you want to use a broader model to calculate force there is a limit to the size of model parameters at the same time also cannot too big, especially call chip is a high number of model parameter to calculate force is not very high, but the IO operation parameters of the reading will affect the ultimate representation of the model.
The richness of the scene was also mentioned earlier, and some of the more successful ones were the degree of cover in Different languages such as Chinese and Japanese, and the type of noise. In a real-time interactive scene it is impossible for everyone to say the same thing in the same scene, and the richness of the scene has to be taken into account.
04. How to land the mobile terminal
Under such conditions, how to implement deep learning? We can solve these problems in two ways.
First of all, the algorithm can be broken through the algorithm. As mentioned just now, the parameters and computational forces of full convolution and full Linear are different, and structures of different computational forces can be combined according to different computational forces through combination of different models. Effect may have some bias difference, what kind of model can be applied to what kind of algorithm, can be solved by such a model structure, overall is a combined algorithm, through the model combination to make its calculation force can try to meet the requirements of its chip and storage space.
Second, the scene of the whole algorithm is different, so different models will be selected to solve it. If the scene can be selected at the beginning, such as the meeting scene, there is no possibility of music and animal cries, and these noise indicators need not be paid special attention, which can be used as the direction of model cutting.
The algorithm itself might be that big, and it might come out with a 5 or 6 Megabyte parameter, and you might think it’s not enough. Or its computing power is not optimized on the mobile end, it may have problems in memory calls, chip storage cache aspects. It will affect its results in the reasoning process and the actual use process. Obviously, it runs OK in training, but it runs differently when landing different chips.
In engineering, breakthrough will also be carried out, mainly for model reasoning and some processing methods will be different. First of all, the model will do some operator optimization, in the training of model building are layer by layer, but many operators can carry out some fusion, including operator fusion, convex optimization. Pruning and quantification of some parameters can further reduce the calculation force and the size of the number of parameters of the model.
The first step is to do some tailoring quantification of the model. This is already done to make your model optimal and most suitable for the scene. In addition, its chips are also different in different mobile terminals. Some mobile phones may only have CPU. Some good mobile phones may have GPU NPU or even DSP chips, which can even open up its computing power.
We can better adapt to the chip, there will be some different reasoning framework, each will have some relatively open source framework to use, such as Apple’s Core ML, Google’s TensorFlow Lite, which will do the optimization of the chip scheduling and compilation layer. At this point, the difference between doing and not doing is huge, because how the whole algorithm works is one thing, and how to do memory calls, matrix calculations, floating point calculations is another. Do engineering optimization, the effect can be a hundredfold improvement. Optimization can be done with open source frameworks, or compiled by yourself. If you are familiar with the computing power of the chip, such as how to call different caches and what their sizes are, you can do it yourself. It’s possible that your results will be better targeted than those of an open source framework.
After we integrate the model and the reasoning engine, we end up with a product that we can fit on almost all terminals, fully engineered on all chips, so it can be used in real time.
05. Noise reduction Demo
Now let’s listen to what the noise reduction looks like.
Here are some of the more common kinds of noise. (mp.weixin.qq.com/s/_eRqcotBv…
Let’s listen to the original sound on the keyboard, and then listen to the noise reduction effect of the keyboard. The keyboard sounds have been largely eliminated.
This is what the wind looks like. This is a speech in German in the wind. Let’s listen to the effect of noise reduction.
Subway is also a relatively common scene. Let’s listen to the original sound. This is actually me reading a poem on Shanghai subway Line 10. Let’s listen to one of the effects of noise reduction.
A noise inside a car, like a noise in a taxi, let’s listen. Let’s listen to the noise reduction effect of motion sickness Brother, and this is a piece of language material that we actually recorded in a taxi, with the whole engine noise removed.
Can we do it better?
After listening to these demos, let’s see what we can do to make it better, to make the scene a little bit more.
We still have a lot of unsolved problems. Include some music information, if you are in a music scene to noise, you’ll find accompaniment are not the only voices, these scenarios may be more sophisticated ways, such as audio source separation way, can the sound of the instrument is also retained, but some of the music sounds like a field of noise is more difficult to solve. The other one is like a human voice, like Babble noise, and the background noise is sometimes hard to distinguish from the human voice, especially like the cocktail effect, where everyone is talking, and it’s hard for the AI to determine which person is really speaking effectively. Noise suppression, such as what we are doing is single channel, using some microphone arrays may do some directional noise reduction, but this is also a difficult place, what sound is worth keeping, vocals and background how to distinguish the direction of this is difficult, and this is the future we will explore a more clear direction.
That’s all for my sharing. Thank you.