“This is the third day of my participation in the November Gwen Challenge. See details of the event: The last Gwen Challenge 2021”.
Before the earliest video recording or live video, the audio was obtained from the microphone, directly transferred to the encoder, packaging and multiplexing to different media formats. In the following IM scenarios, short voice messages, short voice message to text, and voice input design to speech recognition are still directly fetched from the microphone. After the design and development of the whole speech SDK, I planned the technical requirements of sound effect, speech separation and synthesis, resampling and so on.
At the beginning of this year, I took over a project similar to Xiao Ai and Xiao Du voice interactive system. The whole link includes:
At this time, I realized that there was a professional term called “signal processing” for those I planned originally, and there was a separate “signal processing” module in the whole interactive link. Let’s first look at what speech signal processing is and what it involves.
Signal processing function
Speech signal processing is an interdisciplinary subject combining speech linguistics and digital signal processing technology. It is closely related to cognitive science, psychology, linguistics, computer science, signal and information processing, acoustics, pattern recognition and artificial intelligence. Physiology, linguistics and other interdisciplinary subjects. The research includes speech recognition, speaker recognition, language recognition, speech synthesis, speech front-end processing (including but not limited to speech enhancement, speech separation, echo cancellation, microphone array signal processing, etc.), speech coding, etc.
Above is the professional definition, we are talking about speech front-end signal processing.
Echo cancelling AEC
In communication systems, there are two main types of echo: circuit echo and acoustic echo. The Acoustic Echo Cancellation (AEC) that we’re talking about here mainly refers to Acoustic Echo, Namely in the microphone and speaker duplex communication system of the influence of the interaction between “remote speaker A – – > the microphone A — — — — > phone B > phone — — — — — — — > the speaker B > mic B — — > phone B > phone A — — — – > A > speaker microphone A – >…” An infinite loop of. An intuitive description is that two people have a voice conversation. One person turns on the speaker and records the other person’s voice. As a result, the other party hears his own voice.
If the echo just makes for a poor user experience during a call, it can be disastrous in speech recognition. So AEC, as a member of 3A, is what most signal processing systems do. In the Android system, most mobile phones have provided echo cancellation capability in the Framework layer. If you want to achieve it by yourself, you must be able to get a reference signal, but many mobile phones will not open this part of the permission, and can only be used in the scenario of the research system.
Noise suppression NS
Noise Suppression (NS) technology is used to eliminate background Noise and improve the signal-to-noise ratio and intelligibility of speech signals, so that people and machines can hear more clearly. The main thing is to remove some background noise, such as keyboard tapping, drinking water, closing doors, etc.
Two parts of noise suppression are noise estimation and gain factor estimation
Automatic gain AGC
Audio AGC is an audio automatic gain control algorithm, or more accurately, a peak automatic gain control algorithm, which automatically and dynamically adjusts the gain according to the input audio signal level. When the volume (either captured or reproduced) exceeds a certain threshold, the signal is limited. Limiting means that the output of an audio device no longer varies with the input, and the output essentially becomes a horizontal line above the maximum volume position; When it detects that the audio gain reaches a certain threshold, it automatically reduces the gain to avoid the occurrence of amplitude limiting. On the other hand, if the captured volume is too low, the system will automatically increase the gain. Of course, the adjustment of the gain does not cause the volume to exceed the value set by the user in the adjustment wizard.
When we speak at a low volume, we want to increase the volume through the way of amplitude, but the amplitude of the sound is not uniform, some places may have to cut the amplitude, other places are still very small, if the fixed increase of decibel is difficult to achieve the desired effect, so the need for automatic gain function.
Blind source separation BSS
There are two kinds of microphone array algorithms: beamforming algorithm and blind source separation algorithm.
Blind Signal/Source Separation (BSS) is the estimation of Source signals based on observed mixed signals without knowing the Source signals and Signal mixing parameters.
Advantages of blind source separation algorithm:
- Blind source separation does not require prior information of target speech VAD. This prior information is crucial for beamforming algorithm, and its accuracy directly affects the performance. Blind source separation does not require adaptive filtering
- Blind source separation does not require DOA information of target speech.
Beamforming BF
Beamforming is also an algorithm of microphone array, which is mainly used in speech interaction system to generate the signal collected by multichannel microphone into one path.
DOA is estimated in the boda direction
The basic problem of DOA estimation is to determine the spatial positions of multiple signals of interest simultaneously in a certain region of space (i.e. the direction angles of multiple signals to the array reference element).
Go to the reverberation
When sound waves propagate indoors, they should be reflected by obstacles such as walls, ceilings and floors, and some of them should be absorbed by obstacles every time they are reflected. In this way, when the sound source stops making sound, the sound wave will be reflected and absorbed for many times in the room before disappearing. We can feel that there are several sound waves mixing for a period of time after the sound source stops making sound (the sound continuation phenomenon still exists after the sound source stops making sound in the room). This phenomenon is called reverberation, and this time is called reverberation time.
Reverb signals also affect our speech recognition results, so sometimes we do reverb processing for specific scenes (such as fixed living rooms or conference rooms).
Basic knowledge of speech signal processing
This is what signal processing is all about. We haven’t said anything about how it works. There are some open source AAA algorithms on the market, such as WeBRTC, but they are based on the implementation of common solutions, the effect is not very ideal. In the identification and processing scene, it is optimized for specific microphone or scene in order to pursue the effect.
In this section we discuss some algorithms and basic knowledge systems designed for signal processing.
Speech signal processing theory and algorithm
- Digital signal and its basic operation;
- Sampling theorem;
- Time-frequency analysis and Fourier transform;
- Adaptive filtering overlap-save;
- Adaptive filtering RLS algorithm;
- Adaptive filtering AP algorithm;
- Adaptive filtering LMS and RLS algorithms;
- Echo cancellation and noise suppression algorithm;
- Microphone array correlation
See these are not a little confusing, in fact, there is a signal processing course in the university, you can first learn or review from this course, master the basics.
Fundamentals of signal processing
- Signal and system concepts
- Concept of signal
- Typical signals and their characteristics
- The basic operation of signals
- Signal decomposition
- Time domain analysis of continuous time systems
- Description and response of continuous time systems
- Zero input response and zero status response
- Impact response and step response
- The convolution integral
- System characteristics expressed by unit impulse response
- Time domain analysis of discrete time systems
- Frequency domain analysis of continuous time signals and systems
- Orthogonal decomposition of signals
- The frequency spectrum of periodic signals — Fourier series
- Properties of Fourier series
- The spectrum of aperiodic signals — the Fourier transform
- Properties of the Fourier transform
- The power spectrum and the energy spectrum of the signal
- Frequency domain analysis of discrete time signals and systems
- Complex frequency domain analysis of continuous time signals and systems
- Laplace transform
- Properties of the Laplace transform
- Inverse Laplace transform
- Complex frequency domain analysis of continuous time LTI systems
- Z-domain analysis of discrete time signals and systems
- Z transform
- The properties of the Z transformation
- The z transformation
- Application of signal and systems theory
- Distortion free transmission of signals
- Modulation and demodulation
- multiplexing
- Signal filter
- Analysis of system state variables
It involves some knowledge of higher mathematics, and can be picked up together when learning.
Signal processing related tools
We will mainly use Adobe Audition and MATLAB.
conclusion
This paper introduces each function point of signal processing module and its function in mobile terminal speech processing. The basic knowledge and common tools of signal processing are also introduced.