preface
Today, komatsu has studied a speech recognition model of Spitz due to his work needs. He has never touched on this before, and mainly looked at the content of voice wake up.
Local voice wake up, after finish the configuration, only need to implement awaken callback interface, instantiation awakens the engine, the wake up I understood as engine awaken the function of the container, you can implement many interfaces, such as wake up engine detects voices, there will be a corresponding callback methods, developers can implement the callback method, the effect of the custom own app
Part of the callback method will bring the agreed return value, such as detected sound change or error code, etc. The following figure shows the whole process, the blue box outside for application development, box in addition to implement interface callback, the rest is SDK work
In the SDK documentation, local voice recognition, real-time short voice recognition and local voice wake up are all the same content. There are some changes in the corners and corners, which are all three items. Even the interface documentation is directly used as JDK.
Of course, as development engineers, we can’t be satisfied with package adjustment and parameter adjustment. The part in the blue box in the figure above is what SDK does. We hope to have a deep understanding
Awaken identification
As we all know, speech recognition has been widely used in deep learning neural network, and wake recognition, is the first step of speech recognition, similar to the first handshake of TCP links, compared with ordinary speech recognition, need to do more work, mainly speech activity endpoint detection (VAD) and wake word recognition technology
VAD
Now you have an iPhone, and you just say, “Hey Siri,” and Siri answers you, like, Dee, Ai, Ice, it’s all the same, and with that, have you ever wondered what’s going on?
- Feature extraction, the extraction of time domain or frequency domain features from the original audio, similar to whatever message you send to others on wechat, in the eyes of the computer, will be converted into base 2, sound features are rich, including energy, entropy, fundamental frequency, etc.
- Preliminary decision: This is the analysis center, where each input audio clip will be divided into speech clip and non-speech clip. It is here that we commonly use noise-reducing earphones to distinguish human voice and environmental noise for reduction. The algorithm here is very complex: you can set up rules, such as the probability of even decibel noise is ambient noise, but it is not accurate; Big data statistical deep learning can also be carried out, which is more accurate but less interpretable
- Post-processing: Is mainly the result of the preliminary decision smoothly, every audio clips are passed, in the form of frame in frame 1, 2, 3, preliminary decision may think 1, 3, is the human voice, 2 is the environment, is unlikely to this theory, because the people speak should be continuous, is unlikely to middle suddenly without a frame, and then instantly had, Therefore, post-processing is to correct these possible errors and avoid frequent switching of voice states
As can be seen from the above, the most critical part is the initial decision to choose which VAD algorithm to distinguish human voice and environmental sound. We hope that this VAD algorithm can distinguish human voice in noisy environment, that is, high SNR, and have good robustness
The question is, what’s the difference between a human voice and noise? There must be a difference. Why else would the human ear be able to tell? Based on this Angle, the following algorithms emerge
-
Short-time energy detection
This method is that energy is the biggest difference between voice and noise, the energy of the voice is bigger, at the same time, because the voice is essentially non-linear (chatterbox to also won’t speaking continuous), we only need these nonlinear sound energy statistics, an average, and then the future similar energy, we considered vocals, formula is as follows
Students do not need to explore the formula for development, simply speaking, x(n) is the original sound, W (n-m) is the window function, represents the sampling length, and then calculate the average value
This method is relatively fixed in the environment sound, and relatively small, such as MAC programmers with Intel version, using this method, can better filter the MAC fan sound (Intel to god crawl, M1 YYds)
The limitation is also obvious, that is, the scope of application is narrow, in more complex environment, such as companies, roads, vegetable markets and other places, it has almost no effect
-
Based on within DNN
In addition to what we can see and hear, images and voices also contain a lot of characteristic information, but our organs are not capable of parsing it. Deep neural network, on the other hand, can extract features and conduct acoustic feature vector modeling under multiple conversion-activation-pooling. This method, to some extent, is omnipotent. As long as two things are different, neural network can almost extract their differences
Generally speaking, the short-time spectrum features are extracted, such as MFCC Meir cepstrum coefficient, PLP (perceptual linear prediction), and FBANK correlation filter bank features. As for why they are selected, it is very simple, because the test found that neural network is easier to extract them. .
The extraction process is as follows
Speech is usually a continuous piece of waveform speech, so it is first converted into discrete sequence vectors, called feature vectors, which are segmented every 10ms or so. Each of these 10-millisecond vectors is called “frames”.
Then add a window, because the sound is related to the front and back, after being cut into many frames, some smooth Windows are needed to smooth and reduce the boundary effect
FFT, also known as the Fast Fourier Transform, is a mathematical transformation that converts time domain features into frequency domain features
Finally, filter bank analysis is carried out, and the features analyzed here are mainly FBANK, MFCC and PLP
After these features are taken out, it is natural to carry out training, establish a multi-layer neural network, continuous training, and finally the output results for post-processing
The results show that the accuracy of DNN’s speech breakpoint detection is improved a lot, but the effect is still poor in noisy environment
The point is, we didn’t really do anything, we left it all up to DNN to discover, and there’s no way he can completely exhaust the complexity of the sound world, so there are certain features that he can’t learn, or that are hard to learn. We need to find out for ourselves and inform him so that we can improve his accuracy and robustness even more
The following figure can be briefly summarized
algorithm | The principle of | advantages | disadvantages |
---|---|---|---|
Short-time energy detection | Calculate the average of human voice energy | Simple, low noise and stable effect is better | Narrow scope of application and weak function |
DNN | Deep learning to extract features | Better effect, better applicability | Poor interpretability and difficult optimization |
DNN+ Input feature | DNN+ Manual input feature | It’s probably the best solution we have | It is difficult to achieve |
Wakeword recognition
The algorithm principle and two existing algorithms for voice endpoint detection are described above. Let’s go back to Siri. Now that the iPhone has recognized your voice with DNN in a noisy environment, the next step is to decode your voice into a command that the system can recognize
Siri doesn’t respond to us when we say “Hello Siri,” so we can assume that siri’s startup command must be converted to “Hey Siri,” which is keyword detection
Therefore, when iPhone is trained with DNN, it must have carried out big data training in various languages and sound modes for this sentence, and built a keyword acoustic model. There are many kinds of keyword detection algorithms, among which the commonly used one is keyword detection based on hidden Markov model
Hidden Markov model
It sounds vague, but the idea is not difficult. You just need to build a set of HMM models for keywords, called keyword model, and another set of HMM models for non-keywords, called Filler model. When you accept a keyword, if it is found to be the Filler model, you cut it out
We say that all happy families are happy alike, but each unhappy family is unhappy in its own way. Here, too, there’s only one keyword model, and the Filler model is a general term, its HMM model, and there can be many of them
The core role of this model is to filter out keywords and convert them into signals that can be recognized by the system through HMM model. This process is the process of decoding and conversion of acoustic models
decoding
You might say, well, that’s not easy, we’re just going to hardcode it, but we’re just going to say, “Hey siri.”
Of course not! Wake up recognition is only one sentence, but the subsequent user will say many different words, and the decoder needs to work in the whole process of voice interaction, not just wake up
But there are more than ten million different words in the world. Let the server exhaust the sentence like a star? No, it’s essentially detecting key words in a sentence and inferring the meaning of the sentence
So the decoding process, in essence, or keyword detection
Above is the decoder’s position, since to do keyword detection, there must be a thesaurus KW match, at the same time need a & Models model, identify key words in every sentence, of course, that is no longer belongs to the category of the awakening recognition, this article only discuss awaken identification, interested students can go to study well, I will continue to study
Afterword.
This is komatsu’s first systematic summary of language recognition related knowledge, summed up in two steps
- Voice detection VAD DNN algorithm filters noise
- Wakeup word recognition hidden Markov model recognition of keywords, decoder matching word bank decoding
When I was an undergraduate, I took over Apollo from Baidu. Baidu’s speech model is relatively leading in China, and the specific use is also very simple, that is, to fill in the keyword, and then when a sentence contains the keyword, the model can detect it and return the corresponding result.
For example, in the background, fill in a “eat” and then implement a callback function in the app: call the map and search for restaurants
So, you say to the app with the Apollo model: Eat, and the app automatically opens the map and searches for places to eat
It’s fun. If you’re interested, you can try it.
Komatsu is very happy to live in this era of knowledge explosion, there are too many questions I was curious about have scientific answers, even if there are no ready-made answers, there are scientific methods to help us to seek answers, lifelong learning, progress together!
Reference Book: Artificial Intelligence: Speech Recognition Understanding and Practice
Personal profile
My name is Komatsu, class of 20th, the last 985 undergraduate, participated in the development of mobile QQ, and now constantly update problem solving videos, hand-drawn diagrams and code practice in B station, focusing on Android and algorithm, the first article and the public number Komatsu walk
In addition, there is a learning group, which has hundreds of friends who love learning. They conduct mock interviews with group friends every week to encourage each other to learn. If you want to participate, you can add me to wechat CS183071301.
Every Saturday at 9:00 PM, you can participate in the simulated remote interview online. The live broadcast of “Komatsu Bumo” at station B, please join us
\