Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~
This article was published in cloud + Community column by Tencent Cloud AI Center
My speech today is mainly divided into four parts. The first part is to share an overview of speech recognition, and then the foundation of deep neural networks. Next is the application of deep learning in the acoustic model of speech recognition. Finally, the difficulties and future development direction of speech recognition will be shared.
First, a brief introduction to speech recognition. What is the main process? The simplest way of speech recognition is to turn speech into text, which is also what human beings have been pursuing since the invention of computer to recognize what people say, or to further understand what people say. But speech recognition was not commercially available on a large scale until 10 years ago. Why? Because speech recognition wasn’t so good 10 years ago, it was only 70 to 80 percent accurate. Seven or eight words out of ten looks good, but when it comes to real applications, the accuracy is far from enough.
Ten years ago, we did not use so many applications on our products as we do now on mobile phones. We operated more on computers. In this case, the operation mode of mouse and keyboard would be more accurate and convenient. However, now with APP, for example, it is not appropriate to use the mouse or keyboard on many products in a car, so it is necessary to use voice control.
This year, why the accuracy can be greatly improved? There are three reasons for this.
The first is the development of the Internet. Why has the development of the Internet greatly improved the accuracy of speech recognition technology? Because of the development of the Internet let voice can get on the computer to share, to make a lot of voice and data are stored by the speech recognition technology, when 2000 years may only hundreds of hours of speech recognition, at that time felt it was a lot of data, voice recognition technology, now we do require tens of thousands of hours to do better.
Another is the development of hardware, namely the development of GPU/CPU hardware. The computing speed is dozens of orders of magnitude higher than a few years ago, and the application of GPU is more prominent, especially in the application of depth requirements. The CPU does addition and multiplication one instruction at a time, but the GPU does 1,000-dimensional addition with one instruction at a time. The combination of these three factors improves the accuracy of speech recognition.
So what is the basic process of speech recognition? First of all, let’s take a look at the speech recognition, we first imagine ourselves, how do we recognize a speech into text? For example, we now is a pupil, just learn Chinese soon, someone said a words, say “ma”, identified a few words, first heard is a sound wave, such as ma, identify it as pinyin, is ma, ma is hua, teng, get pinyin when I don’t know what words, then check the xinhua dictionary, Corresponding ma, ma, I from pinyin can become Chinese characters, Chinese characters how to become a word and a word? I took a look, I think Ma Huateng this word probability is relatively high, we can often see, when it comes to this sentence, Ma Huateng, not the hemp flower pain behind.
After we get the sound, we use the language model to turn the sound of sound into recognition results through certain interventions, into words, or words, or sentences. In feature extraction, we often see some audio that says 8K, 16K, or 16K, 16B, what does that mean? Let’s say 8K, which means that the audio will take 8,000 times, which is the equivalent of the recording device taking one point every 8,000 times it’s recording. What is the intensity of the sound? This is equivalent to a point of training, and then to frame of the data, such as 25 ms as a frame and move continuously in the future, for each frame of data logarithm change, if it is 16 k, is 400 points, the equivalent of a frame of data, is the time before the above points, and then into the frame on the distribution of the different frequencies of sound is how, And then it goes through a frequency filtering, or a logarithmic transformation, which is sort of a finite value. For example, in MFCC, a frame of data gets 13 numbers and a frame of data becomes 13 points, which is equivalent to the limit of 13×100. This is the general process of feature extraction.
So once we have this characteristic data, we need to turn the characteristic data into what is the probability of the velocity of sound? The history of this study goes something like this, many years ago when they did speech recognition, not the kind of experiential speech recognition we do today, called isolated word recognition DTW, let’s say stand up, 400 people in the room, let’s say how did the door open? I’m going to collect this data, and then I’m going to say a lot of words, and now I’m going to say a new word, and I’m going to compare what word does this person say that has a strong correlation with? For example, if this person comes in, the word is more like open door, it is recognized as open door. At the beginning, speech recognition is relatively simple word by word recognition. Later, THE DEVELOPMENT of HMM/GMM, hybrid high-speed model, this speech recognition from isolated word recognition to large vocabulary of continuous speech recognition, is a relatively big progress, make this is still very cool.
Have mixed high speed model, it improve the recognition rate than previously had the very big margin, but still can not meet the process or the degree of application in our life, deep learning in recent 10 years for speech recognition accuracy of 90% or more, before may be 70% to 80%, more than 90% have a large-scale application in the business.
The phonetic features mentioned above, first become a sonic, then sonic become a sentence, a node has an input, it takes several ingredients to generate this word, pronunciation dictionary, speech model, a lot of textual production, pronunciation dictionary is a question in advance.
So let’s talk about the basics of neural networks, first of all, deep neurons, including the nucleus and dendrites, receive signals from the outside, and based on signals and other reasons choose whether their cells are activated or not activated?
It is said that these scientists were inspired by deep neurons in animals to invent the basic structure of neurons in mathematics. So you multiply each input by a value, add them up, and then do a nonlinear function on the output, which is an easy way to do it.
What can simple neural networks, simple neurons do? First impression is do what, by bringing these neurons, combined into a neural network, neural network structure is relatively simple, he is a kind of hierarchical structure, input layer, hidden layer, and output layer, each neuron, is also said in front of the simple structure of neurons, so a simple neurons if doing things, The ability to represent data is also relatively simple.
The phonetic features mentioned above, first become a sonic, then sonic become a sentence, a node has an input, it takes several ingredients to generate this word, pronunciation dictionary, speech model, a lot of textual production, pronunciation dictionary is a question in advance.
So let’s talk about the basics of neural networks, first of all, deep neurons, including the nucleus and dendrites, receive signals from the outside, and based on signals and other reasons choose whether their cells are activated or not activated?
Convolutional neural networks, what exactly is convolution? We’ve probably heard a lot about convolution, and at first I thought what is convolution? Just as it’s called, you roll it, you multiply it, you roll it, that’s how I understand it, you can look at this picture. You have the original element in the front, and then you have the convolution kernel in the middle, and then you have the convolution, and the convolution, you take this convolution box and you put it in front of it, and you multiply and add up each of the corresponding points, and you get what you get in the back, and that’s the convolution.
What exactly does this convolution do? What’s the point of multiplying it out? I don’t have the intuition, but just to give you the intuition, these are the convolution boxes, and I’m convolving this image the way I did before, and the resulting image looks something like this, which is equivalent to me adding up the main diagonals of an image, and I can figure out the lines of the main diagonals. The second way is to subtract the middle one, and then average it, and you get the main diagonal line, and then the surrounding pixels detect the edges, and then it gets blurred. When we looked at the data earlier, this graph gives you a sense of what the convolution data is.
RNN unit is used to extract spatially similar features of data, speech recognition and NRP data. Points before and after data are related, and frames connected to each other in speech recognition are related, and words before and after words are selected are related in speech. RNN cycle structure of neural network, its input, a neuron can as input of the next time point, the first time input, has the original input, also have such a point in time input, such a network, can get N time points before data information, RNN network has an echelon explosion and the characteristics of the echelon hours, every time training, Will through the activation function, is greater than 1 or less than 1, after the function of time, travel time or larger, or smaller, if less than 1, continuous several times 0 point, has been to the front of the influence is small, if large is multiplied by 1 point a few is bigger and bigger, we have big impact on speech recognition, We don’t feel the data in frame 1 anymore by frame 10.
LSTM unit is advanced, LSTM is the length of time memory unit, in this kind of network structure, it increases the two obvious features, to solve the said in front of the gradient disappeared and the concept of gradient explosion, added a gate, on this channel, its echelon will not disappear or explosion, through the door control can control how much a information point came in.
Now I will talk about the application of deep learning and acoustic model. The most important work of speech recognition focuses on acoustic model modeling, mainly about what the recognized sound is and what the sound is after human pronunciation. The work of deep learning in speech recognition mainly includes DNN, LSTM and CLDNN, which seem to be some English letters, but are actually neural networks of deep learning.
First of all, DNN input a frame of data to get the classification result of articulation units. In fact, this is relatively pure. Input a frame of data, there is a probability of classification result, and it is not applied to other information.
The LSTM unit will make use of this time point and some data sources of the previous time point to assist in judging which data category is the current frame of data?
Two-way LSTM effect is better than the effect of one-way, because it is in both directions at the same time, such as today to have a meal, now LSTM identification, identify it may only be applied to today, when I in front of the data, but a two-way, the identification with me today, will use us to dinner, actually will be used on both sides, You get more contextual information, and it works better. Now many of our speech recognition products can see that we are producing results as we speak, and this model is not aware of the data behind it, and generally only recognizes one-way.
CLDNN, this kind of network structure for now is one of the more mature and stable structure, on the training data, also can be easily trained, the front is back after several network LSTM, Dense later, there are some efficient enterprise will increase new network, for instance may add a lot of convolution network layer, Maybe 10 more layers will be added, because the characteristic of convolutional network is that if 10 more layers are added, the sound features will be presented, and the final recognition effect will be better
In recent years, there are some new technologies in speech recognition, which is not too new. The original algorithm of CTC was put forward in 2003 or 2006. It is an end-to-end recognition method, and before speech recognition, we need to do some pre-processing. We need to match the human audio with the corresponding pronunciation. The end-to-end algorithm does not need to input a word and then out a word. Our training is to input the audio data and then get the result.
The change of unit granularity can train the speed of sound to the final Chinese characters. We don’t need to care about the details inside, nor do we need to process manual processing, which makes it easier for people to operate this model.
RNN unit is used to extract spatially similar features of data, speech recognition and NRP data. Points before and after data are related, and frames connected to each other in speech recognition are related, and words before and after words are selected are related in speech. RNN cycle structure of neural network, its input, a neuron can as input of the next time point, the first time input, has the original input, also have such a point in time input, such a network, can get N time points before data information, RNN network has an echelon explosion and the characteristics of the echelon hours, every time training, Will through the activation function, is greater than 1 or less than 1, after the function of time, travel time or larger, or smaller, if less than 1, continuous several times 0 point, has been to the front of the influence is small, if large is multiplied by 1 point a few is bigger and bigger, we have big impact on speech recognition, We don’t feel the data in frame 1 anymore by frame 10.
LSTM unit is advanced, LSTM is the length of time memory unit, in this kind of network structure, it increases the two obvious features, to solve the said in front of the gradient disappeared and the concept of gradient explosion, added a gate, on this channel, its echelon will not disappear or explosion, through the door control can control how much a information point came in.
Now I will talk about the application of deep learning and acoustic model. The most important work of speech recognition focuses on acoustic model modeling, mainly about what the recognized sound is and what the sound is after human pronunciation. The work of deep learning in speech recognition mainly includes DNN, LSTM and CLDNN, which seem to be some English letters, but are actually neural networks of deep learning.
First of all, DNN input a frame of data to get the classification result of articulation units. In fact, this is relatively pure. Input a frame of data, there is a probability of classification result, and it is not applied to other information.
The LSTM unit will make use of this time point and some data sources of the previous time point to assist in judging which data category is the current frame of data?
Two-way LSTM effect is better than the effect of one-way, because it is in both directions at the same time, such as today to have a meal, now LSTM identification, identify it may only be applied to today, when I in front of the data, but a two-way, the identification with me today, will use us to dinner, actually will be used on both sides, You get more contextual information, and it works better. Now many of our speech recognition products can see that we are producing results as we speak, and this model is not aware of the data behind it, and generally only recognizes one-way.
CLDNN, this kind of network structure for now is one of the more mature and stable structure, on the training data, also can be easily trained, the front is back after several network LSTM, Dense later, there are some efficient enterprise will increase new network, for instance may add a lot of convolution network layer, Maybe 10 more layers will be added, because the characteristic of convolutional network is that if 10 more layers are added, the sound features will be presented, and the final recognition effect will be better
In recent years, there are some new technologies in speech recognition, which is not too new. The original algorithm of CTC was put forward in 2003 or 2006. It is an end-to-end recognition method, and before speech recognition, we need to do some pre-processing. We need to match the human audio with the corresponding pronunciation. The end-to-end algorithm does not need to input a word and then out a word. Our training is to input the audio data and then get the result.
The change of unit granularity can train the speed of sound to the final Chinese characters. We don’t need to care about the details inside, nor do we need to process manual processing, which makes it easier for people to operate this model.
Another end-to-end recognition, encoder — decoder+attention, through the use of machine translation encoding, decoding model and attention mechanism. The first is the listener decoder. The data processed is the same as the traditional data at the beginning. After the features of the data are extracted, the high features are extracted from the data through the small neural network, and then the Attender is fed. Turn those sounds into words and sentences.
I think Google said inside the post, before this method than the traditional algorithm error rate is low, it is a complete end-to-end points method, and it, in front of the other algorithm into the traditional algorithm, but the algorithm also has a defect, can’t do the real-time recognition, can’t identify and talk at the same time, the real-time function temporarily also can’t do.
As for the technical difficulties, we are now if any vendor said accuracy rate 97%, this is very cow, about every manufacturers say that its accuracy, the accuracy is in the case of quiet and was able to achieve standardization of case, if there is noise, or the person don’t have standard mandarin, or with an accent of mandarin, recognition rate reduced to 80%, It didn’t work as well as expected. As for far field recognition, if I have a microphone, simultaneous interpretation is better, but not so good in high noise environment. There is also the recognition of accent is not good. There are many people when the voice mixed, as well as the voice with emotion. These all make the recognition effect bad.
How to solve these problems? Use of microphones, higher quality array microphones, more far-field data, increased semantic understanding AIDS.
Our operation students asked us to add some advertisements, which are our small programs related to Tencent AI. We can experience many products.
Now Tencent cloud speech recognition related products have the above several, offline speech recognition, real-time speech, one sentence recognition, simultaneous interpretation, speech recognition.
Offline voice recognition, customer message recognition, and real-time voice recognition, when suddenly speaking can be recognized, I developed an APP, choose this function can be embedded. One sentence identification, after this sentence can be identified to me. Simultaneous interpretation, recognizing Chinese while translating Chinese into English, and then recognizing it on the screen. And then there’s speech synthesis.
Q&A
Q: Hello, teacher, I would like to ask whether the call time of voice recognition of cloud service on Tencent Cloud can achieve a response time of 20 or 30 milliseconds? thank you
A: If it is our own internal facilities, it can be 30 milliseconds if it is identified. But if it is our service, you have to request it and return it. It is mainly due to the delay, so it can not arrive so fast, it may take about 100 milliseconds.
Q: Can ai have a sense of smell?
A: I thought about that before. I remember when I was in school, our teacher asked us to write an essay, which is similar to what you asked today. You have A sense of smell, and when someone walks in, you know the person is coming.
reading
This is a rundown of deep learning
Machine learning in action! Quick introduction to online advertising business and CTR knowledge
This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the
Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!
Massive technical practice experience, all in the cloud plus community!