At present, the mainstream speech recognition is roughly divided into feature extraction, acoustic model, speech model several parts. At present, the end-to-end acoustic model training methods combined with neural network are mainly CTC and ATTENtion-based.
This paper mainly introduces the basic concepts of CTC algorithm, the possible application fields, and the details of CTC algorithm calculation in combination with neural network.
CTC algorithm concept
The full name of CTC algorithm is Connectionist temporal classification. It is literally used to solve the classification problem of sequential class data.
In the traditional acoustic model training of speech recognition, the corresponding label of each frame of data needs to be known before effective training, and speech alignment needs to be preprocessed before training data. The process of speech alignment itself requires many iterations to ensure accurate alignment, which itself is a time-consuming task.
Figure 1 shows the waveform diagram of the sound of the sentence “Hello”. Each red box represents a frame of data. The traditional method needs to know which phonetic phoneme corresponds to the data of each frame. For example, frames 1,2,3,4 correspond to the sound of n, frames 5,6,7 correspond to the sound of I, frames 8,9 correspond to the sound of h, frames 10,11 correspond to the sound of a, and frames 12 correspond to the sound of o. (Consider each letter a phonetic phoneme for the moment.)
Compared with traditional acoustic model training, acoustic model training using CTC as loss function is a complete end-to-end acoustic model training, which does not need to align data in advance and only needs one input sequence and one output sequence. This eliminates the need for data alignment and labeling, and CTC directly outputs the probability of sequence prediction without external post-processing.
Since CTC’s method is concerned with the result from one input sequence to one output sequence, it only cares whether the predicted output sequence is close to (identical to) the real sequence, and does not care whether each result in the predicted output sequence is exactly aligned with the input sequence at the point in time.
CTC introduces blank (this frame has no predicted value), each predicted classification corresponds to a spike in the whole speech, and other positions that are not spikes are considered blank. CTC’s final output for a speech is a sequence of spikes, regardless of how long each phoneme lasts. As shown in Figure 2, take the nihao pronunciation as an example, the sequence result after CTC prediction may be slightly delayed from the corresponding time point of real pronunciation, and other time points will be marked as blank. This neural network +CTC structure can not only be applied to the acoustic model training of speech recognition, but also can be used in the training of any input sequence to an output sequence (requirements: the length of the input sequence is larger than the output sequence). For example, OCR recognition can also be done by using the RNN+CTC model. The data of each column of the picture containing text is input to the RNN+CTC model as a sequence, and the output is the corresponding Chinese character. Because it takes many columns to form a Chinese character, the length of the input sequence is much larger than the length of the output sequence. Moreover, OCR recognition in this way does not need to detect the exact position of the text in advance, as long as the sequence contains the text.
Training of RNN+CTC model
The following introduces the detailed training process of RNN+CTC model in speech recognition, and how RNN+CTC trains sequence data without prior alignment data. First, CTC is a loss function, which measures how much the input sequence data differs from the real output after it passes through the neural network.
For example, if you input 200 frames of audio data, the actual output is 5. After neural network processing, the data that comes out is still the sequence length of 200. Like both of them said, nihao this sentence, their real output is nihao these five orderly phonemes, but because everyone is not the same as the pronunciation characteristics, for example, some people say fast some people said slowly, the original audio data after neural network calculation, the results of the first person to get might be: nnnniiiiii… Hhhhhaaaaaooo (length 200), the result of what the second person says might be: NiIIIII… Hhhhhaaaaaooo (Length 200). Both of these results are correct calculation results. It can be imagined that there are a lot of results that can correspond to the pronunciation order of nihao in the data with a length of 200. CTC is a method used to calculate and end up with the loss value of the real sequence value when the sequence has multiple possibilities.
Detailed description is as follows:
The training set is, saidN
A sample of training,x
Is the input sample,z
Is the label corresponding to the actual output. The input of a sample is a sequence, and the output label is also a sequence. The length of the input sequence is larger than the length of the output sequence.
For one of the samples(x,z)
.Represents a data frame of length T, and the data of each frame is a vector of dimension M, i.eX_i ∈ R ^ m
.x_i
Can be understood as for a piece of speech, every 25ms as a frame, where the firsti
The result of MFCC calculation of frame data.
Represents the correct phoneme corresponding to this sample speech. For example, a sound that says “hello” is calculated by the MFCC to get a signaturex
, its text message is “hello”, the corresponding phoneme message isz=[n,i,h,a,o]
(Think of each pinyin letter as a phoneme for the moment.)
After the calculation of feature X by RNN, the posterior probability Y of phoneme is obtained after passing through a Softmax layer.
Denotes the probability that the phoneme $k$is pronounced at time t, where there are n phoneme types, k denotes the KTH phoneme, and the probability of all phonemes in a frame of data is 1. That is:
This process can be viewed as a transformation of the input characteristic data x
Where N_w represents the transformation of RNN,
The process is shown in the following figure:
Take a piece of “Hello” voice as an example, 30 frames are generated after MFCC feature extraction, each frame contains 12 features, namely:
(Using 14 phonemes as an example, there are actually about 200 phonemes), the sum of each column in the matrix is 1. The subsequent ctC-Loss based training is calculated based on posterior probability $Y $.
Path $\ PI $and B transform
In actual training, the phoneme corresponding to each frame is not known, so it is difficult to train. $z\prime$= $x$= $z\prime$= $z\prime$= $x$= $z\prime$
In our case, , $z\prime $
Contains the label for each frame. In this case there are:
This value is the product of the black coils in the posterior probability graph. We want to multiply as much as possible, so the mathematical scheme can be written as:
The partial derivative of the objective function with respect to each element y^t_k in the posterior probability matrix y is:
That is, at each time $t$(corresponding to a column of the matrix), the target is only associated with $y^ T_ {z\prime_t}$, in this case with the boxed element.
$N_w$can be regarded as the RNN model. If each frame of the training data is marked with the correct phoneme, the training process will be very simple. However, in fact, such marked data is very rare, and there are many data not marked frame by frame, so CTC can use the data not marked frame by frame to train.
First, define a few symbols:
Represents the set of all phonemes
$z\prime$= z\prime$= z\prime$= z\prime$
$\ PI ^1$can be read as “Nessun Dorma”, $\ PI ^2$can be read as saying “hello”, $\ PI ^3$can be read as saying “hello”, $\pi4, $\ pi5, and $\ pi6$can all be read as saying “hello”.
Define the B transform, which represents simple compression, for example:
Take the above six paths as an example:
So if I have a pathPI.There are, can be considered$\pi$
It’s saying “hello.” Even if it isAs shown, there are many “O” phonemes and few other phonemes. The pathThe probability of is multiplied by the elements of the matrix y through which it passes:
So in the case of no alignment, the objective function should beThe sum of the probabilities of all the elements in. That is:
At T=30, the phoneme isIn the case of totalA path can be compressed to. The calculation formula for the number of paths isAnd the magnitude is approximately. The number of possible paths can be as high as 50 Characters in a 30-second speech$10 ^ 8 $
Obviously, such a large number of paths cannot be calculated directly. Therefore, the CTC method borrows the forward and backward algorithm in HMM to calculate.
Training implementation method
The CTC training process is passedAdjust the value of w to maximize the target value in 4, and the calculation process is as follows:
So as long as you get, can be obtained according to the back propagation. The following uses Hello as an example to describe how to calculate the value.
First, based on the previous example, find all the possibilities that can be compressed toz=[n,i,h,a,o]
The path of, denoted as. Is known to allPI.All haveThat is, the objective function is only expressed with the posterior probability matrix yn,i,h,a,o
For simplicity, we extracted these 5 lines, as shown in the figure below.
At each point, the path can only shift to the right or down. Draw two paths, denoted by q and R, which both pass through y^{14}_h, indicating that both paths are sounding “H” at frame 14. In the continuous term of objective function 4, some terms have nothing to do with y^{14}_h, so this part can be removed, leaving only the part related to y^{14}_h, denoted as
Q and r here are theta and theta$y^{14}_h$
Two related paths. with$q_{1:13}$
and$q_{15:30}$
respectively$q$
in$y^{14}_h$
Before and after, same thing, with$r_{1:13}$
and$r_{15:30}$
respectively$r$
in$y^{14}_h$
Before and after. You can see,There are also two possible paths.The sum of the probabilities of the four paths is:
It can be seen that the value can be summarized as :(leading term) y^{14}_h.(trailing term). Thus, for all paths through y^{14}_h, there is:
This value can be interpreted as the sum of the probabilities of all forward paths from the initial to y^{14}_h.
The implication of this recursive formula is that it is only pronounced “h” or “I” at t=13, and it is possible to pronounce “H” at t=14. Then all forward path probabilities \alpha_{14}(h) pronounced “h” at t=14 are equal to the forward probabilities \alpha_{13}(h) pronounced “h” at t=13 plus the forward probabilities \alpha_{13}(I) pronounced “I”, Multiply by the probability that the current phoneme is judged to be “h” y^{14}_h. Accordingly, each \ alpha_t (s) can by \ alpha_ {1} t – (s) and \ alpha_ {1} t – (1 s) two worth to. The recursive process of \alpha is shown in the figure below:
That is, each value is derived from one or two values at the previous time, and the total calculation is approximately $2.t. Number of phonemes $. $\beta_t(s)$
Once you get this value, you can train according to the back propagation algorithm. Here you can see, the total amount of calculation is very small, calculation of alpha and beta calculation is about t phoneme number (2), (addition multiplication each time), after get the alpha and beta, the calculation for each y ^ t_k partial derivatives of the value of the amount of calculation for the phoneme number (3 t), so the total amount of calculation is about (7 t phoneme number), it is very small, easy to calculate.
At present, deep learning algorithms have been widely applied in Tencent cloud’s voice recognition products. Tencent Cloud has the most advanced speech recognition technology in the industry. Based on massive speech data, Tencent cloud has accumulated hundreds of thousands of hours of labeled speech data. It adopts LSTM, CNN, LFMMI, CTC and other modeling technologies, combined with the language model of super-large corpus, and the recognition effect of standard Mandarin has exceeded 97% accuracy. Tencent cloud’s voice technology covers a wide range of applications, with excellent speech recognition, speech synthesis, keyword retrieval, mute detection, speed detection, emotion recognition and other capabilities. And for games, entertainment, government affairs and other dozens of vertical customized voice recognition programs, so that the effect of voice recognition more accurate, more efficient, fully meet the phone customer service quality inspection, voice dictation, real-time voice recognition and live subtitles and other scenarios of the application. Do you want to try something? Please stab: cloud.tencent.com/product/asr