Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~
This article is published by Tencent Education Cloud in cloud + community column
In general, without the Internet, speech recognition looks like this
del
Embedded speech recognition, on the other hand, looks like this
del
Not only can help you say while recognizing, export into chapter, when there is a personalized name is not difficult to it.
That’s the beauty of embedded speech recognition.
This paper will be based on the implementation and optimization of the embedded speech recognition engine of wechat Wisdom Leaf.
The selection of embedded speech recognition technology is introduced.
01
Speech recognition, in general, comes from this
Speech recognition, which allows machines to “understand” human speech and identify the content of speech as corresponding text.
It started in the 1950s
From the initial small word size isolated recognition system
To today’s large word continuous recognition system
The development of speech recognition system, performance has been significantly improved, mainly benefited from the following aspects:
The arrival of big data era
Application of deep neural network in speech recognition
GPU hardware evolution
Therefore, speech recognition is gradually becoming practical and productized
Voice input method, intelligent voice assistant, voice vehicle interaction system…
It can be said that speech recognition is the forefront of human conquest of artificial intelligence, is the cornerstone of machine translation, natural language understanding, human-computer interaction and so on.
However, the performance improvement is based on the server CPU/GPU high computing capacity and large memory, and the convenience of speech recognition cannot be enjoyed when there is no network.
In order to solve this problem, wechat Smart Leaf is developed for embedded speech recognition. Embedded Speech Recognition, also known as embedded LVCSR (or offline LVCSR, Large Vocabulary Continuous Speech Recognition), refers to Speech Recognition that runs all the way on the mobile phone without relying on the powerful computing power of the server.
In some special scenes of unstable network (vehicle-mounted, overseas and so on), embedded voice recognition can “save the country”.
So, what are the difficulties in implementing an embedded speech recognition?
The basic flow of speech recognition
Among the mainstream speech recognition algorithms, acoustic and language models are included. Acoustic models have benefited from the development of deep learning in the last decade, from GMM (Gaussian model) to DNN (deep neural network), and then from DNN to LSTM RNN (recurrent neural network). The recognition rate has been increasing, while the computational load has also been soaring. The n-gram algorithm commonly used in language model, the higher the performance, the more commonly used model up to tens of GIGABytes of memory.
Therefore, the embedded speech recognition has the following difficulties:
\1. The operation of deep learning is complicated, and the performance loss is great only when the model is clipped, so we need to find a way to recover the performance;
\2. Clipping model is inevitable, so how to avoid the problem that small model training is easy to fall into local optimization in the model training link?
\3. How to calculate faster and meet the embedded CPU environment;
\4. How to organize the language model storage, so that more language information can be stored in limited memory.
In this paper, based on the technical principle of speech recognition, we will talk about the implementation technology of wechat Smart Leaf embedded system.
The content will be divided into four parts:
\1. Review the basic concepts of speech recognition;
\2. Briefly introduce some of our work on speed and memory optimization, focusing on engineering application implementation;
\3. Say what we have done for better performance, focusing on algorithm research introduction;
\4. We do experiments and comparisons, and finally we conclude.
02
The components of speech recognition
Speech recognition ‘black box’
Speech recognition from input recording output text, black box processing through feature extraction, acoustic model, pronunciation dictionary, language model and other processes, the author believes that speech recognition can be compared to a computer.
Feature extraction is like a router, serving as a bellwether to provide continuous data sources for subsequent links.
The acoustic model is equivalent to the CPU, the heart of speech recognition, which will directly affect the accurate performance of recognition.
A language model is the hard disk of speech recognition, where a great deal of word combination information is stored.
A pronunciation dictionary is a memory stick that effectively organizes the relationship between acoustic and linguistic models.
In addition, speech recognition consists of a decoder, which acts like a computer’s operating system, organizing the process efficiently.
Next, we introduce the basic concepts of each “widget” so that we can follow up on how embedded ASR works on these “widgets”.
1. Feature extraction
Voice recognition feature extraction includes a series of processes such as pre-weighting, frame segmentation, windowing, AND FFT (Fast Fourier Transform). Common features include PLP, MFCC, FBANK and so on. Generally speaking, speech recognition divides speech into 100 segments per second (with overlapping), while feature extraction converts each segment of speech data into a vector (commonly with 39 dimensional MFCC features).
In order to correlate contextual information, features are often pieced together as inputs to acoustic models. For example, take 39-dimensional features as an example, if 5 frames of information are taken before and after, there will be 11 frames in total, and the input vector dimension is 11*39=429. Generally speaking, the performance of speech recognition is positively correlated with frame width.
As a router for speech recognition, the computation of feature extraction is not large. However, as the input of the topology of the acoustic model, it indirectly affects the computation of deep learning, which is a problem to be considered in embedded ASR.
2. Frame rate jitter
The frame rate of a live stream is calculated in 5s, and the variance of frame rate is calculated in 1min. If the variance is too large, it is regarded as the frame rate jitter of the stream.
3. Acoustic Model
The importance of acoustic model as CPU of speech recognition is self-evident.
Generally speaking, it occupies most of the computation overhead of speech recognition and directly affects the performance of speech recognition system. Traditional speech recognition systems are generally based on the acoustic model of GMM-HMM, in which GMM models the distribution of acoustic features of speech, and HMM models the timing of speech signals.
Since the rise of Deep learning in 2006, Deep Neural Networks (DNN) have been applied to acoustic models.
In the past ten years, the development of deep learning in acoustic models has been soaring, with various TOPologies of CNN, RNN and TDNN springing up one by one. See the article for more introductions on deep learning in acoustic models.
For embedded LVCSR, it is the core appeal of acoustic model to select proper DNN topology structure and realize structural calculation with reasonable optimization in mobile phone.
4. Language Model
Language models are relatively familiar to NLP practitioners. In speech recognition, language models are used to assess the probability of a sentence (the sequence of words shown in Figure 2) occurring.
Among the language model implementation algorithms, the most common is n-Gram model (N-Gram Models), which uses the n words in front of the current word to calculate its probability. It is a context-dependent model. In recent years, Neural language models use the word Embedding to predict, and it has been widely developed and applied.
In embedded ASR, since computing resources are reserved for acoustic models, the language models still use n-Gram ideas. Therefore, how to maximize the storage of language model in the limited memory is the problem to be solved by embedded ASR.
5. Pronunciation dictionary
A pronunciation dictionary is a memory stick for speech recognition. The memory reads data from the hard disk and performs calculations with the CPU. Similarly, a pronunciation dictionary can convert a sequence of words from a language model into a sequence of phonemes and use an acoustic model to evaluate scores.
A pronunciation dictionary is a bridge between the acoustic model and the language model, and its size directly affects the space of the acoustic model and the language model.
In embedded ASR, the size of the pronunciation dictionary resonates with the size of the language model, so the problem to be solved can be brought into line with the language model.
6. The decoder
Decoder, estimate the literal translation of the word from English decoder, I think a more appropriate name should be called recognizer. There is another visual reason why it is called a decoder. In the case of 16-bit voice data, computers store a bunch of short numbers that we don’t understand, like passwords. Speech recognition can crack those codes, exposing them to us in clear text.
So generally speaking, the decoder is the speech recognition process in series of code engineering. Generally, the cloud adopts the static decoder partner with WFST (weighted optimal finite state automata), which can more conveniently process all aspects of speech recognition comprehensively. In order to save the memory cost of the embedded language model, a specific dynamic decoder is used.
03
Start tuning these components — speed and memory optimization
To optimize the time and memory footprint of these “widgets”, we did a number of things:
Neon computational optimization, singular value decomposition optimization, Huffman coding optimization.
1. Neon optimized acoustic model calculation
Neon’s computational optimisation is a familiar refrain among engineers, especially the machine learning-related T tribe. In the embedded ASR engine, we optimized the core high-frequency functions by NEON and wrote them in assembly language, resulting in a 25% increase in computing speed.
Next, this article is to realize the char type vector multiplication introduction optimization implementation, divided into three versions of the introduction:
A. Plain version before optimization
B. neon c version
C. Neon assembly
First, the function we will implement is:
/** * implement two char type vectors multiplied by * start_a: vector A * start_b: vector B * CNT: number of vector elements * result: vector multiplied by return storage variable */
void vector_product_neon(const char * start_a, const char * start_b, int & result,
const int cnt);
Copy the code
A. Plain version before optimization
void vector_product_neon(const char * start_a, const char * start_b, int & result,
const int cnt) {
int res = 0;
for(int j = 0; j < cnt; j++) {
res += int(*start_a) * int(*start_b);
start_a++;
start_b++;
}
result = res;
}
Copy the code
B. neon c version
The Neon register can perform parallel operations in 128-bit space. For char vector multiplication, the result is in the short range, so it can be implemented as a group of 8. The following code loops through two groups of 8 elements at a time. In our deep learning operation, the vector length of the hidden layer is guaranteed to be a multiple of 16, and the implementation code is as follows:
void vector_product_neon(const char * start_a, const char * start_b, int & result,
const int cnt) {
int res = 0;
int32x4_t neon_sum = vdupq_n_s32(0);
int8x8_t neon_vector1;
int8x8_t neon_vector2;
for(int j = 0; j < cnt / 16; j++) {
neon_vector1 = vld1_s8((char *)start_a);
neon_vector2 = vld1_s8((char *)start_b);
int16x8_t neon_tmp = vmull_s8(neon_vector1, neon_vector2);
start_a += 8;
start_b += 8;
neon_vector1 = vld1_s8((char *)start_a);
neon_vector2 = vld1_s8((char *)start_b);
neon_tmp = vmlal_s8(neon_tmp, neon_vector1, neon_vector2);
neon_sum = vaddw_s16(neon_sum, vget_low_s16(neon_tmp));
neon_sum = vaddw_s16(neon_sum, vget_high_s16(neon_tmp));
start_a += 8;
start_b += 8;
}
for(int j = 0; j < 4; j++)
res += vgetq_lane_s32(neon_sum, j);
result = res;
}
Copy the code
C. Neon assembly
The assembly version of Neon code is expensive to write and maintain, but faster than the C version. In a spirit of refinement, we implemented assembly code:
void vector_product_neon(const char * start_a, const char * start_b, int & result,
const int cnt) {
int res = 0;
asm volatile(
"vmov.s32 q2, #0" "\n\t"
"lsr %[cnt], %[cnt], #4" "\n\t"
".charloop:"
"vld1.s8 {d0}, [%[vec1]]!" "\n\t"
"vld1.s8 {d1}, [%[vec2]]!" "\n\t"
"vmull.s8 q1, d0, d1" "\n\t"
"vld1.s8 {d0}, [%[vec1]]!" "\n\t"
"vld1.s8 {d1}, [%[vec2]]!" "\n\t"
"vmlal.s8 q1, d0, d1" "\n\t"
"vaddw.s16 q2, q2, d2" "\n\t"
"vaddw.s16 q2, q2, d3" "\n\t"
"subs %[cnt], %[cnt], #1" "\n\t"
"bne .charloop" "\n\t"
"vadd.s32 d4, d4, d5" "\n\t"
"vmov.s32 r4, d4[0]" "\n\t"
"add %[sum], r4" "\n\t"
"vmov.s32 r4, d4[1]" "\n\t"
"add %[sum], r4" "\n\t"
: [sum]"+r"(res)
: [vec1]"r"(start_a),
[vec2]"r"(start_b),
[cnt]"r"(cnt)
: "r4"."cc"."memory"
);
result = res;
}
Copy the code
2. Singular value decomposition optimization of acoustic model computation
In order to reduce the number of multiplication and addition operations, we decided to use singular value decomposition to reconstruct DNN. By cutting out the smallest singular value and its corresponding feature vector, we achieved the goal of reducing the number of multiplication and addition operations. Singular value decomposition factorizes any matrix Wm×n(without loss of generality, assuming m≤n) into three matrix products: Wm×n = Um× M σ M ×mVm× N.
σ M ×m is a diagonal matrix, that is,σ M ×m = Diag (σ1,σ2… ,σm), its diagonal element is the singular value of Wm×n; Um×m is the unit orthogonal matrix, and its column vectors are eigenvectors corresponding to singular values. The row vectors in Vm×n are orthogonal to each other and are eigenvectors corresponding to singular values.
The following figure illustrates the model transformation in DNN reconstruction by taking one layer network of DNN model as an example. The original DNN model is the upper subgraph (a) in the figure, and the new reconstructed DNN model is shown in the lower subgraph (b) :
A: One layer structure of the original DNN model
(b) Two-layer corresponding structure of the new DNN model
The optimization of acoustic model computation by SVD can be roughly divided into three steps
(1) Training the initial DNN neural network;
(2) Singular value decomposition of weight matrix;
(3) Retrain the reconstructed DNN model.
Through the model compression method based on SVD, we can reduce the acoustic model computation by 30% while slightly reducing the model performance.
3. Huffman optimized language model memory
In general, N-Gram language models can be stored in a directed graph, with edges to store lexical information, to introduce storage space and quick queries. As we know, taking Chinese as an example, the occurrence frequency of different words varies greatly. If label IDS of all words are stored in int type, the utilization rate of space will be relatively low.
Taking “I”, “want” and “eat” as an example, assuming that the word frequency of the language model is “I > want > eat”, we can construct the Huffman tree in Figure 3, and the coded numbers used for the four words are “I (0),” want (10) and “eat (110)” respectively.
Binary Huffman
Sixteen fork Huffman tree
However, the binary tree data structure in Figure 4 can only process 1bit at a time, which is inefficient and inconvenient for engineering implementation. Therefore, in engineering implementation, we classify and store the vocabulary according to the unit of 4bits code.
We use the Huffman tree structure of a 16-fork tree, and the total number of nodes in each layer of the tree is 16 times that of the previous layer. All child nodes numbered 0 in the tree are used to store words, and the higher the frequency of words, the lower the depth of the node is stored.
Through Huffman optimization, we managed to reduce the memory footprint of the engine by 25% and optimize the engine’s resource files by about 50%.
04
Optimization of identification performance
1. Optimization of acoustic model based on TDNN
In recent years, TDNN (Time-delay Neural Network) [5] topology structure has been applied to speech recognition. In fact, the structure was proposed in 1989 and has been reintroduced in recent years with the development of technology.
Within DNN structure
The topology network of DNN is modeled only for a single feature point in time.
TDNN structure
The hidden layer structure of TDNN enables the abstract modeling of speech features at multiple time points and has stronger modeling ability. In addition, the multi-moment modeling parameters of the TDNN structure are shared (red, green and purple are propagated using the same topology matrix).
Therefore, TDNN requires more BP operations than DNN during training. In speech recognition, due to the reason of parameter sharing, the calculation results of hidden layer can be reused, and each frame only needs to calculate all parameters once, which greatly saves the calculation. Finally, based on the TDNN structure, the recognition rate of the engine is improved by 20% relative to the accuracy of the engine under the premise of maintaining the same amount of computation.
2. Optimize performance based on multi-task training
Multi-task joint training can effectively improve the robustness of acoustic training and avoid falling into local optimum. In the embedded model, the output target of the model is less, and the training is easy to fall into local optimum. Therefore, we combine training with large models with multiple targets to make the hidden layer structure of training more robust.
Acoustic model multi-task training
During training, our network has output 1 and output 2 at the same time. In multi-task training, reverse iteration requires residual coordination. We use the following formula to allocate residual, where λ weighs the training weights of the two models:
Finally, we use multi-task training to optimize the performance, which has brought some improvement to the speech recognition rate. All the following performance improvement will be presented in the next chapter.
3. Performance optimization based on Discriminative Training
The discriminative training of acoustic model is proposed for the deficiency of MLE training. DT training usually defines an Objective Function, or Criterion Function, to approximate a metric related to classification costs. Through discriminative training, we can weaken the influence of model assumption error to some extent.
At the same time, since discriminative training focuses on optimizing the measurement related to the recognition effect, it provides a more direct way to improve the performance of the recognizer. Figuratively, MLE training told the model “this is a chair and that is a table”, whereas discriminative training told the model “this is a table and not a chair and that is a chair and not a table”. MLE training pays more attention to the adjustment of model parameters to reflect the probability distribution of training data, while discriminative training pays more attention to the adjustment of classification planes between models to better classify training data according to the set criteria.
The objective function of DT looks like this:
The objective function of DT can be obtained by using Bayes once:
The numerator is the target function of ML; The denominator is the sum (weighted by language model) of the probability of producing trained speech for all texts, including the trained text and all its competitors. As it is unrealistic to enumerate all possible texts in the denominator, in practice, an existing ML-trained speech system is generally used to decode the trained speech once to obtain n-best list or lattice, and the text above is used to approximate the sum in the denominator. The n-best list or lattice contains close enough competitors for the training text.
4. New word discovery based on mutual information
For speech recognition system, language model is very important to the result. For language models, the dictionary of language models is the key. A good segmentation dictionary is crucial for obtaining a robust language model. In order to select a dictionary composed of reasonable and correct “words”, the first and most crucial step is to mine new words based on existing corpus.
Because the performance of embedded system is limited, selecting suitable size of thesaurus and pruning the language model can reduce the size of installation package, limit memory consumption and improve recognition performance. Compressed word list can screen high frequency words, and through a certain model to identify and screen out truncated words, such as “xin Gong”, “Jia Nian”, “Gan Sheng”, “oval”, “Liu De”, “liya” and other half of the high frequency words.
A simple and effective new word discovery and screening scheme can adopt the calculation method of mutual information and left-right information entropy. The binary information entropy score is composed of three corresponding parts: 1) inter-point mutual information: the higher the inter-point mutual information, the higher the degree of internal aggregation; 2) The minimum value of information entropy h_R_L and h_L_R of two word fragments: the larger the value is, the smaller the possibility of two words appearing together; 3) Minimum value of left-right information entropy of words: the larger the value is, the more context the candidate word appears in and the more likely it is to become a word. Therefore, the higher the score is, the more likely it is to become a word.
After the calculation of binary information entropy, ternary and quaternary information entropy can be calculated successively. The discovery and screening of ternary new words is to replace the original two words with binary as a single word, and the candidate set can be taken as the candidate set with the left information entropy or the right information entropy 0, and so on. In addition, language model is directly related to the output of recognition results, so it is particularly important to select the corpus corresponding to the application scenario for statistics.
05
The experimental contrast
Chapters two and three introduce some of the work we have done, and this chapter will be divided into two parts. First, we verify the results of the work through experimental comparison. Second, we compared the engine to its competitors in the industry.
Validation of work results
At present, there are six universal test sets with sizes of 1220, 6917, 4069, 2977, 2946 and 2500 voices respectively. Among them, test set 1 is a mobile phone recording test set, set 2 is command recording, set 3 is microphone recording involving general life scenes, set 4, 5 and 6 are online real network data, the difference is that set 4 and 5 have clean background, and set 6 has noise background.
The test set
DNN
TDNN
TDNN optimized version
1
10.4
8
6.9
2
13.7
11.3
9.3
3
22.9
18.3
15.6
4
15.8
13.3
12
5
15.3
12.2
10.5
6
22.6
20.3
17.8
In model selection and comparison, we design three different versions of embedded speech recognition engine for DNN, TDNN and TDNN optimization (optimization content is summarized in Chapter 3 2, 3 and 4).
The experimental results of three versions of embedded speech recognition engine on six general test sets are shown in the table. The numbers in the table represent the word error rate, which is the number of misidentified words per 100 characters. Overall, TDNN improved the recognition rate by about 20 percent, and other efforts improved it by about 10 percent.
From the basic concept of speech recognition, to the introduction of speech recognition speed and memory optimization, as well as some precipitation algorithm research, experimental results verification, this paper generally describes the basic process of speech recognition from principle to practice. Welcome to join us if you are also engaged in speech AI recognition
Machine learning in action! Quick introduction to online advertising business and CTR knowledge
This article has been authorized by the author to Tencent Cloud + community, more original text pleaseClick on the
Search concern public number “cloud plus community”, the first time to obtain technical dry goods, after concern reply 1024 send you a technical course gift package!
Massive technical practice experience, all in the cloud plus community!