Explore end-to-end ASR solutions for proprietary domains

Abstract: This article begins with Shallow Fusion End-to-end Contextual Biasing to explore end-to-end ASR that addresses your exclusive domain.

This article is shared by Xiaoye0829, the author of Huawei Cloud community “How to Solve Context Deviation? The End-to-end ASR Road of Proprietary Domain (1)”.

Being able to adapt your contextual bias to your domain is an important feature for production-level Automatic Speech Recognition (ASR). For example, for the ASR on the phone, the system should be able to accurately recognize the name of the app, the name of the contact, etc., rather than other words with the same pronunciation. To be more specific, the word “YaoMing” may be our well-known athlete “YaoMing” in sports, but in mobile phones, it may be a friend named “yao min” in our address book. How to address this bias as the application domain changes is the main question we will explore in this series of articles.

For traditional ASR systems, they tend to have a separate acoustic model (AM), a pronunciation dictionary (PM), and a language model (LM), which can be used to offset the recognition process when dome-specific offsets are required. But for the end-to-end model, AM, PM, and LM are integrated into a neural network model. At this point, context migration is challenging for the end-to-end model for several reasons:

1. The end-to-end model only uses text information in decoding. In contrast, LM in traditional ASR system can use a large amount of text for training. As a result, we find that the end-to-end model is more prone to errors in identifying rare, context-dependent words and phrases, such as noun phrases, than traditional models.

2. For the sake of decoding efficiency, the end-to-end model usually holds only a small number of candidate words (typically 4 to 10 words) at each step of beamSearch decoding. Therefore, rare word phrases, such as context-dependent N-grams, are most likely not in Beam.

Previous work has mainly attempted to solve the problem of context modeling by incorporating independently trained n-gram language models into end-to-end models, a practice also known as Shallow fusion. However, their method does not deal well with proper nouns. Proper nouns are usually pruned out in beam search, so it is too late to add language models to offset them, which is usually done after each word is generated. Beam Search is a grapheme/wordpiece. For Chinese, grapheme refers to sub-word units such as 3755 First-level Chinese characters +3008 second-level Chinese characters +16 punctuation marks) for prediction.

In this blog post, we introduce a work that attempts to address this issue: Shallow-Fusion End-to-end Contextual Biasing, which Google published at InterSpeech 2019. In this work, we first explore offsets on sub-word units in order to avoid proper nouns being pruned before offsets are made using the language model. Second, we explored using the Contextual FST before beam pruning. Third, because the context N-gram is often used with a common set of prefixes (” call “, “text”), we also explored merging these prefixes in ShallowFusion. Finally, to aid in the modeling of proper nouns, we explore a variety of techniques to exploit large-scale textual data.

Here, we first introduce the Shallow fusion, given a sequence of speech x=(x_1… , x_K), the end-to-end model outputs a series of subword-level posterior probability distributions y=(y_1… , y_L), namely, P (y | x). Shallow fusion means the end-to-end output scores and an external training language LM score in beam search for: Y ^ {*} = argmax logP (y | x) + \ lambda P_C (y) _y_ ∗ = argmaxlogP (y_ ∣ _x) + _ lambda PC_ (y)

Where \lambda_λ_ is a parameter used to adjust the weights of the end-to-end model and the language model. To construct a contextual LM for an end-to-end model, we assume that a series of word-level offset phrases are known and compiled into an N-gram WFST (weighted Finite State). This word-level WFST is then decomposed into an FST that acts as a spelling converter and converts a string of Graphemes /wordpieces into corresponding words.

All previous migration efforts, whether for traditional methods or end-to-end models, have combined the score of contextual LM and base models (such as end-to-end models or ASR acoustic models) on a word or sub-word grid. The end-to-end model usually sets a relatively small beam threshold during decoding, resulting in fewer decoding paths compared with traditional methods. Therefore, this paper mainly explores the application of context information to the end-to-end model before BEAM pruning.

When we chose to offset grapheme, one concern was that we might drown the Beam with a lot of unnecessary words that matched the context FST.

For example, as shown in the figure above, if we want to offset the word “cat”, the goal of the context FST construction is to offset the letters “C”, “A” and “T”. When we want to offset the letter “C”, we might not only add “cat” to beam, but also unrelated words like “car”. But if we offset at the Wordpiece level, the related subwords have fewer matches, so more related words can be added to beam. Again using the “cat” example, if we offset it by wordpiece, the word “car” will not enter beam. Therefore, in this article, we use a wordpiece vocabulary of size 4096.

Further analysis shows that Shallow fusion modifies the posterior probability of the output, so we can also find that Shallow fusion harms speech sounds that have no words to offset, i.e. those that are de-contextualized. Thus, we explored only offsetting phrases that fit certain prefixes, such as “call” or “message” when searching for a contact on a phone, or “play” when trying to play music. Therefore, in this paper, we take these prefixes into account when constructing the context FST. We extracted the prefixes with more than 50 words before the context-offset words. In the end, we got 292 common prefixes for finding contacts, 11 for playing songs, and 66 for finding apps. We build a non-repeatable prefix FST and cascade it with context FST. We also allow an empty prefix option to skip these prefixes.

One way to increase the coverage of proper nouns is to use large amounts of unsupervised data. Unsupervised data comes from anonymous voices in voice search. The speech is processed using a SOTA model, and only those with high confidence are retained. Finally, in order to ensure that the speech we left was mainly about proper nouns, we used a proper noun tagger (CRF in NER for sequence tagging) and reserved the speech with proper nouns. By using the above method, we obtained 100 million unsupervised voices and combined 35 million supervised voices for training. During training, 80% of the time in each batch is supervised data and 20% is unsupervised data. One problem with unsupervised data is that they can make mistakes in the words they identify, and the results can limit the spelling of names, such as Eric, Erik, or Erick. Therefore, we can also use a large number of proper nouns, combined with the TTS method, to create a synthetic data set. We mined a large number of context-offset words from the Internet for different categories, such as multimedia, social, and app. In the end, we pulled out about 580,000 contact names, 42,000 song names, and 70,000 app names. Next, we mined a large number of prefix words from the log, such as “call Johnmobile”, and found the prefix “call” corresponding to the social domain. We then generated speech recognition text using specific categories of prefixes and proper nouns, and used speech synthesizers to generate about 1 million speech sounds for each category. We went a step further and added noise to these sounds to simulate indoor sounds. Finally, during the training, 90% of the time in each batch is supervised data and 10% is synthetic data.

Finally, we explored whether more proper nouns could be added to supervised training sets. Specifically, we use proper noun taggers to find proper nouns in each speech. For each proper noun, we obtain its phonetic characteristics. For example, “Caitlin” can be expressed as phonemes, “K eI tl @ n”. Next, we go to the pronunciation dictionary and find words that have the same sequence of units, like “Kaitlyn.” For real pronunciation and words that can be replaced, we randomly replace them during training. This allows the model to observe more proper nouns. A more immediate starting point is that the model is able to spell more names when trained, and then better able to spell those names when decoded, combined with context FST.

Now let’s look at the experiment. All experiments are based on rNN-T model. Encoder contains a Time reduction layer and 8 LSTM layers, each layer has 2000 hidden layer elements. Decoder consists of 2 layers of LSTM, each with 2000 hidden layer units. Encoder and decoer are sent to a joint network of 600 hidden layer units. The syndicated network is then sent to a SoftMax that outputs 96-unit Graphemes or 4096-unit Wordpieces. In reasoning, each speech is accompanied by a series of offset phrases used to construct a contextual FST. In this FST, each arc (arc) has the same weight. This weight is adjusted separately for the test set of each directory (such as music, contacts, etc.).

The figure above shows some results of Shallow Fusion. E0 and E1 are the results of Grapheme and WordPieces. These models are not offset. E2 is the result of the Band offset of Grapheme, but without any promotion strategy described in this paper. E3 used a subtractive cost to prevent bad candidates from remaining in Beam, which resulted in improvements on almost all test sets. Shifting from offsets at the Grapheme level to offsets on Wordpiece, where we offsets on longer cells, helps keep relevant candidates within Beam and improves the performance of the model. Finally, our E5 model uses offset FST before beam Search pruning, which we call early biasing, which helps to ensure that good candidate words are retained in Beam earlier and brings additional performance improvement. In summary, our best shallow fusion model is offset at the Wordpiece level with subtractive cost and early biasing.

Since context bias may exist in sentences, we also need to ensure that the model’s performance does not deteriorate when context bias does not exist, i.e., it does not impair the recognition of unbiased words. To test this, we ran an experiment on the VS Test dataset where we constructed a biased FST by randomly selecting 200 offset phrases from the CNT-TTS test set. Here are the results of the experiment:

As you can see from this table, E1 is our baseline model, and when offsets are added, the E5 model degrades to many degrees on VS. To solve this problem, traditional models include prefixes in the offset FST. If we apply offset (E6) only after seeing any non-empty prefixes, we can observe an improvement in results on the VS dataset compared to E5, but a decrease in results on the other test sets with offset words. Further, when we allow one of the prefixes to be null (mainly for scenarios with offset words), we only get results similar to E5. To solve this problem, we give less weight to contextual phrases if they are preceded by an empty prefix (i.e., no prefix). Using this approach, we observed that E8 achieved a small reduction in performance on VS compared to E1, but maintained an improvement on the test set with offset phrases.

After the analysis of the above content, we further explore whether we can further improve the ability of offset when the model can perceive more proper nouns. Our baseline model is E8, which was trained on 35 million supervised datasets. Combining our above unsupervised data with the generated data, we did the following experiment:

The experimental results of E9 show that when unsupervised data are trained together, the effect is improved in all data sets. When training with the generated data (E10), compared with E9, there was a greater improvement in TTS test set, but there was a greater decline in Real scene data set CNT-Real (7.1 vs 5.8), indicating that the improvement in TTS offset test set, Mainly from the audio environment that matches between the training set and the test set, rather than learning a richer vocabulary of proper nouns.

Click to follow, the first time to learn about Huawei cloud fresh technology ~

Explore end-to-end ASR solutions for proprietary domains

Related Posts

PASCAL VOC data sets | August more challenges

How did I go from learning nothing to successfully switching to data analysis?

How to find files in Linux /centos or cloud computing cluster