• An Introduction to Speech Recognition using WFSTs
  • By Desh Raj
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: sisibeloved
  • Proofread by: xionglong58, JackEggie

My previous blog posts have been about deep learning methods or their application to NLP. A few weeks ago, I started working on automatic Speech Recognition (ASR). As a result, I will be Posting some voice related articles now.

The logic of ASR is very simple (bayesian theory, like other algorithms in machine learning). In essence, ASR is all about converting a given speech waveform, such as identifying the text corresponding to the waveform. Assuming Y represents the feature vector obtained from the waveform (note: this “feature extraction” is itself a very complex process, which I will detail in another article) and w represents any string, the following formula can be obtained:

The two likelihood rates in the formula are trained separately. The first component, called acoustic modeling, is trained using parallel corpora containing speech and speech waveforms. The second component, called language modeling, is trained from large amounts of text in an unsupervised manner.

While ASR training looks simple at the abstract level, the implementation of implementing it is far more complex. We usually do this using a weighted finite state transition machine (WFST). In this article, I will introduce WFST and its underlying algorithm, and briefly explain how it can be used for speech recognition.

Weighted Finite State Transducer (WFST)

If you’ve taken a computer theory course before, you probably already know the concept of automata. Conceptually, finite automata accept a language (a set of strings) as input. They are represented by a directed graph, as shown below.

Each automaton consists of a starting state, one or more final states, and labeled edges used to connect the states. If the string ends in its final state after traversing a path in the graph, the string is accepted. For example, in the above DFA (Deterministic Finite Automata), A, AC and AE will be accepted.

So the receiver maps any input string to the binary class {0,1}, depending on whether the string is accepted. The converter has two labels on each side — an input label and an output label. A weighted state converter, going one step further, has weights corresponding to each edge and each final state.

Thus, WFST is a mapping from string pairs to weights and sums. The string pair is formed from input/output labels along any path of the WFST. For unreachable pairs of nodes in the graph, the weight of the corresponding edge is infinite.

In fact, most languages have corresponding libraries that implement WFST. OpenFST is a popular library in C++ and is also used in the Kaldi speech recognition tool.

In principle, we can implement speech recognition algorithms without using WFST. However, this data structure has a variety of proven results and algorithms that can be used directly with ASR without worrying about correctness or complexity. These advantages make WFST almost unbeatable in speech recognition. Next I will summarize some algorithms on WFST.

Basic algorithms in WFST

merge

As the name implies, merge refers to the process of combining two WFST to form a single WFST. If we had pronunciation and word-level grammar converters, this algorithm would allow us to easily build a speech-to-text system.

The merger follows three principles:

  1. The initial state of the original WFST is paired to form the initial state of the new WFST.
  2. Similarly, combine final states in pairs.
  3. If there is a case where the output label of the first WFST is equal to the input label of the second WFST, add an edge from the start pair to the end pair. The weights of the edges are the sum of the original weights.

Here is an example of a merge:

For edge weights, the definition of “sum” is important. With the help of the semiring concept, WFST can accept “language” in a broad sense. Conceptually, it is a set of elements with two operators, namely ⊕ and ⊗. These operators can be defined differently depending on the type of semi-ring. For example, in a tropical semi-ring, ⊕ means take the minimum, and ⊗ means add. In addition, in any WFST, the sum of the weights of the entire path is equal to the weighted phase ⊗ of each side along the path (note: for tropical semirings “multiply” here means add), and the sum of the weights of the multiple paths is equal to the path phase ⊕ with the same sign sequence.

Here is the implementation of the merge in OpenFST.

To determine the

Deterministic automata are automata with only one transition for each label in each state. With expressions like this, deterministic WFST eliminates all redundancy and greatly reduces the complexity of the underlying syntax. So, are all WFST determinable?

Twin property: Suppose there is an automaton A, in which there are two states P and Q. P and Q are called sibling states if they both have the same string input x and have loops y with the same label. Conceptually, the two sibling states are twinned if the total weight of the paths up to that state (including loops) is equal.

This WFST is determinable when all sibling states are twins.

This is an example of what I said earlier about WFST being an effective implementation of the algorithm used in ASR. There are several ways to determine WFST. One of these algorithms is as follows:

The simplified steps of the algorithm are as follows:

  • In each state, for each output label, if the label has more than one output edge, it is replaced with a single edge whose weight is the sum of ⊕ containing the weights of all the edges of the label.

Since this is a local algorithm, it can be implemented efficiently in memory. To learn how to determine in OpenFST, see here.

To minimize the

Although minimization is not as important as determinization, it is still a good optimization technique. It is used to minimize the number of states and transitions in the determined WFST.

The minimization steps are divided into two steps:

  1. Weight shift: Ownership weights are pushed back to the starting state. See the following example.

  1. After doing this, we will take the same combination of states along the path to the final state. For example, in the above WFST, states 1 and 2 become the same after weighting, so they are combined into one state.

In OpenFST, a concrete implementation of minimization can be found here.

The following figure (from [3]) shows the complete flow of WFST optimization:

Application of WFST in speech recognition

In speech recognition, multiple WFST are sequentially combined in the following order:

  1. Grammar (G) : Language models trained using large corpora.
  2. Vocabulary (L) : Information used to encode the likelihood of speech without context.
  3. Context-dependent speech processing (C) : Similar to the N-element language model, except that it applies to speech processing.
  4. HMM Architecture (H) : A model for processing waveforms.

In general, the complete speech recognition process can be represented by combining the converters in H O C O L O G. Each of these parts can be individually improved to improve the ASR system as a whole.

WFST is an important part of ASR system. This article only briefly introduces WFST. In other speech-related posts, I’ll discuss things like feature extraction, the popular GMM-HMM model, and the latest advances in deep learning. I’m reading, tootheseTo better understand the development of ASR over the years.

reference

  • [1] Gales, Mark and Steve Young. Application of Hidden Markov Models in Speech Recognition. Foundations and Trends® in Signal Processing 1.3 (2008): 195 — 304.
  • [2] Mohri, Mehryar, Fernando Pereira and Michael Riley. Weighted Finite State Machines in Speech Recognition. Computer Speech & Language 16.1 (2002): 69 — 88.
  • [3] Lecture notes of Professor Jiang Hui (York University)

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.