Give you a brief introduction of how to change the voice of the text. I hope this is a good introduction for all of you.


First, we know that sound is actually a wave. Common mp3, WMV and other formats are compressed formats, which must be converted into uncompressed pure waveform files for processing, such as Windows PCM files, also known as WAV files. A WAV file stores only one file header, which is the point of the sound wave. Below is an example of a waveform.



Before starting speech recognition, it is sometimes necessary to remove the mute end and the mute end to reduce the interference in the subsequent steps. This silent removal operation, commonly known as a VAD, requires some signal-processing techniques. To analyze a sound, you need to frame it, that is, cut it into small segments, each of which is called a frame. Frame splitting is usually not a simple cut, but is implemented using the moving window function, which is not detailed here. Frames overlap each other, as shown in the following image:



In the figure, each frame is 25 milliseconds long, and there is an overlap of 25-10=15 milliseconds between each frame. We call frame splitting by frame length of 25ms and frame shift of 10ms. In the figure, each frame is 25 milliseconds long, and there is an overlap of 25-10=15 milliseconds between each frame. We call frame splitting by frame length of 25ms and frame shift of 10ms.


After framing, the speech is broken into many small pieces. But waveforms have little descriptive power in time domain, so they must be transformed. A common transformation method is to extract THE MFCC features, and according to the physiological characteristics of the human ear, turn each frame waveform into a multidimensional vector, which can be simply understood as containing the content information of this frame of speech. This process is called acoustic feature extraction. In practice, there are many details of this step, and there are more than just MFCC acoustic characteristics, which I won’t go into here.


At this point, the sound is a matrix of 12 rows (assuming the acoustic characteristics are 12 dimensions) and N columns, called an observation sequence, where N is the total number of frames. The observation sequence is shown in the figure below. In the figure, each frame is represented by a 12-dimensional vector, and the color depth of the color block represents the magnitude of the vector value.



Now I’m going to show you how to turn this matrix into text. Two concepts will be introduced first:


  1. Phoneme: The sounds of words are made up of phonemes. A common set of phonemes for English is a set of 39 phonemes at Carnegie Mellon University, see The CMU Pronouncing Dictionary. Chinese generally uses all consonants and finals directly as phonemes. In addition, Chinese recognition can be divided into tones and no tones.

  2. State: here understood as more detailed than phoneme phonetic units on the line. A phoneme is usually divided into three states.


How does speech recognition work? In fact, there is no mystery at all, except that:


The first step is to identify the frame as a state (difficult);

Second, combine states into phonemes.

Step three, combine phonemes into words.


As shown below:



In the figure, each small vertical bar represents a frame, and several frames of speech correspond to a state. Each three states are combined into a phoneme, and several phonemes are combined into a word. That is to say, as long as you know which state each frame of speech corresponds to, the result of speech recognition will come out. In the figure, each small vertical bar represents a frame, and several frames of speech correspond to a state. Each three states are combined into a phoneme, and several phonemes are combined into a word. That is to say, as long as you know which state each frame of speech corresponds to, the result of speech recognition will come out.


So which state does each frame correspond to? Well, an easy way to think about it is to look at which state a frame is most likely to be in, and then that frame will be in that state. For example, in the diagram below, this frame has the highest probability of being in S3 state, so let this frame be in S3 state.



So where do these probabilities come from? There’s something called an acoustic model, which has a bunch of parameters in it, and you know what the probabilities are for frames and states. The method of obtaining such a large number of parameters is called “training”, which requires the use of a huge amount of voice data. The training method is rather tedious and will not be described here.


But there’s a problem with this: each frame gets a state number, and you end up with a jumble of state numbers for the entire speech, with the state numbers basically different from one frame to the next. If a speech has 1000 frames, one state for each frame, and three states for each phoneme, it will have about 300 phonemes, but there are no such phonemes in the speech. If you do, the resulting state numbers may not be able to combine into phonemes at all. In fact, it makes sense that the states of adjacent frames should be mostly the same, since each frame is short.


A common way to solve this problem is to use Hidden Markov Model (HMM). It sounds like a fancy thing, but it’s actually pretty simple to use:


Step one, build a state network.

The second step is to find the path that best matches the sound from the state network.


Thus limit the results in the preset network, avoids the problem of just said to, of course, also bring a limit, such as network contains only you set “sunny today” and “rain” today the state of the two sentences path, so no matter what to say, identified the result is one of the two sentences.


What if you want to recognize arbitrary text? Build the network large enough to include paths with arbitrary text. But the larger the network, the harder it is to get good accuracy. Therefore, choose a proper network size and structure based on actual task requirements.


To build a state network is to expand the word-level network into a phoneme network, and then into a state network. The process of speech recognition is actually searching for an optimal path in the state network, and the probability of speech corresponding to this path is the largest, which is called “decoding”. Path search algorithm is a dynamic programming pruning algorithm, called Viterbi algorithm, used to find the global optimal path.



The cumulative probability mentioned here consists of three parts, namely:


  1. Probability of observation: probability of each frame and each state

  2. Transition probability: The probability of each state’s transition to itself or to the next state

  3. Linguistic probability: probability based on linguistic statistics


The first two probabilities are obtained from the acoustic model, and the last probability is obtained from the language model. Language models are trained from a large number of texts and can use the statistical laws of a language itself to improve the accuracy of recognition. The language model is very important. If the language model is not used, when the state network is large, the recognition result is basically a mess.


The above is the traditional HMM based speech recognition. In fact, HMM is more than just a state network. The above text is intended to be easy to understand, not rigorous.


End