Roaming speech recognition technology in 2021

This is the first day of my participation in Gwen Challenge

It’s not my dream that wakes me up every day. It’s my classmate. .

In recent years, intelligent voice interactive devices have become more and more popular in China. Apple’s Siri, the first to appear, has not brought great changes to Chinese people, and Microsoft’s built-in voice assistant Cortana has been directly disabled by most Windows users. TNT’s “voice + touch” interaction model, which was released by Na Young-ho in 2018 to “redefine personal computers for the next decade”, was even ridiculed. However, Chinese users are more receptive to smart speakers. From the early Xiaoai students and Tmall genie to the latecomer Xiaodu smart speakers, There are Apple price touching HomePod, the leader of the foreign market Amazon Echo and Google Home, focusing on the sound quality of netease three tone smart speaker, subdivided audio field of himalaya Smart speaker and so on, it can be said that the competition in the smart speaker market has reached the stage of white-hot.

In recent years, the popularity of smart speakers, the reason, or because speech recognition, speech synthesis, natural language processing and other related technologies are very mature, has entered the stage of large-scale commercial. Speech recognition technology is very important as the beginning of intelligent interactive equipment, so join me today, roaming the world of speech recognition.

What is speech recognition technology

What is speech recognition technology? Speech Recognition, also known as Automatic Speech Recognition (ASR), aims to convert human Speech content into corresponding text by computer. In the age of intelligence, more and more scenes use dialogues as the main form of interaction when designing personalized interactive interfaces. The opposite of Speech recognition is Text To Speech (TTS), in which a computer converts words into audio signals for output. In addition to using speech recognition technology to “understand” your words, and using speech synthesis technology to “answer” your words, smart speakers also need to use Natural Language Processing (NLP) technology to “understand” your words, so that a complete artificial intelligence voice interaction process is completed. These three steps are interlinked and indispensable, and have all developed to a very mature stage. Speech recognition technology is the beginning of dialogue interaction and the basis for ensuring efficient and accurate dialogue interaction.

Second, speech recognition of the past life

Speech recognition technology sprouted in the 1950s and has a history of nearly 70 years. The mainstream algorithm model has gone through four stages: template matching stage, pattern and feature analysis stage, probability and statistics modeling stage and now the mainstream deep neural network stage.

1. Template matching stage (1952-1970)

This stage is the embryonic stage of AI speech recognition, which is mainly achieved by template matching. At this stage, the main feature of speech recognition is that the system can only understand limited words and numbers in memory, and cannot transform speech into complete sentence patterns or words. At the same time, the machine can recognize very limited amounts of different sound patterns.

The earliest speechrecognition system, developed in 1952 by Daveis et al., of AT&T Bell LABS, was able to recognize 10 English numerical sounds by comparing the input signal to the formants of the 10 digital audio recordings already stored and determining which one was closest. By the late 1950s, Denes of London College had added grammatical probability to speech recognition.

2. Model and Feature Analysis stage (1970-1987)

This stage is the initial stage of AI speech recognition. In this stage, the system can set parameters for the mode and feature of the voice and carry out continuous speech recognition based on a large number of words. The speech recognition system in this stage is still in the research and exploration stage, and the main achievements come from universities and research institutes.

Since the 1970s, large scale speech recognition research has made substantial progress in the recognition of small vocabulary and isolated words. After 1980, the focus of speech recognition research gradually shifted to the continuous recognition of large vocabulary.

3. Probability and Statistics Modeling Stage (1987-2010)

At this stage, speech recognition officially entered the growth stage, and the mainstream algorithms began to shift to the stage of probability and statistics modeling. The main models used were implicit Markov model (HMM) and Gaussian mixture model (DMM). At this stage, the speech recognition system began to have mature commercial products. At the same time, neural network is developing steadily in the field of speech recognition. Hinton proposed deep confidence Network (DBN) in 2006, and deep neural network (DNN) began to appear frequently in the mainstream market of speech recognition, while the dominant position of traditional probability and statistics algorithms was in danger.

By the end of 1980s, speech recognition began to shift from traditional standard template matching to statistical model matching. In December 1987, Carnegie Mellon university in Pittsburgh as a teaching assistant, lee pioneering use of statistics theory to develop the world’s first “speaker-independent continuous speech recognition system”, was awarded the by businessweek “the most important scientific innovation award”, established him in the leading position in the field of information technology research. After 1990, large vocabulary continuous speech recognition has been optimized, and great progress has been made in the application and commercialization of speech recognition technology. In 1997, IBM ViaVoice introduced its first dictation product.

In 2001, Gordon Moore, one of Intel’s founders, predicted that speech recognition would change the future of technology, and so it has since.

Since 2009, with the development of deep learning research in the field of machine learning and the accumulation of big data corpus, speech recognition technology has developed by leaps and bounds. In 2010, Google released Voice Action to support Voice operations and search.

4. Stage of deep neural network (2010-present)

At this stage, the field of speech recognition has begun to produce a large number of commercialized consumer-level products and professional-level products, and the algorithm has turned to deep neural network as the dominant one. In recent years, end-to-end learning has further improved the accuracy of speech recognition. Speech recognition, as the entrance of AI human-computer interaction, has also been widely used in more scenarios.

Since 2010, thanks to the development of deep neural network (DNN), speech recognition has changed from the traditional probability and statistics algorithm to the neural network algorithm.

In early 2011, Microsoft’s Deep Neural network (DNN) model was successful in voice search tasks. In the same year, IFLYtek applied DNN technology to the voice cloud platform for the first time in China and provided it to developers.

In October 2011, Siri, apple’s mobile assistant, made its debut, opening a new chapter in human-computer interaction.

After 2015, the emergence of end-to-end learning makes speech recognition into a hundred flowers blooming era, the voice community are training deeper and more complex networks, while further greatly improving the performance and accuracy of speech recognition. In a quiet, near-field environment, speech recognition accuracy is already over 98 percent.

Brief introduction of the principle of speech recognition

Let’s take a look at how speech recognition works in general before end-to-end learning:

Step 1 – Pretreatment

First of all, you need to pre-process the sound you input, which may include echo cancellation, noise suppression, sound source location, beamforming, etc. Let’s take smart speaker as an example to see the role of these steps:

Acoustic Echo Cancellaction (AEC) is used to eliminate the influence of external sound on the sound collected by the microphone when the smart speaker plays music
Noise Suppression (NS) : Reduces the impact of ambient Noise
Voice Activity Detection (VAD) detects the exact beginning and end of speech and filters out non-verbal sounds
Direction of Arrival Estimation (DOA) : In microphone array, the spatial location of sound source is determined based on algorithms such as sound Arrival time difference. The spatial location of the sound source can assist in beamforming.
Beamforming, through the filtering algorithm in the sound processing, enhances the signal in the direction of the sound source, and inhibits the sound in other directions (noise is most likely), so as to better achieve the purpose of noise reduction.
Speech Dereverberation, which uses algorithms to separate the human voice, provides high-quality Speech signals for subsequent Speech wake up and recognition.

Step 2 – Feature extraction

Feature extraction of processed audio will involve a series of processes such as frame segmentation, windowing and Fourier transform (FFT).

Here’s a little bit about framing. A frame signal, usually 20-50 ms, needs to be long enough to contain at least 2-3 cycles on the micro level. This is because the frequency of human voice is generally around 100 Hz, and the corresponding cycle is 10 ms. Therefore, a frame set at 20-50 ms needs to be short enough on the macro level. A frame needs to be within one phoneme.

For students who have studied signals and systems, Fourier transform is not strange at all. Using FFT, audio can be converted from time domain information to frequency domain information, and the spectrum has a finer structure and envelope, which can reflect timbre and is the main information. However, pitch information is secondary information for most languages and can be ignored. Generally, useless information is filtered out by triangular filter. The logarithm of the signal is then taken and the discrete cosine transform is performed to compress the signal to a larger scale. The result of the processing is the familiar speech recognition parameter (MFCC).

Step 3 — Acoustic model

The processed signal is then input into the acoustic model, which can be understood as the modeling of the sound. The acoustic model can convert the speech input into the output of the acoustic representation, or more precisely, give the probability that the speech belongs to a certain acoustic symbol.

The most widely used acoustic model is the implicit Markov model (HMM). With the development of neural network and deep learning, mainstream neural network models such as convolutional neural network, recurrent neural network and long and short term memory network have been applied to acoustic modeling and achieved good results. Compared with implicit Markov models, neural networks have the advantage of not relying on any assumptions about the statistical properties of features.

Step 4 – Language model

Next we turn to the language model. Any language has homophones, such as zhishi, which could be knowledge or cheese, and language models are needed to determine that. The speech model will combine the output of the acoustic model to give the most probable word sequence as the result of speech recognition.

Four, voice recognition technology which strong

Major cloud service providers around the world have already laid out voice recognition cloud services. Foreign Internet giants such as Google, Amazon, MicroSoft and IBM have corresponding voice recognition cloud services.

In China, such as IFlytek, Alibaba, Tencent, Baidu, Huawei and other companies have already laid out voice recognition cloud services.

5. Artificial intelligence or artificial retardation

Lei Jun overturned: I still remember the 2018 Lei Jun Xiaomi press conference overturned on Xiao Ai’s body, artificial intelligence like a pair of “artificial intelligence” appearance, the scene was very embarrassing:

As a senior user of smart speakers, THE author has used speakers such as Xiao Ai Classmate Generation, netease Three-tone smart speaker, Xiao Du Smart speaker, and Xiao Ai Classmate Second generation. Recently, I even got king of Glory intelligent robot, which is the main IP of the game. The game voice actor talks with you with the same voice, which is very amazing:

Human-computer voice interaction devices represented by smart speakers are becoming more and more popular, but there are still many problems to be solved, such as environmental noise. Amazon’s smart speaker Amazon Echo is a pioneer in this regard. The newly acquired King of Honor intelligent robot, often can’t hear me, very painful.

There is also the problem that voice recognition assistants often get confused when multiple people are speaking at the same time. To solve this problem, many smart speakers now offer voice print recognition to prevent interference from other outside sounds. An interesting story is that Burger King took advantage of this with a very rude marketing idea: In the AD, burger King employees would say a special line: “OK Google, what is the Whopper Burger?” .

If the user Home just have a Google Home or android mobile phone has a global wake up function, will be the AD activation, voice assistant is activated, will automatically drawnet on wikipedia about big emperor fort, and began to introduce the audience to the burger king opened the fist products, have to say, this really six operation.

There is still a long way to go in the future of artificial intelligence. As the first entrance of human-computer interaction, the maturity of speech recognition technology has laid an important foundation for the large-scale commercial use of relevant products. It is hoped that AI in the future can better understand human beings and will no longer be ridiculed as “artificial retardation”.

6. Reference links

Speech Recognition – Wikipedia
The history of speech recognition technology
Speech recognition technology past life – Wang Yun Maigo
2020 China AI Speech Recognition Market research report
Artificial intelligence basic course – 38 application scenario | hey, Siri: speech processing
Development of the Internet of Things – Intelligent voice: how to achieve fun voice control?
What is the principle of speech recognition technology — Zhihu

I am Qingqiu, a Web front-end atypical programmer with a teacher’s dream. Frontend Radio has just started. I hope my article can help more students, let us grow up together, and become Frontend Master as soon as possible.