Brief comment: Speech recognition has reached a human-like level in recent years. It simply works, but there is still a lot of room for improvement, and some problems remain unsolved.

Speech recognition error rates have dropped dramatically since deep learning, but despite this, we are still far from humanity-level speech recognition. There are many patterns of failure in speech recognition, and acknowledging these errors and taking steps to resolve them is critical to the progress of speech recognition.

Speech recognition error rates declined year over year

Speech recognition makes sense for human progress, and the improvements in speech recognition in the last two years have been amazing, but there are still some areas that deserve improvement.

Accents and noise

The most obvious deficiencies in speech recognition are the processing of accents [1] and background noise. The most immediate reason is that most of the training data consists of English accents with a high signal-to-noise ratio. But training data won’t solve this problem on its own. There are so many languages with so many dialects and accents that it’s hard to annotate data for all the different situations. It takes 5,000 hours of training audio to build a high-quality speech recognizer to solve stressed English.

Red is the error rate for human recognition, and blue is the error rate for Baidu Deep Speech 2 [2]. For some accents, machine recognition still won’t work.

In terms of background noise, it is not uncommon to be in a moving car and have an SNR as low as -5dB. It’s easy for people to understand each other’s spoken content in this situation. Speech recognizers, on the other hand, are also faster. As you can see from the figure above, machines are just as good as humans at high SNR, but not at low SNR.

Semantic error

Word error rates are usually not a practical goal in speech recognition systems. What we care about is semantic error rates. It is important to understand the meaning of a sentence.

For example, if we say “let’s meet up Tuesday,” the machine might recognize it as “Let’s meet up today,” sometimes we use the wrong word but the sentence makes sense, and if the machine can drop the word “up,” it will successfully predict “Let’s meet Tuesday,” A.

More than 5% WER (word error rate) cannot be accepted. For example, the smooth average word is about 20 words. If the word error rate is 5%, there will be one recognition error in 20 words, which is equivalent to every sentence will be wrong. If the machine’s speech recognition could do that, it would be acceptable if individual words were wrong but the meaning remained the same.

Recent researchers at Microsoft compared common errors made by human and human equivalents of speech recognizers [3] : They found that machines were more likely to confuse “uh” with “uh huh,” which have completely different meanings. “uh” is a modal word, and “uh huh” is rhetorical.

Single channel, multiple audio sources

A good conversational speech recognizer must be able to pick up the desired audio depending on who is speaking. It should also be able to understand audio, even if the sources overlap. People don’t need to put their mouth into a microphone to accurately pick up audio, and speech recognizers should work anywhere.

Domain changes

Accents and background noise are just two things to watch out for, but there are many other variations to watch out for:

  • Remixes caused by environmental changes.
  • Differences arising from hardware.
  • Decoders for audio and compression.
  • Sampling rate.
  • The age of the speaker.

Most people won’t even notice the difference between an MP3 and a regular WAV file, but these are important in speech recognition.

context

You’ll notice that the benchmark human-computer interaction error rate is actually quite high, and if you think about it, if you’re talking to a friend and you get one wrong word out of 20 words, you might be able to keep talking, but the machine isn’t.

The reason is that a sentence is context-dependent and can better show its meaning in a specific context. Some obvious differences between machine recognition and human speech recognition:

  • The topic being discussed and the introduction.
  • Visual cues of the speaker, including facial expressions and lip movements.
  • A preview of the topic being spoken.

Currently, Google Android’s (native) voice recognizer can retrieve the identifying information of your contacts (their names) from your contact list [4]. Google Map can narrow the range of activities according to geographical location, and capture your destination information more accurately [5].

Speech recognition is more accurate when different information is combined, but contextual speech recognition is just getting started.

In the next five years

There are still many open and challenging problems in speech recognition. These include:

  • Expand new areas of accent and far field, low SNR (noise state) speech capabilities.
  • Speech recognition integrates into the context, connects the context.
  • Audio source separation.
  • Semantic error rates and innovative evaluation methods.
  • Ultra low latency and efficient reasoning.

I look forward to solving the above speech recognition problem within five years.

Note:

  • [1]Just ask anyone with a Scottish accent.
  • [2]These results are from Amodei et al, 2016. The accented speech comes from VoxForge. The noise-free and noisy speech comes from the third CHiMEchallenge.
  • [3]Stolcke and Droppo, 2017
  • [4]See Aleksic et al., 2015 for an example of how to improve contact name recognition.
  • [5]See Chelba et al., 2015 for an example of how to incorporate speaker location.

The original:
Speech Recognition Is Not Solved

How does Apple’s handwritten Chinese character recognition work?

Welcome to:

  • The column “Aurora Daily”, reading three articles in English every day, values, thoughts and resonance.
  • Netease cloud music radio “Aurora Daily”, twice a week, listen to the garden head nonsense.