Introduction to the speech recognition model WaveNet

 

This article introduces WaveNet, a depth generation model of raw audio waveform. We demonstrated that WaveNets is able to generate speech that mimics any human speech and sounds more natural than the best text-to-speech systems available, narrowing the gap with human performance by more than 50%.

We also demonstrate that the same network can be used to synthesize other audio signals, such as music, and present some compelling samples of automatically generated piano pieces.

Talking machine

Allowing people to talk to machines has long been a dream of human-computer interaction. In the past few years, the ability of computers to understand natural speech has been revolutionized by the use of deep neural networks (for example, Google Voice Search). However, the generation of speech with computers — a process often called speech synthesis or text-to-speech (TTS) — is still largely based on so-called connected TTS, in which a very large database of short speech fragments is recorded from a single. Speakers then regroup to form a complete speech. This makes it difficult to modify speech sounds (for example, switching to a different speaker, or changing the emphasis or mood of their speech) without recording a whole new database.

This leads to a huge demand for parameter TTS, where all the information needed to generate data is stored in the parameters of the model, and the content and characteristics of speech can be controlled through the input of the model. So far, however, parameterized TTS tend to sound worse than joins. Existing parametric models typically generate audio signals by passing their output through a signal processing algorithm called a vocoder.

WaveNet changes this paradigm by directly modeling the original waveform of the audio signal, one sample at a time. In addition to producing more natural sounds, the use of raw waveforms means WaveNet can simulate any type of audio, including music.

WaveNets

Researchers generally avoid modeling raw audio because it is so fast: typically 16,000 samples per second or more, with significant structures on many time scales. Constructing a fully autoregressive model in which the predictions of each sample are influenced by all previous samples (in statistics, each prediction distribution is conditioned on all previous observations) is clearly a challenging task.

However, our PixelRNN and PixelCNN models, released earlier this year, show that it is possible to generate not only one pixel at a time, but also one color channel at a time, requiring thousands of predictions per image and thus generating complex natural images. This inspired us to adapt two-dimensional PixelNets to one-dimensional WaveNet.

The animation above shows WaveNet’s structure. It is a fully convolutional neural network, in which the convolutional layer has various expansion factors, allowing its perception field to grow exponentially with depth and cover thousands of steps.

During training, the input sequence was a real waveform recorded from a human speaker. After training, we can sample the network to generate synthetic speech. At each step during the sampling period, values are extracted from the probability distribution calculated by the network. This value is then fed back into the input and a new prediction is made for the next step. Building sample computations like this step by step is expensive, but we found it very important for generating complex, realistic audio.

Improve the situation

We trained WaveNet using some OF Google’s TTS datasets so that we could evaluate its performance. The figure below shows the quality of WaveNets, from 1 to 5, compared to Google’s current best TTS system (parameters and connections), as well as human speech using Mean Opinion Scores (MOS). MOS is a standard measure of subjective sound quality testing and was obtained in blind testing with human subjects (over 500 ratings from 100 test sentences). As we can see, WaveNets reduces the gap between the current state of American English and Mandarin and human performance by more than 50%.

Google’s current TTS system is considered one of the best in the world for Both Chinese and English, so using a single model to improve it is a major achievement.

 

In order to convert text to speech using WaveNet, we have to tell it what the text is. We do this by converting text into a series of linguistic and phonetic features (containing information about the current phoneme, syllable, word, etc.) and entering them into WaveNet. This means that the network’s predictions depend not only on the previous audio sample, but also on the text we want it to say.

If we train the network without text sequences, it will still produce speech, but now it has to make up for speech. As you can hear in the example below, this produces a bab bab language in which actual words are interspersed with word-like sounds:

Note that WaveNet also sometimes produces non-speech, such as breathing and mouth movements; This reflects the greater flexibility of the original audio model.

As you can hear from these samples, a single WaveNet is capable of learning the characteristics of many different sounds (male and female). To make sure it knows which voice is used for any given utterance, we adjust the network to the identity of the speaker. Interestingly, we found that the training of many speakers led to better modeling of a single speaker than training that speaker individually, suggesting a transfer approach.

By changing the speaker identity, we can use WaveNet to say the same thing in different voices:

Similarly, we can provide additional inputs to the model, such as emotion or accent, to make the speech more diverse and interesting.

Making music

Since WaveNets can be used to simulate any audio signal, we thought it would be interesting to try to generate music as well. Unlike the TTS experiment, we did not tune the network on the input sequence to tell it what to play (e.g., music score); Instead, we just let it generate whatever it wants. When we trained on a dataset of classical piano music, it produced fascinating samples like the following:

WaveNets offers many possibilities for TTS, music generation and audio modeling. The fact that the time step works on 16kHz audio using a deep neural network to generate each time step directly is really surprising, let alone superior to the most advanced TTS systems. We’re excited to see what we can do next.

See our paper for details.