NaturalTTS synthesis by Conditioning Wavenet on MEL Spectogram Predictions Abstract: This article describes the deep Neural network model structure.

This article is shared from Huawei cloud community “What? Voice synthesis open source code will not run, I will teach you to run Tacotron2”, author: White horse guo Pingchuan.

Tacotron – 2

TTS Thesis: github.com/lifefeel/Sp…

Tensorflow implementation of DeepMind’s Tacotron-2. The structure of the deep neural network model described in this paper: Natural TTS synthesis by Conditioning Wavenet on MELspectogram predictionsGithub address: github.com/Rookie-Chen…

There are some other versions of Tacotron2 open source projects:

  • Github.com/Rayhane-mam…

  • Github.com/NVIDIA/taco…

This Github contains other improvements and attempts to improve the paper, so we use the paper_hparams.py file, which holds exact hyperparameters to reproduce the results of the paper without any additional features. The recommended hparams.py file used by default contains hyperparameters with extra content that will provide better results in most cases. Feel free to modify the parameters as you want, and the differences will be highlighted in the file.

Repository Structure

Step (0): Get the data set, where I set examples of Ljspeech, en_US, and en_UK (from M-AILabs).

Step 1: Preprocess your data. This will provide you with the training_data folder.

Step 2: Train your Tacotron model. Generate the logs-tacotron folder.

Step 3: Synthesize/evaluate the Tacotron model. Give the tacotron_output folder.

Step (4): Train your Wavenet model. Generate the logs-WAVenet folder.

Step (5): Compose audio using Wavenet model. Give the wavenet_output folder.

Note:

  • Steps 2,3 and 4 can be done with a simple run of Tacotron and WaveNet (Tacotron-2, step (*)).

  • The original Github preprocessing only supports Ljspeech and lJSpeech-like datasets (M-AILABS speech data)! If you store data sets differently, you need to make your own Preprocessing scripts.

  • If the two models are trained at the same time, the model parameter structures will be different.

Some pre-training models and demos

You can view some of the key insights on model performance (during the pre-training phase) here.

The proposed framework

Figure 1: Tacotron2 model structure diagram

The model described by the authors can be divided into two parts:

  • Spectral prediction network

  • Wavenet vocoder

For an in-depth exploration of the model architecture, training process, and pre-processing logic, see the author’s wiki

How to start

Environment setup: First, you need to install Python 3 with Tensorflow.

Next, you need to install some Linux dependencies to make sure the audio library works:

apt-get install-y libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeglibav-tools

Finally, you can install requires.txt. If you are an Anaconda user: (pip3 can be used instead of PIP and python3 can be used instead of Python)

pip install -rrequirements.txt

**Docker:** Alternatively, you can build Docker images to ensure that everything is automatically set up and items inside the Docker container are used.

Dockerfile is Insider Docker Folder

Docker images can be built with the following:

docker build -ttacotron-2_image docker/

The container can then run:

docker run -i–name new_container tacotron-2_image

The data set

The Github tested the code above on the LJSpeech Dataset, which had nearly 24 hours of tagged recordings of individual actresses. (More information about the data set is available in the README file at download time)

The Github also runs the current test on the new M-AILabs voice dataset, which contains over 700 voices (over 80 Gb of data) in over 10 languages.

Once you download the dataset, unpack the zip file and the folder is in the github clone.

Hparams set

Before proceeding, you must select the hyperparameter that best fits your needs. While it is possible to change the hyperparameters from the command line during preprocessing/training, I still recommend making the changes directly in the hparams.py file once and for all.

To select the best FFT parameters, I made a Griffin_lim_synthesis_tool notebook that you can use to invert the actual extracted Meyer/linear spectrogram and select how good or bad the pretreatment is. All the other options are well explained in hparams.py and have meaningful names, so you can try them out.

AWAIT DOCUMENTATION ON HPARAMS SHORTLY!!

pretreatment

Make sure you are in the Tacotron-2 folder before running the following steps

cd Tacotron-2
Copy the code

You can then start preprocessing with the following command:

python preprocess.py
Copy the code

The dataset can be selected using the -dataset parameter. If you use the M-AILabs dataset, you need to provide language, voice, Reader, Merge_books, and book arguments to meet your custom needs. Default is Ljspeech.

The sample M – AILABS:

python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=False --book='northandsouth'
Copy the code

Or if you want a speaker to use all the books:

python preprocess.py --dataset='M-AILABS' --language='en_US' --voice='female' --reader='mary_ann' --merge_books=True
Copy the code

It shouldn’t take more than a few minutes.

training

Train the two models in sequence:

python train.py --model='Tacotron-2'
Copy the code

The feature prediction model Tacotron-2 can be trained to use separately:

python train.py --model='Tacotron'
Copy the code

Every 5000 steps are recorded and stored in the logs-tacotron folder.

Of course, training Wavenet alone is done in the following way:

python train.py --model='WaveNet'
Copy the code

logs will be storedinside logs-Wavenet.

Note:

  • If no model parameters are provided, the training will default to Tacotron-2 model training. (Different from tacotron model structure)

  • The parameters of the training model can be found in train.py and there are many options to choose from

  • Wavenet preprocessing may have to use the Wavenet_proprocess.py script alone

synthetic

Compositing audio in an end-to-end (text-to-audio) manner (both models run simultaneously):

python synthesize.py --model='Tacotron-2'
Copy the code

For spectrum graph prediction network, there are three types of MEL spectrum prediction results:

  • Reasoning test (comprehensive evaluation of custom sentences). This is what we usually use when we have a complete end-to-end model.

    python synthesize.py –model=’Tacotron’

  • Natural synthesis (let the model predict separately by feeding the output of the last decoder into the next time step).

    python synthesize.py –model=’Tacotron’ –mode=’synthesis’ –GTA=False

  • Valid alignment composition (default: models are generated by force training under valid real tags). This synthesis method is used when predicting the MEL spectrum used to train Wavenet (producing better results as described in the article).

    python synthesize.py –model=’Tacotron’ –mode=’synthesis’ –GTA=True

Waveform synthesis using Mel spectrum previously synthesized:

python synthesize.py --model='WaveNet'
Copy the code

Note:

  • If no model parameters are provided, tacotron-2 model composition is used by default. (End-to-End TTS)

  • To select the synthesize parameter, you can refer to synthesize.py

References and source code

  • Natural TTS synthesis by conditioning Wavenet on MEL spectogrampredictions

  • Original tacotron paper

  • Attention-Based Models for Speech Recognition

  • Wavenet: A generative model for raw audio

  • Fast Wavenet

  • r9y9/wavenet_vocoder

  • keithito/tacotron

For more AI technology dry goods, welcome to huawei cloud AI zone, currently there are AI programming Python and other six combat camps for everyone to learn for free.

Click to follow, the first time to learn about Huawei cloud fresh technology ~