Tacotron2 model in detail

This is the second day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021

1 overview

Tacotron2 is an end-to-end speech synthesis framework proposed by Google Brain in 2017. From bottom to top, the model can be viewed as consisting of two parts:

Sound spectrum prediction network: an Encoder- attention-decoder network for predicting input character sequences into frame sequences of the Meir spectrum
Vocoder: A modified version of WaveNet that generates a time-domain waveform from a sequence of predicted Meyerspectral frames

2 the encoder

The input for Encoder is a number of sentences, each with the basic unit of character, for example

In English, “Hello world” would be split into” H E L L O W O R L D “as input
In Chinese, “Ni Hao Shi jie” is first marked with pinyin, and then further divided into” N I H AO sh I j IE “, or directly divided into “N I H a O S H I j I E” in a similar way to English.

Encoder’s specific process is as follows:

The dimension of the input data is[batch_size, char_seq_length]
The 512-dimensional Character Embedding is used to map each Character into a 512-dimensional vector with output dimension as[batch_size, char_seq_length, 512]
There are three one-dimensional convolution, each of which contains 512 kernels, and the size of each kernel is 5*1 (i.e., 5 characters are seen each time). Each time you do the convolution, do the BatchNorm, ReLU, and Dropout. The output dimension is[batch_size, char_seq_length, 512](Pad is used to ensure that the dimension of each convolution is unchanged)
The output that you get up here, you throw it to aSingle-layer BiLSTM, the hidden layer dimension is 256. Since this is a bidirectional LSTM, the final output dimension is[batch_size, char_seq_length, 512]

3 Attention mechanism

The figure above depicts the input and output of attention for the first time. Where y0y_0y0 is the coded representation of PreNet’s initial input , and C0C_0C0 is the current “attention context”. In the initial step, both y0y_0y0 and C0c_0c0 are initialized as all 0 vectors, and then y0y_0y0 and C0C_0C0 are spliched together to form a 768 dimensional vector y0,cy_{0,c}y0,c, Take this vector as the input of LSTMcell along with attention_hidden and attention_cell (attention_hidden is actually the hidden_state of LSTMcell, Attention_cell is actually the cell_state of LSTMcell. The result is H1H_1H1 and attention_cell. There is no separate name for attention_cell, mainly because attention_cell is not used elsewhere except attention_RNN

Attention_Layer takes five inputs:

H1h_1h1 is a variable related to MEL spectrum
MMM is the feature extracted from source Character Sequence through Encoder layer
M’m ‘is obtained by MMM via a Linear
Attention_weights_cat is the result of splicting the historical (previous) attention_weights and attention_weights_cum
Mask all false, basically useless

The calculation details are as follows:

The central part of the energies is get_alignment_Energies, a function that introduces positional features internally and is therefore a mixed attention mechanism

The mixed Attention mechanism is actually a combination of the content Attention mechanism (regular Attention) and the positional Attention mechanism:

e_{ij}=score(s_{i-1},\alpha_{i-1},h_j)

Si −1s_{i-1} Si −1 is the implicit state of the previous decoder, α I −1\alpha_{i-1}α I −1 is the previous attention weight, and HJH_Jhj is the JJJ implicit state of the encoder. Add bias BBB to it, and the final score function is calculated as follows:

e_{ij}=v_a^T\mathop{tanh}(Ws_{i-1}+Vh_j+Uf_{i,j}+b)

Where, vav_ava, WWW, VVV, UUU and BBB are the parameters to be trained, Fi, JF_ {I,j} Fi,j is the location feature obtained by convolution of the previous attention weight α I,j\alpha_{I,j}α I,j, Fi = F ∗ alpha – 1 f_i = F * I \ alpha_ {1} I – fi = F ∗ alpha I – 1

The attention mechanism of Tancotron2 is basically the same as the mixed attention mechanism, but slightly different

e_{i,j}=score(s_i,c\alpha_{i-1},h_j)=v_a^T\mathop{tanh}(Ws_i+Vh_j+Uf_{i,j}+b)

Where, sis_isi is the implicit state of the current decoder rather than the previous step, bias BBB is initialized to 0, and the position feature FIF_ifi is convolved with the accumulative attention weight C α IC \alpha_icα I:

\begin{aligned} f_i&=F*c\alpha_{i-1}\\ c\alpha_i&=\sum_{j=1}^{i-1}\alpha_j \end{aligned}

The get_alignment_energies function is illustrated as follows:

4 decoder

A decoder is an autoregressive structure that predicts a spectrogram from a sequence of encoded inputs, one frame at a time

The spectrum predicted in the previous step is first introduced into a PreNet, which contains two layers of neural networks. PreNet acts as a bottleneck layer and is necessary for learning attention
The PreNet output is spliced together with the Attention Context vector to a two-tier LSTM with 1024 cells. The output of the LSTM is again spliced together with the Attention Context vector and then projected through a linear projection to predict the target spectrum
Finally, the target spectral frame is processed through a 5-layer convolution PostNet (post-processing network), and this output is added to the output of Linear Projection (residual connection) as the final output
On the other hand, the output of the LSTM is splited together with the Attention Context vector, projected as a scalar and passed to the sigmoID activation function to predict whether the output sequence has completed the prediction

The diagram and code for the PreNet layer are shown below:

class LinearNorm(torch.nn.Module) :
    def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear') :
        super(LinearNorm, self).__init__()
        self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)

        torch.nn.init.xavier_uniform_(
            self.linear_layer.weight,
            gain=torch.nn.init.calculate_gain(w_init_gain))

    def forward(self, x) :
        return self.linear_layer(x)

class Prenet(nn.Module) :
    def __init__(self, in_dim, sizes) :
        super(Prenet, self).__init__()
        in_sizes = [in_dim] + sizes[:-1]
        self.layers = nn.ModuleList(
            [LinearNorm(in_size, out_size, bias=False)
             for (in_size, out_size) in zip(in_sizes, sizes)])

    def forward(self, x) :
        for linear in self.layers:
            x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
        return x
Copy the code

The PostNet layer is illustrated as follows:

Decoder is composed of prenet, attention_rNN, attention_layer, decoder_RNN, linear_projection, gate_layer

Reference

Tacotron2 paper + code details
The Annotated Tacotron2
Tacotron-2 : Implementation and Experiments
Spectrum prediction network Tacotron2

1 overview

2 the encoder

3 Attention mechanism

4 decoder

Reference

Related Posts

Simple speech recognition using MFCC and RNN

Don’t know these AI knowledge, you may be out!

Parallel programming in Python (6) : Multithreaded synchronized queues implement producer-consumer models