This is the second day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021
1 overview
Tacotron2 is an end-to-end speech synthesis framework proposed by Google Brain in 2017. From bottom to top, the model can be viewed as consisting of two parts:
- Sound spectrum prediction network: an Encoder- attention-decoder network for predicting input character sequences into frame sequences of the Meir spectrum
- Vocoder: A modified version of WaveNet that generates a time-domain waveform from a sequence of predicted Meyerspectral frames
2 the encoder
The input for Encoder is a number of sentences, each with the basic unit of character, for example
- In English, “Hello world” would be split into” H E L L O W O R L D “as input
- In Chinese, “Ni Hao Shi jie” is first marked with pinyin, and then further divided into” N I H AO sh I j IE “, or directly divided into “N I H a O S H I j I E” in a similar way to English.
Encoder’s specific process is as follows:
- The dimension of the input data is
[batch_size, char_seq_length]
- The 512-dimensional Character Embedding is used to map each Character into a 512-dimensional vector with output dimension as
[batch_size, char_seq_length, 512]
- There are three one-dimensional convolution, each of which contains 512 kernels, and the size of each kernel is 5*1 (i.e., 5 characters are seen each time). Each time you do the convolution, do the BatchNorm, ReLU, and Dropout. The output dimension is
[batch_size, char_seq_length, 512]
(Pad is used to ensure that the dimension of each convolution is unchanged) - The output that you get up here, you throw it to aSingle-layer BiLSTM, the hidden layer dimension is 256. Since this is a bidirectional LSTM, the final output dimension is
[batch_size, char_seq_length, 512]
3 Attention mechanism
The figure above depicts the input and output of attention for the first time. Where y0y_0y0 is the coded representation of PreNet’s initial input , and C0C_0C0 is the current “attention context”. In the initial step, both y0y_0y0 and C0c_0c0 are initialized as all 0 vectors, and then y0y_0y0 and C0C_0C0 are spliched together to form a 768 dimensional vector y0,cy_{0,c}y0,c, Take this vector as the input of LSTMcell along with attention_hidden and attention_cell (attention_hidden is actually the hidden_state of LSTMcell, Attention_cell is actually the cell_state of LSTMcell. The result is H1H_1H1 and attention_cell. There is no separate name for attention_cell, mainly because attention_cell is not used elsewhere except attention_RNN
Attention_Layer takes five inputs:
- H1h_1h1 is a variable related to MEL spectrum
- MMM is the feature extracted from source Character Sequence through Encoder layer
- M’m ‘is obtained by MMM via a Linear
- Attention_weights_cat is the result of splicting the historical (previous) attention_weights and attention_weights_cum
- Mask all false, basically useless
The calculation details are as follows:
The central part of the energies is get_alignment_Energies, a function that introduces positional features internally and is therefore a mixed attention mechanism
The mixed Attention mechanism is actually a combination of the content Attention mechanism (regular Attention) and the positional Attention mechanism:
Si −1s_{i-1} Si −1 is the implicit state of the previous decoder, α I −1\alpha_{i-1}α I −1 is the previous attention weight, and HJH_Jhj is the JJJ implicit state of the encoder. Add bias BBB to it, and the final score function is calculated as follows:
Where, vav_ava, WWW, VVV, UUU and BBB are the parameters to be trained, Fi, JF_ {I,j} Fi,j is the location feature obtained by convolution of the previous attention weight α I,j\alpha_{I,j}α I,j, Fi = F ∗ alpha – 1 f_i = F * I \ alpha_ {1} I – fi = F ∗ alpha I – 1
The attention mechanism of Tancotron2 is basically the same as the mixed attention mechanism, but slightly different
Where, sis_isi is the implicit state of the current decoder rather than the previous step, bias BBB is initialized to 0, and the position feature FIF_ifi is convolved with the accumulative attention weight C α IC \alpha_icα I:
The get_alignment_energies function is illustrated as follows:
4 decoder
A decoder is an autoregressive structure that predicts a spectrogram from a sequence of encoded inputs, one frame at a time
- The spectrum predicted in the previous step is first introduced into a PreNet, which contains two layers of neural networks. PreNet acts as a bottleneck layer and is necessary for learning attention
- The PreNet output is spliced together with the Attention Context vector to a two-tier LSTM with 1024 cells. The output of the LSTM is again spliced together with the Attention Context vector and then projected through a linear projection to predict the target spectrum
- Finally, the target spectral frame is processed through a 5-layer convolution PostNet (post-processing network), and this output is added to the output of Linear Projection (residual connection) as the final output
- On the other hand, the output of the LSTM is splited together with the Attention Context vector, projected as a scalar and passed to the sigmoID activation function to predict whether the output sequence has completed the prediction
The diagram and code for the PreNet layer are shown below:
class LinearNorm(torch.nn.Module) :
def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear') :
super(LinearNorm, self).__init__()
self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)
torch.nn.init.xavier_uniform_(
self.linear_layer.weight,
gain=torch.nn.init.calculate_gain(w_init_gain))
def forward(self, x) :
return self.linear_layer(x)
class Prenet(nn.Module) :
def __init__(self, in_dim, sizes) :
super(Prenet, self).__init__()
in_sizes = [in_dim] + sizes[:-1]
self.layers = nn.ModuleList(
[LinearNorm(in_size, out_size, bias=False)
for (in_size, out_size) in zip(in_sizes, sizes)])
def forward(self, x) :
for linear in self.layers:
x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
return x
Copy the code
The PostNet layer is illustrated as follows:
Decoder is composed of prenet, attention_rNN, attention_layer, decoder_RNN, linear_projection, gate_layer
Reference
- Tacotron2 paper + code details
- The Annotated Tacotron2
- Tacotron-2 : Implementation and Experiments
- Spectrum prediction network Tacotron2