B station video explanation

This text mainly describes how to use PyTorch to reproduce Seq2Seq(with Attention) to achieve simple machine translation tasks. Neural Machine Translation by CMS Learning to Align and Translate please read the paper first and then spend 15 minutes reading my two papers Seq2Seq and Attention mechanism Finally look at the text, in order to achieve enlightened, twice the result with half the effort

Data preprocessing

The code for data preprocessing is all about calling various apis, and I don’t want the reader to be distracted by the less important parts, so I won’t post the code here, but just dictate it

As shown in the figure below, this article uses a German-English data set with German input and special identifiers at the beginning and end of each input sentence. The output is In English, and each sentence in the output begins and ends with a special identifier

Whether it is English or German, the length of each sentence is not fixed, so I make the length of each batch sentence the same by adding . In other words, the length of a sentence in a batch is the same, and the length of a sentence in different batches may not be the same. The dimensions are [seq_len, batch_size]

Print a random piece of data and see how the data is encapsulated

In the process of data preprocessing, it is necessary to build dictionaries separately from the source sentences and the target sentences, that is, to build a thesaurus for German and a thesaurus for English

Encoder

Encoder I am using a single bidirectional GRU

The hidden state output of the bidirectional GRU is composed of two vectors, For example the h1 = [h1 – >; hT please] h_1 = [\ overrightarrow {h_1}; \ overleftarrow {h_T}] h1 = [h1; hT], H2 = [h2 – >; please T h – 1] h_2 = [\ overrightarrow {h_2}; \ overleftarrow {h} _ {1} T -] h2 = [h2 and h T – 1]… The last layer of hidden states at all times constitutes the GRU output


o u t p u t = { h 1 . h 2 . . . . h T } output=\{h_1,h_2,… h_T\}

Assuming that this is an M-layer GRU, then the hidden states in all layers at the last moment constitute the final Hidden States of the GRU


h i d d e n = { h T 1 . h T 2 . . . . . h T m } hidden=\{h^1_T,h^2_T,… ,h^m_T\}

Among them


h T i = [ h T i ; h 1 i please ] h^i_T=[\overrightarrow{h^i_T};\overleftarrow{h^i_1}]

so


h i d d e n = { [ h T 1 ; h 1 1 please ] . [ h T 2 ; h 1 2 please ] . . . . . [ h T m ; h 1 m please ] } hidden=\{[\overrightarrow{h^1_T};\overleftarrow{h^1_1}],[\overrightarrow{h^2_T};\overleftarrow{h^2_1}],… ,[\overrightarrow{h^m_T};\overleftarrow{h^m_1}]\}

According to the paper, or if you’ve read my graphic Attention article, what we need is the last layer of hidden output (both forward and reverse), so we can pull out the last layer of hidden States with hidden[-2,:,:] and hidden[-1,:,:]. Let’s put them together and call them s0s_0s0

The last detail is that the dimension of s0s_0s0 is [batch_size, en_hid_dim*2]. Even without the Attention mechanism, it is wrong to use s0s_0s0 as the initial hidden state of the Decoder, because the dimension does not match. The initial hidden state of Decoder was 3d, but now our S0s_0S0 is 2d, so we need to convert the dimension of S0S_0S0 to 3D and adjust the values on each dimension. First, I changed the dimension of S0s_0s0 to [batch_size, dec_hid_dim] through a fully connected neural network.

So that’s all the details of the Encoder, so I’m going to go straight to the code, and my code style is, comment up, code down

class Encoder(nn.Module) :
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout) :
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src) : 
        ''' src = [src_len, batch_size] '''
        src = src.transpose(0.1) # src = [batch_size, src_len]
        embedded = self.dropout(self.embedding(src)).transpose(0.1) # embedded = [src_len, batch_size, emb_dim]
        
        # enc_output = [src_len, batch_size, hid_dim * num_directions]
        # enc_hidden = [n_layers * num_directions, batch_size, hid_dim]
        enc_output, enc_hidden = self.rnn(embedded) # if h_0 is not give, it will be set 0 acquiescently

        # enc_hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        # enc_output are always from the last layer
        
        # enc_hidden [-2, :, : ] is the last of the forwards RNN 
        # enc_hidden [-1, :, : ] is the last of the backwards RNN
        
        # initial decoder hidden is final hidden state of the forwards and backwards 
        # encoder RNNs fed through a linear layer
        # s = [batch_size, dec_hid_dim]
        s = torch.tanh(self.fc(torch.cat((enc_hidden[-2,:,:], enc_hidden[-1,:,:]), dim = 1)))
        
        return enc_output, s
Copy the code

Attention

Attention is just three formulas


E t = t a n h ( a t t n ( s t 1 . H ) ) a t ~ = v E t a t = s o f t m a x ( a t ~ ) E_t=tanh(attn(s_{t-1},H))\\ \tilde{a_t}=vE_t\\ {a_t}=softmax(\tilde{a_t})

St −1s_{t-1} ST −1 refers to the variable S in the Encoder, HHH refers to the variable enc_output in the Encoder, attn()attn()attn() is actually a simple fully connected neural network

We can work backwards from the last formula what are the dimensions of each variable, or what are the dimensions required

At ~\tilde{a_t}at~ [batch_size, src_len] Or at~\tilde{a_t}at~ is three-dimensional, but one of the dimensions has a value of 1 and can be made two-dimensional by squeeze(). At ~\tilde{a_t}at~ is [batch_size, src_len, 1]

Further up, the dimension of variable VVV should be [?, 1],? That means I don’t know what it should be. The dimension of EtE_tEt should be [batch_size, src_len,?]

Now it is known that the dimension of HHH is [batch_size, src_len, enc_hid_dim*2], st−1s_{t-1}st−1 is currently [batch_size, dec_hid_dim], these two variables need to be splintered and fed into the fully connected neural network. So we need to change st−1s_{t-1}st−1 to [batch_size, src_len, dec_hid_dim], [batch_size, src_len, enc_hid_dim*2+dec_hid_dim

attn = nn.Linear(enc_hid_dim*2+dec_hid_dim, ?)
Copy the code

That’s it, except for? The partial values are unclear, and all the other dimensions have been derived. Now let’s go back and think about it, okay? How much, like there’s really no limit, so we can set it? For any value (in code I set? For dec_hid_dim)

So that’s all the Attention details. Here’s the code

class Attention(nn.Module) :
    def __init__(self, enc_hid_dim, dec_hid_dim) :
        super().__init__()
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim, bias=False)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, s, enc_output) :
        
        # s = [batch_size, dec_hid_dim]
        # enc_output = [src_len, batch_size, enc_hid_dim * 2]
        
        batch_size = enc_output.shape[1]
        src_len = enc_output.shape[0]
        
        # repeat decoder hidden state src_len times
        # s = [batch_size, src_len, dec_hid_dim]
        # enc_output = [batch_size, src_len, enc_hid_dim * 2]
        s = s.unsqueeze(1).repeat(1, src_len, 1)
        enc_output = enc_output.transpose(0.1)
        
        # energy = [batch_size, src_len, dec_hid_dim]
        energy = torch.tanh(self.attn(torch.cat((s, enc_output), dim = 2)))
        
        # attention = [batch_size, src_len]
        attention = self.v(energy).squeeze(2)
        
        return F.softmax(attention, dim=1)
Copy the code

Seq2Seq(with Attention)

I will change the order, first Seq2Seq, then Decoder part

Traditional Seq2Seq is to directly input each word in the sentence into the Decoder continuously for training, but after the introduction of Attention mechanism, I need to be able to manually control the input of word by word (because input each word into the Decoder, need to do some more operations), so in the code you can see that I use the for loop. Loop trg_len-1 times (I entered the first <SOS> manually, so loop one less time)

In addition, I use a mechanism called Teacher Forcing in the training process to ensure the training speed and increase the robustness. If you are not familiar with Teacher Forcing, you can read this article

Think about what you should be doing in the for loop. First of all, I need to pass the variables to the Decoder. Since the calculation of Attention is carried out inside the Decoder, I need to pass the three variables dec_input, S and enc_output into the Decoder. The Decoder will return dec_output and the new S. According to the probability, DEC_output is Teacher Forcing

So much for the details of Seq2Seq, here’s the code

class Seq2Seq(nn.Module) :
    def __init__(self, encoder, decoder, device) :
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5) :
        
        # src = [src_len, batch_size]
        # trg = [trg_len, batch_size]
        # teacher_forcing_ratio is probability to use teacher forcing
        
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        # enc_output is all hidden states of the input sequence, back and forwards
        # s is the final forward and backward hidden states, passed through a linear layer
        enc_output, s = self.encoder(src)
                
        # first input to the decoder is the <sos> tokens
        dec_input = trg[0To:]for t in range(1, trg_len):
            
            # insert dec_input token embedding, previous hidden state and all encoder hidden states
            # receive output tensor (predictions) and new hidden state
            dec_output, s = self.decoder(dec_input, s, enc_output)
            
            # place predictions in a tensor holding predictions for each token
            outputs[t] = dec_output
            
            # decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            # get the highest predicted token from our predictions
            top1 = dec_output.argmax(1) 
            
            # if teacher forcing, use actual next token as next input
            # if not, use predicted token
            dec_input = trg[t] if teacher_force else top1

        return outputs
Copy the code

Decoder

Decoder I use a one-way single-layer GRU

The Decoder part is actually three formulas


c = a t H s t = G R U ( e m b ( y t ) . c . s t 1 ) y t ^ = f ( e m b ( y t ) . c . s t ) c=a_tH\\ s_t=GRU(emb(y_t), c, s_{t-1})\\ \hat{y_t}=f(emb(y_t), c, s_t)

HHH refers to the variable enc_output in Encoder, emb(YT)emb(y_t)emb(yt) refers to the result of putting dec_input through the WordEmbedding, f()f()f() function is actually to convert dimensions, Because the required output is TRG_VOCAB_SIZE. One detail is that the GRU has only two parameters, one input and one hidden layer input, but the formula above has three variables, so we should select one as the hidden layer input and the other two “integrated” as the input

What are the dimensions of the variables that we are pushing from the first formula

The weight ata_tat is [batch_size, src_len], and the dimension of HHH is [src_len, batch_size, enc_hid_dim*2]. The two need to be multiplied, and the batch_size dimension should be retained, so the ATA_TAT dimension should be extended one dimension first, then the order of the HHH dimension should be reversed, and then the batch multiplication (i.e., matrix multiplication within the same batch) should be done.

a = a.unsqueeze(1) # [batch_size, 1, src_len]
H = H.transpose(0.1) # [batch_size, src_len, enc_hid_dim*2]
c = torch.bmm(a, h) # [batch_size, 1, enc_hid_dim*2]
Copy the code

As mentioned earlier, since GRU does not need three variables, we need to integrate emb(YT), emb(y_t), emb(YT) and CCC. Yty_tyt is actually the dec_input variable in Seq2Seq, and its dimension is [batch_size]. So first extend yty_tyt by one dimension and then pass the WordEmbedding so that it becomes [batch_size, 1, emb_dim]. Finally, concat CCC and EMB (YT) emB (Y_T) EMB (YT)

y = y.unsqueeze(1) # [batch_size, 1]
emb_y = self.emb(y) # [batch_size, 1, emb_dim]
rnn_input = torch.cat((emb_y, c), dim=2) # [batch_size, 1, emb_dim+enc_hid_dim*2]
Copy the code

The dimension of st−1s_{t-1} ST −1 is [batch_size, dec_hid_dim], so it should be extended by one dimension first

rnn_input = rnn_input.transpose(0.1) # [1, batch_size, emb_dim+enc_hid_dim*2]
s = s.unsqueeze(1) # [batch_size, 1, dec_hid_dim]

# dec_output = [1, batch_size, dec_hid_dim]
# dec_hidden = [1, batch_size, dec_hid_dim] = s (new, is not s previously)
dec_output, dec_hidden = self.rnn(rnn_input, s)
Copy the code

The final formula involves splicing all three variables together and using a fully connected neural network to get the final prediction. Emb (YT)emb(y_t)emb(YT) is [batch_size, 1, emb_dim] and CCC is [batch_size, 1, enc_hid_dim]. The dimensions of sts_tst are [1, batch_size, dec_hid_dim], so we can join them all together as follows

emd_y = emb_y.squeeze(1) # [batch_size, emb_dim]
c = w.squeeze(1) # [batch_size, enc_hid_dim*2]
s = s.squeeze(0) # [batch_size, dec_hid_dim]

fc_input = torch.cat((emb_y, c, s), dim=1) # [batch_size, enc_hid_dim*2+dec_hid_dim+emb_hid] 
Copy the code

That is the Decoder section in detail, and the code is shown below (the above is just sample code and may not be the same as the variable names below).

class Decoder(nn.Module) :
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention) :
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, dec_input, s, enc_output) :
             
        # dec_input = [batch_size]
        # s = [batch_size, dec_hid_dim]
        # enc_output = [src_len, batch_size, enc_hid_dim * 2]
        
        dec_input = dec_input.unsqueeze(1) # dec_input = [batch_size, 1]
        
        embedded = self.dropout(self.embedding(dec_input)).transpose(0.1) # embedded = [1, batch_size, emb_dim]
        
        # a = [batch_size, 1, src_len]  
        a = self.attention(s, enc_output).unsqueeze(1)
        
        # enc_output = [batch_size, src_len, enc_hid_dim * 2]
        enc_output = enc_output.transpose(0.1)

        # c = [1, batch_size, enc_hid_dim * 2]
        c = torch.bmm(a, enc_output).transpose(0.1)

        # rnn_input = [1, batch_size, (enc_hid_dim * 2) + emb_dim]
        rnn_input = torch.cat((embedded, c), dim = 2)
            
        # dec_output = [src_len(=1), batch_size, dec_hid_dim]
        # dec_hidden = [n_layers * num_directions, batch_size, dec_hid_dim]
        dec_output, dec_hidden = self.rnn(rnn_input, s.unsqueeze(0))
        
        # embedded = [batch_size, emb_dim]
        # dec_output = [batch_size, dec_hid_dim]
        # c = [batch_size, enc_hid_dim * 2]
        embedded = embedded.squeeze(0)
        dec_output = dec_output.squeeze(0)
        c = c.squeeze(0)
        
        # pred = [batch_size, output_dim]
        pred = self.fc_out(torch.cat((dec_output, c, embedded), dim = 1))
        
        return pred, dec_hidden.squeeze(0)
Copy the code

Define the model

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
Copy the code

Ignore_index =TRG_PAD_IDX; ignore_index=TRG_PAD_IDX; ignore_index=TRG_PAD_IDX; ignore_index=TRG_PAD_IDX; All the predicted values are predicted to be 2 (the subscript starts from 0). Meanwhile, the loss function is set to ignore the first type of loss, and 0 will be printed at this time

label = torch.tensor([1.1.1])
pred = torch.tensor([[0.1.0.2.0.6], [0.2.0.1.0.8], [0.1.0.1.0.9]])
loss_fn = nn.CrossEntropyLoss(ignore_index=1)
print(loss_fn(pred, label).item()) # 0
Copy the code

If the loss function is set to ignore the second type, then loss will not be 0

label = torch.tensor([1.1.1])
pred = torch.tensor([[0.1.0.2.0.6], [0.2.0.1.0.8], [0.1.0.1.0.9]])
loss_fn = nn.CrossEntropyLoss(ignore_index=2)
print(loss_fn(pred, label).item()) # 1.359844
Copy the code

In this tutorial, we will learn more about the tutorial