• It has a RECURRENT NEURAL NETWORK (RNN) — PART 4: ATTENTIONAL INTERFACES
  • GokuMohandas
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: TobiasLee
  • Proofreader: Changkun Brucexz

This series of articles is summarized

  1. RNN Recurrent Neural Networks Series 1: Basic RNN and char-RNN
  2. RNN Recurrent Neural Networks Series 2: Text Classification
  3. RNN Recurrent neural network series 3: encoders and decoders
  4. RNN Recurrent Neural Network Series 4: Attention Mechanisms
  5. RNN Recurrent Neural Networks Series 5: Custom units

RNN Recurrent Neural Network Series 4: Attention Mechanisms

In this paper, we will try to use encoder-decoder model with attention mechanism to solve seQ-SEQ problem. The realization principle is mainly based on this paper, please refer to here for details.

attention.png

First, let’s catch a glimpse of the entire model architecture and discuss some of the interesting part, then we can achieve in previous without encoder, decoder model based on attention mechanism, add attention mechanism, the implementation details of the previous model here, we will gradually introduce attention mechanism, and the implementation model of inference. Note: This model is not the best model at the moment, not to mention the data THAT I scribbled together in a few minutes. This article aims to help you understand models that use attention mechanisms so that you can apply them to larger data sets with very good results.

Encoder-decoder model with attentional mechanism:

PNG Screen Shot 2016-11-19 at 5.27.39 pm.png

This image is a more detailed version of the first image, with more detail. Let’s start with the encoder and end with the output of the decoder. Firstly, our input data is the vectors processed by Padding and Embedding. We hand these vectors to the RNN network with a series of cells (the blue RNN cells in the figure above). The output of these cells is called hidden state. H0, H1, etc.) are initialized to zero, but after the data is entered, these hidden states change and hold some very valuable information. If you are using an LSTM network (a type of RNN), we will pass the state C of the cell along with the hidden state H to the next cell. For each input (X0, etc.), on each cell we get an output of the hidden state, which is also part of the input of the next cell. We denoted the output of each neuron as H1 through hN, and these outputs would become inputs to our attention model.

Before we dive into the attention mechanism, let’s take a look at how the decoder processes its inputs and generates its outputs. The target language goes through the same word-embedding process as the input to the decoder, starting with the GO identifier and ending with EOS and some padding thereafter. The decoder’s RNN cell also has a hidden state and, like above, is initialized to zero and changes as data is entered. In this way, a decoder and an encoder seem no different. In fact, they differ in that the decoder also receives as input a context vector ci generated by the attentional mechanism. In the next section, we will discuss in detail how context vector, which is based on all input encoder and decoder cell in front of the hidden state produced a very important results: context vector that can guide us in the encoder input on how to divide the attention, to better predict the output of the next.

Each cell of the decoder is calculated by using the input generated by the encoder, the hidden state of the previous cell and the context vector generated by the attention mechanism. Finally, the final target output is generated by the Softmax function. It is worth noting that during the training process, each RNN cell only uses these three outputs to obtain the output of the target. However, in the inference stage, we do not know what the next input of the decoder is. So we will use the decoder’s previous prediction as the new input.

Now, let’s take a closer look at how the attention mechanism produces context vectors.

Attention mechanism:

PNG Screen Shot 2016-11-19 at 5.27.49 pm.png

The diagram above shows the attention mechanism. Let’s first focus on the input and output parts of the attention layer: we use all the hidden states generated by the encoder and the output of the previous decoder cell to generate the corresponding context vector for each decoder cell. Firstly, these inputs will pass through a layer of TANh function to generate an output matrix E with shape [N, H]. The output of each cell in the encoder will generate an EIJ corresponding to the ith cell in the decoder. We then apply the Softmax function to the matrix E once to get a probability of each hidden state, which we call alpha. Then alpha is multiplied by the original hidden state matrix H to obtain a weight for each hidden state in each H. Finally, summation is carried out to obtain the context vector CI with shape [N, h], which is actually a representation with weight distribution of the input generated by the encoder.

This context vector may be arbitrary at the beginning of the training, but as the training progresses, our model will continue to learn which parts of the input generated by the encoder are important, helping us to produce better results on the decoder side.

Tensorflow implementation:

Now let’s implement this model, the most important part of which is the attention mechanism. We will use a one-way GRU encoder and decoder, very similar to the one used in the previous article, except that the decoder will use the additional context vector (representing the distribution of attention) as input. In addition, we will use the embedding_attention_decoder() interface in Tensorflow.

First, let’s look at the data set that will be processed and passed to the encoder/decoder.

Data:

I created a very small data set for the model: 20 Sentences in English and their Spanish counterparts. The focus of this tutorial is to understand how to build an encoder-decoder model with soft attention mechanisms to solve sequence-to-sequence problems such as machine translation. So I wrote 20 Sentences about myself in English and translated them into Spanish, and that’s our data.

First, we turn these sentences into a series of tokens, which are then converted into the corresponding lexical ids. During this process, we build a vocabulary dictionary that allows us to convert from token to term ID. For our target language (Spanish), we will add an additional EOS identifier. Next we will populate the set of tokens from the source and target languages to the maximum length (the longest sentence length in their respective datasets, respectively), which will eventually be the data we feed to our model. We pass the populated source language data to the encoder, but we also do some extra work on the target language input to get the input and output of the decoder.

Finally, the input looks like this:

PNG Screen Shot 2016-11-19 at 4.20.54 pm.png

This is just an example of a data set. The 0’s in the vector are all filled parts, 1 is the GO identifier, and 2 is an EOS identifier. The following is a more general representation of the data processing process. You can omit the target weights section because we won’t use it in our implementation.

screen-shot-2016-11-16-at-5-09-10-pm

The encoder

Inputs for encoder_inputs The input data is a matrix of the shape [N, max_len], which becomes [N, max_len, H] by word embedding. The encoder is a dynamic RNN, and after processing it, we get an output of shape [N, max_len, H] and a state matrix of shape [N, H] (this is the last cell related state of all sentences through the RNN network). These will be the outputs of our encoder.

decoder

Before discussing the attention mechanism, let’s look at the inputs and outputs of the decoder. The initial state of the decoder is transmitted by the encoder. After each sentence passes through the RNN network, it reaches the state of a cell (shape is [N, H]). Tensorflow’s embedding_attention_decoder() function requires that the input to the decoder be a sequentially ordered (word order in a sentence) list, so we convert the input of [N, max_len] to a max_len long list [N]. We also used softMax weighted matrices to process the output of the decoder to create our output projection weights. We pass the sequence list (that is, the transformed [N, max_len]), the initial state, the attention matrix, and the projection weight as parameters to the embedding_attention_deocder() function, Get outputs (outputs of shape [max_len, N, H] and state matrix [N, H]). The outputs we get are also arranged in chronological order, and we flatten them and apply the Softmax function to get a matrix of form [N max_len, C]. Next we are 0 0 again for the target output changing from [N, MAX_len] to **[N max_len,]** and using SPARse_softMAX_cross_entropy_WITH_logits () to calculate the loss Next, we will perform some masking operations on Loss to avoid the impact of filling operations on Loss.

Attention:

Finally, we get to the attention mechanism. Now that we know the inputs and outputs, we gave the embedded_attention_decoder() function a set of parameters (sequence list, initial state, attention matrix, the encoder’s output), but what’s really going on here? First, we’ll create a set of weights to embed the input, which we’ll call W_embedding. After generating the decoder’s output from the input, we start a loop function to decide which portion of the output to pass to the next decoder as input. During training, we usually don’t pass the output of the previous decoder unit to the next, so the loop function here is None. During reasoning, we do this, so the looping function here uses _extract_argmax_and_embed(), which does what its name suggests (extract parameters and embed). The output of the decoder unit is multiplied by output_projection and converted from [N, H] to [N, C]. The same W_embedding is used instead of the output ([N, H]). The processed output is then used as input to the next decoder unit.

# If we need to predict the next word, use the following loop function
loop_function = _extract_argmax_and_embed(
    W_embedding, output_projection,
    update_embedding_for_previous) if feed_previous else NoneCopy the code

PNG Screen Shot 2016-11-22 at 7.53.40 am.png

Another optional argument to the loop function is update_embedding_, which, if set to False, stops using gradient updates on the W_embedding weights when we embed the output of the decoder (except for the GO token). So, although we use W_embedding in both places, its value depends only on the word embedding we use on the input to the decoder and not on the output (except for the GO token). We can then pass the embedded timing decoder input, initial state, attention matrix, and loop function to the attention_decoder() function.

The attention_decoder() function is at the heart of the attentional mechanism, with some additional operations not mentioned in the paper at the beginning of this article. Recall that the attention mechanism will use our attention matrix (the encoder’s output) and the state of the previous decoder unit, which will be passed into a TANH layer to hide the state to get an E_IJ (a variable used to measure the degree of sentence alignment). We will then use the softmax function to convert it to alpha_ij for and to multiply with the original attention matrix. So let’s take the sum of this multiplied vector, and that’s our new context vector c_i. Eventually, this context vector will be used to produce the output of our new decoder.

The main difference is that our attention matrix (the encoder’s output) and the state of the previous decoder unit are not simply handled by a _Linear () function, and the conventional TANH function is applied. We need some additional steps to solve this problem: first, we use a 1×1 convolution of the attention matrix, which helps us to extract important features in the attention matrix, rather than directly processing the original data — you can recall the important feature extraction role of the convolution layer in pattern recognition. This step gives us better features, but the problem is that we need to represent the attention matrix with a 4-dimensional vector.

' 'Shape Transition: initial hidden state: [N, MAX_len, H] 0 Vector for 4D: [N, MAX_len, 1, H] = N shape pictures so we can apply a new filter on top [1, 1, H, H] = [height, width, depth, # num filters] use the convolution of stride 1 and padding 1: H = ((H - F + 2P) / S) + 1 = ((max_len - 1 + 2)/1) + 1 = height'W = ((w-f + 2P)/S) + 1 = ((1-1 + 2)/1) + 1 = 3 K = K = H', 3, H]
'' '

hidden = tf.reshape(attention_states,
    [-1, attn_length, 1, attn_size]) # [N, max_len, 1, H]
hidden_features = []
attention_softmax_weights = []
for a in xrange(num_heads):
    # filter
    k = tf.get_variable("AttnW_%d" % a,
        [1, 1, attn_size, attn_size]) # [1, 1, H, H]Hidden_features. Append (tf) nn) conv2d (hidden, k,,1,1,1 [1],"SAME"))
    attention_softmax_weights.append(tf.get_variable(
        "W_attention_softmax_%d" % a, [attn_size]))Copy the code

This means that in order to process the transformed 4-dimensional attention matrix and the former decoder unit state, we need to convert the latter to a 4-dimensional representation as well. It’s as simple as putting the state of the previous decoder unit through an MLP and turning it into a 4-dimensional tensor that matches the transformation of the attention matrix.

y = tf.nn.rnn_cell._linear(
    args=query, output_size=attn_size, bias=True)

0 0 0 0 0 0
y = tf.reshape(y, [-1, 1, 1, attn_size]) # [N, 1, 1, H]

# calculated Alpha
s = tf.reduce_sum(
    attention_softmax_weights[a] *
    tf.nn.tanh(hidden_features[a] + y), [2, 3])
a = tf.nn.softmax(s)

# Evaluate context vector cC = TF.reduce_sum (TF.0 (A, [-1, ATTN_Length, 1, 1])* Hidden, [1,2]) c. 0 (TF.0 (C, [-1, attn_size])Copy the code

After converting both the attention matrix and the state of the previous decoder unit, we can proceed with the TANH operation. We multiplied and summed the result after TANH and the weight obtained by Softmax, and applied softmax function again to get alpha_ij. Finally, we’re 0 0 multiplying alphas by 0 0 with the initial attention matrix and suming to get our context vector c_i

The input to the decoder can then be processed one by one. Let’s talk about training. We don’t care about the output of the decoder because the input eventually becomes the output, so the loop function here is None. We will process the decoder input (initialized to zero) through an MLP using the _Linear () function with the previous context vector, and then pass it along with the state of the previous decoder unit to the dynamic_RNN unit for output. We process all the sample data at once for the token at the same time, because we need the previous state corresponding to the last token from the index at that time. Chronological input makes it more efficient to do so over a batch of data, which is why we need the input to be a chronological list.

Once we have the output and state of the dynamic RNN, we can calculate the new context vector based on the new state. The output of the cell and the new context vector are then passed through an MLP to get our decoder output. These additional MLPS are not drawn in the decoder schematic, but they are the additional steps necessary for us to get the output. Note that both the output of the cell and the output of attention_decoder have a shape of [max_len, N, H].

When we infer, instead of None, the loop function is _extract_argmax_and_append(). This function will receive the output of the previous decoder unit, and the input of our new decoder unit will be the result of softMax’s previous output, which will then be re-embedded. After w has been processed using the attention matrix, the preV will be updated as the output of the new prediction.


Process the input to the decoder in turn
for i, inp in enumerate(decoder_inputs):

    if i > 0:
        tf.get_variable_scope().reuse_variables()

    if loop_function is not None and prev is not None:
        with tf.variable_scope("loop_function", reuse=True):
            inp = loop_function(prev, i)

    # Merge input and attention vectors
    input_size = inp.get_shape().with_rank(2)[1]
    x = tf.nn.rnn_cell._linear(
        args=[inp]+attns, output_size=input_size, bias=True)

    # decoder RNN
    cell_outputs, state = cell(x, state) # our stacked cell


    Get the context vector through attention
    attns = attention(state)

    with tf.variable_scope('attention_output_projection'):
        output = tf.nn.rnn_cell._linear(
            args=[cell_outputs]+attns, output_size=output_size,
            bias=True)
    if loop_function is not None:
        prev = output
    outputs.append(output)

return outputs, stateCopy the code

We then process the output from attention_decoder: use the Softmax function, perform the flatten operation, and finally compare with the target output and calculate the loss.

Details:

Sampled Softmax

The use of the attentional mechanism model for sequence-to-sequence tasks such as machine translation works well, but is often problematic because of the sheer size of the corpus. In particular, calculating Softmax for the decoder output during our training was very expensive, and the solution was to use MedisoftMax. You can learn more about why and how to do this in this article.

Here is the code for MedisoftMax. Note that the weights are the same as those we used for Output_projection on the decoder, because they are used for the same purpose: to convert the decoder’s output (a vector of length H) to a vector of length corresponding to the number of categories.

def sampled_loss(inputs, labels):
    labels = tf.reshape(labels, [-1, 1])
    # We need to compute the sampled_softmax_loss using 32bit floats to
    # avoid numerical instabilities.
    # We calculate sampled_softMAX_loss using a 32-bit floating-point number to avoid numerical instability
    local_w_t = tf.cast(w_t, tf.float32)
    local_b = tf.cast(b, tf.float32)
    local_inputs = tf.cast(inputs, tf.float32)
    return tf.cast(
            tf.nn.sampled_softmax_loss(local_w_t, local_b,
                local_inputs, labels,
                num_samples, self.target_vocab_size),
            dtype)
softmax_loss_function = sampled_lossCopy the code

Next, we can use the seq_loss function to calculate Loss, where the weight vector is 1 except for the part whose target output is PAD token, which is 0. It’s worth noting that we will only use Sampled SoftMax during the training process, and in the prediction process, we will sample the entire corpus, using regular Softmax, not just some of the closest ones.

else:
    losses.append(sequence_loss(
      outputs, targets, weights,
      softmax_loss_function=softmax_loss_function))Copy the code

A buckets model:

Another common addition is to use tF.nn.seq2seq.model_with_buckets (), which is also used in Tensorflow’s official NMT tutorial. This buckets model has the advantage of shortening the length of the attention matrix vector. In the previous model, we applied the attention vector to the hidden states of max_len length. In this case, we only apply the attention vector to the relevant parts, because the PAD token is completely negligible. We can also choose buckets to keep PAD tokens as few as possible.

But I personally find this a little rough, and if I really want to avoid dealing with PAD tokens, I would recommend using the seq_lens property to filter out PAD tokens in the encoder’s output, or when we’re computing context vectors, We can set the PAD token’s hidden state to 0 in each sentence. This method is a bit complicated, so we won’t implement it here, but Buckets is not an elegant solution to the noise caused by the PAD token.

Conclusion:

Attention mechanism is a hot research topic, and there are many variations. No matter what the situation, this model does a great job on sequence-to-sequence tasks, so I really enjoy using it. Be careful to separate the training set from the validation set, as this model can easily overfit and produce very poor performance on the validation set. In future articles, we will use attention mechanisms to solve more complex tasks of designing memory and logical reasoning.

Code:

GitHub Repo

Matrix shape analysis:

The output form of the encoder is [N, max_len], which is changed to [N, max_len, H] after the embedding operation, and then given to the encoder RNN. The output shape of the encoder is [N, max_len, H], and the shape of the state matrix is [N, H], which contains the state of the last cell of each sample.

The encoder’s output and attention vectors are of the shape [N, max_len, H].

The output of the decoder, of form [N, max_len], is converted to a sequence list of length max_len, with each vector of shape N. The initial state of the decoder is the state matrix of the codec shape [N, H]. Before entering the data into the decoder RNN, the data is embedded into a sequence list of max_len length, where each vector has a shape of [N, H]. The input data may be the actual decoder input or, in the case of prediction, the output produced by the previous decoder cell. The previous decoder cell produces an output shape of [N, H] at the previous moment, which will be changed to [N, C] through a SoftMax layer (output projection). We then embed back [N, H] again using the weight vector we used on the input. These inputs are fed to the decoder RNN, resulting in an output of the decoder form [max_len, N, H] and a state matrix [N, H]. The output will be flatten to [N* max_len, H] and compared to the same flatten target output (of the same form [N* max_len, H]). If there is a PAD token in the target output, some masking is done when calculating loss, followed by backprop.

Inside the decoder RNN, there are also some shape transformation operations. First the attention vector (encoder output) of form [N, max_len, H] will be transformed into a 4-dimensional vector [N, max_len, 1, H] (so that we can use convolution operations) and use convolution to extract useful features. The shape of these hidden features is also 4-dimensional, [N, height, 3, H]. The former hidden state x vector of the decoder, of form [N, H], is also an input to the attention mechanism. The hidden state vector passes through an MLP to become [N, H] (the reason for this is to prevent the second dimension (H) of the previous hidden state from being different from the attention_size, which is also H here). This hidden state vector is then also converted to a 4-dimensional vector [N, 1, 1, H] so that we can combine it with the hidden feature. We use tanh function for the result of the addition and softmax function to get alpha_ij, whose shape is [N, max_len, 1, 1] (which represents the probability of each hidden state in each sample). This alpha is multiplied by the original hidden state to get a vector of form [N, max_len, 1, H], and summed to get a context vector of form [N, H].

The context vector is combined with the decoder’s input of form [N, H]. Whether this input comes from the decoder’s input data (at training time) or from the previous cell’s prediction (at prediction time), this input is just one of the vectors of form [N, H] in the max_len list of length. First we add it to the previous context vector (initialized to an all-zero [N, H] matrix). Recall that our input from the decoder is a sequential list of length N with a vector of shape [max_len,], which is why the input is all of shape [N, H]. The result of the addition will go through a layer of MLP, resulting in an output of shape [N, H], which together with the state matrix (shape [N, H]) will be given to our dynamic RNN cell. The obtained output cell_outputs form is [N, H], and the state matrix is also [N, H]. This new state matrix will be the input to our next decoder. We perform this operation on max_len inputs, resulting in a max_len list of vectors [N, H]. After getting the output and state matrix from the decoder, we pass the new state matrix to the attention function to get the new context vector of form [N, H] and add it to the output of form [N, H]. Again, using MLP, we convert it into a vector of form [N, H]. Finally, if we are making a prediction, the new prev will be our final output (prev starts with None). The prev will become the input to loop_function to get the output of the next decoder.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, React, front-end, back-end, product, design and other fields. If you want to see more high-quality translation, please continue to pay attention to the Project, official Weibo, Zhihu column.