Attention is a Mechanism used to improve the effect of RNN (LSTM or GRU) based Encoder + Decoder model, commonly known as the Attention Mechanism. The Attention Mechanism is very popular right now and is used in many areas like machine translation, speech recognition, Image Caption, etc. The reason why it is so popular is that Attention gives the model the ability to distinguish, for example, in machine translation, speech recognition applications, For each word in this sentence gives different weights, made the learning of the neural network model become more flexible (soft), at the same time Attention itself can be as a relationship of alignment, explain the alignment of the relationship between the input/output sentence, what interpretation model learned knowledge, for us to open the black box of deep learning, provides a window.

The best articles I’ve collected for you on the mechanics of attention

The title instructions additional
Model Summary 24 – Attention Mechanism in deep learning: Principle, classification and application First zhihu 2017
What are the mainstream attention methods? The attention mechanism is well understood 2017
Attention_Network_With_Keras attention model code implementation and analysis Code parsing book 20180617
Attention_Network_With_Keras Code implemented on GitHub 2018
The influence of various attentional mechanisms on DEEP learning in NLP An overview of the heart of the machine 20181008

Feel don’t want to see too much text directly drag to the end of the text to see my code, to a realisation

In order to2014 “Learning Phrase Representations Using RNN Decoder-Decoder for Statistical Machine Translation”The topic

To introduce the structure and principle of Attention Mechanism, we first need to introduce the structure of Seq2Seq model. The Seq2Seq model based on RNN is mainly introduced by two papers, but different RNN models are adopted. Ilya Sutskever et al. used LSTM to build Seq2Seq model in their paper Sequence to Sequence Learning with Neural Networks in 2014. Then, in 2015, Kyunghyun Cho et al., in their paper “Learning Phrase Representations Using RNN Encoder — Decoder for Statistical Machine” Translation puts forward the Seq2Seq model based on GRU. The Seq2Seq model proposed in the two papers aims to solve the main problem of how to map the variable length input X to a variable length output Y in machine translation. Its main structure is shown in the figure.

Traditional Seq2Seq structure

Where, Encoder turns one into an input sequence x1, x2, x3…. Xt is encoded into a fixed length hidden vector (background vector, or context vector) C, C has two functions: 1. As the initial vector to initialize the Decoder model, as the Decoder model to predict the initial vector of Y1. 2. As the background vector, it guides the output of y of each step in the Y sequence. Decoder is mainly based on the background vector C and the output YT-1 of the previous step to decode the output YT of t at this moment, until it encounters the end sign () so far.

As mentioned above, the traditional Seq2Seq model lacks differentiation for the input sequence X. Therefore, in 2015, Kyunghyun Cho et al., in their paper Learning Phrase Representations Using RNN Encoder — Decoder for Statistical Machine Translation, The Attention Mechanism was introduced to solve this problem, and the structure of the model they proposed is shown in the figure.

Diagram of the Attention Mechanism module

In order toAttention_Network_With_KerasExample of an Attention implementation code

Part of the code

Tx = 50 # Max x sequence length
Ty = 5 # y sequence length
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)

# Split data 80-20 between training and test
train_size = int(0.8*m)
Xoh_train = Xoh[:train_size]
Yoh_train = Yoh[:train_size]
Xoh_test = Xoh[train_size:]
Yoh_test = Yoh[train_size:]
Copy the code

To be careful, let’s check that the code works:

i = 5
print("Input data point " + str(i) + ".")
print("")
print("The data input is: " + str(dataset[i][0]))
print("The data output is: " + str(dataset[i][1]))
print("")
print("The tokenized input is:" + str(X[i]))
print("The tokenized output is: " + str(Y[i]))
print("")
print("The one-hot input is:", Xoh[i])
print("The one-hot output is:", Yoh[i])
Copy the code
Input data point 5. The data input is: 23 min after 20 p.m. The data output is: 20:23 The tokenized input is:[ 5 6 0 25 22 26 0 14 19 32 18 30 0 5 3 0 28 2 25 2 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40] The tokenized output is: [ 2 0 10 2 3] The one-hot input is: [[0. 0 0.... 0. 0. 0.] [0. 0. 0.... 0. 0. 0.] [1. 0. 0.... 0. 0. 0.]... [0. 0. 0.... 0. 0. 1.] [0. 0. 0.... 0. 0.  1.] [0. 0. 0. ... 0. 0. 1.]] The one-hot output is: [[0. 0. 1. 0. 0, 0, 0, 0, 0, 0, 0.] [1. 0. 0. 0. 0, 0, 0, 0, 0, 0, 0.] [0. 0. 0. 0. 0, 0, 0, 0, 0, 0, 1] is [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]]Copy the code

Model

Our next goal is to define our model. The important part will be defining the attention mechanism and then making sure to apply that correctly.

Define some model metadata:

layer1_size = 32
layer2_size = 128 # Attention layer
Copy the code

The next two code snippets defined the attention mechanism. This is split into two arcs:

  • Calculating context
  • Creating an attention layer

As a refresher, an attention network pays attention to certain parts of the input at each output time step. attention denotes which inputs are most relevant to the current output step. An input step will have attention weight ~1 if it is relevant, and ~0 otherwise. The context is the “summary of the input”.

The requirements are thus. The attention matrix should have shape and sum to 1. Additionally, the context should be calculated in the same manner for each time step. Beyond that, there is some flexibility. This notebook calculates both this way:


For safety, is defined as .

# Define part of the attention layer gloablly so as to
# share the same layers for each attention step.
def softmax(x):
    return K.softmax(x, axis=1)

at_repeat = RepeatVector(Tx)
at_concatenate = Concatenate(axis=- 1)
at_dense1 = Dense(8, activation="tanh")
at_dense2 = Dense(1, activation="relu")
at_softmax = Activation(softmax, name='attention_weights')
at_dot = Dot(axes=1)

def one_step_of_attention(h_prev, a):
    """ Get the context. Input: h_prev - Previous hidden state of a RNN layer (m, n_h) a - Input data, possibly processed (m, Tx, n_a) Output: context - Current context (m, Tx, n_a) """
    # Repeat vector to match a's dimensions
    h_repeat = at_repeat(h_prev)
    # Calculate attention weights
    i = at_concatenate([a, h_repeat])
    i = at_dense1(i)
    i = at_dense2(i)
    attention = at_softmax(i)
    # Calculate the context
    context = at_dot([attention, a])

    return context
Copy the code
def attention_layer(X, n_h, Ty):
    """ Creates an attention layer. Input: X - Layer input (m, Tx, x_vocab_size) n_h - Size of LSTM hidden layer Ty - Timesteps in output sequence Output: output - The output of the attention layer (m, Tx, n_h) """    
    # Define the default state for the LSTM layer
    h = Lambda(lambda X: K.zeros(shape=(K.shape(X)[0], n_h)), name='h_attention_layer')(X)
    c = Lambda(lambda X: K.zeros(shape=(K.shape(X)[0], n_h)), name='c_attention_layer')(X)
    # Messy, but the alternative is using more Input()

    at_LSTM = LSTM(n_h, return_state=True, name='at_LSTM_attention_layer')

    output = []

    # Run attention step and RNN for each output time step
    for _ in range(Ty):
        context = one_step_of_attention(h, X)

        h, _, c = at_LSTM(context, initial_state=[h, c])

        output.append(h)

    return output
Copy the code

The sample model is organized as follows:

  1. BiLSTM
  2. Attention Layer
    • Outputs Ty lists of activations.
  3. Dense
    • Necessary to convert attention layer’s output to the correct y dimensions
layer3 = Dense(machine_vocab_size, activation=softmax)

def get_model(Tx, Ty, layer1_size, layer2_size, x_vocab_size, y_vocab_size):
    """ Creates a model. input: Tx - Number of x timesteps Ty - Number of y timesteps size_layer1 - Number of neurons in BiLSTM size_layer2 - Number of neurons in attention LSTM hidden layer x_vocab_size - Number of possible token types for x y_vocab_size - Number of possible token types for y Output: model - A Keras Model. """

    # Create layers one by one
    X = Input(shape=(Tx, x_vocab_size), name='X_Input')

    a1 = Bidirectional(LSTM(layer1_size, return_sequences=True), merge_mode='concat', name='Bid_LSTM')(X)

    a2 = attention_layer(a1, layer2_size, Ty)

    a3 = [layer3(timestep) for timestep in a2]

    # Create Keras model
    model = Model(inputs=[X], outputs=a3)

    return model
Copy the code

The steps from here on out are for creating the model and training it. Simple as that.

# Obtain a model instance
model = get_model(Tx, Ty, layer1_size, layer2_size, human_vocab_size, machine_vocab_size)
Copy the code
plot_model(model, to_file='Attention_tutorial_model_copy.png', show_shapes=True)
Copy the code

Model structure and description (here comes the emphasis)

Model to evaluate

Evaluation

The final training loss should be in the range of 0.02 to 0.5

The test loss should be at a similar level.

# Evaluate the test performance
outputs_test = list(Yoh_test.swapaxes(0.1))
score = model.evaluate(Xoh_test, outputs_test)
print('Test loss: ', score[0])
Copy the code
2000/2000 [= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =] - 2 s 1 ms/step Test loss: 0.4966005325317383Copy the code

Now that we’ve created this beautiful model, let’s see how it does in action.

The below code finds a random example and runs it through our model.

# Let's visually check model output.
import random as random

i = random.randint(0, m)

def get_prediction(model, x):
    prediction = model.predict(x)
    max_prediction = [y.argmax() for y in prediction]
    str_prediction = "".join(ids_to_keys(max_prediction, machine_vocab))
    return (max_prediction, str_prediction)

max_prediction, str_prediction = get_prediction(model, Xoh[i:i+1])

print("Input: " + str(dataset[i][0]))
print("Tokenized: " + str(X[i]))
print("Prediction: " + str(max_prediction))
print("Prediction text: " + str(str_prediction))
Copy the code
Input: 13.09 Tokenized: [4 6 2 3 12 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  40 40 40 40 40 40 40 40 40] Prediction: [1, 3, 10, 0, 9] Prediction text: 13:09Copy the code

Last but not least, all introductions to Attention networks require a little tour.

The below graph shows what inputs the model was focusing on when writing each individual letter.

Attention mechanism diagram

Attention mechanism essential

Global attention mechanism

Local attention mechanism

Self-attention mechanism

Hidden vectorIt is first passed to the full connection layer. And then calibrate the coefficientWill compare the output of the full connect layerAnd the trainable context vector U (randomly initialized), which are normalized by Softmax. The attention vector S can finally be the weighted sum of all the hidden vectors. The context vector can be interpreted as the best word represented on average. But when the model is faced with a new sample, it uses this knowledge to decide which word needs more attention. During training, the model updates the context vector through back propagation, that is, it adjusts its internal representation to determine what the optimal word is.

Self Attention is very different from traditional Attention mechanics: Traditional Attention is based on the source end and target end of the hidden state (hidden state) to calculate Attention, the result is the source end of each word and the target end of each word dependency. But Self Attention is different, it is carried out in the source end and the target end respectively, only the Self Attention which is related to the source input or the target input itself, captures the dependency between the words of the source end or the target end. Then add the self Attention of the source end to the Attention of the target end to capture the dependencies between the source end and the target end. So self Attention works better than the traditional Attention mechanism, and one of the main reasons is that the traditional Attention mechanism ignores word-to-word dependencies in the source and target sentences. Self Attention can not only obtain the dependencies between the source end and the target end, but also effectively obtain the dependencies between the source end and the target end

Hierarchical attention mechanism

In this framework, the self-attention mechanism is used twice: at the word level and at the sentence level. This approach is important for two reasons. The first is that it matches the natural hierarchical structure of the document (word-sentence-document). Second, it allows the model to determine first which words are important in the sentence and then which sentences are important in the document in the process of calculating the document code.

If you’re still confused or want to know more, please follow my blog at Wangjiang ARTIFICIAL Think Tank or contact me at GitHub.

I’ll post the experiment on my GitHub when I’m done with the attention mechanism.