Attention model is very popular in NLP field, but I still don’t understand the basic principle of Attention model. After reading many other people’s articles, I seem to understand it, but in fact, WHEN I really understand it, I just can’t explain the reason, so I write an article with my own way of understanding.

Model of the Encoder and Decoder

When it comes to Attention model, we have to talk about seq2seq model first. The problem seq2seQ solves is to solve the problem of mapping one sentence to another, for example, the following application scenario:

Machine translation: Sequence of text to be translated –> Sequence of translated text

Speech recognition: Acoustic feature sequences –> Recognition of text sequences

Question answering system: word sequence of question description –> Generate word sequence of answer

Text summary: Text sequence –> Summary sequence

The basic SEQ2SEQ model mainly includes Encoder, Decoder and fixed length semantic vector. Encoder-Decoder model is proposed in Sequence to Sequence Learning with Neural Networks. Take machine translation as an example to show the basic principle of Encoder-Decoder model

Encoder and Decoder are neural networks, which can be RNN or LSTM, etc. We take RNN as an example to expand the model, and its process is shown in the figure below

Encoder

It’s an input sequence, and if it’s a machine translation task, thenIs to enter the words to be translated.

Encoder is a neural network, which can be RNN, LSTM, etc.

In the case of RNN, the current hidden state of RNN is determined by the hidden state of the previous input and the current input, so:

The current RNN hides the node state calculation

Indicates the current hidden state of RNN

Represents the hidden state of the previous input

Represents the current input

At the end of the input, after obtaining the hidden layer states of all the inputs, the final semantic vector C is generated

Semantic vector C calculation

The semantic vector C is a fixed length vector that will be used as input to the Decoder.

Decoder

Similarly, the Decoder part can be RNN or LSTM, and the input to the Decoder part is Encoder output.This is the result of decoder decoding, if it is a machine translation task, then this is the result returned by the machine.

In this phase, the given semantic vectorAnd the output sequence that has been generatedAnd… ,, predict the next output word yt

You can also write

setIs the hidden state of RNN, then, can be abbreviated as

The current outputI had the last output, the hidden state of the previous output, Encoder output semantic vector, afterThe operation of the function yields, whereIs a nonlinear neural network, in this case RNN

However, this approach has drawbacks: Because the semantic vector output of Encoder coding is fixed length, some information will be lost for long input, resulting in poor results of Decoder. To solve this problem, KyungHyun Cho et al. modified Encoder-Decoder model. An early model of NEURAL MACHINE TRANSLATION BY computerized LEARNING TO ALIGN AND TRANSLATE is put forward.

Attention model:

In order to solve the problem of information loss caused by fixed-length semantic vector, KyungHyun Cho et al. introduced the Attention model. The mechanism of Attention model is similar to that of human article translation, which focuses on the words to be translated and carries out translation in combination with the context. The Attention model looks for several words corresponding to the source sentence and translates them in combination with the translated words. For example, when we translate machine learning, when translating to the machine, Attention will be focused on the “machine”, so that the Decoder can see the information of each word of Encoder, instead of being limited to the hidden vector of fixed length, resulting in information loss.

The specific process is as follows:

For a sentenceAnd… ,

1.Input sequence encoding, the vector of words obtained in a certain method, and then input into Encoder for coding, and the encoded vector is obtained, denoted here asAnd… ,Here, we can use RNN or LSTM as the encoding mode. Here, WE still take RNN as an example.

2,Calculation of attention weightTo facilitate recording, attention weight is defined as, the vector after Decoder isAnd… ,, where I is the subscript of the sentence input sequence

For the vector QI that currently needs decoder, its corresponding attention weightAnd… ,By a function

We can also take the vector after Decoder as Query, and the vector after decoding is the field. Here, we are using Query to find the corresponding weight of the field

3,Computational semantic vector (including,Is attention weight,Is the vector obtained by Encoder)

4, Decoder next output according to the semantic vector C, the output of the last Decoder, the hidden state of the last Decoder output, predict the next output

The addition of the Attention mechanic adds two new differences

1. Every time Decoder has a word, it needs to calculate an attention weight

2. Every Decoder needs to recalculate semantic vectors

To calculate the weight of attention, two pieces of information are needed:

  • Vector after EncoderAnd… ,This vector contains the information of all the words decoded. We can think of it as a dictionary
  • Vector from the i-1 decoder

To illustrate this with the first example,

First, enter the text “machine learning” after the Encoder of RNN, get the hidden state

The second step is to calculate the attention weight. At the beginning, there is no hidden state of Decoder.Usually a fixed start token

After the attention weight is obtained, the hidden state is obtained according to EncoderAnd the corresponding attention weight was calculated to obtain the semantic vectorAnd finally Decoder gets the result machine

Decoder second word, the steps are the same, the difference is to calculate the weight in the attention, the use of Decoder to get the hidden state

So the question here is, what is this function f_? Generally, there are the following methods to calculate the weight of attention

  • Bilinear method

It’s set up directly with a weight matrixIs more direct, and calculation speed is faster.

  • Dot Product

This method is more direct, eliminating the weight and creatingThe advantages of the relational mapping is faster calculation, and no parameters, reduce the complexity of the model. But you need toThe dimensions should be the same.

  • scaled-dot Product

There is a problem with the dot product method above, that is, with the increase of vector dimensions, the weight obtained will also increase. In order to improve computing efficiency and prevent data overflow, scaling is performed on it.

Talk is cheap, show me the code! ! The next article will try to implement the model.