Attention model is very popular in NLP field, but I still don’t understand the basic principle of Attention model. After reading many other people’s articles, I seem to understand it, but in fact, WHEN I really understand it, I just can’t explain the reason, so I write an article with my own way of understanding.
Model of the Encoder and Decoder
When it comes to Attention model, we have to talk about seq2seq model first. The problem seq2seQ solves is to solve the problem of mapping one sentence to another, for example, the following application scenario:
Machine translation: Sequence of text to be translated –> Sequence of translated text
Speech recognition: Acoustic feature sequences –> Recognition of text sequences
Question answering system: word sequence of question description –> Generate word sequence of answer
Text summary: Text sequence –> Summary sequence
The basic SEQ2SEQ model mainly includes Encoder, Decoder and fixed length semantic vector. Encoder-Decoder model is proposed in Sequence to Sequence Learning with Neural Networks. Take machine translation as an example to show the basic principle of Encoder-Decoder model
Encoder and Decoder are neural networks, which can be RNN or LSTM, etc. We take RNN as an example to expand the model, and its process is shown in the figure below
Encoder
、 、 、 It’s an input sequence, and if it’s a machine translation task, then 、 、 、 Is to enter the words to be translated.
Encoder is a neural network, which can be RNN, LSTM, etc.
In the case of RNN, the current hidden state of RNN is determined by the hidden state of the previous input and the current input, so:
The current RNN hides the node state calculation
Indicates the current hidden state of RNN
Represents the hidden state of the previous input
Represents the current input
At the end of the input, after obtaining the hidden layer states of all the inputs, the final semantic vector C is generated
Semantic vector C calculation
The semantic vector C is a fixed length vector that will be used as input to the Decoder.
Decoder
Similarly, the Decoder part can be RNN or LSTM, and the input to the Decoder part is Encoder output. 、 This is the result of decoder decoding, if it is a machine translation task, then this is the result returned by the machine.
In this phase, the given semantic vectorAnd the output sequence that has been generated 、 And… ,, predict the next output word yt
You can also write
setIs the hidden state of RNN, then, can be abbreviated as
The current outputI had the last output, the hidden state of the previous output, Encoder output semantic vector, afterThe operation of the function yields, whereIs a nonlinear neural network, in this case RNN
However, this approach has drawbacks: Because the semantic vector output of Encoder coding is fixed length, some information will be lost for long input, resulting in poor results of Decoder. To solve this problem, KyungHyun Cho et al. modified Encoder-Decoder model. An early model of NEURAL MACHINE TRANSLATION BY computerized LEARNING TO ALIGN AND TRANSLATE is put forward.
Attention model:
In order to solve the problem of information loss caused by fixed-length semantic vector, KyungHyun Cho et al. introduced the Attention model. The mechanism of Attention model is similar to that of human article translation, which focuses on the words to be translated and carries out translation in combination with the context. The Attention model looks for several words corresponding to the source sentence and translates them in combination with the translated words. For example, when we translate machine learning, when translating to the machine, Attention will be focused on the “machine”, so that the Decoder can see the information of each word of Encoder, instead of being limited to the hidden vector of fixed length, resulting in information loss.
The specific process is as follows:
For a sentence 、 And… ,
1.Input sequence encoding, the vector of words obtained in a certain method, and then input into Encoder for coding, and the encoded vector is obtained, denoted here as 、 And… ,Here, we can use RNN or LSTM as the encoding mode. Here, WE still take RNN as an example.
2,Calculation of attention weightTo facilitate recording, attention weight is defined as, the vector after Decoder is 、 And… ,, where I is the subscript of the sentence input sequence
For the vector QI that currently needs decoder, its corresponding attention weight 、 And… ,By a function
We can also take the vector after Decoder as Query, and the vector after decoding is the field. Here, we are using Query to find the corresponding weight of the field
3,Computational semantic vector , (including,Is attention weight,Is the vector obtained by Encoder)
4, Decoder next output according to the semantic vector C, the output of the last Decoder, the hidden state of the last Decoder output, predict the next output
The addition of the Attention mechanic adds two new differences
1. Every time Decoder has a word, it needs to calculate an attention weight
2. Every Decoder needs to recalculate semantic vectors
To calculate the weight of attention, two pieces of information are needed:
- Vector after Encoder 、 And… ,This vector contains the information of all the words decoded. We can think of it as a dictionary
- Vector from the i-1 decoder 。
To illustrate this with the first example,
First, enter the text “machine learning” after the Encoder of RNN, get the hidden state 、 、 、
The second step is to calculate the attention weight. At the beginning, there is no hidden state of Decoder.Usually a fixed start token
After the attention weight is obtained, the hidden state is obtained according to Encoder 、 、 、 And the corresponding attention weight was calculated to obtain the semantic vectorAnd finally Decoder gets the result machine
Decoder second word, the steps are the same, the difference is to calculate the weight in the attention, the use of Decoder to get the hidden state
So the question here is, what is this function f_? Generally, there are the following methods to calculate the weight of attention
- Bilinear method
It’s set up directly with a weight matrix 和 Is more direct, and calculation speed is faster.
- Dot Product
This method is more direct, eliminating the weight and creating 和 The advantages of the relational mapping is faster calculation, and no parameters, reduce the complexity of the model. But you need to 和 The dimensions should be the same.
- scaled-dot Product
There is a problem with the dot product method above, that is, with the increase of vector dimensions, the weight obtained will also increase. In order to improve computing efficiency and prevent data overflow, scaling is performed on it.
Talk is cheap, show me the code! ! The next article will try to implement the model.