Overview of the main contents of the article:
1. Seq2Seq and the attention mechanism
A Seq2Seq task is a task whose inputs and outputs are sequential. For example, speak English and translate into Chinese.
1.1 The relationship between Encoder-Decoder model and Seq2Seq?
A: Encoder-decoder model was first proposed by Cho et al and applied in machine translation. Since it is text-to-text conversion in machine translation, such as French to English, Sutskever et al also call the encoder-decoder model sequence to sequence learning (Seq2Seq).
Related papers are as follows:
【1】Cho K, Van Merrienboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint ArXiv: 1406.1078, 2014.
Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.
1.2 Encoder, Decoder model
Generally speaking, the most common Seq2Seq task is to use Encoder+Decoder mode, first encode a sequence into a context matrix, then use Decoder to decode. Of course, we only use the context vector as the input from the encoder to the decoder.
The encoder – decoder model is very classical, but it has many limitations. The biggest limitation is that the only connection between the encoder and decoder is a fixed-length context vector. That is, the encoder compresses the entire sequence of information into a fixed-length vector. It has three disadvantages:
(1) For the encoder, the semantic context vector may not be able to completely represent the information of the whole sequence. (2) For the encoder, the information carried by the sequence first input to the network may be overwritten by the sequence later input. The longer the input sequence is, the more serious this phenomenon will be. (3) For the decoder, the weight of each input word is inconsistent when decoding.
The above three disadvantages make the decoder may not get enough information of the input sequence at the beginning of decoding, so the accuracy of natural decoding is not high. Therefore, the mechanism of attention was added to the NMT task.
1.3 The essence of attention mechanism
For an introduction to attentional mechanisms, see my previous article: Attentional mechanisms in Deep Learning, Microstrong, mp.weixin.qq.com/s/3911D_FkT…
After being familiar with the principle of attention mechanism, let’s explore the essential idea of attention mechanism.
One way to think about the Attention mechanism is this: Imagine the constituent elements in the Source as consisting of a series of <Key,Value> data pairs. At this time, an element Query (decoder hidden layer state) in the given Target is calculated to calculate the similarity or correlation between Query and each Key (each word in the input sentence). Get the weight coefficient of each Key corresponding to Value (semantic encoding corresponding to each word in the input sentence, that is, the weight of each hidden layer vector of the encoder), and then sum the weighted values to obtain the final Attention Value. So essentially, the Attention mechanism is a weighted sum of the values of elements in Source, while Query and Key are used to calculate the weight coefficients of the corresponding values. That is, the essential idea can be rewritten into the following formula:
Where, represents the length of Source, and the meaning of the formula is described above. In the example of machine translation, because in the process of calculating Attention, the Key and Value in the Source are combined and point to the same thing, that is, the semantic code corresponding to each word in the input sentence, it may not be easy to see the structure that can reflect the essential idea.
Of course, conceptually, Attention is still understood as selectively selecting a small amount of important information from a large amount of information and focusing on these important information, while ignoring most of the unimportant information. This idea is still valid. The focusing process is reflected in the calculation of the weight coefficient. The larger the weight is, the more the focus will be on the corresponding Value, that is, the weight represents the importance of the information, and the Value is the corresponding information.
As for the specific calculation process of Attention mechanism, if most current methods are abstracted, it can be summarized into two processes: the first process calculates the weight coefficients according to Query and Key, and the second process weights and sums values according to the weight coefficients. The first process can be subdivided into two phases: the first phase calculates the similarity or correlation between Query and Key; In the second stage, the original score values of the first stage are normalized. In this way, the Attention calculation process can be abstracted into the three stages shown in the figure below.
In the first stage, different functions and calculation mechanisms can be introduced. The most common methods include finding the vector dot product of Query and Cosine of each other, or introducing additional neural networks to calculate their similarity or correlation, namely, the following ways:
The dot product:
Cosine similarity
MLP network:
In the second stage, a calculation method similar to SoftMax is introduced for numerical conversion of the score in the first stage. On the one hand, normalization can be carried out, and the original calculated score value can be sorted into probability distribution where the sum of the weights of all elements is 1. On the other hand, SoftMax’s built-in mechanism can be used to highlight the weight of important elements. In other words, the following formula is generally used:
The calculation result of the second stage is the corresponding weight coefficient, and then the weighted sum can get the Attention value:
Through the calculation of the above three stages, the Attention value for Query can be calculated. At present, most concrete calculation methods of Attention mechanism conform to the above three-stage abstract calculation process.
2. There are some drawbacks based on LSTM methodology
The traditional RNN has the problems of gradient disappearance and gradient explosion. Both LSTM and GRU alleviate the problem of gradient disappearance and gradient explosion, but do not completely solve the problem. Therefore, in order to further improve the capability of the model, Transformer model is proposed in the paper Attention Is All You Need, which can completely replace RNN, CNN and other models by using the Attention mechanism. The popular Bert is based on Transformer, which is widely used in NLP fields such as machine translation, question answering system, text summarization and speech recognition. \
The problem of gradient disappearance and explosion in RNN and LSTM
【 1 】 cycle neural network (RNN), Microstrong, address: mp.weixin.qq.com/s/IPyI2Ee6K… (2) understand the LSTM network, Microstrong, address: mp.weixin.qq.com/s/0Q0aK4xmy…
3. Overall structure of Transformer
Just like the Attention mechanic, the Transformer model uses encoder-decoder architecture. However, its structure is more complex than Attention. In this paper, the Encoder layer is stacked with 6 Encoders, as is the Decoder layer.
The internal structure of each Encoder and Decoder is shown below:
- Encoder consists of two layers, a self-attention layer and a feedforward neural network layer. The self-attention layer helps the current node not only focus on the current word, so as to obtain the semantics of the context.
- Decoder also includes the two layers of networks mentioned by Encoder, but there is a Attention layer in between that helps the current node get what it needs to pay Attention to.
4. Transformer Encoder structure
First of all, the model needs to implement a embedding operation for the input data, which can also be understood as a similar operation to Word2vec. After embedding is finished, the data is input to the Encoder layer and sent to the feedforward neural network after self-attention processing. The computation of the feedforward neural network can be carried out in parallel, and the output will be input to the next Encoder.
4.1 Positional Encoding
Unlike the sequence model, the Transformer model lacks a way to interpret the order of words in the input sequence. To solve this problem, Transformer adds an additional vector Positional Encoding to the input of the Encoder and Decoder layers. Dimensions are the same as those of embedding, and this vector uses a unique way for the model to learn this value. This vector determines the position of the current word, or the distance between different words in a sentence. There are many specific calculation methods for this position vector, and the calculation methods in this paper are as follows:
Where pos refers to the position of the current word in the sentence and the index of each value in the vector. It can be seen that sine encoding is used in even positions and cosine encoding is used in odd positions.
Finally, this Positional Encoding is added to the value of embedding as input to the next layer.
4.2 the Self – Attention
Let’s take a macro look at the self-attention mechanism and refine how it works.
For example, the following sentences are the input sentences we want to translate:
The animal didn’t cross the street because it was too tired
What does “it” mean in this sentence? Does it mean Street or this animal? This is a simple problem for humans, but not for algorithms.
When the model processed the word “it”, the self-attention mechanism allowed “it” to associate with “animal”.
As the model processes each word in the input sequence, self-attention focuses on all words in the entire input sequence, helping the model better encode the word. If you are familiar with RNN (recurrent neural network), recall how it maintains hidden layers. The RNN combines the representation of all the previous words/vectors it has already processed with the current word/vector it is processing. The self-attention mechanism integrates all the relevant word understanding into the word we’re processing.
As shown in The figure above, when we encode The word “it” in encoder #5 (The uppermost encoder in The stack), The attention mechanism part will look at “The Animal” and encode part of its representation into “it”. Now let’s take a look at how self-attention works in detail.
In this paper, the dimension of vector is dimension. We call these three vectors Query, Key and Value respectively. The three vectors are obtained by embedding vector multiplied by a matrix (), which is randomly initialized. The dimension is (64, 512). Note that the second dimension needs to be the same as that of embedding, its value will be updated in the process of BP, and the dimension of the three vectors obtained is 64.
As you can see in the figure above, multiplying the weight matrix produces the “Query” vector associated with this word. The result is to create a “Query” vector, a “key” vector, and a “value” vector for each word in the input sequence.
What are “query” vectors, “key” vectors, and “value” vectors? A: They are all abstractions that help calculate and understand the mechanisms of attention. Read on to find out exactly what role each vector plays in calculating the mechanism of attention.
(2) The second step calculates the score value of self-attention, which determines how much Attention we pay to other parts of the input sentence when we encode a word in a certain position. The score value is calculated as a dot of Query and Key. For example, suppose we are calculating the attention vector for the first word “Thinking” in this example. We need to score “Thinking” for each word in the input sentence. These scores determine how much attention is given to other parts of the sentence when coding the word “Thinking.”
These scores are calculated by taking the dot product of the key vector of the scoring words (all the words entered into the sentence) with the Value vector of the “Thinking”. So if we’re dealing with self-attention for the uppermost words, the first fraction is the dot product of sums, and the second fraction is the dot product of sums.
(3) Next, divide the result by a constant, in this case by 8, which generally takes the square root of the first dimension of the matrix mentioned above, the square root of 64, 8, which makes the gradient more stable. Of course, you can choose another value, 8 is the default value. Then do a Softmax calculation with the result. The role of Softmax is to normalize the scores of all words, resulting in positive scores and a sum of 1. The result is how relevant each word is to the word in the current position.
This Softmax score determines each word’s contribution to the current position of encoding (” Thinking “). Obviously, the word already in this position will get the highest Softmax score, meaning that word relevance in the current position is bound to be high, but sometimes it helps to focus on another word related to the current word.
(4) The next step is to multiply each Value vector by the values obtained by SoftMax (this is in preparation for summing them up later). The intuition here is to focus on semantically related words and weaken unrelated words (for example, multiply them by 0.001 decimal points). The multiplied values are then added and the result is the output of the self-attetion layer at the current position (for the first word in our case).
This completes the calculation of self-attention. The resulting vector can then be passed on to the feedforward neural network. In practice, however, these calculations are done in matrix form to make the calculations faster. So let’s see how we do that with matrices.
(1) The first step is to calculate the Query matrix, Key matrix and Value matrix. To do this, we embed the words of the input sentence into the matrix and multiply it by our trained weight matrix ().
From the figure above, each row in the matrix corresponds to a word in the input sentence. Again, we see the size difference between the word embedding vector (512, or four cells in the graph) and the vector (64, or three cells in the graph).
(2) Finally, since we are dealing with a matrix, we can combine steps 2 to 4 in the detailed calculation of self-attention into a formula to calculate the output from the Attention layer.
In summary, in practical application scenarios, in order to improve the calculation speed, we use matrix to directly calculate the matrix of Query, Key and Value, and then multiply the Value of embedding with the three matrices, multiply the new matrix Q and K, divide by a constant, softmax operation is done. And then finally times the V matrix. This method of determining the weight distribution of value by the similarity of query and key is called scaled dot-product attention.
4.3 – Headed Attention
By adding a mechanism called “multi-headed” attention, the paper further refined the self-attentional layer and improved its performance in two ways:
- It extends the model’s ability to focus on different locations. In the example above, although the encoding of each word is represented to a greater or lesser extent, it may be dominated by the actual word itself. If we translate a sentence, like “The animal didn’t cross The street because it was too tired,” and we want to know what word “it” refers to, The model’s “multi-headed” attention mechanism kicks in.
- It gives representation subspaces of the attention layer. As we will see next, for the “multi-headed” attention mechanism, we have multiple Query/Key/Value weight matrix sets (Transformer uses eight attention heads, so we have eight matrix sets for each encoder/decoder). Each of these sets is randomly initialized, and after training, each set is used to project input word embeddings (or vectors from lower encoders/decoders) into different presentation subspaces.
As can be seen from the above figure, under the “multi-headed” attention mechanism, we maintain an independent Q/K/V weight matrix for each head, resulting in different Q/K/V matrices. Just like before, we multiply X times the matrix to produce the Q/K/V matrix.
If we do the same self-attention calculation as above, just eight different weight matrix operations, we will get eight different matrices.
So that gives us a little bit of a challenge. The feedforward layer does not need 8 matrices, it only needs one matrix (consisting of representation vectors for each word). So we need a way to compress these eight matrices into one matrix. So what do you do? You can just concatenate these matrices together, and then multiply them by an additional weight matrix.
That is almost all the bulls’ attention. There are a lot of matrices, and we’re trying to put them all in one picture so we can see them at a glance.
Now that we’ve touched on so many heads of the attentional mechanism, let’s revisit the previous example and see where the different heads of attention focus when we code the word “it” in the example:
When we code the word “it,” one looks at “animal” and the other looks at “tired.” In a sense, the model’s representation of the word “it” is in some ways a proxy for “animal” and “tired.”
However, things get harder to explain if we add all that attention to the picture:
4.4 The Residuals and Layer normalization
Before moving forward, we need to mention the details of an encoder architecture: there is a residual link around each of the sub-layers (self-attention, feed-forward network) in each encoder, and it all follows a Layer Normalization step.
[1] Ba J L, Kiros J R, Hinton G E. Layer Normalization [J]. ArXiv Preprint arXiv:1607.06450, 2016.
If we visualize these vectors and the layer-norm operation associated with self-attention, it will look something like this:
The sublayer of the decoder does the same thing. If we imagine a Transformer with a 2-layer encoder-decoder structure, it will look something like this:
4.5 Batch Normalization and Layer Normalization
There are many types of Normalization, but they all have a common purpose of converting input into data with a mean of zero and variance of one. We do normalization before putting data into the activation function because we don’t want the input data to fall in the saturated area of the activation function.
(1) Batch Normalization
The main idea of BN is to normalize each batch of data in each layer. We might normalize the input data, but after the network layer, our data is no longer normalized. With the development of this situation, the deviation of the data becomes larger and larger. My back propagation needs to take these large deviations into account, which forces us to use a smaller learning rate to prevent gradient disappearance or gradient explosion. The specific approach of BN is to normalize each small batch of data in the direction of batch.
(2) Layer normalization
It’s also a way of normalizing the data, but LN calculates the mean and variance on each sample, rather than BN, which calculates the mean and variance on the batch direction! The formula is as follows:
5. Transformer Decoder structure
According to the overall structure diagram above, Decoder and Encoder are basically the same. At first, a Positional Encoding of mutil-head attetion is added, just like in Section 4.1. Mask is also a key technology in Transformer and will be described below. The rest of the layer structure is the same as Encoder, see the Encoder layer description above for details.
5.1 masked mutil – head attetion
Mask indicates a mask that masks certain values so that they do not take effect when parameters are updated. Transformer model involves two types of mask, namely padding mask and Sequence mask. The padding Mask is used in all scaled dot-Product attention, while the Sequence Mask is only used in self-attention for Decoder.
(1) Padding mask
What is a padding mask? Because the length of the input sequence is different from batch to batch in other words, we need to align the input sequence. In particular, you populate short sequences with zeros. However, if the input sequence is too long, the content on the left is cut and the excess is discarded. Because these filling positions are meaningless, our attention mechanism should not focus on these positions, so we need to do some processing.
To do this, add the values of these positions to a very large negative number (negative infinity), so that the probability of these positions approaches 0 by SoftMax!
And our padding mask is actually a tensor, and each value is a Boolean, and the value false is where we’re going to do the processing.
(2) Sequence mask
Sequence masks are designed to prevent the decoder from seeing future information. In other words, for a sequence, when time_step is t, our decoding output should only depend on the output before t, but not on the output after T. So we need to figure out a way to hide the information after t.
So how do you do that? It’s also very simple: generate an upper triangle matrix with all values of 0. Applying this matrix to each sequence will do the trick.
- For scaled dot-product attention, use padding Mask and Sequence Mask as attn_mask. The implementation is to add two masks together as attn_mask.
- In all other cases, attn_mask equals the padding mask.
5.2 Final linear transformation and Softmax layer
When Decoder layer after all has been completed, how to get the vector map as we need the word, is very simple, only need to add a full at the end connection layer and layer softmax, if our dictionary is 10000 words, that will eventually softmax output 10000 – word probability, probability value biggest corresponding word is our final results.
In the diagram above, the bottom begins with the output vector produced by the decoder component. It then converts to an output word.
6. Transformer dynamic flow chart
The encoder works by processing input sequences. The output of the top encoder is then transformed into a set of attention vectors containing the vectors K (the key vector) and V (the value vector), which is parallelized. These vectors will be used by each decoder for its own encoding-decoding attention layer, and these layers help the decoder focus on where the input sequence is appropriate:
After completing the coding phase, the decoding phase begins. Each step in the decoding phase outputs an element of the output sequence (in this case, the English translation of the sentence).
The next steps repeat the process until a special termination symbol is reached indicating that the Transformer decoder has finished its output. The output of each step is fed to the bottom decoders at the next time step, and the decoders output their decoded results just as the encoder did before. Also, just as we did with the encoder’s input, we embed and add location codes to those decoders to represent the position of each word.
The self-attention layer in those decoders behaves in a different pattern than the encoder: in the decoder, the self-attention layer is allowed to process only those positions higher up in the output sequence. Before the SoftMax step, it hides the following positions (set them to -INF).
The encoder-decoder attention layer works basically like the multihead self-attention layer, except that it creates the Queries matrix beneath it and obtains Keys and Values matrices from the encoder’s output.
7. Summary of Transformer training
Now that we have seen the forward process of a trained Transformer, it is useful to take a look at the concept of training.
During training, the model will go through the above forward process, and when we train on the labeled training set, we can compare the predicted output with the actual output.
For visualization, assume that the output has only six words in total (” A “, “am”, “I”, “thanks”, “student”, “”)
Figure: The model’s vocabulary is generated during pre-processing prior to training
Once the word list is defined, we can construct a vector of the same dimension to represent each word, such as one-hot encoding, for example “am”.
Figure: An example of one-hot encoding output word list
Next, let’s discuss the loss of the model and the indicators used for optimization in the training process to guide learning to get a very accurate model.
8. Loss function
Let’s use a simple example to demonstrate training, such as translating “merci” into “thanks”. That means the probability distribution of the output points to the word “thanks”, but since the model is untrained and randomly initialized, it is unlikely to be the desired output.
Since the model parameters are randomly initialized, the untrained model outputs random values. We can compare the real output, and then adjust the model weight by using the error backpass to make the output closer to the real output.
How do you compare these two probability distributions? Simply use one of the cross-entropy or KULlback-leibler divergence options.
Since this is an extremely simple example, it would be more realistic to use a sentence as input. For example, the input is “Je suis etudiant” and the expected output is “I am a student”. In this example, we expect the continuous probability distribution of the model output to satisfy the following conditions:
- Each probability distribution has the same dimension as the word list.
- The first probability distribution has the highest predicted probability value for “I”.
- The second probability distribution has the highest predicted probability value for “am”.
- Until the fifth output points to the “” flag.
Figure: Target probability distribution of the training model \ for a sentence
After sufficient training time on a sufficiently large training set, the probability distribution we expect to generate is as follows:
After training, the output of the model is the translation we expect. Of course, this does not mean that the process comes from a training set. Note that each location can have a value, even if it is nearly irrelevant to the output, and this is where SoftMax can be helpful for training.
Now, because the model produces only one set of outputs per step, assume that the model selects the highest probability and throws away the rest, a method of producing predicted results called greedy decoding. Another method is beam search. In each step, only the two outputs with the highest probability of the head are retained, and then the next step is predicted according to these two outputs, and then the two outputs with the highest probability of the head are retained, and the prediction is repeated until the end of the prediction. Top_beams is superparametric and can be adjusted experimentally.
9. Some questions about Transformer
9.1 Why does the Transformer need multi-head Attention? What’s the good of that? How to calculate multi-head Attention?
According to the original paper, the reason for multi-head Attention is to divide the model into multiple heads and form multiple subspaces, so that the model can pay Attention to different aspects of information, and finally synthesize all aspects of information. In fact, intuitively speaking, if you design such a model by yourself, attention will not be done only once. The comprehensive result of multiple attention can at least enhance the model, and it can also be similar to the simultaneous use of multiple convolutional kernels in CNN. Intuitively speaking, Multiple attention helps the network capture richer features/information. The calculation process for multi-head Attention has been described in detail.
9.2 What are the advantages of Transformer over RNN/LSTM? Why is that?
- RNN series models have poor parallel computing capability. The problem of RNN parallel computation lies in this, because the calculation at time T depends on the hidden layer calculation result at time T-1, and the calculation at time T-1 depends on the hidden layer calculation result at time T-2, and so on, the so-called sequence dependence relationship is formed.
- Transformer has better feature extraction capability than RNN series models. Specific experimental contrast can refer to: give up fantasy, fully embrace the Transformer: natural language processing three feature extraction apparatus (CNN/RNN/TF), jun-lin zhang, address: zhuanlan.zhihu.com/p/54743941.
However, it is worth noting that Transformer is not able to completely replace RNN series models. Every model has its own scope of application. Similarly, RNN series models are preferred for many tasks. Quickly analyze what model to use and how to do it well.
9.3 Why Transformer can replace SEQ2SEQ?
- Seq2seq faults: The biggest problem with seq2SEq is that it compresses all the information on the Encoder side into a fixed length vector and uses it as the first input to the Decoder side to hide the state. To predict the hidden state of the first word (token) on the Decoder side. This obviously loses a lot of information on the Encoder side when the input sequence is long, and it sends the fixed vector to the Decoder side all at once, and the Decoder side cannot focus on the information it wants to focus on. The above two points are the shortcomings of SEQ2SEQ model, and the subsequent paper has some improvements, such as the famous “Neural Machine Translation by CPC Learning to Align and Translate”. Although seQ2SEQ model has been substantially improved, its parallelism capability is still limited because the main model is still RNN (LSTM) series model.
- The Transformer advantages: Transformer not only improves the seQ2SEQ model on both of these points, but also introduces the self-attention module, which makes the source sequence and target sequence “self-relate” in the first place. The embedding representation of source sequence and target sequence itself contains richer information, and the subsequent FFN layer also enhances the expression ability of the model. How Much Attention Do You Need?A Granular Analysis of Neural Machine Translation Architectures) Moreover, Transformer’s capability of parallel computing is far superior to seQ2SEQ model, so I think this is where Transformer is superior to SEQ2SEQ model.
How does 9.4 Transformer parallelize?
In my opinion, the parallelization of Transformer is mainly reflected in the self-attention module. Transformer on the Encoder side can process the whole sequence in parallel and get the output of the whole input sequence through the Encoder side. In the self-attention module, for a certain sequence, The attention module can calculate the dot product directly, while the RNN series models must compute the dot product in order from.
10. Reference
【 1 】 【 transformer 】 you should know the transformer, machine learning algorithms and natural language processing production, address: mp.weixin.qq.com/s/lwAPIdIt9… (2) Attention? Attention! , Lil’Log address: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#whats-wrong-with-seq2seq-model [3] The Illustrated the Transformer, address: jalammar. Making. IO/Illustrated… [4] Transformer each layer network structure detailed explanation! Interview essentials! (with the code implementation), address: https://mp.weixin.qq.com/s/NPkVJz7u0L4WWD_meZw3MQ [5] graphic Transformer (full version), address: blog.csdn.net/longxinchen… [6] [translate] The Illustrated The Transformer – Zewei Chu in his article on zhihu https://zhuanlan.zhihu.com/p/75591049 [7] some issues about The Transformer finishing record Adherer articles – zhihu zhuanlan.zhihu.com/p/82391768 [8] the attention mechanism in the deep learning, address: https://mp.weixin.qq.com/s/swLwla75RIQfyDDCPYynaw
“`php
Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching
Copy the code