This is the seventh day of my participation in the First Challenge 2022. For details: First Challenge 2022.
Attention is all you need.
Attention Is All You Need
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ł ukasz Kaiser, Illia Polosukhin
Code:github.com/jadore80112…
preface
Encoder-Decoder models based on RNN or CNN dominate the FIELD of NLP, but they are not perfect:
- RNN models such as LSTM and GRU cannot realize parallel computation due to their inherent cyclic sequence structure. When the sequence is long, the computational efficiency is especially low. Although recent works such as factor decomposition technique 1 and conditional computation 2 have improved the computational efficiency and performance to a certain extent, the limitation of sequential computation still exists.
- Although CNN models such as Extended Neural GPU3,ByteNet4 and ConvS2S5 can perform parallel computation, it is still difficult to learn the long-distance relationship of signals at any two positions, and the computational complexity increases linearly or logarithmic with distance.
Google, however, chose to abandon the inherent structure of mainstream models and proposed Transformer based entirely on attention mechanism, which has advantages that other models cannot compare with:
- Transformer can train efficiently in parallel, so the speed is very fast. I trained for 3.5 days on 8 Gpus.
- For learning long-distance relationships, Transformer reduces the time complexity to a constant and uses multiplex attention to offset the effective resolution reduction caused by average weighting of location information
- Transform is an auto-encoding model that exploits context simultaneously
The overall structure
The overall structure of Transfromer is an Encoder-Decoder, and the autocoding model is mainly used for semantic understanding, which is more advantageous for generating tasks or autoregression model
We can divide it into four parts: input, coding block, decoding block and output
Let’s take a look at the structure in order. I hope you can look at this diagram carefully before reading the rest of the article and refer to it as you read
The input
Nn. Embedding is used for Word Embedding, dModeld_ {model}dmodel=512
When embedded, the weights of the left and right parts are shared
The word embedding vector needs to be multiplied by dModel \ SQRT {d_{model}}dmodel, which may be to reduce the influence of position coding
Inputs for the first layer of the network are added in: Inputs for the first layer
Location encoding is done after input, giving Transformer the ability to capture sequence order
Encoder-Decoder
Overall structure as shown in figure
The internal structure of Encoder-Decoder is shown below:
-
Encoder: A block of code is made up of six identical layers, each with two sub-layers
The first Layer consists of a Multi Head Self−AttentionMulti Head Self−Attention, Layer NormalizationLayer\ NormalizationLayer Normalization and residual wiring
The second layer includes a fully connected feed-forward layer of the second layer: FFN(x)= Max (0,xW1+ B1)W2+b2FFN(x)= Max (0,xW_1+ B_1)W_2+b_2FFN(x)= Max (0,xW1+ B1)W2+ B2. The dimension of the middle layer is 2048. There is also Layer NormalizationLayer\ NormalizationLayer Normalization and residual wiring
-
Decoder: The Decoder block also consists of six identical layers, each with residual links and Layer NormalizationLayer\ NormalizationLayer Normalization
A third sub-layer is added, namely, Masked Multi Head AttentionMasked Multi Head AttentionMasked Multi Head AttentionMasked Multi Head Attention, which is aimed at the output of the previous layer and will be explained in detail below
In addition, we have modified the sub-attention sub-layer (as shown in the figure above, from self-attention to Encoder-Decoder Attention).
Layer Normalization: NLP tasks mainly use Layer NormLayer\ NormLayer Norm rather than Batch NormBatch\ NormBatch Norm, because normalizing on batches confuses information between different statements, and we need to normalize within each statement.
The output
The output of the decoder uses ordinary linear variations with Softmax as the input for the next layer.
Attentional mechanism
Self-Attention
See my other blog, Attention Mechanics, for more details
Zoom dot product attention
Scaling dot product attention is shown as follows:
The formula for
The scaled dot product refers to the scoring function
The common attention model is additive model and dot product model. The dot product model is more efficient than the additive model, but when the input vector dimension is too high, the dot product model usually has a large variance, resulting in a small softmax function gradient, and the scaling dot product model can solve this problem well.
Also, Transformer uses residual connections in its implementation
Softmax gradient problem:
We know that Softmax’s role is to bridge the gap between big data
For a set of data [x,x,2x], let’s assign different values to them to observe changes in variance and S3S_3S3
import numpy as np x = np.array([np.exp([i, i, 2*i]) for i in [1, 10, 100]]) print(np.square(np.linalg.norm(x, axis=1, Word = 2))) # S print variance (x:, 2)/x.s um (axis = 1). T) # S3Copy the code
Even if the data are proportional to each other, Softmax assigns almost the entire probability distribution to the largest number at large orders of magnitude
The gradient of Softmax is
When the above condition, softmax will output a approximately one – hot vector [0, 1,…, 0] [0, 1, \ cdots, 0] [0, 1,…, 0], gradient for at this time
Why does the scaled dot product work?
In the footnote of the paper, the following hypotheses are given:
If the components of the vectors Q and K are independent random variables, with mean 0 and variance 1, then the mean of the dot product QK is 0 and variance 1
See sections 2.3.5 and 2.3.6 of my other blog on probability theory for details of the reasoning process
We learned one of the basic properties of variance in sophomore year, for the random variable Y is equal to aX plus bY is equal to aX plus bY is equal to aX plus b
So divide by dk\ SQRT {d_k} DK can control the variance to 1, so as to effectively solve the gradient disappearance
Multi-Head Attention
Multiple attention, as follows
It is better to perform Attention in parallel with multiple Q’s, K’s, and V’s of dModeld_ {model}dmodel dimensions (512 dimensions here) for the following reasons:
- It enhances the model’s ability to focus on different pieces of information
- Provides multiple “presentation subspaces” for the attention layer
Specific operation:
For each header, we use a separate set of weight matrices WQW_QWQ, WKW_KWK, WVW_VWV, and reduce their dimensions to dModel /Hd_{model}/Hdmodel/H
H different attention matrices were generated and spliced together
Finally, a separate weight matrix W^O is used to obtain the final attention weight
Since the dimensions are scaled, the total cost of multiple attention is similar to the cost of using only one attention
Relation with convolution:
We can find that multiplex attention is actually similar to convolution
Just as multiple heads can notice different information, different convolution kernels can extract different features in the image
Similarly, just as information redundancy exists in multiple channels of feature map, information redundancy also exists in multiple attention
Location code
Main reference
Why do WE need location coding?
As mentioned above, Transformer is a kind of parallel computing. In order for the model to capture the sequential relationship of sequences, position coding is introduced to obtain the relative distance between words.
Sines and cosines position coding
The cosine function is used for encoding odd positions
The even numbers are encoded using sine functions
Note: position refers to the position of the data in a word vector, pos refers to the position of the word in the statement
For example, the position of a word in the statement is Pos=5 and dmodeld_{model}dmodel=512, then its position coding vector is
You can see that 2i, 2i+1 only determines whether you use sines or cosines, and for the same I, the inside is the same
Once the location code is obtained, it is added to the word vector as the final input
The intuition here is that by adding position encoding to word vectors, they will provide meaningful distance information when projected to Q/K/V and dotted
Why is location coding effective?
We learned trigonometric induction formula in second grade:
We can get:
We make u (k) = (k, I) 2 PE u (k) = (k, I) 2 PE u (k) = 2 (k, I), PE v (k) = PE (k, 2 I + 1) v (k) = PE (k, 2 I + 1) v (k) = (k, I + 1) 2 PE, too:
Given the relative distance k, PE(POS + K)PE(POS + K)PE(POS + K) Has a linear relationship with PE(posPE(POS)
Therefore, the model can better capture the relative position of words by encoding absolute position
More and more
There is no doubt that location coding has a huge role in overall Transformer
A Tranformer without location coding is a giant word bag
Let’s look at the limitations of sines and cosines position coding
Directionality of relative distance
As we know, the dot product can represent relative distance. In the attention mechanism, the dot product is used as a scoring function to obtain the similarity of Q and K. Let’s look at the distance encoded between two positions with relative distance of K
For PEposPE_{pos}PEpos, let ci=1100002idc_i=\frac{1}{10000^{\frac{2i}{d}}}ci=10000d2i1:
Inner product can be obtained:
The cosine function is an even function, so the sines and cosines position coding can only capture the distance relationship between two words, but cannot judge its distance relationship
The effect of self-attention on position coding
In Transfromer, self-attention is calculated after position encoding, with the following formula:
It can be seen that after the calculation of self-attention, the model actually cannot retain the location information between words
So how does Transformer work?
In Bert, we use Learned Position Embedding instead of sinusposition encoding
The code on
Location code
class PositionalEncoding(nn.Module): def __init__(self, d_hid, n_position=200): Super (PositionalEncoding, self).__init__() # Not a parameter # Define a constant in memory named pos_table, which can be used by self.pos_table, Models can be written and read as they are saved and loaded. self.register_buffer( 'pos_table', self._get_sinusoid_encoding_table(n_position, d_hid)) def _get_sinusoid_encoding_table(self, n_position, d_hid): ''' Sinusoid position encoding table ''' # TODO: Def get_position_angle_vec(position): return [position / np.power(10000, 2 * (hid_j // 2)/d_hid) for hid_j in range(d_hid)] # Sinusoid_table = Array ([get_position_angle_vec(pos_i) for pos_i in range(n_position)]) # 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim return torch.FloatTensor(sinusoid_table). Unsqueeze (0) # def forward(self, x): Return x + self.pos_table[:, :x.size(1)].clone().detach() # add to the word vectorCopy the code
Scaling dot product and multiples self attention
class ScaledDotProductAttention(nn.Module): "Def __init__(self, temperature, attn_dropout=0.1): Super ().__init__() self.temp = temp # root d_k, def forward(self, q, k, v, mask=None): attn = torch.matmul(q / self.temperature, k.transpose(2, 3)) if mask is not None: Attn = attn.masked_fill(mask == 0, -1e9) # Make the part of the mask that is 0 become -1e9 attn = self.dropout(f.oftmax (attn, dim=-1)) output = torch.matmul(attn, v) return output, attn class MultiHeadAttention(nn.Module): "Multi-head Attention module" "def __init__(self, n_head, d_model, d_k, d_v, d_model, dropout=0.1): # d_model=512 super().__init__() self.n_head = n_head # self.d_k = d_k # self. D_q =d_k # use full connection layer to initialize and train weight matrix, in fact the subsequent work is generated with a full connection layer, Then split self.w_qs = nn.Linear(d_model, n_head * d_k, bias=False) self.w_ks = nn.Linear(d_model, n_head * d_k, bias=False) self.w_ks = nn.linear (d_model, n_head * d_k, bias=False) self.w_vs = nn.Linear(d_model, n_head * d_v, bias=False) self.fc = nn.Linear(n_head * d_v, d_model, Bias = False) self. Attention = ScaledDotProductAttention (temperature = d_k * * 0.5) self. Dropout = nn. Dropout (dropout) self.layer_norm = nn.LayerNorm(d_model, eps=1e-6) def forward(self, q, k, v, mask=None): d_k, d_v, n_head = self.d_k, self.d_v, self.n_head sz_b, len_q, len_k, len_v = q.size(0), q.size(1), k.size(1), V.size (1) residual = q # Pass through the pre-attention projection b x lq x (n*dv) # Separate different heads: b x lq x n x dv q = self.w_qs(q).view(sz_b, len_q, n_head, d_k) k = self.w_ks(k).view(sz_b, len_k, n_head, d_k) v = self.w_vs(v).view(sz_b, len_v, n_head, d_v) # Transpose for attention dot product: B x n x Lq x dv # change n to the second dimension, Similar to the channel q, k, v = q.ranspose (1, 2), K.Ranspose (1, 2), v.Ranspose (1, 2) if mask is not None: mask = mask.unsqueeze(1) # For head axis broadcasting. q, attn = self.attention(q, k, v, mask=mask) # Transpose to move the head dimension back: b x lq x n x dv # Combine the last two dimensions to concatenate all the heads together: b x lq x (n*dv) q = q.transpose(1, 2).contiguous().view(sz_b, len_q, -1) # self Baidu contiguous() q = self. Dropout (self. Fc (q)) # equivalent to the above WO q += residual q = selfCopy the code
The nuggets code block looks too laborious, so I’ll skip hh
The appendix
[1] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks.
[2] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.
[3] ł ukasz ł Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems
[4] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neuralmachine translation in linear time
[5] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning