This is the seventh day of my participation in the First Challenge 2022. For details: First Challenge 2022.

Attention is all you need.

Attention Is All You Need

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ł ukasz Kaiser, Illia Polosukhin

Code:github.com/jadore80112…

preface

Encoder-Decoder models based on RNN or CNN dominate the FIELD of NLP, but they are not perfect:

  • RNN models such as LSTM and GRU cannot realize parallel computation due to their inherent cyclic sequence structure. When the sequence is long, the computational efficiency is especially low. Although recent works such as factor decomposition technique 1 and conditional computation 2 have improved the computational efficiency and performance to a certain extent, the limitation of sequential computation still exists.
  • Although CNN models such as Extended Neural GPU3,ByteNet4 and ConvS2S5 can perform parallel computation, it is still difficult to learn the long-distance relationship of signals at any two positions, and the computational complexity increases linearly or logarithmic with distance.

Google, however, chose to abandon the inherent structure of mainstream models and proposed Transformer based entirely on attention mechanism, which has advantages that other models cannot compare with:

  • Transformer can train efficiently in parallel, so the speed is very fast. I trained for 3.5 days on 8 Gpus.
  • For learning long-distance relationships, Transformer reduces the time complexity to a constant and uses multiplex attention to offset the effective resolution reduction caused by average weighting of location information
  • Transform is an auto-encoding model that exploits context simultaneously

The overall structure

The overall structure of Transfromer is an Encoder-Decoder, and the autocoding model is mainly used for semantic understanding, which is more advantageous for generating tasks or autoregression model

We can divide it into four parts: input, coding block, decoding block and output

Let’s take a look at the structure in order. I hope you can look at this diagram carefully before reading the rest of the article and refer to it as you read

The input

Nn. Embedding is used for Word Embedding, dModeld_ {model}dmodel=512

When embedded, the weights of the left and right parts are shared

The word embedding vector needs to be multiplied by dModel \ SQRT {d_{model}}dmodel, which may be to reduce the influence of position coding

Inputs for the first layer of the network are added in: Inputs for the first layer

Location encoding is done after input, giving Transformer the ability to capture sequence order

Encoder-Decoder

Overall structure as shown in figure

The internal structure of Encoder-Decoder is shown below:

  • Encoder: A block of code is made up of six identical layers, each with two sub-layers

    The first Layer consists of a Multi Head Self−AttentionMulti Head Self−Attention, Layer NormalizationLayer\ NormalizationLayer Normalization and residual wiring

    The second layer includes a fully connected feed-forward layer of the second layer: FFN(x)= Max (0,xW1+ B1)W2+b2FFN(x)= Max (0,xW_1+ B_1)W_2+b_2FFN(x)= Max (0,xW1+ B1)W2+ B2. The dimension of the middle layer is 2048. There is also Layer NormalizationLayer\ NormalizationLayer Normalization and residual wiring

  • Decoder: The Decoder block also consists of six identical layers, each with residual links and Layer NormalizationLayer\ NormalizationLayer Normalization

    A third sub-layer is added, namely, Masked Multi Head AttentionMasked Multi Head AttentionMasked Multi Head AttentionMasked Multi Head Attention, which is aimed at the output of the previous layer and will be explained in detail below

    In addition, we have modified the sub-attention sub-layer (as shown in the figure above, from self-attention to Encoder-Decoder Attention).

Layer Normalization: NLP tasks mainly use Layer NormLayer\ NormLayer Norm rather than Batch NormBatch\ NormBatch Norm, because normalizing on batches confuses information between different statements, and we need to normalize within each statement.

The output

The output of the decoder uses ordinary linear variations with Softmax as the input for the next layer.

Attentional mechanism

Self-Attention

See my other blog, Attention Mechanics, for more details

Zoom dot product attention

Scaling dot product attention is shown as follows:

The formula for


A t t e n t i o n ( Q . K . V ) = s o f t m a x ( Q K T d k ) V Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V

The scaled dot product refers to the scoring function


Q K T d \frac{QK^T}{\sqrt d}

The common attention model is additive model and dot product model. The dot product model is more efficient than the additive model, but when the input vector dimension is too high, the dot product model usually has a large variance, resulting in a small softmax function gradient, and the scaling dot product model can solve this problem well.

Also, Transformer uses residual connections in its implementation

Softmax gradient problem:


S i = e x i j = 1 N e x j S_i=\frac{e^{x_i}}{\sum_{j=1}^Ne^{x_j}}

We know that Softmax’s role is to bridge the gap between big data

For a set of data [x,x,2x], let’s assign different values to them to observe changes in variance and S3S_3S3

import numpy as np x = np.array([np.exp([i, i, 2*i]) for i in [1, 10, 100]]) print(np.square(np.linalg.norm(x, axis=1, Word = 2))) # S print variance (x:, 2)/x.s um (axis = 1). T) # S3Copy the code

{ x = 1 S = 6.938 S 3 = 0.576 x = 10 S = 2.253 e 17 S 3 = 0.999 x = 100 S = 5.221 e 173 S 3 = 1.0 \begin{cases}{} x=1\quad S=6.938\quad S_3=0.576 \\ x=10 \quad S=2.253e17\quad S_3=0.999\\ x=100\quad S=5.221e173\quad S_3 = 1.0 \ end {cases}

Even if the data are proportional to each other, Softmax assigns almost the entire probability distribution to the largest number at large orders of magnitude

The gradient of Softmax is


partial S ( x ) partial x = [ y 1 0 0 0 y 2 0 0 0 y d ] [ y 1 2 y 1 y 2 y 1 y d y 2 y 1 y 2 2 y 2 y d y d y 1 y d y 2 y d 2 ] \frac{\partial S(x)}{\partial x}= \left[\begin{array} {c}y_1&0&\cdots&0\\ 0&y_2&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&y_d \end{array}\right]- \left[\begin{array} {} y_1^2&y_1y_2&\cdots&y_1y_d\\ y_2y_1&y_2^2&\cdots&y_2y_d\\ \vdots&\vdots&\ddots&\vdots\\ y_dy_1&y_dy_2&\cdots&y_d^2 \end{array}\right]

When the above condition, softmax will output a approximately one – hot vector [0, 1,…, 0] [0, 1, \ cdots, 0] [0, 1,…, 0], gradient for at this time


partial S ( x ) partial x = [ 1 0 0 0 0 0 0 0 0 ] [ 1 2 0 0 0 0 0 0 0 0 ] = 0 \frac{\partial S(x)}{\partial x}= \left[\begin{array} {c}1&0&\cdots&0\\ 0&0&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&0 \end{array}\right]- \left[\begin{array} {} 1^2&0&\cdots&0\\ 0&0&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&0 \end{array}\right]=0

Why does the scaled dot product work?

In the footnote of the paper, the following hypotheses are given:

If the components of the vectors Q and K are independent random variables, with mean 0 and variance 1, then the mean of the dot product QK is 0 and variance 1
d k d_k

See sections 2.3.5 and 2.3.6 of my other blog on probability theory for details of the reasoning process

We learned one of the basic properties of variance in sophomore year, for the random variable Y is equal to aX plus bY is equal to aX plus bY is equal to aX plus b


sigma Y 2 = a 2 sigma X 2 \sigma_Y^2=a^2\sigma_X^2

So divide by dk\ SQRT {d_k} DK can control the variance to 1, so as to effectively solve the gradient disappearance

Multi-Head Attention

Multiple attention, as follows

It is better to perform Attention in parallel with multiple Q’s, K’s, and V’s of dModeld_ {model}dmodel dimensions (512 dimensions here) for the following reasons:

  • It enhances the model’s ability to focus on different pieces of information
  • Provides multiple “presentation subspaces” for the attention layer

Specific operation:

For each header, we use a separate set of weight matrices WQW_QWQ, WKW_KWK, WVW_VWV, and reduce their dimensions to dModel /Hd_{model}/Hdmodel/H

H different attention matrices were generated and spliced together

Finally, a separate weight matrix W^O is used to obtain the final attention weight


M u t i H e a d ( Q . K . V ) = C o n c a t ( h e a d 1 . h e a d 2 . . h e a d h ) W O w h e r e h e a d i = A t t e n t i o n ( Q W i Q . K W i K . V W i V ) MutiHead(Q,K,V)=Concat(head_1,head_2,\cdots,head_h)W^O\\where\quad head_i=Attention(QW^Q_i,KW^K_i,VW^V_i)

Since the dimensions are scaled, the total cost of multiple attention is similar to the cost of using only one attention

Relation with convolution:

We can find that multiplex attention is actually similar to convolution

Just as multiple heads can notice different information, different convolution kernels can extract different features in the image

Similarly, just as information redundancy exists in multiple channels of feature map, information redundancy also exists in multiple attention

Location code

Main reference

Why do WE need location coding?

As mentioned above, Transformer is a kind of parallel computing. In order for the model to capture the sequential relationship of sequences, position coding is introduced to obtain the relative distance between words.

Sines and cosines position coding


P E ( p o s . 2 i ) = sin ( p o s 1000 0 2 i d m o d e l ) P E ( p o s . 2 i + 1 ) = cos ( p o s 1000 0 2 i d m o d e l ) PE(pos,2i) = \sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})\\ PE(pos,2i+1) = \cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})\\

The cosine function is used for encoding odd positions

The even numbers are encoded using sine functions

Note: position refers to the position of the data in a word vector, pos refers to the position of the word in the statement

For example, the position of a word in the statement is Pos=5 and dmodeld_{model}dmodel=512, then its position coding vector is


[ s i n ( 5 1000 0 0 512 ) c o s ( 5 1000 0 0 512 ) s i n ( 5 1000 0 2 512 ) c o s ( 5 1000 0 2 512 ) s i n ( 5 1000 0 512 512 ) ] \left[\begin{array}{c}sin(\frac{5}{10000^{\frac{0}{512}}})\\cos(\frac{5}{10000^{\frac{0}{512}}})\\sin(\frac{5}{10000^{\f rac{2}{512}}})\\cos(\frac{5}{10000^{\frac{2}{512}}})\\\vdots\\sin(\frac{5}{10000^{\frac{512}{512}}})\end{array}\right]

You can see that 2i, 2i+1 only determines whether you use sines or cosines, and for the same I, the inside is the same

Once the location code is obtained, it is added to the word vector as the final input

The intuition here is that by adding position encoding to word vectors, they will provide meaningful distance information when projected to Q/K/V and dotted

Why is location coding effective?

We learned trigonometric induction formula in second grade:


sin ( Alpha. + Beta. ) = s i n ( Alpha. ) c o s ( Beta. ) + c o s ( Alpha. ) s i n ( Beta. ) c o s ( Alpha. + Beta. ) = c o s ( Alpha. ) c o s ( Beta. ) s i n ( Alpha. ) s i n ( Beta. ) \sin(\alpha+\beta)=sin(\alpha)cos(\beta)+cos(\alpha)sin(\beta)\\ cos(\alpha+\beta)=cos(\alpha)cos(\beta)-sin(\alpha)sin(\beta)

We can get:


P E ( p o s + k . 2 i ) = P E ( p o s . 2 i ) P E ( k . 2 i + 1 ) + P E ( p o s . 2 i + 1 ) P E ( k . 2 i ) P E ( p o s + k . 2 i + 1 ) = P E ( p o s . 2 i + 1 ) P E ( k . 2 i + 1 ) P E ( p o s . 2 i ) P E ( k . 2 i ) PE(pos+k,2i)=PE(pos,2i)PE(k,2i+1)+PE(pos,2i+1)PE(k,2i)\\ PE(pos+k,2i+1)=PE(pos,2i+1)PE(k,2i+1)-PE(pos,2i)PE(k,2i)

We make u (k) = (k, I) 2 PE u (k) = (k, I) 2 PE u (k) = 2 (k, I), PE v (k) = PE (k, 2 I + 1) v (k) = PE (k, 2 I + 1) v (k) = (k, I + 1) 2 PE, too:


[ P E ( p o s + k . 2 i ) P E ( p o s + k . 2 i + 1 ) ] = [ v ( k ) u ( k ) u ( k ) v ( k ) ] [ P E ( p o s . 2 i ) P E ( p o s . 2 i + 1 ) ] \left[\begin{array} {c}PE(pos+k,2i)\\ PE(pos+k,2i+1) \end{array} \right]= \left[\begin{array} {c}v(k)&u(k)\\-u(k)&v(k) \end{array} \right] \left[\begin{array} {c}PE(pos,2i)\\PE(pos,2i+1) \end{array}\right]

Given the relative distance k, PE(POS + K)PE(POS + K)PE(POS + K) Has a linear relationship with PE(posPE(POS)

Therefore, the model can better capture the relative position of words by encoding absolute position

More and more

There is no doubt that location coding has a huge role in overall Transformer

A Tranformer without location coding is a giant word bag

Let’s look at the limitations of sines and cosines position coding

Directionality of relative distance

As we know, the dot product can represent relative distance. In the attention mechanism, the dot product is used as a scoring function to obtain the similarity of Q and K. Let’s look at the distance encoded between two positions with relative distance of K

For PEposPE_{pos}PEpos, let ci=1100002idc_i=\frac{1}{10000^{\frac{2i}{d}}}ci=10000d2i1:


P E p o s = [ P E ( p o s . 0 ) P E ( p o s . 1 ) P E ( p o s . 2 ) P E ( p o s . d ) ] = [ s i n ( c 0 p o s ) c o s ( c 0 p o s ) s i n ( c 1 p o s ) c o s ( c d 2 1 p o s ) ] \begin{aligned} PE_{pos}&= \left[\begin{array} {c}PE(pos,0)\\PE(pos,1)\\PE(pos,2)\\ \vdots\\PE(pos,d) \end{array}\right]\\ &= \left[\begin{array} {c}sin(c_0pos)\\cos(c_0pos)\\sin(c_1pos)\\\vdots\\cos(c_{\frac{d}{2}-1}pos) \end{array}\right] \end{aligned}

Inner product can be obtained:


P E p o s T P E p o s + k = i = 0 d 2 1 s i n ( c i p o s ) s i n ( c i ( p o s + k ) ) + c o s ( c i p o s ) c o s ( c i ( p o s + k ) ) = i = 0 d 2 1 c o s ( c i ( p o s + k p o s ) ) = i = 0 d 2 1 c o s ( c i k ) \begin{aligned}PE_{pos}^TPE_{pos+k}&= \sum_{i=0}^{\frac{d}{2}-1}{sin(c_ipos)sin(c_i(pos+k))+cos(c_ipos)cos(c_i(pos+k))}\\ &=\sum_{i=0}^{\frac{d}{2}-1} {cos(c_i(pos+k-pos))}\\ &=\sum_{i=0}^{\frac{d}{2}-1}cos(c_ik) \end{aligned}

The cosine function is an even function, so the sines and cosines position coding can only capture the distance relationship between two words, but cannot judge its distance relationship

The effect of self-attention on position coding

In Transfromer, self-attention is calculated after position encoding, with the following formula:


s c o r e ( x i ) = ( x i W Q ) ( x i W K ) T d = ( ( x i p o s i t i o n + x i w o r d ) W Q ) ( ( x i p o s i t i o n + x i w o r d ) W K ) T d score(x_i)=\frac{(x_iW_Q)(x_iW_K)^T}{\sqrt{d}}=\frac{((x_i^{position}+x_i^{word})W_Q)((x_i^{position}+x_i^{word})W_K)^T} {\sqrt{d}}

It can be seen that after the calculation of self-attention, the model actually cannot retain the location information between words

So how does Transformer work?

In Bert, we use Learned Position Embedding instead of sinusposition encoding

The code on

Location code

class PositionalEncoding(nn.Module): def __init__(self, d_hid, n_position=200): Super (PositionalEncoding, self).__init__() # Not a parameter # Define a constant in memory named pos_table, which can be used by self.pos_table, Models can be written and read as they are saved and loaded. self.register_buffer( 'pos_table', self._get_sinusoid_encoding_table(n_position, d_hid)) def _get_sinusoid_encoding_table(self, n_position, d_hid): ''' Sinusoid position encoding table ''' # TODO: Def get_position_angle_vec(position): return [position / np.power(10000, 2 * (hid_j // 2)/d_hid) for hid_j in range(d_hid)] # Sinusoid_table = Array ([get_position_angle_vec(pos_i) for pos_i in range(n_position)]) # 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim return torch.FloatTensor(sinusoid_table). Unsqueeze (0) # def forward(self, x): Return x + self.pos_table[:, :x.size(1)].clone().detach() # add to the word vectorCopy the code

Scaling dot product and multiples self attention

class ScaledDotProductAttention(nn.Module): "Def __init__(self, temperature, attn_dropout=0.1): Super ().__init__() self.temp = temp # root d_k, def forward(self,  q, k, v, mask=None): attn = torch.matmul(q / self.temperature, k.transpose(2, 3)) if mask is not None: Attn = attn.masked_fill(mask == 0, -1e9) # Make the part of the mask that is 0 become -1e9 attn = self.dropout(f.oftmax (attn, dim=-1)) output = torch.matmul(attn, v) return output, attn class MultiHeadAttention(nn.Module): "Multi-head Attention module" "def __init__(self, n_head, d_model, d_k, d_v, d_model, dropout=0.1): # d_model=512 super().__init__() self.n_head = n_head # self.d_k = d_k # self. D_q =d_k # use full connection layer to initialize and train weight matrix, in fact the subsequent work is generated with a full connection layer, Then split self.w_qs = nn.Linear(d_model, n_head * d_k, bias=False) self.w_ks = nn.Linear(d_model, n_head * d_k, bias=False) self.w_ks = nn.linear (d_model, n_head * d_k, bias=False) self.w_vs = nn.Linear(d_model, n_head * d_v, bias=False) self.fc = nn.Linear(n_head * d_v, d_model, Bias = False) self. Attention = ScaledDotProductAttention (temperature = d_k * * 0.5) self. Dropout = nn. Dropout (dropout) self.layer_norm = nn.LayerNorm(d_model, eps=1e-6) def forward(self, q, k, v, mask=None): d_k, d_v, n_head = self.d_k, self.d_v, self.n_head sz_b, len_q, len_k, len_v = q.size(0), q.size(1), k.size(1), V.size (1) residual = q # Pass through the pre-attention projection b x lq x (n*dv) # Separate different heads: b x lq x n x dv q = self.w_qs(q).view(sz_b, len_q, n_head, d_k) k = self.w_ks(k).view(sz_b, len_k, n_head, d_k) v = self.w_vs(v).view(sz_b, len_v, n_head, d_v) # Transpose for attention dot product: B x n x Lq x dv # change n to the second dimension, Similar to the channel q, k, v = q.ranspose (1, 2), K.Ranspose (1, 2), v.Ranspose (1, 2) if mask is not None: mask = mask.unsqueeze(1) # For head axis broadcasting. q, attn = self.attention(q, k, v, mask=mask) # Transpose to move the head dimension back: b x lq x n x dv # Combine the last two dimensions to concatenate all the heads together: b x lq x (n*dv) q = q.transpose(1, 2).contiguous().view(sz_b, len_q, -1) # self Baidu contiguous() q = self. Dropout (self. Fc (q)) # equivalent to the above WO q += residual q = selfCopy the code

The nuggets code block looks too laborious, so I’ll skip hh

The appendix

[1]  Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks.

[2]  Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.

[3] ł ukasz ł Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems

[4]  Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neuralmachine translation in linear time

[5]  Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning