Transformer is a classic MODEL of NLP proposed by Google’s team in 2017. Now the popular Bert is also based on Transformer. The Transformer model uses the self-attention mechanism and does not use the sequence structure of RNN, so that the model can be trained in parallel and have global information.

1. The structure of the Transformer

First, the overall structure of Transformer is introduced. The following figure shows the overall structure of Transformer for Chinese and English translation.

Transformer is made up of two parts, Encoder and Decoder. Both Encoder and Decoder contain six blocks. Transformer’s workflow is as follows:

The first step is to get the representation vector X for each word of the input sentence. X is obtained by adding the Embedding of the word and the Embedding of the word position.

Step 2: Input the obtained word representation vector matrix (as shown in the figure above, each line is a word representation X) into Encoder. After 6 Encoder blocks, the coding information matrix C of all words in sentences can be obtained, as shown in the figure below. The word vector matrix is represented by X(n×d), where n is the number of words in the sentence and D is the dimension of the vector (d=512 in this paper). The matrix dimensions of each Encoder block output are exactly the same as those of the input.

Step 3: Pass the coding information matrix C output by Encoder to Decoder, which will translate the next word I +1 in turn according to the currently translated word 1~ I, as shown in the figure below. In the process of use, the words after I +1 need to be covered by Mask operation when translated to the word I +1.

In the figure above, Decoder receives Encoder’s coding matrix C, and then first input a translation start symbol ”

” to predict the first word “I”; Then enter the translation start ”

” and the word “I”, predict the word “have”, and so on. This is how Transformer works in general, followed by details of the various parts.

2. The input of the Transformer

The input words in Transformer indicate that x is obtained by adding the words Embedding and location Embedding.

2.1 the word Embedding

The Embedding of words can be obtained in many ways. For example, it can be pretrained by Word2Vec, Glove and other algorithms, or trained in Transformer.

2.2 position Embedding

In Transformer, besides Embedding of words, we also need to use location Embedding to indicate where words appear in sentences. Because Transformer uses global information rather than the structure of RNN, it cannot leverage word order information, which is very important for NLP. So Transformer uses location Embedding to store the relative or absolute positions of words in a sequence.

Location Embedding is represented by PE, and the dimension of PE is the same as that of the word Embedding. PE can be obtained either by training or by using a formula. The latter is used in Transformer and is calculated as follows:

Where pos represents the position of the word in the sentence, d represents the dimension of PE (same as the word Embedding), 2i represents the even dimension, and 2i+1 represents the odd dimension (i.e. 2i≤d, 2i+1≤ D). Using this formula to calculate PE has the following benefits:

  • So that PE can adapt to longer sentences than all the sentences in the training set. Assuming that the longest sentence in the training set has 20 words and a sentence with length 21 suddenly appears, the method of formula calculation can calculate the 21st bit Embedding.
  • The relative position can be easily calculated by the model. For fixed length spacing k, PE(pos+k) can be calculated by PE(pos). For Sin (A + B) = Sin Cos (A) (B) + Cos Sin (A) (B), Cos (A + B) = Cos Cos (A) (B) – Sin Sin (A) (B).

If we add the word Embedding and location Embedding, we get the word representation vector X, which is the input of Transformer.

3.Self-Attention

The figure above is the internal structure of Transformer in the paper, with Encoder block on the left and Decoder block on the right. The part in the red circle is multi-head Attention, which is composed of multiple self-attention. You can see that the Encoder block contains one multi-head Attention. The Decoder Block contains two multi-head Attention (one of which uses Masked). There is also an Add & Norm Layer on top of multi-head Attention. Add means Residual Connection to prevent network degradation and Norm means Layer Normalization. Used to normalize the activation values for each layer.

Since self-attention is the focus of Transformer, we will focus on multi-head Attention and self-attention. First, we will have a detailed understanding of the internal logic of self-attention.

3.1 the Self – Attention structure

The figure above shows the structure of self-attention, which uses matrices Q(query), K(key), and V(value) to calculate. In practice, self-attention receives either the input (the word represents the matrix x of vector X) or the output of the last Encoder block. Q, K, and V are obtained by linear transformation of the self-attention input.

3.2 Calculation of Q, K and V

The input of self-attention is represented by matrix X, so Q, K, V can be calculated by using linear matrix WQ, WK, WV. The calculation is shown below, noting that each line of X, Q, K, and V represents a word.

3.3 Output of self-attention

After obtaining the matrices Q, K, and V, we can calculate the output of self-attention using the following formula.

The formula calculates the inner product of each row of vectors of the matrices Q and K, and to prevent the inner product from getting too big, divide by the square root of DK. After Q is multiplied by K transpose, the number of rows and columns in the matrix is n, and N is the number of sentence words. This matrix can represent the attention intensity between words. The figure below shows Q times K transpose, where 1234 represents the words in the sentence.

After QKT is obtained, Softmax is used to calculate the attention coefficient of each word for other words. In the formula, Softmax is Softmax for each row of the matrix, that is, the sum of each row becomes 1.

The Softmax matrix can be multiplied by V to get the final output Z.

The first line of the Softmax matrix in the figure above represents the attention coefficient of word 1 and all other words. The final output of word 1, Z1, is equal to the value of all words I, Vi, which is obtained by adding together the ratio of attention coefficient, as shown in the figure below:

3.4 – the Head of Attention

In the previous step, we already know how to calculate the output matrix Z through self-attention. Multi-head Attention is composed of multiple self-attention combinations. Below is the structure diagram of multi-head Attention in the paper.

It can be seen from the figure above that multi-head Attention contains multiple self-attention layers. Input X is firstly passed to H different self-attention layers, and h output matrix Z is calculated. Here is what happens when h=8, where you get eight output matrices Z.

After obtaining 8 output matrices Z1 to Z8, multi-head Attention splics them together (Concat), and then pass a Linear layer to obtain the final output Z of multi-head Attention.

For multi-head Attention, the output matrix Z has the same dimension as the input matrix X.

4. The Encoder structure

The Encoder block structure of Transformer in red is composed of multi-head Attention, Add & Norm, Feed Forward, Add & Norm. Now that you’ve seen how multi-head Attention works, look at the Add & Norm and Feed Forward sections.

4.1 Add & Norm

The Add & Norm layer consists of Add and Norm, and its calculation formula is as follows:

Where X represents input for multi-head Attention or FeedForward, and MultiHeadAttention(X) and FeedForward(X) represent output (output and input X have the same dimension, so they can be added).

Add refers to X+MultiHeadAttention(X), which is a residual connection, usually used to solve the problem of multi-layer network training, so that the network can only focus on the part of the current difference, often used in ResNet.

Norm refers to Layer Normalization, which is usually used in the RNN structure. Layer Normalization converts the input of each Layer of neurons to have the same mean and variance, which speeds up convergence.

4.2 the Feed Forward

The Feed Forward layer is relatively simple, which is a two-layer fully connected layer. The activation function of the first layer is Relu, and the activation function of the second layer is not used. The corresponding formula is as follows.

X is the input, and the resulting output matrix of the Feed Forward has the same dimension as X.

4.3 组成 Encoder

Using the above described multi-head Attention, Feed Forward, Add & Norm, we can construct an Encoder block. The Encoder block receives input matrix X(n×d), And output a matrix O of n by d. An Encoder can be formed by stacking multiple Encoder blocks.

The input of the first Encoder block is the expression vector matrix of sentence words, the input of the subsequent Encoder block is the output of the previous Encoder block, and the output matrix of the last Encoder block is the coding information matrix C. This matrix will be used later in the Decoder.

5. The Decoder structure

The Decoder block structure of Transformer is similar to Encoder Block in red, but there are some differences:

  • Contains two multi-head Attention layers.
  • The first multi-head Attention layer uses the Masked operation.
  • The K and V matrices of the second multi-head Attention layer are calculated using Encoder’s coded information matrix C, while Q is calculated using the output of the previous Decoder block.
  • Finally, a Softmax layer calculates the probability of the next translated word.

5.1 The first multi-head Attention

The first multi-head Attention of Decoder Block adopts the Masked operation, because the translation process is sequential, that is, the I +1 word can be translated only after the I word is translated. The Masked operation prevents the ith word from knowing the information after I +1 word. “I have a cat” is translated into “I have a cat” as an example to understand the Masked operation.

In the following description, the Teacher Forcing concept is used. Those who are not familiar with the Teacher Forcing can refer to the Seq2Seq Model in detail. In Decoder, it is necessary to solve the current most likely translation according to the previous translation, as shown in the figure below. The first word is predicted to be “I” based on the input “<Begin>”, and the next word “have” based on the input “<Begin> I”.

The Decoder can use the Teacher Forcing during training and parallelize training by passing the correct word sequence (

I have a cat) and corresponding output (I have a cat

) to the Decoder. When predicting the i-th output, Mask the words after I +1. Note that the Mask operation is used before the Softmax of self-attention.

I have a cat

.



The first step: is the Decoder input matrix and Mask matrix, the input matrix contains ”

I have a cat” (0, 1, 2, 3, 4) five words of vector, Mask is a 5×5 matrix. In Mask, it can be found that word 0 can only use the information of word 0, while word 1 can use the information of word 0 and word 1, that is, only the previous information can be used.

Step 2: The next step is the same as the previous self-attention, by entering the matrix X to compute the Q, K, and V matrices. And then you take the product of Q and KT.

Step 3: After QKT is obtained, Softmax is needed to calculate attention score. Before Softmax, Mask matrix is used to cover the information behind each word. The shielding operation is as follows:

After obtaining Mask QKT, Softmax is performed on Mask QKT, and the sum of each line is 1. But the word 0 has zero attention score on words 1, 2, 3, and 4.

Step 4: Mask QKT is multiplied by matrix V to get the output Z, then the output vector Z1 of word 1 only contains the information of word 1.

Step 5: Through the above steps, we can get an output matrix Zi for Mask self-attention, and then, similar to Encoder, The output Z of the first multi-head Attention is calculated by splicing multiple output Zi through multi-head Attention. Z has the same dimension as input X.

5.2 The second multi-head Attention

The main difference is that the K/V matrix of self-attention is not computed using the output of the previous Decoder block. It is calculated using Encoder’s coded information matrix C.

K and V are calculated according to the output C of Encoder, and Q is calculated according to the output Z of the last Decoder block (input matrix X is used for calculation if it is the first Decoder block). The subsequent calculation method is consistent with the previous description.

The advantage of this is that when Decoder is used, each word can use the information of all Encoder words (this information does not need to Mask).

5.3 Softmax predicts output words

The last part of Decoder Block is to use Softmax to predict the next word. In the previous network layer, we can get a final output Z. Because of the existence of Mask, the output Z0 of word 0 only contains the information of word 0, as follows.

Softmax predicts the next word based on each row of the output matrix

This is the definition of a Decoder block. Like Encoder, a Decoder is a combination of multiple Decoder blocks.

6. The Transformer

Transformer, unlike RNN, can train in parallel better.

Transformer itself does not make use of word order information, so it needs to add location Embedding to the input otherwise Transformer is a word bag model.

The focus of Transformer is on the self-attention structure, where the Q, K and V matrices used are obtained by linear transformation of the output.

Multi-head Attention in Transformer has multiple self-attention, which can capture the correlation coefficient Attention score between words in various dimensions.

7. References

  • Attention Is All You Need
  • Jay Alammar blogs at The Illustrated Transformer
  • Pytorch Transformer code :The Annotated Transformer