The Transformer,

B station video explanation

Transformer is a seq2SEQ model proposed by Google Brain in its Attention Is All You Need paper published at the end of 2017. BERT is a pre-training language model derived from Transformer

The article is divided into the following sections

Transformer intuitive understanding
Positional Encoding
Self Attention Mechanism
Residual joins and Layer Normalization
Transformer Encoder overall structure
Transformer Decoder overall structure
conclusion
Refer to the article

0. Have an intuitive knowledge of Transformer

The biggest difference between Transformer and LSTM is that LSTM training is iterative and serial, with the current word having to be processed before the next word can be processed. Transformer is trained in parallel, meaning all words are trained at the same time, which greatly increases the computational efficiency. Transformer uses Positional Encoding to understand language order, self-attention Mechanism and fully connected layer for calculation, which will be discussed later

Transformer model is divided into two parts, Encoder and Decoder. The Encoder is responsible for insinuating the input into a hidden layer (the part represented by the nine grid in step 2 below), and the decoder maps the hidden layer into a natural language sequence. For example, the following machine translation example (when the Decoder is output, the token is output through N layers of Decoder Layer, not through each Layer of Decoder Layer)

Most of this article is devoted to explaining the Encoder, which is the process of mapping natural language sequences into hidden layers of mathematical representation. After understanding the structure of Encoder, it is very easy to understand Decoder

The figure above shows the Transformer Encoder Block structure. Note: The title numbers below correspond to the numbers of the boxes 1,2,3 and 4 in the figure

1. Positional Encoding

Since the Transformer model does not have cyclic neural network iterative operations, we must provide Transformer with the position information of each word so that it can recognize the sequential relationships in the language

Now define a concept for Positional Encoding, which is called Positional Encoding. The dimension of Positional embedding is [max_sequence_length, embedding_dimension], which is the same as the dimension of word vectors. Are embedding_dimension. Max_sequence_length is a hyperparameter that limits how many words each sentence can consist of

Note that we normally train Transformer models on word basis. Vocab_size, embedding_dimension], vocab_size being the number of characters in the font, embedding_dimension being the dimension of the word vector, corresponding to the PyTorch, Embedding(vocab_size, embedding_dimension)

In this paper, linear transformation of sine and cosine functions is used to provide the position information of the model:

PE{(pos,2i)} = \sin(pos / 10000^{2i/d_{\text{model}}}) \\ PE{(pos,2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}})

[0,max_sequence_length)[0, \text{Max \_sequence\_length})[0,max_sequence_length), Embedding_embedding_dimension /2)[0, \text{embedding_dimension}/2)[0, \text{embedding_dimension}/2) Dmodeld_ {\text{model}} dModel refers to the value of embedding_dimension

There’s a formula for sin⁡\sinsin \cos ⁡\coscos, which corresponds to an embedding_dimension with an odd and even number, like 0,1, 2,3, The sin⁡\sinsin and cos⁡\coscos functions above produce different cycles, and the position is embedded in the Embedding_dimension and the cycles get slower and slower as the number of the dimension increases, resulting in a texture that contains the location information, As stated on page 6 of the paper, the period of the position embedding function varies from 2π2 \pi2π to 10000∗2π10000 * 2 \pi10000∗2π, In the Embedding_dimension, each location can get the value combination of sin⁡\sinsin and cos⁡\coscos of different cycles, thus generating the unique texture location information, and finally make the model learn the dependency between the locations and the timing characteristics of natural language

If you don’t understand why this is done here, check out this article for Positional Encoding in Transformer

So let’s draw the embedding, and if you look in the embedding_dimension, you can see that as the embedding_dimension number increases, the embedding function cycles more and more gently

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

def get_positional_encoding(max_seq_len, embed_dim) :
    Initialize a positional Encoding
    # embed_DIM: Dimension of word embedment
    # max_seq_len: Maximum sequence length
    positional_encoding = np.array([
        [pos / np.power(10000.2 * i / embed_dim) for i in range(embed_dim)]
        ifpos ! =0 else np.zeros(embed_dim) for pos in range(max_seq_len)])
    
    positional_encoding[1:, 0: :2] = np.sin(positional_encoding[1:, 0: :2])  # dim 2i even
    positional_encoding[1:, 1: :2] = np.cos(positional_encoding[1:, 1: :2])  # dim 2i+1 odd
    return positional_encoding

positional_encoding = get_positional_encoding(max_seq_len=100, embed_dim=16)
plt.figure(figsize=(10.10))
sns.heatmap(positional_encoding)
plt.title("Sinusoidal Function")
plt.xlabel("hidden dimension")
plt.ylabel("sequence length")
Copy the code

plt.figure(figsize=(8.5))
plt.plot(positional_encoding[1:, 1], label="dimension 1")
plt.plot(positional_encoding[1:, 2], label="dimension 2")
plt.plot(positional_encoding[1:, 3], label="dimension 3")
plt.legend()
plt.xlabel("Sequence length")
plt.ylabel("Period of Positional Encoding")
Copy the code

2. Self Attention Mechanism

For the input sentence XXX, the word vector of each word in the sentence is obtained by WordEmbedding, and the position vector of all the words is obtained by Positional Encoding, which is added up (with the same dimensions and can be added directly) to obtain the true vector representation of the word. The vector of the TTT word is called xtx_txt

Then we define three matrices WQ,WK.WVW_Q,W_K.W_VWQ,WK.WV, and use these three matrices to perform cubic linear transformations on all the word vectors, so that three new vectors qt,kt, VTq_t,k_t,v_tqt,kt,vt are derived from all the word vectors. We put all qtQ_tqt vectors into a large matrix, denoted as query matrix QQQ, all ktK_tkt vectors into a large matrix, denoted as key matrix KKK, and all VTv_tvt vectors into a large matrix, denoted as value matrix VVV (see figure below).

To get the attention weight of the first word, we need to multiply the query vector q1Q_1Q1 of the first word by the key matrix K (see figure below).

            [0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
            [1, 0, 1]
Copy the code

You then need to pass the resulting values through SoftMax so that they add up to 1 (see figure below)

Softmax ([2, 4, 4]) = [0.0, 0.5, 0.5]Copy the code

Once you have the weights, multiply them by the corresponding word value vector VTv_tvt (see figure below).

0.0 * [1.2.3] = [0.0.0.0.0.0]
0.5 * [2.8.0] = [1.0.4.0.0.0]
0.5 * [2.6.3] = [1.0.3.0.1.5]
Copy the code

Finally, these weighted value vectors are summed up to get the output of the first word (see figure below).

[0.0, 0.0, 0.0] + [1.0, 4.0, 0.0] + [1.0, 3.0, 1.5] -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- = [2.0, 7.0, 1.5]Copy the code

Perform the same operation on the other input vectors to get all the output after self-attention

Matrix calculation

The method described above requires a loop through all the words xtx_txt, and we can calculate the output of all times at once by converting the vector calculation above into matrix form

So the first step is not to compute qt,kt,vtq_t,k_t,v_tqt,kt,vt for any given moment, but to compute Q,KQ,KQ,K, and VVV all at once. The calculation process is shown in the figure below, where the input is a matrix XXX, and the vector of the TTT word of the matrix represents xTX_txt

Next, multiply QQQ and KTK^TKT, then divide by DK \ SQRT {d_k} DK (this is a trick mentioned in the paper), and then multiply by VVV to get the output after softmax

Multi-Head Attention

This paper also puts forward the concept of multi-head Attention. The set of Q,K,VQ,K,VQ,K,V defined earlier can make one word attend to related words. We can define multiple sets of Q,K,VQ,K,V to focus on different contexts. And the process for calculating Q,K,VQ,K,VQ,K,V is the same, Just from a set of linear transformation of matrix (WQ, WK, a WV) (^ W ^ Q, W K, W ^ V) (WQ, WK, a WV) into multiple sets of (W0Q W0K, W0V) (W ^ Q_0, W ^ K_0, W ^ V_0) (W0Q W0K, W0V), (W1Q, W1K W1V) (W ^ Q_1, W ^ K_1, W ^ V_1) (W1Q, W1K W1V),… See the figure below

For the input matrix XXX, each group of QQQ, KKK and VVV can obtain an output matrix ZZZ. See the figure below

Padding Mask

In the calculation process of Self Attention above, we usually use mini-batch to calculate, that is, to calculate multiple sentences at a time. In other words, the dimension of XXX is [batch_size, sequence_length], sequence_length is the sentence length. A mini-batch is composed of several sentences of unequal length. We need to complete the remaining sentences according to the maximum sentence length in the mini-batch, usually filled with 0. This process is called padding

But that’s when softMax is a problem. Review softmax function sigma (zi) = ezi ∑ j = 1 kezj \ sigma (z_i) = \ frac {e ^ {z_i}} {\ sum_ {j = 1} e ^ ^ K {z_j}} kezjezi sigma (zi) = ∑ j = 1, e0e ^ 0 e0 is 1, and is of value, In this case, the padding part of SoftMax participates in the calculation, which is equivalent to the invalid part participating in the calculation, which may produce great hidden trouble. Therefore, we need to do a mask operation to make these invalid regions not participate in the operation, usually by adding a large negative offset to the invalid region, i.e

\ begin} {aligned & Z_ {illegal} = Z \ _ {illegal} + bias_ {illegal} \ \ & bias_ {illegal} – > – up \ end} {aligned

3. Residual join and Layer Normalization

Residual connection

We in the previous step to get the output, after the self – attention weighting is self – attention (Q, K, V) \ text {self – attention} (Q, K \ \ V) self – attention (Q, K, V), And then add them up to do the residual join

X_{embedding} + \text{Self-Attention}(Q, \ K, \ V)

Layer Normalization

The function of Layer Normalization is to normalize the hidden layers in the neural network into the standard normal distribution, that is, I.I.D.I.I.D., and it plays a role in accelerating the training speed and convergence

\mu_{j}=\frac{1}{m} \sum^{m}_{i=1}x_{ij}

In the above formula, the column of the matrix is taken as the unit to obtain the mean value.

\sigma^{2}_{j}=\frac{1}{m} \sum^{m}_{i=1}(x_{ij}-\mu_{j})^{2}

The above formula takes the column of the matrix as the unit to calculate the variance

LayerNorm(x)=\frac{x_{ij}-\mu_{j}}{\sqrt{\sigma^{2}_{j}+\epsilon}}

Then subtract the mean of each column from each element of each column and divide by the standard deviation of each column to get the normalized value, adding ϵ\epsilonϵ to prevent the denominator from reaching 0

The following figure shows more details: Inputs X1,x2x_1, X_2x1,x2 pass through the self-attention layer to become Z1, Z2Z_1, Z_2Z1,z2, and then make residual connections with inputs X1,x2x_1, X_2x1,x2, and output to the fully connected layer after LayerNorm. The fully connected layer also has a residual connection and a LayerNorm, which is output to the next Encoder (the FeedForward layer weights are shared in each Encoder Block).

4. Overall structure of Transformer Encoder

After the above three steps, we have basically understood the main components of Encoder, let’s use the formula to sort out the calculation process of an Encoder block:

1). Word vector and position coding

X = \text{Embedding-Lookup}(X) + \text{Positional-Encoding}

2). Self-attention mechanism

Q = \text{Linear}_q(X) = XW_{Q}\\ K = \text{Linear}_k(X) = XW_{K}\\ V = \text{Linear}_v(X) = XW_{V}\\ X_{attention} = \text{Self-Attention}(Q,K,V)

3). Self-attention residual connection and Layer Normalization

X_{attention} = X + X_{attention}\\ X_{attention} = \text{LayerNorm}(X_{attention})

4). The next step is the fourth part of the Encoder block structure diagram, namely FeedForward, which is actually a two-layer linear mapping and activated by activation function, such as ReLUReLUReLU

X_{hidden} = \text{Linear}(\text{ReLU}(\text{Linear}(X_{attention})))

5). FeedForward residual connection and Layer Normalization

X_{hidden} = X_{attention} + X_{hidden}\\ X_{hidden} = \text{LayerNorm}(X_{hidden})

Among them

X_{hidden} \in \mathbb{R}^{batch\_size \ * \ seq\_len. \ * \ embed\_dim}

5. Overall structure of Transformer Decoder

Let’s first look at the Decoder structure from the perspective of HighLevel. From bottom to top, it is:

Masked Multi-Head Self-Attention
Multi-Head Encoder-Decoder Attention
FeedForward Network

As with Encoder, each of the above three sections has a residual join followed by a Layer Normalization. The intermediate components of the Decoder are not complicated, most of which have been introduced in the previous Encoder, but the Decoder will involve some details in training due to its special functions

Masked Self-Attention

Specifically, Decoder in traditional Seq2Seq uses RNN model, so when the words of TTT moment are input in the training process, the model can not see the words of future moment in any case, because the cyclic neural network is time-driven. Only when the operation of TTT moment is finished, the words of T + 1T + 1T +1 moment can be seen. Transformer Decoder has abandoned RNN and changed to self-attention, which causes a problem. In the process of training, the whole ground truth is exposed in Decoder, which is obviously wrong. We need to do some processing on the input of Decoder. This process is called a Mask

For example, the ground truth of the Decoder is ”

I am fine”, we input this sentence into the Decoder, after WordEmbedding and Positional Encoding, Perform cubic linear transformation (WQ,WK,WVW_Q,W_K,W_VWQ,WK,WV) on the resulting matrix. Q×KTdk\frac{Q\times K^T}{\ SQRT {d_k}} DK Q×KT to get Scaled Scores. For example, when we enter “I”, the model currently only knows information about all the words before “I”, that is, ”

” and “I”, and should not know information about the words after “I”. The reason is very simple, when we do the prediction is in accordance with the order of the word a word prediction, how can this word is not predicted, already know the following word information? Mask is very simple. First, generate a matrix with all zeros in the lower triangle and negative infinity in the upper triangle, and then add it to Scaled Scores

After softMax, you can change -INF to 0, and the matrix you get is the weight between each word

Multi-head self-attention is nothing more than doing the above steps a few times in parallel, which is also described in the previous Encoder, so I won’t go into the details here

Masked Encoder-Decoder Attention

In fact, this part of the computational process is similar to Masked self-attention, and the structure is the same, the only difference is that K,VK,VK,V is the output of Encoder, and QQQ is the output of Masked self-attention in Decoder

6. Summary

At this point, 95% of the content in Transformer has been covered, and we have a diagram showing the complete structure. I have to say, the Transformer is very well designed

The following questions are from the Internet. I think I can have a deeper understanding of Transformer after reading them

Why Transformer needs multi-head Attention?

As mentioned in the original paper, the reason for multi-head Attention is that the model is divided into multiple heads to form multiple subspaces, so that the model can pay Attention to different aspects of information, and finally integrate all aspects of information. In fact, intuitively, we can also think that if we design such a model by ourselves, we will not only do attention once. The comprehensive results of multiple attention can at least enhance the model, and it can also be analogous to the role of using multiple convolution kernel in CNN. Intuitively, The attention of the bulls helps the network capture richer features/information

What are the advantages of Transformer over RNN/LSTM? Why is that?

The model of RNN series cannot be calculated in parallel, because the calculation at time T depends on the result of hidden layer calculation at time T-1, while the calculation at time T-1 depends on the result of hidden layer calculation at time T-2
The feature extraction capability of Transformer is better than that of RNN series models

Why Transformer can replace SEq2SEq?

The biggest problem with seq2seq is to compress all the information on the Encoder side into a fixed length vector and use it as the input of the first hidden state on the Decoder side. To predict the hidden state of the first word (token) on the Decoder side. When the input sequence is relatively long, this will obviously lose a lot of information on the Encoder side, and the Decoder side can not pay attention to the information it wants to pay attention to when the fixed vector is sent to the Decoder side all at once. Transformer not only substantially improves on these two weaknesses of the SEQ2SEq model (the multi-head interactive attention module), but also introduces a self-attention module, where the source sequence and the target sequence are first “self-associated”. In this way, The embedding representation of source sequence and target sequence itself contains richer information, and the subsequent FFN layer also enhances the expression capability of the model, and the parallel computing capability of Transformer far exceeds that of SEQ2SEQ series model

7. Refer to articles

Transformer
The Illustrated Transformer
TRANSFORMERS FROM SCRATCH
Seq2seq pay Attention to Self Attention: Part 2