This is a comprehensive review of Transformer produced by Fudan University. It covers a wide range of topics and can take 15-20 minutes to complete
introduce
Transformer is now a widely used model in various fields, including NLP,CV and voice. As the years have progressed, some Transformer variants have improved in the following areas:
1. Model efficiency
The self-attention module’s computation and storage complexity make Transformer less efficient when dealing with long sequences of data. The main solution is to lighten Attention and introduce divisor.
2. Model generalization
Transformer does not introduce inductive bias like CNN, making it difficult to train on small data sets. The solution is to introduce structural bias or regularization terms, pre-training on large data sets, and so on.
3. Model adaptability
This effort is focused on the introduction of Transformer into downstream tasks
Background knowledge
The original Transformer
In this section, we mainly introduce several key modules in the original Transformer
Attention module
The prototype Attention formula is as follows:
Transformer uses a multi-head attention mechanism, which first uses multiple sets of,, &W_v& to calculate the results separately, then splits them together and performs a linear transformation with the results. The whole process is as follows:
An overview of Transformer’s overall structure
In Transformer, there are three kinds of Attention:
-
Self-attention, the mechanism of self-attention, that is, Q,K and V in the above formula are all set to the input X.
-
Masked self-attention: Used in decoder, where Masked guarantees that only information within this location is used when calculating attention.
-
Cross-attention: Query uses the output of the decoder, while Key and Value use the output of the encoder.
FFN module
The feedforward layer is two fully connected layers plus an activation layer, which can be written as
Residual connections and LayerNorm
In each Transformer Block, residual connections and LayerNorm are added. The whole process can be written as
Location code
Since Transformer does not introduce loop and convolution structures, the model itself is missing location information. We make the data carry certain position information by adding position code to the input of Encoder and Decoder
Model usage
There are usually three
-
Decoder-decoder is often used for modeling from Sequence to Sequence (e.g., machine translation).
-
Encoder-Only uses Only the preceding Encoder and is often used for classification and sequence annotation problems
-
Decoder-only uses Only the later Decoder and is often used for sequence generation tasks
Model analysis
Here we mainly analyze the complexity and number of parameters of the two components, FFN and self-attention. Here we assume that the sequence length is T, the number of channels is D, and the size of the full-connected layer in the middle of FFN is 4D. Then we have the following table
A simple derivation
Let’s take the example of self-attention, where self-attention takes three matrices, and then projects the input X to get Q,K, and V. The sizes of these three matrices are all DxD (because the input dimension is D and the output dimension is D). After the final calculation of attention, there is still a linear layer whose matrix size is ALSO DxD, so the number of parameters is
If matrix A is (N, M) and matrix B is (M, P), then the obtained matrix C is (N, P), and the calculation of each element requires M times of multiplication, so the complexity is O(NMP). For other operation comparison, please refer to Transformer/CNN/RNN comparison (time complexity, Sequence of the operand, the maximum path length) (zhuanlan.zhihu.com/p/264749298)
The matrix multiplication of Q and K is the multiplication of (T, C) and (C, T)(K transpose) matrices, so the complexity is O(), the number of parameters and the computational complexity of FFN and so on.
You can see that for long sequences, the self-attention operation increases in complexity by the power of two.
Transformer compared to other networks
The self – analysis of attention
We have summarized the following three advantages
-
It has the same maximum path length as the full connection layer and is suitable for long distance modeling.
-
Compared with the convolution with finite receptive field (in CNN, many layers need to be stacked to get the global receptive field), self-attention can model the long-term dependency relationship with finite layers.
-
Self-attention has a higher degree of parallelism than the loop layer (RNN series).
Different layers of complexity, sequence operands, maximum path length
The inductive bias
CNN uses convolution kernel to introduce image locality. RNN introduces time invariance. Transformer does away with these inductive biases, making it versatile and flexible, but also easy to overfit on a small scale.
Related to this is the NETWORK of GNN graphs, Transformer can be thought of as GNN on a fully directed graph (with self-loop), where each input is a node in the graph.
Different types of Transformer
Subsequent Transformer variants have evolved from these components.
Summary diagram of related work
Attention
The self-attention mechanism is an important component, and it has two problems
-
Complexity, which is very expensive to compute when dealing with long sequences
-
The structure prior, which abandons all inductive bias, leads to its easy overfitting in small data
The solutions are as follows:
-
Sparse Attention introduces Sparse bias into Attention calculation
-
Linearized Attention separates the Attention matrix from the feature map and reduces it to linear complexity
-
Memory compression, reducing the number of QKV to reduce the attention matrix
-
Low rank self Attention, this kind of work focuses on the low rank of self Attention
-
Attention, with a priori, uses preallocation of Attention to complement the standard self-attention mechanism
-
Improved multi-head mechanism
Sparse Attention
In some trained Transformer models, it is observed that the attention matrix is often sparse, so the computational complexity can be reduced by limiting the number of Query-key pairs.
We can further divide the sparse methods into two categories: position-based methods and content-based methods.
Sparse attention based on location information
Basic sparse mode of attention
There are basically five basic patterns of sparse attention
-
Global Attention In order to enhance the model simulation of long-distance dependency, we can add some Global nodes.
-
Most of the data in Band Attention is local, and we limit the query to interacting with only a few adjacent nodes
-
Dilated Attention is similar to Dilated Conv in CNN, which increases the gap to obtain a larger receptive field
-
Random Attention improves non-local interaction through Random sampling
-
Block Local Attention uses multiple non-overlapping blocks to limit information interaction
5 Attention graphs
Composite sparse attention patterns
Combining with the basic sparse attention model above, I won’t elaborate here.
Different composite sparse attention patterns
Extended sparse attention patterns
In addition to the basic attention patterns above, there are also extended sparse attention for specific data types.
Bp-transformer constructs a binary tree based attention model, with all tokens as leaf nodes and internal nodes containing multiple tokens. Higher-level span nodes can contain tokens over longer distances.
For visual data, Image Transformer tries two sparse attention modes
-
Flat10s the image and applies a block Local Sparse Attention
-
In 2D form, apply a 2D block local attention
Axial Transformer applies a separate attention module for each axis of the image
Extended sparse attention
Sparse attention based on content
Some work is done to create sparse attention based on input data. One easy way to do this is to select a key that has a high similarity to a given query.
Routing Transformer uses K-means clustering method to cluster key and Query on the center vector set. Each Query only interacts with its key in the same cluster.
The center vector is updated by means of moving average:
Where is the number of clusters, is the current number of vectors of the cluster
The Reformer uses local-sensitive hashing(LSH) to select key-values for each query. Its main idea is to query and key hash, divided into multiple buckets, in the same bucket query, key to participate in interaction.
Let b be the number of buckets, and given a matrix of size, LSH can be written as
In addition, Sparse Adaptive Connection considers a sequence as a graph and adaptively learns Sparse connections. Sparse Sinkhorn Attention uses a sorting network to divide queries and keys into multiple blocks. Each query block is assigned a key block. Each query is allowed to interact with only the corresponding key.
Linearized Attention
These methods are primarily used to approximate or substitute for computational Attention in order to reduce the computational complexity to O(T). (Where is the feature map in the row direction)
Standard Attention versus Linearized Attention
Let’s write the general form of Attention:
Where SIM is a function used to calculate vector similarity. In the original Transformer is the vector inner product + Softmax, we chose to use a kernel function instead
So we could rewrite Attention as
There are two key components to this type of approach, which are feature mapping and feature aggregation
Feature mapping
As mentioned earlier, Linear Transformer uses the Linear Transformer, which is not intended to approximate the dot product of standard Attention, but to be comparable in performance to standard Transformer.
Performer uses a random feature map:
In its first release, Performer was inspired by random Fourier eigenmaps, often used to approximate Gaussian kernels. Among them:
-
l = 2
-
f1 = sin, f2=cos
While the first version does not guarantee that the score calculated by Attention is non-negative, the second version improves to:
-
l = 1
-
f1 = exp
Of course, there are other feature mapping methods, which I won’t go into too much here
Polymerization methods
In the previous formula, features are aggregated by simple summation.
RFA introduces a gate mechanism that makes historical associations appear exponentially decayed by a learnable scalar g when a new association is added to the memory matrix S.
Schlag et al. used the write/delete method to increase the memory matrix capacity.
Query prototype and video memory compression
In addition to sparse and linearized attention, another way to reduce attention complexity is to reduce the number of Queries or key-values.
Query prototyping and Memory Compression
Use the Query prototype for Attention
The Query prototype is used as the primary source for calculating the Attention distribution. The model can either copy the distribution to the position of the corresponding Query or fill it with a discrete uniform distribution.
Cluster Attention groups the query into multiple clusters and calculates the Attention distribution for the center vector of each Cluster.
Informer explicitly uses query sparsity metric to select Query prototype, which is approximately derived from KL divergence between query attention distribution and discrete uniform distribution.
Attention using compressed key-value video memory
Such methods reduce the number of key-value pairs to reduce complexity.
Liu et al proposed MCA, using the convolution layer to reduce the number of key-values
Set Transformer and Luna use externally trainable global nodes to compress input
The Linformer projects key, values (that is, multiplied by a matrix), reducing their length.
The Poolingformer uses two stages of Attention, one sliding window Attention and one compressed memory Attention.
Welcome to GiantPandaCV, where you will see exclusive deep learning sharing, adhere to the original, share the new knowledge we learn every day. ✧ (, ̀ omega, ́)
If you have questions about this article, or want to join the communication group, please add BBuf wechat:
Qr code
This article uses the Article Synchronization Assistant to synchronize