This is a comprehensive review of Transformer produced by Fudan University. It covers a wide range of topics and can take 15-20 minutes to complete

introduce

Transformer is now a widely used model in various fields, including NLP,CV and voice. As the years have progressed, some Transformer variants have improved in the following areas:

1. Model efficiency

The self-attention module’s computation and storage complexity make Transformer less efficient when dealing with long sequences of data. The main solution is to lighten Attention and introduce divisor.

2. Model generalization

Transformer does not introduce inductive bias like CNN, making it difficult to train on small data sets. The solution is to introduce structural bias or regularization terms, pre-training on large data sets, and so on.

3. Model adaptability

This effort is focused on the introduction of Transformer into downstream tasks

Background knowledge

The original Transformer

In this section, we mainly introduce several key modules in the original Transformer

Attention module

The prototype Attention formula is as follows:

Transformer uses a multi-head attention mechanism, which first uses multiple sets of,, &W_v& to calculate the results separately, then splits them together and performs a linear transformation with the results. The whole process is as follows:

An overview of Transformer’s overall structure

In Transformer, there are three kinds of Attention:

  • Self-attention, the mechanism of self-attention, that is, Q,K and V in the above formula are all set to the input X.

  • Masked self-attention: Used in decoder, where Masked guarantees that only information within this location is used when calculating attention.

  • Cross-attention: Query uses the output of the decoder, while Key and Value use the output of the encoder.

FFN module

The feedforward layer is two fully connected layers plus an activation layer, which can be written as

Residual connections and LayerNorm

In each Transformer Block, residual connections and LayerNorm are added. The whole process can be written as

Location code

Since Transformer does not introduce loop and convolution structures, the model itself is missing location information. We make the data carry certain position information by adding position code to the input of Encoder and Decoder

Model usage

There are usually three

  • Decoder-decoder is often used for modeling from Sequence to Sequence (e.g., machine translation).

  • Encoder-Only uses Only the preceding Encoder and is often used for classification and sequence annotation problems

  • Decoder-only uses Only the later Decoder and is often used for sequence generation tasks

Model analysis

Here we mainly analyze the complexity and number of parameters of the two components, FFN and self-attention. Here we assume that the sequence length is T, the number of channels is D, and the size of the full-connected layer in the middle of FFN is 4D. Then we have the following table

A simple derivation

Let’s take the example of self-attention, where self-attention takes three matrices, and then projects the input X to get Q,K, and V. The sizes of these three matrices are all DxD (because the input dimension is D and the output dimension is D). After the final calculation of attention, there is still a linear layer whose matrix size is ALSO DxD, so the number of parameters is

If matrix A is (N, M) and matrix B is (M, P), then the obtained matrix C is (N, P), and the calculation of each element requires M times of multiplication, so the complexity is O(NMP). For other operation comparison, please refer to Transformer/CNN/RNN comparison (time complexity, Sequence of the operand, the maximum path length) (zhuanlan.zhihu.com/p/264749298)

The matrix multiplication of Q and K is the multiplication of (T, C) and (C, T)(K transpose) matrices, so the complexity is O(), the number of parameters and the computational complexity of FFN and so on.

You can see that for long sequences, the self-attention operation increases in complexity by the power of two.

Transformer compared to other networks

The self – analysis of attention

We have summarized the following three advantages

  1. It has the same maximum path length as the full connection layer and is suitable for long distance modeling.

  2. Compared with the convolution with finite receptive field (in CNN, many layers need to be stacked to get the global receptive field), self-attention can model the long-term dependency relationship with finite layers.

  3. Self-attention has a higher degree of parallelism than the loop layer (RNN series).

Different layers of complexity, sequence operands, maximum path length

The inductive bias

CNN uses convolution kernel to introduce image locality. RNN introduces time invariance. Transformer does away with these inductive biases, making it versatile and flexible, but also easy to overfit on a small scale.

Related to this is the NETWORK of GNN graphs, Transformer can be thought of as GNN on a fully directed graph (with self-loop), where each input is a node in the graph.

Different types of Transformer

Subsequent Transformer variants have evolved from these components.

Summary diagram of related work

Attention

The self-attention mechanism is an important component, and it has two problems

  1. Complexity, which is very expensive to compute when dealing with long sequences

  2. The structure prior, which abandons all inductive bias, leads to its easy overfitting in small data

The solutions are as follows:

  1. Sparse Attention introduces Sparse bias into Attention calculation

  2. Linearized Attention separates the Attention matrix from the feature map and reduces it to linear complexity

  3. Memory compression, reducing the number of QKV to reduce the attention matrix

  4. Low rank self Attention, this kind of work focuses on the low rank of self Attention

  5. Attention, with a priori, uses preallocation of Attention to complement the standard self-attention mechanism

  6. Improved multi-head mechanism

Sparse Attention

In some trained Transformer models, it is observed that the attention matrix is often sparse, so the computational complexity can be reduced by limiting the number of Query-key pairs.

We can further divide the sparse methods into two categories: position-based methods and content-based methods.

Sparse attention based on location information

Basic sparse mode of attention

There are basically five basic patterns of sparse attention

  • Global Attention In order to enhance the model simulation of long-distance dependency, we can add some Global nodes.

  • Most of the data in Band Attention is local, and we limit the query to interacting with only a few adjacent nodes

  • Dilated Attention is similar to Dilated Conv in CNN, which increases the gap to obtain a larger receptive field

  • Random Attention improves non-local interaction through Random sampling

  • Block Local Attention uses multiple non-overlapping blocks to limit information interaction

5 Attention graphs

Composite sparse attention patterns

Combining with the basic sparse attention model above, I won’t elaborate here.

Different composite sparse attention patterns

Extended sparse attention patterns

In addition to the basic attention patterns above, there are also extended sparse attention for specific data types.

Bp-transformer constructs a binary tree based attention model, with all tokens as leaf nodes and internal nodes containing multiple tokens. Higher-level span nodes can contain tokens over longer distances.

For visual data, Image Transformer tries two sparse attention modes

  1. Flat10s the image and applies a block Local Sparse Attention

  2. In 2D form, apply a 2D block local attention

Axial Transformer applies a separate attention module for each axis of the image

Extended sparse attention

Sparse attention based on content

Some work is done to create sparse attention based on input data. One easy way to do this is to select a key that has a high similarity to a given query.

Routing Transformer uses K-means clustering method to cluster key and Query on the center vector set. Each Query only interacts with its key in the same cluster.

The center vector is updated by means of moving average:

Where is the number of clusters, is the current number of vectors of the cluster

The Reformer uses local-sensitive hashing(LSH) to select key-values for each query. Its main idea is to query and key hash, divided into multiple buckets, in the same bucket query, key to participate in interaction.

Let b be the number of buckets, and given a matrix of size, LSH can be written as

In addition, Sparse Adaptive Connection considers a sequence as a graph and adaptively learns Sparse connections. Sparse Sinkhorn Attention uses a sorting network to divide queries and keys into multiple blocks. Each query block is assigned a key block. Each query is allowed to interact with only the corresponding key.

Linearized Attention

These methods are primarily used to approximate or substitute for computational Attention in order to reduce the computational complexity to O(T). (Where is the feature map in the row direction)

Standard Attention versus Linearized Attention

Let’s write the general form of Attention:

Where SIM is a function used to calculate vector similarity. In the original Transformer is the vector inner product + Softmax, we chose to use a kernel function instead

So we could rewrite Attention as

There are two key components to this type of approach, which are feature mapping and feature aggregation

Feature mapping

As mentioned earlier, Linear Transformer uses the Linear Transformer, which is not intended to approximate the dot product of standard Attention, but to be comparable in performance to standard Transformer.

Performer uses a random feature map:

In its first release, Performer was inspired by random Fourier eigenmaps, often used to approximate Gaussian kernels. Among them:

  • l = 2

  • f1 = sin, f2=cos

While the first version does not guarantee that the score calculated by Attention is non-negative, the second version improves to:

  • l = 1

  • f1 = exp

Of course, there are other feature mapping methods, which I won’t go into too much here

Polymerization methods

In the previous formula, features are aggregated by simple summation.

RFA introduces a gate mechanism that makes historical associations appear exponentially decayed by a learnable scalar g when a new association is added to the memory matrix S.

Schlag et al. used the write/delete method to increase the memory matrix capacity.

Query prototype and video memory compression

In addition to sparse and linearized attention, another way to reduce attention complexity is to reduce the number of Queries or key-values.

Query prototyping and Memory Compression

Use the Query prototype for Attention

The Query prototype is used as the primary source for calculating the Attention distribution. The model can either copy the distribution to the position of the corresponding Query or fill it with a discrete uniform distribution.

Cluster Attention groups the query into multiple clusters and calculates the Attention distribution for the center vector of each Cluster.

Informer explicitly uses query sparsity metric to select Query prototype, which is approximately derived from KL divergence between query attention distribution and discrete uniform distribution.

Attention using compressed key-value video memory

Such methods reduce the number of key-value pairs to reduce complexity.

Liu et al proposed MCA, using the convolution layer to reduce the number of key-values

Set Transformer and Luna use externally trainable global nodes to compress input

The Linformer projects key, values (that is, multiplied by a matrix), reducing their length.

The Poolingformer uses two stages of Attention, one sliding window Attention and one compressed memory Attention.


Welcome to GiantPandaCV, where you will see exclusive deep learning sharing, adhere to the original, share the new knowledge we learn every day. ✧ (, ̀ omega, ́)

If you have questions about this article, or want to join the communication group, please add BBuf wechat:

Qr code

This article uses the Article Synchronization Assistant to synchronize