ICCV2021 | Vision reflection and improvement of the relative position encoding in the Transformer

preface

In computer vision, the effectiveness of relative position coding has not been well studied, and even remains controversial. This paper analyzes several key factors in relative position coding, and proposes a new relative position coding method for 2D images, called image RPE(IRPE).

This article is from the public CV technical guide of the paper sharing series

Pay attention to the public CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

Code: github.com/microsoft/C…

Background

Self-attention is at the heart of Transformer, which models the relationship between tokens in order. However, self-attention has an inherent flaw — it doesn’t capture the order in which tokens are entered. Therefore, an explicit representation of incorporated location information is especially important for Transformer because the model is otherwise completely immune to sequence sorting, which is not desirable for modeling structured data.

There are two main types of encoding for transformer location representations. One is absolute, the other is relative.

The absolute method encodes the absolute position of input tokens from 1 to the maximum sequence length **. That is, each position has a separate encoding vector **. The coded vector is then combined with input Tokens to feed location information into the model.

The relative position method codes the relative distances between input tokens and learns the pair-wise relationships between tokens. Relative Position Encoding (RPE) is typically computed by query tables with learnable parameters that interact with query and key in the self-attention module. This approach allows modules to capture very long dependencies between Tokens.

Relative position coding is proved to be effective in natural language processing. In computer vision, however, the effect remains unclear. There is little literature on it recently, but controversial conclusions have been drawn in Vision Transformer.

For example, Dosovitski et al. observed that relative location coding did not give any gain compared to absolute location coding. In contrast, Srinivaset et al. found that relative position encoding induced significant gain, superior to absolute position encoding. Furthermore, recent work has claimed that relative location coding does not work as well as absolute location coding. These works have drawn different conclusions about the effectiveness of relative location coding in the model, which has prompted us to re-examine and rethink the application of relative location coding in Vision Transformer.

On the other hand, language modeling uses raw relative position encoding and input data is a one-dimensional word sequence. But for visual tasks, the input is usually a 2D image or video sequence in which the pixels have a highly spatial structure. It is not yet clear whether the extension from one to two dimensions will apply to visual models; Is directional information important in visual tasks?

Contributions

This paper first reviews the existing relative position coding methods and then proposes a new coding method for two-dimensional images. The following contributions were made.

1. Several key factors in relative position coding are analyzed, including relative direction, importance of context, interaction between Query, key, value and relative position embedding, and calculation cost. This analysis provides a comprehensive understanding of relative position coding and provides empirical guidance for the design of new methods.

2. An efficient relative coding method is proposed. The computational cost is reduced from O() to O(NKD)(where K <

3. Considering efficiency and versatility, four new relative position coding methods for Vision Transformer are proposed, called Image RPE(IRPE). These methods are simple and can be easily inserted into the self-attention layer. Experiments show that the proposed method improves 1.5%(top-1ACC) and 1.3%(MAP) on ImageNet and COCO compared with its original model DeiTS and Detr-Resnet50 without adjusting any hyperparameters and Settings, respectively.

4. Experiments show that relative position coding can replace absolute coding in image classification. At the same time, absolute coding is necessary for target detection, where pixel position is important for target location.

Methods

First, to investigate whether coding can be embedded independently of input, two Contextual modes are introduced: Bias Mode and Contextual Mode. Different from the traditional Clip function, a Piecewise function is proposed to map relative positions to codes. Then, in order to study the importance of directivity, two non-directional methods and two directional methods are designed.

Bias Mode and Contextual Mode

Previous relative position coding methods relied on input embedding. It raises the question, can the encoding be independent of the input? This paper introduces the relative position coding bias mode and context mode to study this problem. The former has nothing to do with input embedding, while the latter considers interactions with Query, key, or value.

Expressed by a unified formula, i.e

Where b_ij is 2D relative position encoding, used to define bias or context modes.

For bias mode, b_ij = r_ij, where r_ij is a learnable scalar and represents the relative positional weights between positions I and j.

For context mode,

Where r_ij is the trainable vector that interacts with query embedding. There are several variations of the context pattern, which are not listed here. Please refer to the paper if you need one.

Piece Index Function

Before describing two-dimensional relative position weights, a many-to-one function is first introduced to map a relative distance to an integer in a finite set, and then the encoding is shared between different relational positions indexed by that integer. Such index functions can greatly reduce the computational cost and parameter number of long sequences (such as high-resolution images).

Although the clipping function h(X)= Max (−β, min(β, X)) used in [18] also reduces costs, positions with relative distances greater than β are assigned to the same code. This approach inevitably omits contextual information about the remote’s relative location.

Paper introduces a piecewise function g (x) : R – > {y Z ∈ | – beta y beta} or less or less, used for relative distance index to the corresponding code. This function is based on the assumption that close neighbors are more important than distant neighbors and distributes attention by relative distance. It is expressed as

Where [·] is the rounding operation, Sign() determines the symbol of the number, that is, positive input returns 1, negative input returns -1, and vice versa returns 0. α determines the segmentation point, β controls the output within the range [−β, β], and γ adjusts the curvature of the logarithmic portion.

The piecewise function h(X) is compared with the clipping function h(X)=min(−β, Max (β, X)). In Figure 2, the clipping function H (X) distributes uniform attention and omits distant positions, but the piecewise function G (X) distributes different attention levels according to relative distances. The author believes that latent information of remote locations should be preserved, especially for high resolution images or tasks requiring remote feature dependence, so G (X) is selected to construct the mapping method.

2D relative position calculation

1. Euclidean Method: ** Calculates the Euclidean distance of two relative positions and maps the distance to the corresponding encoding through a learnable bias scalar or context vector.

2. Quantization Method: In the above Euclidean distance Method, two nearby neighbors with different relative distances can be mapped to the same index. For example, two-dimensional relative positions (1, 0) and (1, 1) are mapped to index 1, and the nearest neighbor should be separated. Therefore, the Euclidean distance needs to be quantized, that is, different real numbers map to different integers.

Quant (·) takes a set of real numbers {0,1,1.41, 2,2.24… } maps to a set of integers {0,1,2,3,4… }. This method is also non-directional.

3. Cross Method. The position direction of pixel is also very important to image, so a directed mapping method is proposed. This approach, called the Cross method, computes the codes horizontally and vertically, and then sums them up. The method is given as follows,

Where p˜xi(I, j) and p˜yi(I, j) are both learnable scalars in the bias mode and learnable vectors in the context mode. Similar to the encoding in SASA, the same offset shares the same encoding on either the x or y axes, but the main difference is that we use a piecewise function to allocate attention based on relative distance.

4. Product Method. If the distances in one direction are the same, whether horizontal or vertical, the crossover method encodes different relative positions into the same embed. In addition, the crossover method incurs additional computational overhead. In order to improve efficiency and contain more directional information, the product method is designed, and its formula is as follows

An efficient implementation method

In context mode, all of the above methods have one thing in common:.

Computing this part requires the time complexity O(), where n and d represent the length of the input sequence and the number of characteristic channels, respectively. Due to the many-to-one nature of I(I, j), the size K of the set I(I, j) is usually smaller than the Vision Transformer. Therefore, the paper provides the following efficient implementation:

It takes O(NKD) time to anticipate all of z_i t, and then assigns zi_ t to that common expression by mapping t= I (I, j). The time complexity of assignment is O(N^2), and the cost is much lower than the predicted calculation. Therefore, the computational cost of relative position coding is reduced from O() to O(NKD).

Conclusion

1. Comparison between the two modes of the four methods.

In Vision Transformer, directed methods (crossover and product) generally perform better than undirected methods (Euclidean distance and quantization). This phenomenon shows that directionality is important for Vision Transformer because image pixels are highly structured and semantically relevant.

Regardless of the method used, context mode achieves better performance than bias mode. The underlying reason may be that context mode changes the encoding with input characteristics, while bias mode remains static.

2. Relative position encoding can compare results shared or not shared between different headers.

For bias modes, accuracy is significantly reduced when coding is shared between heads. In context mode, by contrast, the performance difference between the two schemes is negligible. Both methods achieved an average top-1 accuracy of 80.9%.

The paper speculates that different headers require different relative position coding (RPE) to capture different information. In context mode, each header can calculate its own RPE by formula. When in bias mode, sharing the RPE forces all heads to give the same attention to patches.

3. Comparison between piecewise function and clipping function

In the image classification task, the performance difference between these two functions is very small, even can be ignored. However, in the target detection task, the clipping function is worse than the segment function. The fundamental reason is that when the sequence length is short, the two functions are very similar. Piecewise functions are effective, especially if the sequence size is much larger than the number of buckets. P_I(I,j) = P_I(I,j)

Compared to classification, target detection uses much higher resolution inputs, resulting in much longer input sequences. Therefore, it is speculated that when the input sequence is long, the piecewise function should be used because it is able to allocate different attention to relatively large distances, while when the relative distances are greater than β, the clipping function allocates the same codes.

4. Comparison with other SOTA models on ImageNet

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

ICCV2021 | Vision reflection and improvement of the relative position encoding in the Transformer

Background

Contributions