“This is my 28th day of participating in the First Challenge 2022. For details: First Challenge 2022.”

The new Local Attention, beyond the Vision Transformer, achieves 87.1% on ImageNet without additional data sets.

VOLO: Vision Outlooker for Visual Recognition

Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan

Code: github.com/sail-sg/vol…

Abstract

  • The visual recognition task has been dominated by CNNCNNCNN for years. Self-attention-based ViTViTViT shows great potential in imageneTimagenet classification. Without additional data, The performance and the most advanced TransformerTransformerTransformer CNNCNNCNN model still has a gap. In this work, we aim to close the performance gap between the two and demonstrate that attention-based models can indeed perform better than CNNCNNCNN.
  • At the same time, we found that the main factor limiting the performance of ViTsViTsViTs in ImageNetImageNetImageNet classification was its low efficiency in the process of encoding fine-grained features by TokenTokenToken representation. In order to solve this problem, We introduced a new kind of outlookoutlookoutlook attention, and put forward a simple and general architecture, called VisionVisionVision outlookeroutlookeroutlooker (VOLOVOLOVOLO). Outlookoutlookoutlook attention mainly encodes fineFinefine – LevelLevelLevellevel features and context information more efficiently into tokenToKenToken representations, These Tokentokentokens are critical to recognition performance, but are often overlooked by self-attention.
  • Experiments showed that VOLOVOLOVOLO achieved 87.1% toptoptop-111 accuracy on imageneTimageneTimagenet-1K1K1K classification task without using any additional training data, which was the first model to exceed 87% accuracy. In addition, pre-trained VOLO models migrate well to downstream tasks such as semantic segmentation. We got 84.3% mIoUmIoUmIoU on CityscapesCityscapesCityscapes validation set, got 54.3% of the mIoUmIoUmIoU on ADE20KADE20KADE20K validation set, have set a new record.

Conclusion: This paper proposes a new attentional mechanism called Outlook AttentionOutlook\ AttentionOutlook Attention, Global long-distance relationship between the Self and rough modeling AttentionSelf \ AttentionSelf Attention, OutlookOutlookOutlook can more precisely on the neighborhood coding field characteristics, To make up for the Self AttentionSelf \ AttentionSelf Attention to finer feature coding.

OutLooker Attention

The OutLooker module can be thought of as a structure with two separate phases, The first part contains a heap OutLookerOutLookerOutLooker used to generate the intensification of said (TokenTokenToken representationsrepresentationsrepresentations), The second part deploys a series of converters to aggregate global information. Before each part has a piece of embedded module (patchpatchpatch embeddingembeddingembedding modulemodulemodule) map the input to the specified shape.

Theory and form

OutLooker is based on the following two points:

  1. The features of each spatial location are representative enough to generate attention weights that aggregate local proximity information
  2. Dense local spatial aggregation information can effectively encode fine-level information

OutLooker consists of outlook attention layer for spatial information encoding and MLP for channel information interaction. Given X∈RH×W×CX\in \mathbb{R}^{H\times W\times C}X∈RH×W×C, it has the following form:


X ~ = O u t l o o k A t t ( L N ( X ) ) + X . ( 1 ) Z = M L P ( L N ( X ~ ) ) + X ~ . ( 2 ) \tilde{\mathbf{X}}=OutlookAtt(LN(\mathbf{X}))+\mathbf{X},\qquad(1)\\ \mathbf{Z}=MLP(LN(\tilde{\mathbf{X}}))+\tilde{\mathbf{X}}.\qquad\qquad(2)

Which LNLNLN LayerNormLayerNormLayerNorm says

methods

As we can easily see from the figure above, the whole process is divided into two routes, the first of which is introduced

Outlook Attention Generation:

  1. QQQ is obtained by changing the channel of InputInputInput from [H,W,C][H,W,C][H,W,C] [H,W,C] to [H,W,K4][H,W,K^4] through the full connection layer
  2. Reshapereshapereshape changes the weight of attention into [H∗W,K∗K,K∗K][H*W,K*K,K*K][H∗W,K∗K,K∗K], which represents the weight generated by each pixel
  3. In the last dimension, SoftmaxSoftmaxSoftmax is used. Here you can see why the number of channels is changed to K4K^4K4, because it needs to establish a relationship between all the pixels in the window of size K×KK\times KK×K, that is, This can be seen as a kind of local self – attentionself – attentionself – attention, it is also working with InvolutionInvolutionInvolution and similar a huge difference; Here, softMaxSoftMax is the calculation of a similarity between each pixel and all other pixels (including itself)

Dense aggregation (Value Generation) :

  1. First, a linear transformation is carried out with the full connection layer, and the number of channels does not change
  2. usenn.Unfold()I’m going to expand it, the dimension is
    [ H W . K K . C ] [H*W,K*K,C]
    ,
    V V

Calculate The Attention:

  1. WeightOutlook\ Attention\ WeightOutlook Attention Weight is the convolution kernel
  2. usenn.FoldFold back to original size

The code for the whole process is as follows:

+

class OutlookAttention(nn.Module) :
    """ Implementation of outlook attention --dim: hidden dim --num_heads: number of heads --kernel_size: kernel size in each window for outlook attention return: token features after outlook attention """

    def __init__(self, dim, num_heads, kernel_size=3, padding=1, stride=1,
                 qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.) :
        super().__init__()
        head_dim = dim // num_heads
        self.num_heads = num_heads
        self.kernel_size = kernel_size
        self.padding = padding
        self.stride = stride
        self.scale = qk_scale or head_dim**-0.5

        self.v = nn.Linear(dim, dim, bias=qkv_bias)
        self.attn = nn.Linear(dim, kernel_size**4 * num_heads)

        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

        self.unfold = nn.Unfold(kernel_size=kernel_size, padding=padding, stride=stride)
        self.pool = nn.AvgPool2d(kernel_size=stride, stride=stride, ceil_mode=True) # The realization of stride may depend on pool

    def forward(self, x) :
        B, H, W, C = x.shape

        v = self.v(x).permute(0.3.1.2)  # B, H, W, C ->B, C, H, w

        h, w = math.ceil(H / self.stride), math.ceil(W / self.stride)
        v = self.unfold(v).reshape(B, self.num_heads, C // self.num_heads,
                                   self.kernel_size * self.kernel_size,
                                   h * w).permute(0.1.4.3.2)  # B, N, C//N, K*K, H*W->B, N, H*W, K*K, C//N 

        attn = self.pool(x.permute(0.3.1.2)).permute(0.2.3.1)
        attn = self.attn(attn).reshape(
            B, h * w, self.num_heads, self.kernel_size * self.kernel_size,
            self.kernel_size * self.kernel_size).permute(0.2.1.3.4)  # B,H,N,kxk,kxk
        attn = attn * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).permute(0.1.4.3.2).reshape(
            B, C * self.kernel_size * self.kernel_size, h * w)
        x = F.fold(x, output_size=(H, W), kernel_size=self.kernel_size,
                   padding=self.padding, stride=self.stride)

        x = self.proj(x.permute(0.2.3.1))
        x = self.proj_drop(x)

        return x
Copy the code

Realization of multi mechanism

The implementation of the multiple mechanism is very simple, assuming that the number of heads is set to NNN, Just adjust WA∈RC×K4→WA∈RC×N⋅K4W_A\in \mathbb{R}^{C\times K^4}\rightarrow W_A\in \mathbb{R}^{C\times N\cdot K ^ 4} RC x ∈ K4 – WA WA RC x ∈ N ⋅ K4, Finally, NNN An∈RH×W×K4,Vn∈RH×W×CNA_n\in\mathbb{R}^{H\times W\times K^4},V_n\in\mathbb{R}^{H\times W\times C_N}An∈RH×W×K4,Vn∈RH×W×CN, where CN×N=CC_N\times N=CCN×N=C, finally Concat

Patch Embedding

Patch Embedding originally originates from ViT, which is similar to pooling. Patch Embedding maps a Patch in the feature graph through linear transformation of convolution, and finally realizes a down-sampling effect

Different from the roughness of pooling, Patch Embedding can retain information to a certain extent and expand the sensing field

At the same time, the computation of subsequent modules can be reduced

Its implementation method is achieved by controlling the size and step size of the convolution kernel. The code of VOLO is as follows

class PatchEmbed(nn.Module) :
    """ Image to Patch Embedding. Different with ViT use 1 conv layer, we use 4 conv layers to do patch embedding """

    def __init__(self, img_size=224, stem_conv=False, stem_stride=1,
                 patch_size=8, in_chans=3, hidden_dim=64, embed_dim=384) :
        super().__init__()
        assert patch_size in [4.8.16]

        self.stem_conv = stem_conv
        if stem_conv:
            self.conv = nn.Sequential(
                nn.Conv2d(in_chans, hidden_dim, kernel_size=7, stride=stem_stride,
                          padding=3, bias=False),  # 112x112
                nn.BatchNorm2d(hidden_dim),
                nn.ReLU(inplace=True),
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, stride=1,
                          padding=1, bias=False),  # 112x112
                nn.BatchNorm2d(hidden_dim),
                nn.ReLU(inplace=True),
                nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, stride=1,
                          padding=1, bias=False),  # 112x112
                nn.BatchNorm2d(hidden_dim),
                nn.ReLU(inplace=True),
            )

        self.proj = nn.Conv2d(hidden_dim,
                              embed_dim,
                              kernel_size=patch_size // stem_stride,
                              stride=patch_size // stem_stride)
        self.num_patches = (img_size // patch_size) * (img_size // patch_size)

    def forward(self, x) :
        if self.stem_conv:
            x = self.conv(x)
        x = self.proj(x)  # B, C, H, W
        return x

Copy the code

Instead of embedding with ViT using one layer of convolution, this paper uses four layers. The first three layers extract certain features, and the last layer divides the whole feature map into patch_size parts. The shape changes to Patch_size × Patch_sizepatch \_size\times patch\_sizepatch_size× Patch_size

Patch EmbeddingPatch\ EmbeddingPatch Embedding plays a key role here. It not only reduces the amount of computation required by the attention module, but also aggregates neighboring information to a certain extent. Because Outlook AttentionOutlook\ AttentionOutlook Attention is generated using only linear transformation on the channel, its receptive field is actually 111, EmbeddingPatch\ EmbeddingPatch Embedding greatly increases the sensing field, Outlook AttentionOutlook\ AttentionOutlook Attention is only a few adjacent pixels of the center pixel, but its receptive field on the original image is very large.

Of course, in ValueGenerationValue \ quad GenerationValueGeneration has been in the local neighborhood information aggregation.

The network structure

Attention

In Self AttentionSelf\ AttentionSelf Attention, Q, K, V are linear transformations of the input itself

In EXternal AttentionEXternal\ AttentionEXternal Attention, Q is a linear transformation of the input itself, and K and V are introduced parameters

In Outlook AttentionOutlook\ AttentionOutlook Attention, Q is the input itself, V is the linear transformation of the input itself, and K is the introduced parameter

other

In fact, this paper is related to Involution: Inverting the Inherence of Convolution for Visual Recognition is very similar. In Involution, a broader method is proposed, which can be regarded as an example of Involution

The author responds to this situation as follows:

That’s the reason for the above

Two articles differs InvolutionInvolutionInvolution this method is regarded as a kind of new convolution, and this article is to regard it as a kind of attention module, in fact there are both some of the same effect, For example in this article the bulls mechanism is corresponds to the grouping of InvolutionInvolutionInvolution (not completely the same)

The difference is that this paper learns ViTViTViT’s patch embeddingpatch\ embeddingpatch embedding method, which reduces the amount of computation in the attention module. Moreover, it improves the situation that the generation of “convolution kernel” is only related to the central pixel (perhaps through continuous learning, the convolution kernel can model the relationship between the central pixel and adjacent pixels).