PVT: Pyramid Vision Transformer for backbone on intensive tasks!

PVT based on Detectron2 is open source. Welcome star: github.com/xiaohu2015/…

Since ViT, the research on Vision Transformer has been in a blowout, which can be divided into two directions from the perspective of ideas. One is to improve the effect of ViT in image classification. The second is to apply ViT to other image tasks, such as segmentation and detection, such as the Pyramid Vision Transformer introduced here. Compared with ViT, PVT introduces a pyramid structure similar to CNN, making PVT as backbone applied in dense Prediction tasks (segmentation and detection, etc.).

CNN structure is commonly used as a pyramid structure. As shown in the figure above, CNN network can be divided into different stages. At the beginning of each stage, the length and width of feature graph are halved, while the channel of feature dimension is widened by 2 times. There are two main considerations. First, feature dimension reduction by convolution or pooling layer of stride=2 can increase the receptive field, and also reduce the amount of calculation, but at the same time, the loss in space is compensated by the increase of channel dimension. However, ViT itself is a global receptive field, so ViT is simple and straightforward. Directly make input images tokens and then accumulate the same Transformer Encoders. This can be applied to image classification without too much problem. However, if it is applied to intensive tasks, problems will be encountered: First, segmentation and detection often require large resolution input, and when the input image increases, the amount of ViT calculation will rise sharply; Second, ViT directly adopts larger patchs for tokenization. If 16×16 is used, the coarse-grained features obtained will suffer a great loss for intensive tasks. This is exactly what PVT is trying to solve. PVT uses a similar architecture to CNN to divide the network into different stages. The dimensions of each stage are halved compared to the previous stage characteristic graph, which means the number of tokens is reduced by four times.

The input for each stage is 0 0 THE input for each stage is 0 0 THE input for each stage is 0 0 3-D At the beginning of each stage, the input image is tokenized like ViT, that is, patch embedding, the patch size is 2×2 (the patch size of the first stage is 4×4), which means that the dimension of the final feature map obtained by this stage is halved. There was a fourfold decrease in tokens. PVT has a total of 4 stages, which is similar to ResNet. Compared with the original image, the size of the feature map obtained by the 4 stages is 1/4, 1/8, 1/16 and 1/32 respectively. Since different stages have different amounts of tokens, each stage adopts different position embeddings, and their respective position embedding is added after patch Embed. When the input image size changes, Position embeddings can also be self-adaptive through interpolation.

Different stages have different amounts of tokens. The more advanced stages have more patchs. We know that the calculation amount of self-attention is proportional to the square of the length of the sequence. With all transformer Encoders using the same parameters, the calculation load would be unbearable. PVT To reduce computation, different stages use different network parameters. The network parameters of different PVT series are set as follows, where is the size of patch, the size of feature dimension, the number of heads in MHA (Multi-head attention), and the expansion coefficient of FFN. The default value in Transformer is 4.

For example, the feature dimension of STAGE1 is only 64, while that of STAGe4 is 512. This setting is similar to conventional CNN network Settings. Therefore, although the number of patchs in the previous stage is large, the feature dimension is small, so the calculation is not too large. The difference in PVT of different volumes is mainly reflected in the difference in the number of Transformer encoder of each stage.

To further reduce computation, PVT replaces the conventional multi-head attention (MHA) with spatial-Reduction attention (SRA). The core of SRA is to reduce the number of key and value pairs in the attention layer. In conventional MHA calculation in the attention layer, the number of key and value pairs is the length of the sequence, but SRA reduces it to the original. The specific structure of SRA is as follows:

Patch Embeddings with dimension 0 is 0 0 The patch Embeddings with dimension 0 is 0 0 Each Patchs will get patch Embeddings with dimension as through linear transformation (the implementation here is similar to patch EMB operation, which is equivalent to a convolution operation). Finally, a Layer norm layer is applied, which can greatly reduce the number of K and V. The specific implementation code is as follows:

class Attention(nn.Module): def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0., sr_ratio=1): super().__init__() assert dim % num_heads == 0, f"dim {dim} should be divided by num_heads {num_heads}." self.dim = dim self.num_heads = num_heads head_dim = dim // Num_heads self.scale = qk_scale or head_dim ** -0.5 self.q = nn.Linear(dim, dim, bias=qkv_bias) self.kv = nn.Linear(dim, dim * 2, bias=qkv_bias) self.attn_drop = nn.Dropout(attn_drop) self.proj = nn.Linear(dim, Dim) self.proj_drop = nn.dropout (proj_drop) self.sr_ratio = sr_ratio # Implement this is equivalent to a convolution layer if sr_ratio > 1: self.sr = nn.Conv2d(dim, dim, kernel_size=sr_ratio, stride=sr_ratio) self.norm = nn.LayerNorm(dim) def forward(self, x, H, W): B, N, C = x.shape q = self.q(x).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3) if self.sr_ratio > 1: X_ = x.p ermute (0, 2, 1). Reshape x_ (B, C, H, W) = self. The sr (x_). Reshape (B, C, 1). Permute x_ (0, 2, 1) # here. Shape = (B, N/R^2, C) x_ = self.norm(x_) kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) else: kv = self.kv(x).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) k, v = kv[0], kv[1] attn = (q @ k.transpose(-2, -1)) * self.scale attn = attn.softmax(dim=-1) attn = self.attn_drop(attn) x = (attn @ v).transpose(1, 2).reshape(B, N, C) x = self.proj(x) x = self.proj_drop(x) return xCopy the code

From the network setting of PVT, the larger value of stage in front, such as stage1, indicates that the number of Q and V is directly reduced to 1/64 of the original, which greatly reduces the amount of computation.

PVT is specific to image classification task. Like ViT, IT also introduces a class token to achieve the final classification, but PVT is introduced at the last stage:

def forward_features(self, x): B = x.shape[0] # stage 1 x, (H, W) = self.patch_embed1(x) x = x + self.pos_embed1 x = self.pos_drop1(x) for blk in self.block1: x = blk(x, H, W) x = x.reshape(B, H, W, -1).permute(0, 3, 1, 2).contiguous() # stage 2 x, (H, W) = self.patch_embed2(x) x = x + self.pos_embed2 x = self.pos_drop2(x) for blk in self.block2: x = blk(x, H, W) x = x.reshape(B, H, W, -1).permute(0, 3, 1, 2).contiguous() # stage 3 x, (H, W) = self.patch_embed3(x) x = x + self.pos_embed3 x = self.pos_drop3(x) for blk in self.block3: x = blk(x, H, W) x = x.reshape(B, H, W, -1).permute(0, 3, 1, 2).contiguous() # stage 4 x, (H, X) = self. token (x) = self. token (x) = self. token (x) = self. token (x) = self. token (x) = self. token (x) = self. token (x) = self. token (x) = self. token (x) = self. token (x) dim=1) x = x + self.pos_embed4 x = self.pos_drop4(x) for blk in self.block4: x = blk(x, H, W) x = self.norm(x) return x[:, 0]Copy the code

In terms of classification tasks, the TOP-1 Acc of PVT on ImageNet is actually similar to that of ViT. In fact, the most important application of PVT is backbone as dense task such as segmentation and detection. On the one hand, through some clever design, PVT does not have as much model calculation as ViT for input images with large resolution. In this paper, VIT-Small /16, VIT-Small, PVT is much better than ViT. When scale=640 is input, the computation amount of PVT-Small and ResNet50 is similar. However, if the input scale is larger, PVT is growing much faster than ResNet50.

Another advantage of PVT compared with ViT is that it can output feature maps of different scales, which is very important for segmentation and detection. At present, most segmentation and detection models adopt FPN structure, while PVT can be used as the backbone of CNN to seamlessly connect segmentation and detection heads. In this paper, a large number of experiments on detection, semantic segmentation and instance segmentation are conducted, which shows the advantages of PVT in dense task. For example, ptT-small based RetinaNet has a higher AP value on COCO than R50 based RetinaNet with less reasoning time (38.7 vs. 36.3), although increasing scale improves the effect, additional reasoning time is required:

Therefore, although PVT can solve part of the problem, if the input image resolution is particularly large, the scheme based on CNN may still be optimal. In addition, YOLOF, the latest paper of Glenga, points out that a C5 feature of ResNet plus some modules that increase the receptive field can achieve similar results in detection, which makes people think whether multi-scale feature is necessary, and Transformer Encoder itself is a global receptive field. Recently, Intel proposed DPT to obtain characteristic maps of different scales for dense task directly based on ViT model through Reassembles operation, and achieve new SOTA (mIoU 49.02) on ADE20K semantic segmentation dataset. Recently, Swin Transformer proposed by Microsoft and PVT has a similar network architecture, but its performance is as good as SOTA (mIoU 53.5 in ADE20K semantic segmentation data set) in various detection and segmentation data sets. At its core, they present a movewindow method that reduces the need for self-attention.

I believe there will be better work in the future! Looking forward to!

reference

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
whai362/PVT
Pyramid Vision Transformer in plain English
You Only Look One-level Feature
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Vision Transformers for Dense Prediction

Recommended reading

Meta Pseudo Labels refresh SOTA on ImageNet!

ViT: Transformer is All You Need!

Avenue to Jane! An in-depth interpretation of CVPR2021 paper RepVGG!

PyTorch source code interpretation torch. Autograd

FixRes: Beat SOTA on ImageNet data set twice

How can Transformer break into CV and kill CNN?

SWA: Increases your target detection model painlessly by 1% AP

CondInst: The performance and speed of the instance segmentation model exceed Mask RCNN

CenterX: Open CenterNet in a new way

An in-depth analysis of probability Anchor allocation mechanism PAA

New MMDetection version V2.7 with DETR support and YOLOV4 on the way!

CNN: I’m not who you think I am

TF Object Detection finally supports TF2!

Knowledge distillation improves ResNet50 accuracy to 80%+ on ImageNet without tricks

Try MoCo instead of pretrain on ImageNet!

Blockbuster! Deep learning model compression and acceleration in this paper

Learn Transformer from source code!

Mmdetection Minimum copy (vii) : Difference analysis of Anchor -base and Anchor -free

Mmdetection Minimal Reprint edition (iv) : Exclusive YOLO Conversion Insider

Machine learning algorithm engineer

A heart of the public account

This article uses the article synchronization assistant to synchronize

PVT: Pyramid Vision Transformer for backbone on intensive tasks!

reference

Related Posts

Keras: Use TensorFlow to customize models and training

Three ways to enforce identity and access management

Will AI robots be a threat to humans in the future?