Preface:
Convolutional neural network (CNN) has achieved remarkable success in the field of computer vision and has become a universal and dominant method for almost all computer vision tasks.
Given transformer’s success in natural Language processing (NLP), many researchers are trying to explore the use of Transformer for some work model visual tasks as dictionary lookup problems to learn queries, and using Transformer Decoder as a CNN backbone for a specific task’s head, Such as VGG and ResNet. Some existing technologies have incorporated attention modules into CNNs.
Thesis: arxiv.org/pdf/2102.12…
Code: github.com/whai362/PVT
Click follow and update two computer vision articles a day
Introduction
This paper proposes an unconvoluted backbone network using the Transformer model, called Pyramid Vision Transformer (PVT), which can act as a universal backbone in many downstream tasks. Including image – level prediction and pixel – level intensive prediction.
Comparison of different architectures, where “Conv” and “TF-E” denote convolution and Transformer encoders respectively.
(a) It shows that many CNN backbones use pyramid structure for intensive prediction tasks, such as target detection (DET), semantic and instance segmentation (SEG).
(b) shows that the recently proposed Vision Transformer (ViT) is a “columnar” structure designed specifically for image classification (CLS).
(c) By combining the pyramid structure of CNNs, we propose pyramid Vision Transformer (PVT), which can be used as the universal backbone of many computer vision tasks, expanding the scope and influence of ViT. In addition, our experiments also show that PVT can be easily combined with DETR to build an end-to-end target detection system without convolution and artificial design components such as dense Anchor and non-maximum suppression (NMS).
Specifically, as shown in the figure above, PVT, unlike ViT, overcomes the difficulties of traditional Transformer in the following ways:
(1) Learning high-resolution representations with fine-grained image patches (i.e., 4×4 per patch) as input is critical for intensive prediction tasks.
(2) Reduce the sequence length of Transformer when the network depth increases, and significantly reduce the calculation consumption.
(3) Spatial attention-reduction (SRA) layer is adopted to further reduce the resource cost of learning high-resolution feature maps.
Method
Similar to the CNN backbone network, the proposed method has four stages to generate feature maps of different scales. All phases share a similar architecture consisting of a patch embedding layer and a Li Transformer encoder layer.
The overall structure of the proposed Pyramid Vision Transformer (PVT).
The whole model is divided into four stages, each of which consists of a patch embedding layer and a Li-Layer Transformer encoder. Following the pyramid structure, the output resolution of the four stages is gradually reduced from step size 4 to step size 32.
In the first stage, given an input image of size H×W×3, they first divided it into (HW)/4² patches, with each patch of size 4×4×3.
They then entered the flat patches into a linear projection and obtained embedded patches of size (HW)/4²×C1. After that, the embedded block and position embedded are 0 0 through L1 layer Transformer encoder, and the output is 0 0 is a feature diagram F1 with dimensions H/4×W/4×C1. Again, using feature maps from the previous phase as input, they obtain the following feature maps F2, F3, and F4 with steps of 8, 16, and 32 pixels relative to the input image.
A spatial attention-reduction (SRA) layer is proposed to replace the traditional multi-attention-reduction (MHA) layer in encoders.
Like MHA, SRA also takes a query Q, a key K, and a value V as input, and outputs an improved feature. The difference is that our SRA will reduce the spatial scale of K and V before the attention operation, which greatly reduces the computation/memory overhead.
Stage 1 SRA details are as follows:
These are the parameters of a linear projection. N_i is the number of stage 1 Transformer Encoder. Therefore, the size of each head is Ci/Ni. SR(·) is a space reduction operation, which is defined as:
Using the feature pyramid {F1, F2, F3, F4}, the method can be easily applied to most downstream tasks, including image classification, object detection, and semantic segmentation.
conclusion
In general, the proposed PVT has the following advantages. Firstly, compared with the traditional CNN backbones, the receptive field will increase when the depth increases. PVT always generates a global receptive field, which is more suitable for detection and segmentation than the local receptive field of CNNs. Second, compared to ViT, the proposed approach is easier to plug into many representative intensive forecast pipelines due to the advances in pyramid structures. Third, with PVT, we can combine PVT with other Transformer decoders designed for different tasks to build a convolution free pipeline.
The first end-to-end target detection pipeline, PVT+DETR, is completely convolution free. It achieved 34.7 on the 2017 COCO, better than the original RESnet 50-based DETR.
Original link:
Medium.com/mlearning-a…
This article comes from the public CV technical guide of the paper sharing series.
Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.
Open world target detection
Real-time lane detection and alarm for autonomous driving
Shi Baixin, Peking University: From the perspective of reviewers, talk about how to write a CVPR paper
Siamese network summary
Summary of computer vision terms (a) to build the knowledge system of computer vision
Summary of under-fitting and over-fitting techniques
Summary of normalization methods
Summary of common ideas of paper innovation
Summary of efficient Reading methods of English literature in CV direction
A review of small sample learning in computer vision
A brief overview of intellectual distillation
Optimize the read speed of OpenCV video
NMS summary
Loss function technology summary
Summary of attention mechanism technology
Summary of feature pyramid technology
Summary of pooling techniques
Summary of data enhancement methods
Summary of CNN structure Evolution (I) Classical model
Summary of CNN structural evolution (II) Lightweight model
Summary of CNN structure evolution (iii) Design principles
How to view the future trend of computer vision
Summary of CNN visualization technology (I) – feature map visualization
Summary of CNN visualization Technology (ii) – Convolutional kernel visualization
CNN Visualization Technology Summary (iii) – Class visualization
CNN Visualization Technology Summary (IV) – Visualization tools and projects