DETR: Target detection based on Transformers

preface

Recently, it can be said that with the fire of ViT, it can be said that almost a day can see a CV paper based on Transformers. Today, I will introduce to you another one published by Facebook on ECCV2020 based on Transformers Objective detection paper, this paper is the baseline of follow-up quite a lot of Transformers detection/segmentation, through this paper we will understand its routine.

Related work

When it comes to target detection, let’s first briefly review the most basic work called Faster R-CNN

The first step of Faster R-CNN is to use CNN to extract features for images, and then extract candidate boxes through non-maximum suppression algorithm, and finally predict the position and category of each candidate box.

How DETR is implemented

DETR’s article greatly simplifies this process by replacing the candidate box extraction process with a standard Transformers encoder-decoder architecture, which directly predicts the location and category of objects in the decoder section.

The process is divided into three steps:

CNN lift characteristics
Transformers Encoder – Decoder for information fusion
FFN predicts class and box

CNN

Using resNET-50 network, the input image is turned into a feature of scale, and the channel is changed from 2048 to smaller (usually 512) through a 1X1 convolution.

Transformers encoder-decoder

Transformer Encoder part firstly reduces the dimension of the input feature graph and flatten it into D HXW dimension vectors. Each vector serves as the input token. Since self-attention is permutation and non-deformation, it is necessary to reflect each token In the original sequence relationship, we added a positional encodings to each token. The output is the V and K corresponding to the Decoder section.

For example, if we started with 512*512, d would be 256.

The input part is 100 Object Queries, for example, there are 100 categories of objects to be predicted in our data set, then the 100 Object queries go through the Transformers decoder A number of categories of objects and location information are then predicted.

The authors found it helpful to use auxiliary Losses in the decoder during training, especially to help the model output the correct number of objects per class.

FFN

DETR adds predictive FFN and Hungarian Loss after each decoder layer, with all predictive FFNS sharing its parameters. We use additional shared layer norm to standardize the input of predictive FFN from different decoder layers. FFN is the simplest multilayer perceptron module that predicts the category and location information of each Object Query for the output of the Transformers Decoder. In the process of practical training, the Hungarian algorithm is used to match prediction and label minimum loss, and only the Query on the pair is applied to calculate the loss return gradient.

Loss Includes Box Loss and Class Loss

Box Loss includes IOUloss and L1loss. The principle is very simple.

where are hyperparameters and is the generalized IoU [38]:

Class Loss is the simplest cross entropy.

Hungarian matching

Hungarian matching algorithm is a classical algorithm of graph theory in discrete mathematics, which describes the maximum matching of a binary graph. In other words, the binary graph is divided into two parts. One part is the result of our prediction of 100 object Queries, and the other part is the actual label. Since we do not know which categories of objects should be predicted when the 100 Object Queries are input, we may be the first one at the beginning Token predicts object A, while the second token predicts object C. In A word, it is disordered. We need to find the calculation loss closest to the predicted result according to the actual label. For others that are not matched, the Loss return gradient is not calculated. Here’s a picture you can see at a glance:

Think DETR in comparison to ViT

In fact, when I read this article, I was more focused on the differences in some implementation details of ViT.

First of all, ViT does not use CNN, while DETR uses CNN to extract image features first
ViT uses only the translater-Encoder, adding an extra Class token to predict image types in encoder, while DETR’s Object token is learned through Decoder.
Both DETR and VIT Transformers use Position Embedding in encoder, but the Position Embedding used in VIT is also the puzzle of the author at the beginning of reading literature.
DETR’s Transformers Encoder used each pixel of the feature as the token embeddings input, while ViT cut the image directly into 16*16 patches, each Patch leveled directly as the token embeddings
Compared to VIT,DETR is closer to the original Transformers architecture.

DETR can also do segmentation

First check box
Split each box
Vote for each pixel category

The author did not elaborate on the implementation details in this paper, but the SETR published in CVPR2021 this year focuses on how to use Transformers to do segmentation. We will talk about it in detail next time.

Analysis of experimental results

Comparison Study

By comparison, the classical Faster R-CNN in the detection field can be seen that the detection result of large target object is better than the Faster R-CNN under the same two parameters. The reason is that the author says that the Transformers can pay more attention to global information.

Ablation Study

Decoder is more important than Encoder
Decoder has hidden “anchors” that are critical for detection
Encoder only helps aggregate pixels of the same object, reducing the burden of decoder

In the part of position coding, the author compares the learnable position coding with the position coding method based on Sincos function (that is, the position coding method of the original Transformers), and it can be seen that the effect is Sincos * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Well, it is a simple experiment to verify the function of each part of loss, which is basically a formatting thing.

Simple code

The author ends with a brief code implementation detail in the appendix

`import torch` `from torch` `import nn` `from torchvision.models import resnet50` `class DETR(nn.Module):` `def __init__(self, num_classes, hidden_dim, nheads,` `num_encoder_layers, num_decoder_layers):` `super().__init__()` `# We take only convolutional layers from ResNet-50 model` `self.backbone=nn.Sequential(` `*list(resnet50(pretrained=True).children())[:-2])` `self.conv = nn.Conv2d(2048, hidden_dim, 1)` `self.transformer = nn.Transformer(hidden_dim, nheads,` `num_encoder_layers,` `num_decoder_layers)` `self.linear_class = nn.Linear(hidden_dim, num_classes + 1)` `self.linear_bbox = nn.Linear(hidden_dim, 4)` `self.query_pos =nn.Parameter(torch.rand(100, hidden_dim))` `self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))` `self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))` `def forward(self, inputs):` `x = self.backbone(inputs) h = self.conv(x)` `H,W=h.shape[-2:]` `pos = torch.cat([` `self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),` `self.row_embed[:H].unsqueeze(1).repeat(1, W, 1), ],` `dim=-1).flatten(0, 1).unsqueeze(1)` `h = self.transformer(pos +` `h.flatten(2).permute(2, 0, 1), ` ` self. Query_pos. Unsqueeze (1)) ` ` return self. Linear_class (h), self.linear_bbox(h).sigmoid()` `detr = DETR(num_classes=91, hidden_dim=256, nheads=8, num_encoder_layers=6, num_decoder_layers=6)` `detr.eval()` `inputs = torch.randn(1, 3, 800, 1200)` `logits, bboxes = detr(inputs)`Copy the code

conclusion

A very simple application of Transformers in target detection is also a paper that will be cited in the recent popular Transformers series. I think he and VIT represent CV’s two views on the Transformers architecture, and VIT only uses Encoder is also the most mainstream method at present, while DETR adopts the combination of CNN and Transformers Encoder and decoder. From the perspective of motivation, I personally prefer DETR, and this period of time, I have basically changed DETR Transformers series all read, will be a series of several good papers explained (there are too many hydrology).

Recommended reading

Meta Pseudo Labels refresh SOTA on ImageNet!

ViT: Transformer is All You Need!

Avenue to Jane! An in-depth interpretation of CVPR2021 paper RepVGG!

PyTorch source code interpretation torch. Autograd

FixRes: Beat SOTA on ImageNet data set twice

How can Transformer break into CV and kill CNN?

SWA: Increases your target detection model painlessly by 1% AP

CondInst: The performance and speed of the instance segmentation model exceed Mask RCNN

CenterX: Open CenterNet in a new way

An in-depth analysis of probability Anchor allocation mechanism PAA

New MMDetection version V2.7 with DETR support and YOLOV4 on the way!

CNN: I’m not who you think I am

TF Object Detection finally supports TF2!

Knowledge distillation improves ResNet50 accuracy to 80%+ on ImageNet without tricks

Try MoCo instead of pretrain on ImageNet!

Blockbuster! Deep learning model compression and acceleration in this paper

Learn Transformer from source code!

Mmdetection Minimum copy (vii) : Difference analysis of Anchor -base and Anchor -free

Mmdetection Minimal Reprint edition (iv) : Exclusive YOLO Conversion Insider

Machine learning algorithm engineer

A heart of the public account