DETR is based on the standard Transorfmer structure, and its performance is comparable to that of Faster RCNN. However, the overall idea of this paper is very simple, and it is hoped that it can provide a general idea for many subsequent studies like Faster RCNN
Source: Xiaofei algorithm engineering Notes public account
Thesis: End-to-end Object Detection with Transformers
- Thesis Address:Arxiv.org/abs/2005.12…
- Thesis Code:Github.com/facebookres…
Introduction
I’ve seen some work before on applying self-attention to visual tasks, For example, Stand Alone self-attention in Vision Models and On the Relationship between self-attention and Convolutional Layers, However, most of these methods only achieve similar effects to convolution and have not performed well. DETR subverts the mainstream target detection based on Transformer, mainly with three highlights:
- Standard Transformer, DETR adopts Standard Transformer and forward network FFN for feature processing and result output, with carefully designed Postion Encoding and object Queries, without anchor. Direct prediction of Bbox coordinates and categories.
- Set prediction, DETR uses Hungarian sorting algorithm in the training process to correspond GT to the predicted result of the model one by one, so that the predicted result of the model in the inference process is the final result, without subsequent NMS operation.
- Target detection performance exceeds that of the classic Faster RCNN, opening up a new line of target detection research, and DETR can also be adapted for panoramic segmentation tasks with good performance.
The DETR model
DETR architecture
The overall architecture of DETR is simple, as shown in Figure 2, and consists of three main parts: CNN backbone, Encoder-decoder Transformer, and simple forward network (FFN).
-
Backbone
Define initial picture ximg∈R3×Ho×Wox_{img} \in \mathbb{R}^{3\times H_o\times W_o}ximg∈R3×Ho×Wo, Using conventional CNN backbone to generate low resolution feature graph F ∈RC×H×Wf\in \mathbb{R}^{C\times H\times W}f∈RC×H×W, The paper uses C=2048C=2048C=2048 and H,W=Ho32,Wo32H,W=\frac{H_o}{32}, \frac{W_o}{32}H,W=32Ho,32Wo.
-
Transformer encoder
The input is first reduced to a smaller dimension DDD by convolution of 1×11\times 11×1, and the new feature graph zo∈Rd×H×Wz_o \in \mathbb{R}^{d\times H\times W}zo∈Rd×H×W is obtained. Then the zoz_ozo space dimension of the feature graph is folded into 1 dimension and converted into the serialized input of D ×HWd\times HWd×HW. DETR consists of several encoder, each encoder is standard structure, including mulliti-head self-attention module and forward network FFN. Since Transformer is sort invariant, add a fixed-position encoding input for each attention layer.
-
Transformer decoder
Decoder is also a standard structure for Transformer, which uses multi-head self-attention module and encoder decoder to output NNN size AS DDD. The only difference is that DETR decode NNN targets in parallel, without the need for autoregressive mechanisms. Because decoder is also sort invariant, using the learned position encdoing(equivalent to anchor) as the input, called object Queries. Similar to encoder, you enter location encoding into each attention layer, as well as spatial location encoding, as shown in Figure 10. Decoder will NNN object queries into NNN output embedding, and then independently decoded into box coordinates and class tags, get NNN final prediction structure. Using self-attention and encoder-decoder attention mechanisms, the model can consider all targets globally.
-
Prediction feed-forward networks (FFNs)
A 3-layer perceptron with ReLU activation and a linear mapping layer were used to decode and obtain the final prediction results. The hidden dimension of the perceptron was DDD. FFN predicts NNN normalized center coordinates, height, width, and category score after Softmax. Since NNN is generally greater than the number of targets, use the special category ∅\emptyset∅ to mark unpredicted targets, similar to the background class. Note that the FFN used for output is not the same as the FFN used in Encoder and decoder.
-
Auxiliary decoding losses
It is found that training decoder with auxiliary loss is very effective, especially to help the model output the correct number of targets, so FFN and Hugarian Loss are added to each decoder layer, all FFN shared parameters are added, and the shared layer-norm is also used to normalize the input of FFN.
Object detection set prediction loss
DETR outputs fixed NNN predicted results. The biggest difficulty is to score the predicted results according to GT. The corresponding relationship between the predicted results and GT needs to be found first. Define yyy as a GT set with an ∅\emptyset∅ for the missing, y^={y^ I} I =1N\hat{y} _I \} =\{\hat{y} _I \}^N_{I =1}y^={y^ I} I =1N for the forecast result. To best match GT and the forecast result, Use the Hungarian algorithm (bipartite graph matching method) to find the optimal arrangement sigma \sigma sigma that minimizes the matching loss:
Lmatch (yi, y ^ sigma (I)) = 1 – {ci indicates ∅} p ^ sigma (I) (ci) + 1 {ci indicates ∅} Lbox (bi, b ^ sigma (I)) \ mathcal {L} _ {match} (y_i, \hat{y}_{\sigma(i)})=-\Bbb{1}_{\{c_i \ne \emptyset\}}\hat{p}_{\sigma(i)}(c_i)+1_{\{c_i \ne \emptyset \} } \mathcal{L_{box}}(b_i, \ hat {b} _ {\ sigma (I)}) Lmatch (yi, y ^ sigma (I)) = {ci = ∅} – 1 p ^ sigma (I) (ci) + 1 {ci = ∅} Lbox (bi, b ^ sigma (I)) is sorted GT – predicted results for the matching of the loss, loss category forecast and bbox match. Yi = (ci, bi) y_i = (c_i, b_i) yi = (ci, bi) for GT, including cic_ici for category, Bi ∈[0,1]4b_i\in [0,1] ^4bi∈[0,1]4 is the coordinate vector (x, y, hetight, weight) with respect to the image size, P ^ sigma (I) (ci) \ hat {p} _ {\ sigma (I)} (c_i) p ^ sigma (I) ^ sigma (ci) and b (I) \ hat {b} _ {\ sigma (I)} b ^ sigma (I) to predict the category of the confidence and bbox respectively. The matching process here is similar to the matching logic of Anchor and GT in the current detection algorithm, but the difference is that the prediction result here is one-to-one corresponding to GT. After finding an optimal arrangement of σ^ hat{\sigma}σ^, a Hungarian loss is calculated:
When implemented, consider the category imbalance, and reduce the weight of the category item ci=∅c_i=\emptysetci=∅ by a factor of 10. Unlike ordinary target detection methods which predict the difference of bbox, DETR directly predicts the coordinates of bbox. Although this method is simple to implement, the calculation of loss will be affected by the target size. In this paper, linear L1\ Mathcal {L} _1L1 loss and IoU loss are used to keep the scale unchanged. Lbox bbox loss (bi, b ^ sigma (I)) \ mathcal {L} _ {box} (b_i, \ hat {b} _ {\ sigma (I)}) Lbox (bi, b ^ sigma (I)) for lambda iouLiou (bi, b ^ sigma (I)) + lambda L1 ∣ ∣ bi ^ sigma (I) – b ∣ ∣ 1 \ lambda_ {iou} \ mathcal {L} _ {iou} (b_i, \ hat {b} _ {\ sigma (I)}) + \ lambda_ {L1} | | b_i \ hat – {b} _ {\ sigma (I)} | | _1 lambda iouLiou (bi, b ^ sigma (I)) + lambda L1 ∣ ∣ bi ^ sigma (I) – b ∣ ∣ 1, Bbox loss needs to be normalized with positive sample number.
Experiments
DETR outperforms the classic Faster RCNN.
The influence of the number of Encoder layers on performance was explored
The accuracy of decoder output prediction for each layer can be seen increasing layer by layer.
As for the influence of the location embedding mode on performance, the spatial POS here corresponds to the spatial positional encoding in Figure 10, while the output POS corresponds to the Object Queries in Figure 10.
Impact of loss function on performance.
DETR for panoptic segmentation
DETR can also connect a mask head to the output of decoder for panoramic segmentation task, mainly using the feature extraction ability of DETR model.
Comparison of panoramic segmentation performance with current mainstream models.
Conclustion
DETR is based on the standard Transorfmer structure, and its performance is comparable to that of Faster RCNN. However, the overall idea of this paper is very simple, and it is hoped that it can provide a general idea for many subsequent studies like Faster RCNN.
If this article was helpful to you, please give it a thumbs up or check it out
For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]