Abstract: This paper introduces two general object detection algorithms for end-to-end training, DETR and DeFCN.

As a basic task in computer vision, general object detection plays an important role in image understanding and information extraction. Many methods have been developed for this task, such as single-stage methods based on full convolution DenseBox, YOLO,SSD, RetinaNet and CenterNet, etc. And more complex multi-stage methods such as RCNN, Fast RCNN, Faster RCNN and Cascade RCNN. According to the definition of target detection task, these methods need to be in the image of the target object localization and classification, in order to ensure that the recall rate of the target, combined with the convolution of the neural network using the characteristics of sliding window predict the convolution method are used for intensive candidate region extraction and target forecast, so for each of the target input, There are often multiple network predictive values corresponding to each other.

These methods typically use non-maximum suppression (NMS) methods to filter repeated predictions to get the final prediction. Different from the convolution operation, the NMS process cannot be differentiated, the model cannot optimize the process of deduplication, and the final detection model cannot be completely end-to-end due to the existence of this operation.

In the recent progress of universal target detection, some methods realize end-to-end training, training and reasoning without NMS operation. For example, Transformer based DETR[1] and full convolutional DeFCN[2] adopt different implementation routes and have different advantages and potentials respectively. The following two methods will be introduced.

DETR

Having achieved great success in natural language processing tasks, DETR applied Transformer to target detection tasks for the first time, achieving end-to-end target detection and achieving the same accuracy in COCO target detection tasks as the highly optimized Faster RCNN. DETR deduces the relationship between the target and global information, without the NMS’s directly parallel output predicting the target set.

As shown in Figure 1, DETR combines the structure of CNN and Transformer to predict the target set directly and in parallel. It can be seen that the detection problem is regarded as a set prediction problem. Compared with the previous method based on set prediction, DETR is different in that: binary matching loss function is used; Parallel Transformer decoding structure. These characteristics make the DETR prediction results have permutation invariance and can be predicted in parallel, which improves the efficiency of the model.

Figure 1. The DETR inspection process

After the 2d features are extracted from the image by CNN, since Transformer can only process one-dimensional sequences, the 2d features need to be paved to obtain one-dimensional features. Direct input to Transformer will cause the feature to lose its spatial encoding information, so this method adds a positional embed to the feature to preserve the feature spatial information. It is noted that flattening features will consume a lot of computation in Transformer. This method does not use FPN structure and only uses high level low resolution features.

As shown in Table 1, compared with Faster RCNN, this method has a better effect on large targets because global information is used in the prediction process. At the same time, because FPN structure is not adopted, the result of this method is relatively poor for small targets.

Table 1. DETR experiment results on COCO

This method does not require an NMS operation, but the addition of an NMS has some effect on the result. As shown in Figure 2, the AP value increases slightly after adding NMS, and the increase gradually decreases with the increase of model complexity, indicating that there is basically no repeated prediction in this method under certain conditions, and no NMS operation is required.

Figure 2. The impact of NMS on the results

DETR breaks through the previous detection paradigm and uses set prediction. However, the optimization speed of DETR method is slow, and due to the limitation of computational amount, it is difficult to use high-resolution rate feature and has poor detection effect on small targets. Some subsequent methods, such as DeformDet, have made some improvements to address these problems and achieved improved results.

DeFCN

Unlike DETR, which uses Transformer, DeFCN implements end-to-end detection based on full convolution. DeFCN is based on the FOCS implementation, which also uses intensive prediction but does so without NMS operations. In the previous approach, a one-to-many strategy was adopted in both the training and prediction phases, that is, each target corresponds to multiple predicted values of the network output, which resulted in the test phase having to use the NMS for de-duplication. This method discusses the corresponding strategy, proposes to use one-to-one sample matching method, and through additional design makes the final model achieve one-to-one prediction while maintaining the equivalent performance. Since no NMS is required, DeFCN can break through the theoretical limit of NMS on dense data sets, which fully reflects the advantages of this method.

Figure 3. Structure diagram of DeFCN

The simplest way to use one-to-one allocation strategy is to directly use the target center or anchor frame as the only positive sample of each target. However, compared with the previous one-to-many design such as FOCS, this approach will have a large performance loss. This method solves the performance degradation caused by one-to-one allocation from two aspects of loss function and feature. The overall structure is shown in Figure 3.

For the loss function, one problem to consider is how to define the positive sample. Due to the change of target shape, it is not a very good choice to choose the center of target bounding box, especially there is only one positive sample for each target definition, so network optimization is more susceptible to the influence of allocation strategy. Inspired by set loss function, this method regards sample matching as a binary graph matching problem, optimizes set loss to some extent, and distributes positive and negative samples according to network output results. The specific allocation strategy mainly considers three aspects: the position prior of positive sample distribution; The score of the classification branch; Return to the bounding box with GT IOU. The positive sample with the highest product score is selected. As shown in Formula 2:

After using the one-to-one sample allocation strategy, the performance of the model is still difficult to reach the previous one-to-many method, so the method adds an additional auxiliary loss in the training stage, which does not affect the reasoning. The traditional one-to-many sample allocation for this loss was used, as shown in Table 2, and the result was significantly improved after the loss was added.

From the network design into consideration, this method is based on the convolution network, and the convolution is a linear operation, one-on-one strategy requires network output is sharp, has certain difficulty for convolution, therefore the method of characteristics using the maximum pool filter, and the multiple dimensions of information fusion of FPN. As shown in Table 2, the addition of this module (3DMF) significantly improved.

Table 2. Influence of different modules on final results (COCO)

Figure 4. DeDCN response visualization

As shown in Figure 4, in the target probability graph of network output, FCOS has multiple responses to each target, and it is necessary to carry out NMS de-balancing (as shown in 4 (a)), while DeFCN achieves one response for each target with the addition of each module (as shown in 4 (d)).

Table 3. CrowdHuman performance analysis

This method has a strong advantage in dense data, and can exceed the theoretical limit of NMS. It is not easy to misfilter dense objects.

In general, the above two end-to-end detection methods follow different routes, but both can remove NMS and achieve complete end-to-end between network input and predicted results, both of which show good potential. The introduction of DETR into Transformer has potential in object-relation modeling and global information understanding. DeFCN has a good application value in dense scenarios due to its simple design and easy deployment.

reference

1. Carion N, Massa F, Synnaeve G, Ieee Transactions on Transformers and Transformers[J]. Ieee Transactions on Transformers and Transformers, 2008, 24 (1) : 1-11.

2. Wang J, Song L, Li Z, Convolutional object detection with fully convolutional network[J]. ArXiv Preprint arXiv:2012.03544, 2020.

This article is shared from huawei cloud community “Technology Review eight: End-to-end general target detection method introduction”, original author: I want to quiet.

Click to follow, the first time to learn about Huawei cloud fresh technology ~