Reproduced indicate the source: blog.csdn.net/accepthjp/a…

The following is an excerpt from my research report.

1. The background

Target detectionObject Detection is simply a task that selects the target and predicts the category. It is a kind of image segmentation based on the geometric and statistical characteristics of the target, which combines the segmentation and recognition of the target. Its accuracy and real-time is an important ability of the whole system. Especially in complex scenes, when multiple targets need to be processed in real time, automatic target extraction and recognition becomes particularly important[1].

With the development of computer technology and the wide application of computer vision principle, using computer image processing technology for real-time tracking the target more and more popular, to carry on the dynamic real-time tracking the target positioning in the intelligent traffic system, the intelligent monitoring system, the military targets detection and medical navigation surgery surgical instruments positioning has extensive application value.

2. Artificial features

The steps of the general method: feature extraction by sliding window -> classifier classification



Figure 2.2.1 VISUALIZATION of DPM model

The DPM model proposed in literature [2,3] should be the best artificial feature method in target detection. It improves the HOG feature and puts forward global and local models (as shown in Figure 2.2.1) to greatly improve the accuracy of artificial feature in target detection. The disadvantages of DPM method are relatively complex features, slow calculation speed and poor detection effect for rotating and stretching objects. Because uav video generally has large resolution and unfixed perspective, if DPM is directly used to solve the target detection problem of UAV video, real-time performance and model generalization performance may not be guaranteed.

Before 2013, target detection using deep learning has not become the mainstream, and most target detection methods are still improved based on the ideas of DPM method, and no new method with essential changes has been proposed, so the research work of target detection has encountered a bottleneck.


3. Deep learning

When THE DPM method encounters the bottleneck, some people are also studying how to use deep learning for target detection.

Literature [4] proposes that deep learning can be used to deal with the problem of target detection. The author treats detection as a regression boundingbox problem, which has the advantage that compared with feature extraction by sliding window, such method is more efficient, but the detection accuracy is very poor, falling far behind the artificial feature method.

Since regression doesn’t work well, how about treating it as a classification problem? Girshick made a series of such studies in literature [5-8], forming a research line of RCNN->SPPNet->fastRCNN->fasterRCNN. The general steps of this method are candidate region generation > deep network feature extraction > classifier classification and regression correction.

To solve the question “What if object detection is treated as a classification problem?” RCNN came up with this question.

RCNN: Used segmentation[9]Candidate regions are generated, and features are extracted by CNN. Features are respectively sent into multiple SVM classifications, and boundingbox is modified by regression. Finally, NMS and edge detection are used to modify the whole process again, as shown in Figure 2.3.1. Its contribution is that the detection effect is greatly improved and a new framework for target detection based on deep learning is proposed. However, its disadvantages are also obvious, such as the slow speed caused by repeated feature extraction in each candidate region.



Figure 2.3.1 RCNN target detection process

SPPNet: The spatial pyramid pooling layer is designed after the last convolutional layer, so that the network input is not a fixed size and the information loss of the image caused by stretching and clipping can be avoided to the greatest extent. The mapping relationship between some regions of the original image and extracted features is established. For a given region, features can be directly calculated to avoid repeated convolution.

In order to solve the problem of repeated computation in multiple candidate regions of RCNN, fast-RCNN was introduced based on the idea of SPPNet.

Fast-rcnn: The whole process is shown in Figure 2.3.2, which is different from RCNN in three aspects. The RoI pooling layer is added, which has the same function as the pooling layer of SPPNet. On the basis of full experiment, change SVM into Softmax; Placing classification and boundingbox regression behind the same network significantly reduces computational overhead. Its advantages lie in avoiding repeated convolution, integrating multiple tasks at the same time, and further improving the computational efficiency. Now the architecture and optimization of the whole network have been basically completed, the key to restrict the speed is the generation of candidate regions.



Figure 2.3.2 Fast-RCNN target detection process

In order to solve the problem of slow speed of Region proposal, ftP-RCNN appeared.

Ftp-rcnn: Its core idea is to leave candidate region generation to the network. Because the target position needs to be corrected in the fast-RCNN of the next target detection, the generation of candidate region does not need an overly accurate method. The candidate region generation network is also a fast-RCNN in essence. Its input is a region in the pre-set image, and its output is whether the region belongs to the foreground or background and the corrected region. This method specifies only a few possible target areas, which is much faster than either sliding Windows or over-splitting.

Through this series of work, the role of network has evolved from simple feature extraction to a deep framework to complete the whole process of target detection, and the accuracy and speed of target detection have also been improved. At present, the work on the ftP-RCNN series has also encountered problems. There is no breakthrough point for target detection with classification problem, so everyone is considering to study target detection purely as regression problem at the beginning.

The disadvantage of RCNN series is that it cannot make full use of the context information of the local target in the whole image after transforming the detection problem into the classification problem of the local area of the image. Therefore, literature [10] proposed a method of target detection as a regression problem, YOLO, and the whole process is shown in Figure 2.3.3.



Figure 2.3.3 YOLO target detection process

YOLO: Divide the image into multiple grids, regression boundingBox and trust values respectively, and finally filter out low-score boxes by NMS. The disadvantage of YOLO lies in its poor detection effect and weak generalization ability for objects that are close to each other. Due to the problem of loss function, positioning error is the main reason affecting the detection effect. Even if YOLO is not yet perfected, and even if it is not as fast and accurate as the already well-established FtP-RCNN approach, its performance will be significantly improved once these problems are solved.


Now the latest target detection method is SSD, occupy a pit first, later have time to write SSD. Completely for the positioning accuracy research work LocNet, etc.

References:

   [1]    https://en.wikipedia.org/wiki/Object_detection

[2] Felzenszwalb P, McAllester D,Ramanan D. A discriminatively trained, multiscale, deformable partmodel[C]//CVPR, 2008:1-8.

   [3]    Felzenszwalb P F, Girshick R B,McAllester D, et al. Object detection with discriminatively trained part-basedmodels[J]. PAMI, 2010, 32(9): 1627-1645.

   [4]    Szegedy C, Toshev A, Erhan D.Deep neural networks for object detection[C]//NIPS. 2013: 2553-2561.

   [5]    Girshick R, Donahue J, DarrellT, et al. Rich feature hierarchies for accurate object detection and semanticsegmentation[C]//CVPR. 2014: 580-587.

   [6]    Girshick R. Fast r-cnn[C]//ICCV.2015: 1440-1448.

   [7]    Ren S, He K, Girshick R, et al.Faster R-CNN: Towards real-time object detection with region proposalnetworks[C]//NIPS. 2015: 91-99.

[8] He K, Zhang X, Ren S, et al.Spatial pyramid pooling in deep convolutional networks for visualrecognition[C]//ECCV, 2014:346-361.

[9] Uijlings J R, van de Sande KE A, Gevers T, et al. Selective search for object recognition[J]. IJCV, 2013,104(2): 154-171.

 [10]    Redmon J, Divvala S, GirshickR, et al. You only look once: Unified, real-time object detection. CVPR, 2016.