Take you through a paper series on computer vision RCNN
But then, we get used to not wanting to change.
preface
RCNN series of articles mainly include **RCNN, Fast RCNN, Faster RCNN, Mask RCNN, Cascade RCNN, and **, which are representatives of two-stage algorithm for target detection. These series of algorithms have high accuracy and good effect, and are an important method.
First, explain the target detection work R-CNN
Rich feature hierarchies for accurate object detection and semantic segmentation
This is an early Object Detection algorithm, which was published in CVPR in 2014, and it is also the first work of R-CNN series algorithms. There are many related blogs on the Internet. This paper does not explain the algorithm according to the order of the paper, but looks at this algorithm based on my own experience, hoping to give beginners an intuitive feeling. You don’t need to worry too much about the details, because a lot of the parts have been improved in later algorithms.
R-cnn, as the first paper to combine RPN concept with CNN, has been continuously improved in the subsequent years and produced a series of classic models such as FtREL-RCNN and Mask-RCNN. Therefore, r-CNN is a must-read classic paper for starting CV.
01 Look for recommended areas
In R-CNN, the first step Of architecture is to find the Region Proposal, that is, to find the possible Region Of Interest (ROI). There are three methods to obtain the recommendation area, which are divided into sliding window, rule block and selective search.
The first is sliding Windows. The sliding window is essentially an exhaustive method, which uses different scales and aspect ratios to pull out all possible blocks, large and small, and then send them to be identified. Those with a high probability of being identified are left behind. Obviously, such a method is too complex, resulting in many redundant candidate regions, which is not feasible in reality.
The second type is regular blocks. Some pruning was carried out on the basis of exhaustive method, using only fixed size and aspect ratio. However, for common target detection, rule blocks still need to access a lot of locations, so the complexity is high.
The third is selective search. From the perspective of machine learning, the previous method is good, but the accuracy is not good, so the core of the problem is how to effectively remove the redundant candidate regions. In fact, most of the redundancy candidate regions overlap, which is utilized by selective search to merge adjacent overlapping regions from bottom up to reduce redundancy.
02 R – CNN structure
R-cnn is mainly composed of three parts:
Phase 1: Region Extraction. Selective Search, a technology that has been proposed before, is mainly used. We will not talk about this technology in detail. If you are interested, you can see the specific implementation details of this paper at the end. We only need to understand that this phase will return about 2k region proposals if an image is entered.
Stage 2: After we get a lot of region proposals, we input them into a CNN network respectively. Each region shall input once and return a feature vector (4096 dimensions) corresponding to the region proposal. On the specific structure of CNN, The author used Alexnet as the model skeleton.
Some people may have found the problem. The region proposals extracted in the first stage are all of different sizes, while the input of CNN is always required to be fixed. In this paper, the input size of CNN is required to be fixed at 227*227.
Stage 3: After obtaining the feature vectors of each region proposal, we used the SVM binary classifier to predict each region proposal.
Some students have questions again, why so troublesome, directly like Alexnet with a Softmax direct classification is not sweet, why also use SVM to classification one by one?
The main reason is that: the categories of positive and negative samples are unbalanced, and the number of negative samples is much larger than the number of positive samples. In the training process, taking batCH_size of 128 as an example, there are only 32 positive samples and 96 negative samples, and the negative samples are randomly sampled. However, SVM has a better ability to use difficult samples than Softmax.
Positive sample of this class of true value calibration box.
Negative sample is used to examine each candidate frame. If the overlap with all calibration frames in this category is less than 0.3, it is considered as a negative sample.
3 R – details on CNN
R-cnn workflow
RCNN algorithm is divided into four steps
- One image generates 1K~2K candidate regions;
- For each candidate region, deep network is used to extract features.
- The features are sent into the SVM classifier of each category to determine whether it belongs to this category.
- The position of candidate box was refined by regressor.
Distortion training samples from VOC 2007 training. Average detection accuracy of VOC 2010 tests (%). R-cnn is most directly comparable to UVA and Regionlets because all methods use selective search region recommendations. Bounding box regression (BB). At the time of release, SegDPM was the best performer in PASCAL VOC charts. DPM and SegDPM re-score using context not used by other methods.
The results of
In 2014, when the paper was published, DPM had entered the bottleneck stage, and the improvement was limited even with the use of complex features and structures. In this paper, deep learning was introduced into the detection field, and the detection rate on PASCAL VOC increased from 35.1% to 53.7% at one stroke.
The first two steps of this paper (candidate region extraction + feature extraction) are independent of the class to be detected and can be shared among different classes. These two steps take about 13 seconds on the GPU.
When detecting multiple classes at the same time, only the latter two steps (discrimination + refinement) need to be multiplied, which are simple linear operations with fast speed. These two steps take 10 seconds for the 100K category.
Reference: zhuanlan.zhihu.com/p/168742724 blog.csdn.net/qq_30091945…