1. The past life of target detection

Object detection, as one of the most basic and challenging problems in computer vision, has attracted extensive attention in recent years. Its development over the past 20 years is a microcosm of the history of computer vision. If we think of target detection today as a technical aesthetic under the power of deep learning, then turn back the clock 20 years and we will witness the wisdom of the era of cold weapons.

The figure above vividly represents the development history of target detection. Before the advent of RCNN, target detection was in the era of “cold weapons”, requiring the use of traditional computer vision methods, manual design of stronger features, and then the use of classical machine learning algorithms, as shown in the figure below.

With the emergence of CNN network, target detection has entered the era of deep learning. Target detection technology is more and more inclined to the design of network structure, loss function and optimization method. People pay more attention to using CNN network to automatically extract image features instead of the original manual design features. Target detection transitions from the era of “cold weapon” to the era of “hot weapon”, from the two-stage target detection algorithm to the single-stage target detection algorithm, and then to the current Anchor Free algorithm trend. With the continuous improvement of hardware computing level, the development of target detection technology enters the “fast track”. This article mainly introduces the RCNN algorithm.

2. RCNN principle

Links to papers: arxiv.org/pdf/1311.25…

2.1 Pipeline

The overall idea is very simple and clear. The author thinks that it is not necessary to use the method of sliding window to get Region Proposal (candidate box) on the whole image, but only select part of candidate box, and then run CNN network on these Windows.

  • Enter a picture.
  • Selective Search was used to select about 2000 Region proposals, and the number of candidate boxes obtained by this method was much less than that of traditional methods. It is roughly the image block obtained by image segmentation algorithm, and the candidate box is segmented. It will be further introduced later.
  • Resize each candidate frame image block to 227*227, and then input into an AlexNet CNN network, each candidate frame image can get a 4096-dimensional feature.
  • Build a SVM classifier for each class, for example, if you want to divide ten classes, there will be ten SVM classifiers. By inputting the features extracted from CNN in the previous step into these SVM classifiers, the scores of each category can be obtained, so as to obtain the classification results. (The NMS algorithm performs box deduplication, more on this later)
  • At the same time, the feature vectors output by CNN in the third step are regressed and the positions of the upper left and lower right coordinates of the Bounding Box are corrected.

2.2 Selective Search

Selecttive search papers link: www.huppelen.nl/publication…

The target detection task is much more complex than the image classification task, because the target detection task may have many object classification and localization tasks. We say that target detection generally implies classification and localization tasks. Before training the classifier, we usually need to get the Region Proposal through some methods, that is, candidate boxes. Classification and regression are carried out in these candidate boxes to find the target information we want.

Before RCNN, Region Proposal Algorithms mainly adopt the method of exhaustive or segmentation, with a large search range and a large number of candidate boxes, which will bring great pressure to the subsequent training. In RCNN, the author chose selective Search method, which is much faster than previous methods and has a high recall rate. It takes color, size, shape, texture and other features into comprehensive consideration to group image regions. This method mainly includes two parts: Hierarchical Grouping Algorithm and Diversification Strategies.

2.2.1 Hierarchical Grouping Algorithm

The general idea is that the author uses Felzenszwalb and Huttenlocher’s segmentation method to obtain the segmentation region of the image as the initial region, and then uses the diversity strategy to represent different segmentation regions from multiple perspectives. Combining with the similarity of regions, greedy algorithm is used to perform iterative grouping of regions. Here’s what it says:

  • Felzenszwalb and Huttenlocher methods were used to obtain the initial region R.
  • Construct the region similarity set, calculate the similarity of all pair regions, and put them into the similarity set.
  • Find the two regions with the greatest similarity, merge these two regions into a new region, and add into R.
  • Remove all items related to the two merged regions from S, calculate the similarity of the new region S to the other regions, and add to S.
  • Skip to step 2 until S is empty.

2.2.2 Diversification Strategies

The author measures the similarity of different image regions from multiple dimensions and carries out weighted average, mainly including the following dimensions:

  • color
  • texture
  • The size of the
  • size

Even if Selective Search and other pre-processing steps are used to extract potential bounding boxes as input, there is still a serious speed bottleneck. The reason is also obvious: feature extraction of all regions requires repeated calculation and similarity calculation of all regions. The number of candidate boxes obtained is also too large.

Detailed measurement methods can be found in the original paper, because selective search is almost no longer used now, so it will not be described here too much.

2.3 Training Process

  • Network input preprocessing: All candidate boxes are resized to the size of 227*227, but before resize, a trick is used, that is, Region expansion of the Region Proposal is carried out to ensure that there are 16 pixels of original image context information around the Region Proposal after resize.
  • CNNC network: AlexNet classification network was used as the feature extractor, and the network input was 227227 image blocks, through 5 convolution layers and 2 fully connected layers, fixed 4096 dimensional vector features can be obtained (6 is finally obtained from the convolution layer6*256 features, and the receptive field of each pixel point is 195), Alexnet network is shown in the figure below:

In this part, the author uses AlexNet to classify images of RP area blocks. Those with IOU>=0.5 are considered as positive cases, and those with IOU>=0.5 are considered as negative cases.

  • Pre-training model: Use Imagenet(1000 classes) for training, then PASCAL VOC(21) for fine-tune. Training this way improved by 8 percent.
  • Batch setting: 32 positive samples +96 negative samples. In reality, negative samples will be much more than positive samples.
  • Category setting: classified into C + 1, including C target categories and 1 background.
  • Loss: Classification Loss +bbox Regression Loss, which sets the superparameter weight λ=1000 in the original text.
  • Classifiers: Need to train (C + 1) SVM classifiers, one classifier for each class. Here, the classification rules of positive and negative samples are different from those of CNNC network classification. Similar to Ground Truth, IOU values greater than 0.5 are considered as positive samples, while those less than 0.3 are considered as negative samples, and other boxes are ignored.
  • Bounding Box regression: Bbox regression with incorrectly classified boxes are ignored. After obtaining the score of the classifier SVM, a new Bbox needs to be predicted on the feature map to fine-tune the Region Proposal. After obtaining the features extracted from CNN network, the feature vector is put into a linear regression model to fit the position, width and height of the frame, mainly fitting the XY coordinate of the upper-left corner coordinate of Bbox and the width and height of the frame. The method is to learn a mapping from P to GT on the basis of the existing prediction frame P to get a more accurate frame. The four offset parameters learned are shown in the figure below, the XY offset of the upper-left coordinate and the width and height offset respectively.
  • Learning Rate Schedule: Initial Learning Rate is set to 0.001 + SGD.

2.4 Post-processing NMS

The testing process is basically the same as the training process, which includes nearly 2000 Region proposals, and then classification and box regression. However, many-to-one matching between the Region Proposal and the target box may occur, that is, several detection boxes may match one ground truth, and several different detection boxes may have a large IoU with the same ground truth. In the process of reasoning, they can return to the same object with high classification confidence. Therefore,NMS post-processing is necessary in order to remove duplicate detection boxes.

NMS is a non-maximum suppression algorithm. The algorithm flow is as follows:

Input: indicates all detection enclosures, detection enclosure score, and NMS threshold

Output: the detection box and its corresponding score after passing through the NMS

  • Find the one with the highest score in a certain type of detection box, record it, and put it in the new set of NMS detection box.
  • Calculate the IOU values of the maximum box of scores recorded and other bboxes. IOU represents the intersection ratio and the upper union of the intersection ratio of the two box areas, as shown in the figure below:
  • If the IOU is greater than the NMS threshold, the bbox is discarded (there is a high probability that the two boxes are the same target) and the bbox information is removed from the box set and threshold set.
  • From the last remaining set of test boxes, find the one with the maximum scores, and repeat the cycle until the original test box set is empty.

The Python implementation code is as follows:

import numpy as np

def nms(dets, thresh) :
    x1 = dets[:, 0]
    y1 = dets[:, 1]
    x2 = dets[:, 2]
    y2 = dets[:, 3]
    scores = dets[:, 4]

    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
    order = scores.argsort()[::-1]

    keep = []
    while order.size > 0:
        i = order[0]
        keep.append(i)
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])

        w = np.maximum(0.0, xx2 - xx1 + 1)
        h = np.maximum(0.0, yy2 - yy1 + 1)
        inter = w * h
        ovr = inter / (areas[i] + areas[order[1:]] - inter)

        inds = np.where(ovr <= thresh)[0]
        order = order[inds + 1]

    return keep
Copy the code

3. The RCNN faults

  • Multi-stage training: Pre-training (ImageNet) + Selective Search + CNN feature Extractor (VOC) + classifier (SVMs) + boundary box regression (LR).
  • Feature vectors (1×4096) extracted from each suggestion box are stored in hard disks, consuming a lot of time and space costs.
  • Each Warped region needs to be re-sent to CNN for feature extraction and repeated convolution calculation for many times.
  • Inference: A picture took too long, GPU: 13 s/frame, CPU: 53 s/frame.

Reference

  1. www.cnblogs.com/fariver/p/6…
  2. Arxiv.org/pdf/1311.25…
  3. zhuanlan.zhihu.com/p/55856134
  4. Zihuaweng. Making. IO / 2017/12/17 /…
  5. www.huppelen.nl/publication…