Recently, I was fortunate to have access to the target detection field of deep learning in the project. After several weeks of practice and consulting of relevant knowledge, I am now able to wander at the threshold of target detection. This article will add their own understanding on the principle of the three target detection classical algorithm, I hope you can read this article can no longer be confused out.

Target detection

What is object detection

In a word: the object’s class and position are detected in the view.

Three algorithms of target detection

As we said above, object detection is a matter of classification and location. Classification problem we can input pictures to the classifier processing, and the positioning problem needs to select the object box, so how to determine the box? Here we might need a bunch of “candidate boxes”.

Here we have roughly identified a two-step approach to the target detection problem:

  1. Candidate box selection;
  2. Classify objects in the candidate box; (The box with low score needs to be eliminated later)

In this way, the prototype of [two-stage] algorithm is obtained, representing [RCNN], and the subsequent [fast-rCNN] and [ftYL-rCNN] are based on it for iteration.

Is it possible to predict categorization at the same time that the candidate box is selected? Yes, that’s where SSD and YOLO (One-stage) come in.

Let’s go through the specific detection steps of each algorithm in “a little bit” detail. (Non-major students, there are mistakes please correct ~)

RCNN and its iterations (2-stage)

1. Main ideas

RCNN is “Region” + “CNN”. Perfectly corresponds to the above mentioned checkbox + classification.

If you want to know more about what CNN is, you can read this article: Understanding convolutional neural networks. Here, it is also briefly summarized that CNN obtains [image feature vector] as the input of [full connection layer (that is, the combination of multiple neural network layers)] through [dimension reduction] (convolution, activation, pooling) and other operations of the input image, and finally outputs the result.

As shown in the figure below, [x1, x2…xn] can be regarded as the feature vector after dimensionality reduction, and then the output result can be obtained through neural network processing.

Region is the candidate box mentioned above. RCNN determines the candidate Region through window scanning. Play the classic game of finding birds among birds. We might scan each word line by line with our eyes, where we can see each word as a grid and determine whether the words in the grid are black or not. This process is actually a rough understanding of RCNN. The selection in the early stage mainly adopts the [sliding window] technology, which is to determine a selection box in the starting point of the view, and then move the selection box to a specific distance each time, and then detect objects in the window.

2, RCNN

See above, I believe that everyone has a clear aware of sliding window technology, we are not sure whether the initial window that fit the parcel to the need to detect objects, which requires the initial window size try very many times, if the number of objects to be detected is more, have multiplied the number of window is to be detected, And each attempt includes a complete CNN calculation process, which is obviously very time-consuming.

Therefore, RCNN has changed from sliding window technology to SS (Selective Search) technology in the selection of candidate boxes. This means dividing the view by texture and color, identifying about 2,000 areas as candidate areas, and then classifying them.

3, Fast – RCNN

The improvements of FAST-RCNN are as follows:

  1. Firstly, the image is input into CNN to obtain the feature map
  2. SS regions in the feature map were divided (this was obtained in the original image) to obtain feature boxes, and each feature box was pooled to a uniform size in the ROI pooling layer
  3. Finally, the feature box is input into the whole connection layer for classification and regression

ROI pooling layer is to transform feature boxes of different sizes into feature vectors of the same size in the layer for subsequent classification and output of regression box operations. It speeds up processing.

4, Faster – – RCNN

The ftP-rCNN step is similar to the fast-RCNN step. The major breakthrough is to change the selection box extraction technology from SS to RPN. The brief principle of RPN is as follows:

  1. Generate 9 candidate boxes for each pixel of the input feature map, as shown in the red box below;

  1. To modify the generated basic candidate box is to delete the candidate box does not contain the target;
  2. Crop the candidate box beyond the image boundary
  3. Ignore checkboxes that are too long or too wide
  4. Rank the scores of all candidate boxes, and select the first 12000 candidate boxes
  5. Eliminate overlapping candidate boxes
  6. Take the first 2000 and make a second correction

SSD (1 – stage)

Compared with RCNN, SSD omitted the selection of candidate regions, but divided the feature maps of the image into N * N blocks (the maximum value of N in the figure below is 38), and preset 4-6 default candidate regions at the center point of each block.

As shown in the figure below, SSD firstly extracts features from images, and then adds multiple auxiliary convolutional layers to set default candidate regions for each layer. The reason for setting up multiple convolutional layers is to better identify objects of different sizes.

During training, the generated default candidate region is combined with the actual labeled region and calculated as IoU = (candidate ∩ actual)/(candidate ∪ actual). If the IoU value is greater than 0.5, it matches the actual region, and then we can get the approximate shape of the target. The following figure shows an 8*8 image. When detecting a person, SSD sets three default candidate boxes (green boxes) at the center point (green boxes) of cell (4,5). Compared with the real box (blue boxes), it can be found that 1 and 2 have high similarity.

Here is a brief description of the use of multi-convolution layer. In the following figure, B is an 8 * 8 feature graph and C is a 4 * 4 feature graph. For cat detection, Figure B can better determine the selection area. For dog detection, the candidate area of Figure B (right) obviously did not wrap the target well, while feature plot C dealt with dog detection well.

Therefore, high resolution feature map is conducive to the detection of small size objects, while low resolution feature map is conducive to the detection of large size objects

YOLO (1 – stage)

The general principle of YOLO is similar to that of SSD. However, YOLO only uses feature maps at the highest level in detection, which is different from SSD, which uses feature maps (pyramid structure) with multiple auxiliary convolutional layers and multiple sizes. I won’t repeat it here.