1. Motivation: RCNN can perform end-to-end training

Fast RCNN is Ross Girshick’s follow-up to RCBB. The publication of RCNN is regarded as the pioneering work of deep learning target detection. Later, He Keming’s SPP Net solved the problem of redundant convolution computation of RCNN and greatly improved the training and reasoning speed, but it still failed to get rid of the defects of multi-stage training. Please refer to the previous TECHNICAL blog of SPP. In order to achieve end-to-end training, Fast RCNN must solve the problem that SPP method cannot send back gradients, and must integrate classification and bounding box regression tasks. Compared with the previous two algorithms (RCNN and SPP Net), Fast RCNN mainly proposes two points to realize end-to-end training:

  • Multitask loss function
  • Region Of Interest (ROI)

This will be explained in detail below.

2. Principle of Fast RCNN

2.1 Pipeline

Overall, Fast RCNN is divided into four steps:

  • Using selective Search method, nearly 2000 candidate boxes were generated from the original image.
  • The whole image was input into THE CNN network to obtain the feature graph, and the candidate frames generated by Selective Search were projected onto the feature graph to obtain the corresponding feature matrix.
  • Input each feature matrix into the ROI pooling layer and scale to7 * 7Then flatten this feature graph to obtain the features of fixed dimensions.
  • The flattened features are input into the classification branch and regression branch, and the classification and regression results are predicted.

2.2 Network structure changes

Compared with SPP Net, the network structure has been adjusted:

  • The backbone of feature extraction is changed from AlexNet to VGG, which has stronger feature extraction ability.
  • Replace the SPP module with ROI Pooling.
  • By integrating classification and regression tasks and replacing them with multi-task loss function, the target detection task does not need to be trained in stages.
  • After extracting ROI feature vectors, the two branches are connected in parallel. The SVM classifier (C+1 class, including background) was replaced by Softmax. The FC fully connected boundary box regresser replaces LR regression model, and the new boundary box regresser outputs corresponding to (C + 1) categories of candidate box boundary regression parameters (DX, dy, DW, DH), with a total output of (C + 1) * 4 nodes, as shown in the figure below. The meaning of regression parameters is consistent with that of RCNN.

2.3 Only one convolution calculation is performed for an image

Here, the author draws on the idea of SPP Net to input the whole image into the network, and then extracts candidate regions from the feature matrix. With the coordinate correspondence between the original image and the feature map, the features of these candidate regions do not need to be repeatedly calculated. Refer to the previous SPP Net technical blog post here without going into details.

2.4 Region of Interest

RoI can be regarded as a simplified version of SPP. The original SPP consists of new features through concat after multi-scale pooling, while RoI only uses one scale and can scale the feature matrix of any dimension into a fixed dimension. The specific method in this paper is to divide the height and width into 7*7 blocks on average, and then conduct Max pooling operation in each small block. The channel dimension is unchanged, which can make the output dimension fixed. Meanwhile, THE RoI pooling is not multi-scale, and the gradient return is very convenient. Conditions are provided for the fine-tune convolution layer. (SPP Net cannot fine-tune convolution layer)

2.5 Multi-task loss function design

The multi-task loss function is loss integrating the classification task and regression task, so as to realize the end-to-end training process. The loss function is as follows:

Where P is the Softmax probability distribution predicted by classifier P =(P0, P1…) , u corresponds to the real category label of the target, tu corresponds to the regression parameter of the corresponding category U predicted by the boundary box regresser, and V corresponds to the regression parameter of the real target box.

The classification Loss function is Negative Log Likelyhood Loss :(considering that p is calculated by softmax, it is equivalent to the classification is calculated by CrossEntropyLoss)

Boundary box regression Loss function is changed from L2 Loss of RCNN and SPP Net to Smooth L1 Loss, which mainly fits Bbox coordinates and width-height deviation values, as shown in the figure below:

Here are a few points to note about the bounding box loss function:

  • In the bounding box loss functionLambda [u > = 1]Is the Iverson bracket, the function is the background class box does not need to calculate the boundary box loss, focus on the boundary box belonging to the target.
  • Benefits of opting for Smooth L1:
    • The disadvantage of L1 Loss is that it cannot be differentiated at 0, which leads to a small difference between the predicted value and the ground truth in the later training period. The absolute value of derivative of L1 Loss to the predicted value is still 1, while if the learning rate remains unchanged, the Loss function will fluctuate near the stable value. It is difficult to continue convergence to achieve higher accuracy.
    • The disadvantage of L2 Loss is that when X is very large, Loss is also very large, which is easy to cause unstable training.
    • The advantage of Smooth L1 is that when the difference between the prediction box and the ground truth is too large, the gradient value will not be too large, which is more stable to outlier and avoids gradient explosion. When the difference between the prediction box and ground truth is very small, the gradient value is small enough.

2.6 Training process and reasoning process

The following figure clearly describes the training and reasoning process of Fast RCNN. Training process: Input the whole image into CNN network, extract candidate boxes with SS algorithm, map the feature matrix of candidate boxes in Conv5 feature map, conduct ROI Pooling, regularized to a fixed size, and then pass through the full connection layer. The features after the full connection layer are input into SoftMax classifier and Bounding box regression machine respectively (a layer of FC is added according to the output dimension), and the multi-task combined loss function is used for calculation and gradient return to realize end-to-end network training.

Reasoning process: Same as training process, add post-processing NMS algorithm for each type.

Mini-batch sampling method was adopted to train the model:

  • batch_size=128
  • One batch comes from two images, each image takes 64 candidate regions, the positive and negative sample ratio is 1:3, the judgment condition of positive sample is IOU value greater than 0.5, and the judgment condition of negative example is IOU value between 0.1 and 0.5, which is a difficult case mining strategy.

3. Effect and disadvantages of Fast RCNN

3.1 Effect and contribution

As shown in the figure above, to sum up, the effect of Fast RCNN is to increase speed and increase points. The accuracy of Fast RCNN is 0.9 points higher than RCNN in L dimension, 8.8 times faster than RCNN in training speed, and 146 times faster than RCNN in testing speed. Its main contribution is to realize the end-to-end training of deep learning target detection network for the first time, which has a great breakthrough in speed.

3.2 Problems and disadvantages of Fast RCNN

Fast RCNN running network inference on GPU only needs 0.32s, but selective search requires 2s, that is to say, selective search severely restricts the speed of Fast RCNN, which has become the main bottleneck. (Later, Faster RCNN proposed RPN network to solve this problem)

Reference

  1. Arxiv.org/pdf/1504.08…
  2. Github.com/rbgirshick/…
  3. Cloud.tencent.com/developer/a…
  4. www.bilibili.com/video/BV1af…
  5. Blog.csdn.net/chaipp0607/…