Faster – R – CNN explanation

preface

This paper is mainly based on Bestrivern’s blog to understand the network of Faster R-CNN, deepen the understanding of the network, and sort out the generated perception

Summary of a.

Let me first show you a diagram of the structure of Faster R-CNN.

After the accumulation of R-CNN and Fast R-CNN, Ross B. Girshick proposed the new Faster RCNN in 2016. In terms of structure, The Faster RCN has been feature extraction, proposal extraction, Bounding box Regression (Rect refine) and classification are integrated in one network, which greatly improves the comprehensive performance, especially in the detection speed.

                           

Faster R-CNN can actually be divided into four main elements:

  • Conv Layers (Feature extraction). As a CNN network target detection method, Faster RCNN first uses a set of basic conv+relu+pooling layers to extract feature maps of the image. The feature maps are shared for subsequent RPN and full connectivity layers.
  • Region Proposal Networks(extracting candidate boxes). RPN network is used to generate region proposals. For this layer, softmax judged that anchors were positive or negative, and corrected the precise proposals by bounding box regression.
  • Roi Pooling(determine which candidate boxes are valid) This layer collects the input feature maps and proposals, synthesises these information and extracts the proposal feature maps, which are then sent to the subsequent full-connection layer to determine the target category.
  • Classification(to classify objects while correcting the candidate box). The proposal category was calculated with the proposal feature maps, and the exact final position of the detection box was obtained again with bounding box regression.

Overall process:

The following figure shows the network structure of Faster_rCNn_test. pt in the VGG16 model of Python version. It can be clearly seen that for an image of any size PxQ, the network first scales to a fixed size MxN, and then feeds the MxN image into the network. Conv Layers contains 13 Conv layers, 13 Relu layers, and 4 pooling layers. RPN network first goes through 3×3 convolution, then generates positive anchors and corresponding bounding box regression offsets respectively, and then calculates their proposals. The PROPOSED Pooling layer extracts the proposal feature from the feature maps by means of proposals, and sends it to the subsequent full link and Softmax network for classification (i.e., what is the object of the proposal).

                

2.,

1.Conv layers

Conv Layers include three layers: Conv, pooling and RELu. Take the network structure of Faster_rcnn_test. pt in the Python VGG16 model as an example. As shown in the figure above, there are 13 Conv layers, 13 Relu layers and 4 pooling layers in the Conv Layers section. Here’s a very overlooked but extremely important piece of information in Conv Layers:

  • All conV layers are: kernel_size=3, pad=1, stride=1
  • 2, stride=1, stride=1, stride= 2, stride= 2

Why is it important? In The Faster RCNN Conv Layers, all convolution was expanded (pad=1, that is, filling a circle with 0), resulting in the original image becoming (M+2)x(N+2) size, and then 3×3 convolution was performed to output MxN. It is this setting that causes the Conv layer in Conv Layers to not change the input and output matrix sizes. The diagram below:

              

Similarly, the stride=2 for the pooling layer in Conv Layers. In this way, each MxN matrix passed through the pooling layer will change to the size of (M/2)x(N/2). In summary, in the whole Conv Layers, Conv and RELu layers do not change the input and output sizes, and only the pooling layer makes the output width and length become 1/2 of the input.

So, an MxN size matrix fixed by Conv Layers becomes (M/16)x(N/16)! In this way, the feature map generated by Conv Layers can be corresponding to the original map.

My own understanding:

The partially completed is to extract features, in fact, a convolution is actually a picture of a pair of pixel mix into a point, the point is bearing the weight of the weight of different surrounding pixels, by setting the channel is different, we can get different fusion after pictures, so the new generation of each a picture (channel) to the original image characteristics under different parameters, Use this (channel) image to represent the entire graph ——- the actual function of convolution: obtaining the feature map

2 Region Proposal Networks(RPN)

The classic detection method is very time-consuming to generate detection box, such as OpenCV Adaboost use sliding window + image pyramid to generate detection box; Or, as R-CNN, use SS(Selective Search) method to generate detection box. However, Faster RCNN abandoned the traditional sliding window and SS methods and directly used RPN to generate detection boxes, which is also a huge advantage of Faster R-CNN, which can greatly improve the generation speed of detection boxes.

                   

The figure above shows the specific structure of the RPN network. You can see that RPN network is actually divided into two lines. The above one is classified positive and negative by Softmax anchors, and the next one is used to calculate the bounding box regression offset of anchors. To get a precise proposal. The final Proposal layer is responsible for obtaining proposals by integrating positive anchors and the offset of corresponding bounding box regression, and eliminating those proposals that are too small or beyond the boundary. In fact, when the whole network reaches the Proposal Layer, it completes the function equivalent to target positioning.

Conclusion:

  1. By making the 128 candidate boxes into 2 classifications – Softmax problem, we can know which of the 128 boxes are real and valid and which are invalid
  2. The 1 * 1 line below calculates the box offset for the candidate box, generating four values, the center offset position x,y,scaleh,scalewx_,y_,scale_h,scale_wx,y,scaleh,scalew, the center position and the scale-down ratio
  3. The last Proposal is to integrate the previous two parts, keep the effective anchor, and calibrate the effective anchor, modify the center point and the zoom of the box (remove the positive anchor beyond the picture itself).
2.1 anchors

Anchors are actually a set of rectangles generated by RPN /generate_anchors. Running generate_anchors. Py directly from the author demo yields the following output:

This is the coordinates of the nine rectangles generated for a central point

[[-84.  -40.   99.   55.]
 [-176.  -88.  191.  103.]
 [-360. -184.  375.  199.] [-56.  -56.   71.   71.]
 [-120. -120.  135.  135.]
 [-248. -248.  263.  263.] [-36.  -80.   51.   95.] [-80. -168.   95.  183.]
 [-168. -344.  183.  359.]]
Copy the code

The four values in each row (X1,x2,x3,x4x_1,x_2,x_3, X_4x1,x2,x3,x4) are the point coordinates in the upper left and lower right corners of the rectangle. The 9 rectangles have 3 shapes, and the aspect ratio is about weight:height$\in${1:1,1:2,2:1}, as shown below. Anchors actually introduce the multi-scale approach that is often used in testing.

                                                                          

Note: The above anchors size is actually set according to the check image. In the Python Demo, input images of any size are 0 (M=800, N=600 in Figure 2). Looking back at the size of anchors, the most of 1:2 in length is 352×704, and the most of 2:1 in length is 736×384, covering basically every size and shape of 800×600.

So what are these nine anchors for? Using the original image in Faster RCNN’s paper as shown below, the feature maps calculated by Conv Layers are traversed, and each point is equipped with these 9 anchors as the initial detection box. It’s not very accurate to get the check box, so don’t worry, there are two bounding box regressions that can correct the check box position.

                                                  

Explain the numbers in the picture above.

  1. In the original text, ZF model was used. The num_output of the last conv5 layer in Conv Layers was 256, and 256 feature maps were generated correspondingly, so each point in the feature map was 256-dimensions. In other words, the finally obtained feature map is N16∗M16∗256\frac{N}{16} * \frac{M}{16} * 256 16N∗16M∗256.
  2. After conv5, rpn_conv/3×3 convolution was performed and num_output=256,Each point incorporates 3 by 3 pieces of spatial information around it(Guess this might be more robust? I didn’t test it anyway), while 256-D remains the same (red box in the figure above).
  3. It is assumed that there are K anchors at each point in conv5 feature map (default K =9), and each ANhCOR should be divided into positive and negative, so each point is converted from 256D feature to CLS = 2K scores. Each anchor has (x, y, w, h) for 4 offsets, so reg= 4K coordinates
  4. In addition, all of those anchors are too many to train. The training process will be in the right anchorsRandomly select 128 postive anchors+128 negative anchorsTraining (what is the right anchors is explained below)

Note that VGG conv5 num_output=512 is used in this lecture, so it is 512d, and so on.

Actually,RPNFinally, dense candidate Anchors are set on the scale of the original picture. Then, CNN is used to determine which Anchor is a positive Anchor with a goal and which is a negative Anchor without a goal. So, it’s just a dichotomy! (Core idea of RPN)

So how many anchors are there? The original image is 800×600, and the VGG sub-sampling is 16 times. Each point of the feature map is set as 9 Anchors, so:

                                               

Where, ceil() indicates rounding up, because the feature map size output by VGG is 50 * 38.

2.2 SoftMax Determine positive and negative

After an mxn-sized matrix is fed into the Faster R-CNN network, it becomes (M/16)x(N/16) in the RPN network. Let W=M/16, H=N/16. 1×1 convolution was done before entering SoftMax, as shown below:

The 1×1 convolved caffe prototxt is defined as follows:

layer {
  name: "rpn_cls_score"
  type: "Convolution"
  bottom: "rpn/output"
  top: "rpn_cls_score"
  convolution_param {
    num_output: 18   # 2(positive/negative) * 9(anchors)
    kernel_size: 1 pad: 0 stride: 1}}Copy the code

You can see that its num_output=18, that is, the output image after the convolution is W x H x 18 size. This corresponds to the feature maps where each point has 9 anchors, and each anchors can be positive and negative, all of which are stored on a W x H x (9 * 2) matrix. Why do you do that? Softmax gives you positive anchors, which means you have initially extracted the candidate area box (generally thought to be on positive anchors).

__c. train.rpn_positive_overlap = 0.7 and __C. train.rpn_negative_overlap = 0.3 128 positive anchors and 128 negative anchors are selected from all generated anchors. (Selected by np.sample function) Label is set according to the relationship between IOU and threshold (label: 1 is positive, 0 is negative, -1 is Dont care).

So why is SoftMax followed by a New Layer? In fact, it is just for the convenience of SoftMax classification, as for the specific reasons, it will start from the caffe implementation form. The caffe basic data structure BLOb stores data in the following form:

Blob =[batch_size, channel, height, width]Copy the code

The corresponding matrix of positive/negative anchors above is stored in caffe Blob as [1, 2 x 9, H, W]. Softmax is 0. The Positive /negative bishape is needed in softMax’s classification, so Layer will be 0 [1, 2, 9xH, W], that is, a new dimension is “out” separately for SoftMax to categorize, and later shape will return. Caffe softmax_loss_Layer. CPP’s Shape is 0:

"Number of labels must match number of predictions; "
"e.g., if softmax axis == 1 and prediction shape is (N, C, H, W), "
"label count (number of labels) must be N*H*W, "
"with integer values in {0, 1, ... , C-1}.";
Copy the code

To sum up, RPN network uses anchors and Softmax to initially extract positive anchors as a candidate area (in addition, sigmoid is used to replace Softmax, the principle is similar).

My own understanding:

This partial overall process is summarized as follows: First, we went through a 3 * 3 convolution, and then feature maps were blended again. Then we went through a 1 * 1 * 18 convolution process, mainly to add a channel to each feature map generated in the previous step. 9 Positive Anchor and 9 Negative Anchor This is in preparation for the subsequent classification of SoftMax. The two shape changes were made because the original author was limited in data storage when caffe was used.

2.3 Bounding box Regression Principle

As shown in the picture below, the green box is Ground Truth(GT) for an airplane, and the red box is extracted positive anchors. Even though the red box is recognized as an airplane by the classifier, this picture does not detect the airplane properly because the red box is not positioned correctly. So we wanted to find a way to fine-tune the red box so that Positive Anchors are closer to GT.

For Windows, four-dimensional vectors (x,y, W,h) are generally used to represent the center point coordinates, width and height of the window respectively. As shown below, the red box A stands for original positive Anchors, and the green box G stands for target GT. Our goal is to find A relationship so that the input original Anchor A can be mapped to A regression window G’ that is closer to the real window G, that is:

Given the anchor A = (Ax, Ay, Aw, Ah) A = (A_x A_y, A_w, A_h) A = (Ax, Ay, Aw, Ah) and GT = (GTx, GTy GTw, GTh) GT = (GT_x GT_y, GT_w GT_h) GT = (GTx, GTy GTw, GTh) F looking for a kind of transformation, Make F (Ax, Ay, Aw, Ah) = (GTx ‘, GTy ‘, GTw ‘, GTh ‘) F (A_x A_y, A_w, A_h) = ^ (GT_ {x} {‘}, GT_ {} y ^ {‘}, GT_ ^ {w} {‘}, GT_ {h} ^ {‘}) F (Ax, Ay, Aw, Ah) = (GTx ‘, GTy ‘and GTw’, GTh ‘), Among them (GTx ‘, GTy ‘, GTw ‘, GTh ‘) material (GTx ‘, GTy ‘, GTw ‘, GTh ‘) (GT_ {x} ^ {‘}, GT_ {} y ^ {‘}, GT_ ^ {w} {‘}, GT_ {h} ^ {‘}) \ approx ^ (GT_ {x} {‘}, GT_ {} y ^ {‘}, GT_ ^ {w} {‘}, GT_ {h} ^ {‘}) (GTx ‘, GTy ‘and GTw’, GTh ‘) material (GTx ‘, GTy ‘and GTw’, GTh ‘)

What transformation F takes to change from Anchor A in Figure 10 to G prime? A simple way to think about it is:

To observe the above four formula found that to learn is dx (A), dy (A), dw (A), dh d_x (A) (A), d_y (A), d_w, (A) d_h dx (A) (A), dy (A), dw (A), dh (A) the four transformations. When the input Anchor A is low on anchors and the transformation is considered A linear transformation, you can use linear regression modeling to fine-tune the window (note that you can only use linear regression modeling if anchors A and GT are close to each other, otherwise it is A complex nonlinear problem).

The next question is how to dx is acquired by linear regression (A), dy (A), dw (A), dh d_x (A) (A), d_y (A), d_w, (A) d_h dx (A) (A), dy (A), dw (A), dh (A). Linear regression is to learn a set of parameters W, given the input eigenvector X, so that the value after linear regression is very close to the true value Y, that is, Y=WX. For this problem, the input X is CNN feature map, defined as φ; At the same time, the transformation quantity passed in between A and GT is trained, namely (TX,ty,tw,th)(T_x,t_y, T_w, t_H)(TX,ty,tw,th). Output is four transform dx (A), dy (A), dw (A), dh d_x (A) (A), d_y (A), d_w, (A) d_h dx (A) (A), dy (A), dw (A), dh (A). Then the objective function can be expressed as:

                                       

Among themIs the feature vector composed of feature map corresponding to Anchor,
W W_*
Are parameters that need to be learned,
d ( A ) d_* (A)
Is the predicted value obtained (* represents x, y, w, h, i.e., each transformation corresponds to one of the above objective functions). In order to get the predicted value
d ( A ) d_* (A)
And the real value
t t_*
The gap is the smallest, and the design loss function is:

The objective of function optimization is:

  • To do translation

                                                                                                                   

  • To do scaling

                                                                                                                                  

It needs to be noted that the above linear transformation can be approximately considered to be valid only when GT is close to the position of the required regression box. After saying the principle, corresponding to the original text of Faster R-CNN, the translation (TX,ty)(T_x, T_Y)(TX,ty) and scale factor (TW,th)(T_W, t_H)(TW,th) between positive Anchor and ground truth are as follows:

For training the regression branch of Bouding Box Regression network, the input is CNN feature φ, and the supervised signal is the gap between Anchor and GT (TX, TY, TW,th)(T_x, T_Y, T_W, T_H)(TX, TY, TW,th), that is, the training goal is: The input φ makes the network output as close as possible to the supervisory signal. Then when bouding box Regression is working and φ is input again, the output of the regression network branch is the translation of each Anchor and the transformation scale (TX, TY, TW,th)(T_x, T_Y, T_W, T_H)(TX, TY, TW,th), Obviously, this can be used to fix the Anchor position.

$(T_x, T_Y, T_W, T_H)$of all positive Anchors was calculated, but only 128 were selected for loss return at the end.

My own understanding:

In fact, we trained another network in this part, which is a simple linear regression. We used variance as a loss function to train the network between anchor generated by ourselves and the real label ground truth, and finally obtained
( t x . t y . t w . t h ) (t_x,t_y,t_w,t_h)
.

2.4 Surveys box regression for the proposals

After understanding the bounding box regression, look back at the second line of RPN network, as shown in the figure below.

                       

First take a look at the definition of 1×1 convolved with Caffe prototxt in the figure above:

layer {
  name: "rpn_bbox_pred"
  type: "Convolution"
  bottom: "rpn/output"
  top: "rpn_bbox_pred"
  convolution_param {
    num_output: 36   # 4 * 9(anchors)
    kernel_size: 1 pad: 0 stride: 1}}Copy the code

It can be seen that its num_output=36, that is, the output image after the convolution isW x H x 36, stored as in caffe Blob[1, 4 x 9, H, W]Here, there are 9 anchors for each point in the feature maps, and 4 for each anchor

Change quantity.

VGG’s output features a anchors, while RPN’s output is 50 * 38 * 512

  • The value of positive/ Negative SoftMax is 50 x 38 x 2K
  • Regression coordinate regression eigenmatrix with size 50 * 38 * 4K

RPN is exactly satisfied to complete the coordinate regression of positive/negative classification + bounding box regression.

2.5 Proposal Layer

Integrated Proposal Layer is responsible for all the dx (A), dy (A), dw (A), dh d_x (A) (A), d_y (A), d_w, (A) d_h dx (A) (A), dy (A), dw (A), dh (A) transform and positive anchors, Calculate the accurate proposal and feed it into the follow-up RoI Pooling Layer. Let’s look at the Proposal Layer caffe prototxt definition:

layer {
  name: 'proposal'
  type: 'Python'
  bottom: 'rpn_cls_prob_reshape'
  bottom: 'rpn_bbox_pred'
  bottom: 'im_info'
  top: 'rois'
  python_param {
    module: 'rpn.proposal_layer'
    layer: 'ProposalLayer'
    param_str: "'feat_stride': 16"}}Copy the code

The Proposal Layer has three inputs:

  • Positive vs Negative Anchors 0
  • The corresponding bbox reg dx (A), dy (A), dw (A), dh d_x (A) (A), d_y (A), d_w, (A) d_h dx (A) (A), dy (A), dw (A), dh (A) transform rpn_bbox_pred,
  • And im_info
  • Another parameter is feat_stride=16.

First, explain im_info. 0 for a PxQ image of any size, the shape to fix MxN, im_info=[M, N, scale_factor] before passing Faster R-CNN is 0. Then, after Conv Layers and 4 pooling, the size of WxH=(M/16)x(N/16) is obtained, in which feature_stride=16, the information is saved and used to calculate the anchor offset.

                  Proposal Layer forward(Caffe Layer prepass functions) are processed in the following order:

  1. Generate anchors, using bbox regression for all anchors (here the anchors are generated and trained exactly the same)
  2. According to the input positive softmax scores, rank the former pre_nms_topN(e.g. 6000) anchors from large to small, that is, extract the positive anchors after correction.
  3. Delimit positive anchors beyond image boundaries (prevent future ROI pooling when making a proposal beyond image boundaries)
  4. Eliminating very small (width<threshold or height<threshold) positive anchors
  5. For nonmaximum suppression
  6. The Proposal Layer has three inputs: positive and negative anchors classifier result rpn_cls_prob_shape, the corresponding bbox reg’s (e.g. 300) result is the Proposal output.

Then output proposal=[x1, y1, x2, y2]. Note that since in the third step we mapped the proposal back to the original image to see if it went out of bounds, the output here corresponds to the MxN input image scale, which will be useful in the later network.

RPN network structure is introduced here, summed up as:

Anchors -> Positvie anchors -> Bbox Reg returns to positive anchors -> Proposal Layer generates proposals

                                    

3.RoI Pooling

The RoI Pooling layer is responsible for collecting proposals, calculating proposal feature maps, and feeding them into subsequent networks. The Rol pooling layer has 2 inputs:

  • Original feature maps
  • RPN output proposal boxes (sizes vary)

Because RPN outputs proposal boxes of different sizes, and after thatTo access the FC layer, it must have a fixed sizeHere, use the RoI Pooling layer to make all proposal boxes of a fixed size. (Very important)

3.1 RoI Pooling principle

原 文 : RoI of Pooling of the caffe prototxt

layer {
  name: "roi_pool5"
  type: "ROIPooling"
  bottom: "conv5_3"
  bottom: "rois"
  top: "pool5"
  roi_pooling_param {
    pooled_w: 7
    pooled_h: 7
    spatial_scale: 0.0625 # 1/16}}Copy the code

There are new parameters pooled_w and pooled_h, and another parameter, spatial_scale, which any careful reader would already know the purpose of. RoI Pooling layer forward process:

  • Since the proposal corresponds to the MXN scale, the spatial_scale parameter is first used to map it back to the feature map scale of (M/16)X(N/16).
  • Then, the corresponding feature map area of each proposal is divided into three levelspooled_w * pooled_hThe grid;
  • Max pooling for each copy of the grid.

After this process, even if the proposal size is different, the output result is that the pooled_w * pooled_h has a fixed size, so that the output length is fixed.

Conclusion: RoI Pooling

  1. The actual purpose of this layer is to normalize the features map and proposal input, and then to keep the input consistent when connecting the full link layer

4.Classification                    

Classification part uses the obtained proposal feature maps, calculates the specific category (such as people, cars, TV, etc.) of each proposal through the Full Connect layer and SoftMax, and output the CLS_PROb probability vector. At the same time, the position offset bbox_pred of each proposal was obtained with bounding box regression, which was used to regression a more accurate target detection box. Classification Partial network structure is shown in the following figure.

                      

After the PROPOSAL feature maps with the size of 7×7=49 are obtained from THE RoI Pooling method and sent to the follow-up network, the following two things can be seen:

  1. They are categorized by the full connection and SoftMax, which is actually already the scope of identification
  2. Again, the proposals were carried out with bounding box regression to obtain the RECt box with higher accuracy

Let’s look at InnerProduct Layers, a simple diagram that looks like this,

                                                                              

The calculation formula is as follows:

                                       

Where W and BIAS B are pre-trained, that is, the size is fixed, of course, input X and output Y are also fixed size. Therefore, this also confirms the necessity of THE previous Roi Pooling.

5. Faster – R – CNN training

The training of Faster R-CNN continues on the basis of the already trained model (such as VGG_CNN_M_1024, VGG, ZF). The actual training process is divided into 6 steps:

  1. On the trained model, train the RPN network to stage1_RPn_train.pt
  2. They are collected by using the RPN network trained in step 1, corresponding to rpn_test.pt
  3. First training Fast RCNN network, corresponding to stage1_fast_RCNn_train.pt
  4. The second train RPN network, corresponding to Stage2_RPn_train.pt
  5. Then the RPN network trained in step 4 is used to collect their proposals, which correspond to rpn_test.pt
  6. The second training is Fast RCNN network, corresponding to stage2_FAST_RCNn_train.pt

You can see that the training process is similar to an “iterative” process, but only loops twice. As for the reason that it only circulates twice, the author mentioned: “A similar alternating training can be run for more iterations, but we have observed negligible improvements”, That is, loop more times without ascending. The following chapter describes the training process in the six steps described above.

Here is a flowchart of the training process, which should be clearer

5.1 RPN Network Training

In this step, you first read the pre-trained model provided by RBG (used in this article)VGG), begin iterative training. Take a look at the stage1_RPn_train.pt network structure, as shown in Figure 19.

Similarly to the network detection, Conv Layers is still used to extract feature maps. The Loss used in the whole network is as follows:

In the above formula, I stands for anchors index, PI stands for positive softmax probability, PI ∗P_{I}^{*} PI ∗ stands for the corresponding GT predict probability (that is, when IoU>0.7 between the ith anchor and GT, The anchor is believed to be positive, Pi∗P_{I}^{*}Pi∗=1; If IoU<0.3, the anchor is considered as negative, Pi∗P_{I}^{*}Pi∗=0; Those anchors with 0.3

  1. CLS Loss, or rpn_CLs_loss layer computing softmax Loss, is used to classify anchors for positive and negative network training
  2. Reg loss, namely Soomth L1 loss calculated by the RPn_Loss_Bbox layer, is used for bounding box Regression network training. Note that the loss is multiplied by Pi∗P_{I}^{*}Pi∗, which is equivalent to only paying attention to the return of positive anchors (in fact, there is no need to care about negative in the return).

Because in the actual process,
N c l s N_{cls}
and
N r e g N_{reg}
The gap is too large,The parameter λ is used to balance the two(e.g.,Set when), so that two kinds of Loss can be uniformly considered in the calculation process of total network Loss. The important thing here is
L r e g L_{reg}
The calculation formula of SOomth L1 loss is as follows:

                                                

  1. During the RPN training phase, the RPN-Data (Python AnchorTargetLayer) layer generates Anchors for training in exactly the same way as the Proposal layer during the test phase
  2. For rpn_loss_cls, the entered RPn_clS_SCORs_shape and rpn_labels correspond to P and P∗ p ^* p ∗, respectively, and the NclsN_{CLS}Ncls parameter is hidden in the size of caffe blob of p and p ∗ p ^* p ∗
  3. For rpn_loss_bbox, the entered rpn_bbox_pred and rpn_bbox_targets correspond to T and T ∗t^* t∗, respectively, and rpn_bbox_inside_weigths correspond to P∗P^* P∗, Rpn_bbox_outside_weigths is not used (as seen in the soomth_L1_Loss Layer code), and NregN_{reg}Nreg is also implied in the caffe blob size

In this way, the formula corresponds exactly to the code. It is important to note that the training and testing phases generate and store anchors in exactly the same order so that the training results can be used for testing!

5.2 Proposals are collected through the trained RPN network

In this step, the proposal rois is obtained using the previous RPN network, and the positive softmax probability is obtained, as shown below, and the obtained information is saved in a Python pickle file. The network is essentially the same as the RPN network in the test.

                                    

5.3 Training the Faster RCNN network

The proposed and positive probabilities are obtained by reading the pickle file saved previously. Input the network from the Data layer. And then:

  1. The proposed extracts are transmitted to the network as ROIS, as shown in the blue box below
  2. Calculate bbox_inside_weights + bbox_outside_weights and pass in the soomth_L1_loss Layer as shown in the green box below

In this way, the final identification of SoftMax and the final bounding box regression can be trained, as shown in Figure 21.

The stagE2 training after that is much the same, so I won’t repeat it any more. Faster R-CNN also has an end-to-end training that allows you to complete the train in one go.