Faster – – Rcnn network
1. The principle of faster RCNN is introduced, and the diagram should be drawn in detail
The Faster R-CNN is a two-stage method, and the RPN network proposed by it replaces the Selective search algorithm so that the detection task can be completed end-to-end by neural network. Structurally, Faster RCNN performs feature extraction, Region proposal extraction, bounding box regression, and bounding box regression. Classification is integrated in a network, which greatly improves the comprehensive performance, especially in the detection speed.
2. Functions and implementation details of the RPN Network
**RPN network functions: **RPN is specially used to extract candidate boxes. On the one hand, RPN takes less time; on the other hand, RPN can be easily combined into Fast RCNN to become a whole.
Implementation details of RPN network: A Feature graph (the common Feature Map of Faster RCNN) is processed by sliding window to obtain 256-dimensional features. Two full connection operations are performed for each Feature vector, and two scores are obtained for each Feature vector. One gets 4 coordinates {then the result is 2k fractions and 4k coordinates by two full joins [k refers to k anchor boxes generated by anchor points]}
2 scores, because RPN is the candidate box, there is no need to judge the category, so it only needs to distinguish whether it is an object or not, so there are two scores, the score of the foreground (object), and the score of the background; 4 coordinates are the offset of the pointer to the coordinate of the original picture. First of all, remember that it is the original picture;
It’s predetermined that there are 9 combinations, so k is equal to 9, and we end up with a result for those 9 combinations, so we have H, x, W, x, 9 outcomes, 18 fractions and 36 coordinates.
Write the loss function for RPN (multitasking loss: dichotomous loss +SmoothL1 loss)
When training RPN networks, we define a dichotomous label (yes or no) for each anchor point.
There are two situations when we view the anchor point as a positive sample tag:
1. Anchor points and the rectangular area with the highest overlap between them and the label
2. Or IOU of anchor points and labels >0.7
How to calculate the input variable of regression loss in RPN loss? (Note that the regression is not the coordinates and width and height, but the offset calculated by them.)
Ti and Ti * are the predicted value and regression target of the network respectively
Target T * needs to be prepared when training RPN. It is calculated through ground-truth box (target true box) and Anchor box (anchor box generated according to certain rules), and represents the transformation relationship between Ground-truth box and Anchor box. Use this to train the RPN so that the RPN eventually learns to output a good transformation relation T. This T is the transforming relationship between Predicted Box and Anchor Box. Through this T and anchor box, the real coordinates of the prediction box box can be calculated.
How is the anchor box selected in RPN?
The mapping point at the center of the sliding window in the original pixel space is called anchor. Based on this anchor, K (default K =9 in paper, 3 scales and 3 aspect ratios/ different sizes and different aspect ratios) are generated. Three area sizes (128^ 2,256 ^ 2,512 ^2), then for each area size, take three different aspect ratios (1:1,1:2,2:1)
Why is anchor Box proposed?
There are two main reasons: a window can only detect one target, can not solve the multi-scale problem.
At present, there are three main ways to select the size of Anchor box: artificial experience selection, K-means clustering and learning as a hyperparameter
Why different sizes and different aspect ratios? In order to get a larger union ratio.
3. Tell me how RoI Pooling is done. What are the defects? What does it do
The process of RoI Pooling is to map box rectangular boxes of different sizes into rectangular boxes of fixed size (W * h)
Specific operations :(1) map ROI to the corresponding position of feature map according to input image; (2) divide the mapped region into sections of the same size (the number of sections is the same as the output dimension); (3) Conduct Max pooling operation for each section;
In this way, corresponding feature maps of fixed size can be obtained from the boxes of different sizes. It is worth mentioning that the size of the output feature maps does not depend on ROI and the size of convolutional feature maps. The greatest benefit of ROI pooling is that the processing speed is greatly improved. (In the Pooling process, it is necessary to calculate the range occupied by the Pooling result corresponding to the feature map, and then take Max or Average from that range.)
** reuse feature map in CNN; 2. Can significantly accelerate the speed of training and testing; 3. Allow the target detection system to be trained in the form of end-to-end.
** Disadvantages: ** Because RoIPooling adopts nearest neighbor interpolation (INTER_NEAREST), in resize, for coordinates that cannot be exactly integer after scaling, rough rounding of decimal points is adopted, which is equivalent to selecting the nearest point to the target point, thus losing certain spatial accuracy.
1. The xyWH of the region proposal is usually a decimal, but it will be integer for easy operation. 2. Evenly divide the integer boundary area into k x K units, and integer each unit boundary. // After the above two integers, the candidate box at this time has a certain deviation from the initial regression position, which will affect the accuracy of detection or segmentation
How to do the mapping: The mapping rule is relatively simple, which is to divide each coordinate by “the ratio of the input image to the size of the feature map”.
Differences between ROI Pooling and ROI Align(Mask R-CNN)
ROI Align: The idea of ROI Align is very simple: cancel the quantization operation, use the method of bilinear interpolation to obtain the image value of floating point pixels, so as to transform the whole feature gathering process into a continuous operation; 1. Iterate over each candidate region, leaving the float boundary unquantized. 2. The candidate area is divided into K x K units, and the boundary of each unit is not quantified. 3. Calculate the fixed four coordinate positions in each cell, calculate the values of these four positions with the method of bilinear interpolation, and then carry out the maximum pooling operation.
Difference: ROI Align eliminates the quantization method of integer approximation pixels and uses bilinear interpolation to determine the coordinates of feature maps corresponding to pixel positions in the original maps. ROI Align well solves the mis-alignment problem caused by twice quantization in ROI Pooling operation.
For detecting large target objects in the picture, there is little difference between the two schemes. If there are many small target objects in the picture that need to be detected, RoiAlign should be selected first to be more accurate.
A concrete method for calculating pixel values by bilinear interpolation in RoI Align
In mathematics, bilinear interpolation is a linear interpolation extension of the interpolation function with two variables, and its core idea is to perform linear interpolation in two directions respectively.
If we want to obtain the value of the unknown function f at P = (x, y), suppose we know the value of the function f at Q11 = (x1, y1), Q12 = (x1, y2), Q21 = (x2, y1), and Q22 = (x2, y2). In the most common case, f is the pixel value of a pixel point. First, perform linear interpolation in the x direction, and obtain
And then we do linear interpolation in the y direction, and we get
This is the result of bilinear interpolation:
Since image bilinear interpolation only uses 4 adjacent points, the denominator of the above formula is 1.
The eigenvalues of each sampling point are derived from the pixel values of the four adjacent integral feature points by bilinear difference.
Nearest neighbor interpolation: Assigning the value of the nearest pixel in the original image to the new pixel
Non maximum suppression NMS (NON maximum suppression
Use: the essence is to search local maximum, suppress non-maximum elements.
Principle :NMS is non-maximum suppression and is used to suppress redundant boxes during detection.
The general algorithm process is as follows: 1. Rank the confidence of all prediction boxes in descending order; 2. Select the prediction box with the highest confidence, confirm it as the correct prediction, and calculate the IOU 3 of it and other prediction boxes. If the IOU calculated in 2 is higher than threshold, delete 4. Remaining prediction boxes return to step 1 until there are none left
(Note that non-maximum Suppression is done one category at a time. If there are N categories, non-maximum Suppression is done N times.)
Assuming that the two targets are very close to each other, they will be identified as a Bbox. What are the problems and how to solve them?
When the two targets are very close to each other, the one with low confidence will be suppressed by the box with high confidence, so the two targets will be identified as a Bbox when they are very close to each other. To solve this problem, you can use softNMS (basic idea: replace the old score with a slightly lower score rather than setting it to zero)
5. How does Faster R-CNN solve the problem of unbalanced positive and negative samples?
Limit the ratio of positive and negative samples to 1:1. If positive samples are insufficient, negative samples will be used to supplement. This method is rarely used in future research work. Generally, the problem of category imbalance can be solved by adjusting sample number or modifying loss weight. The commonly used methods are OHEM, OHNM, Class Balanced Loss and Focal Loss.
How does Faster RCNN screen positive and negative Anchors
We assign a positive label to two types of anchor points :(I) anchor points with the highest overlap ratio (IoU) with the actual bounding box,
(ii) Anchor points with an overlap of more than 0.7 IoU with the actual bounding box. When the IoU ratio is below 0.3, we assign a negative label to non-positive anchors.
6. What formula is used for bbox regression in ftP-RCNN, and how does this network return to Bbox?
Where x,y, W and h are the coordinates of the center point, width and height of bbox respectively. Prediction Box, Anchor Box and real Box are respectively.The first two lines are the predicted box’s offset and scales of anchor, and the last two lines are the real box’s offset and scales of anchor. The purpose of the return is clear, even ifAs close as possible. The regression loss function utilizes the smooth L1 function defined in fast-RCNN, which is less sensitive to external points:
The weight W of the loss function was optimized so that a better OFFsets and scales of the Bbox could be obtained after W calculation during the test. By using this offsets and scales, the original prediction Bbox could be fine-tuned to get better prediction results.
Why Bounding-box Regression?
Border regression is used to fine-tune the candidate area/box to make the fine-tuned box closer to the Ground Truth.
7. Summarize the forward computing process of The Faster RCNN and the training steps of the Faster RCNN
Input a picture to be detected -> VGG16 network ConV layers extract the features of the whole picture, output the feature map and input it to RPN and Fast RCNN network respectively ->RPN network to get the region proposal. Send the information of these candidate boxes to the beginning of Fast RCNN network -> Extract features using the feature map sent by the candidate boxes before. The feature map of a specified size is obtained through the ROI Pooling layer -> these feature maps are sent into the Fast RCNN network for classification and regression coordinates, and finally the coordinates of the object to be detected are obtained.
Describe the training steps for Faster RCNN
The first step is to train RPN, which is initialized with ImageNet pre-trained model and fine-tuned end-to-end to generate region proposal.
The second step is to train the Fast R-CNN, which is initialized by the imageNet model. The region proposals generated by RPN in the first step are used as input data to train the Fast R-CNN for an independent detection network. At this time, the two networks do not share the convolution layer.
The third step is to tune RPN and initialize RPN with the fast-RCNN model in the second step for training again, but the shared convolutional layer is fixed and only the layer unique to RPN is fine-tuned. Now the two networks share the convolutional layer.
The fourth step is to tune the Fast R-CNN, initialize the Fast RCNN network by the RPN model in step 3, and input data to the proposals generated in step 3. Keep the shared convolutional layer fixed, fine-tune the FC layer of Fast R-CNN. In this way, the two networks share the same convolution layer and form a unified network.
8. Is there any disadvantage to Faster RCNN? How to improve?
Improvements: 1. Better feature network ResNet etc.; 2. More accurate RPN: RPN network 3 can be designed using FPN network architecture. Better ROI classification method: for example, ROI-pooling is performed on conv4 and CONV5 respectively, and then classification is performed after the combination. In this way, the amount of computation is basically not increased, and conv4 with higher resolution can be utilized; 4. Replace NMS with softNMS.
RP extraction was compared with FasterRCNN’s improved point RPN in RCNN series
reference
Ren S, He K, Girshick R, et al. Faster r-cnn: Ieee Transactions on Geoscience and Remote Sensing, 2016, 18 (1) : 143-154. Zhang J, Zhang J, Zhang J, et al. Real-time Object Detection based on Regional Proposal Networks [J].
Refer to the link
zhuanlan.zhihu.com/p/137735486
zhuanlan.zhihu.com/p/137736076
Blog.csdn.net/wshdkf/arti…