Object Detection, as I understand it, is to accurately find the location of an Object in a given picture and mark its category. Object Detection solves the problem of where and what an Object is. However, this problem is not so easy to solve, objects can vary in size, Angle, pose, and appear anywhere in the image, not to mention objects can be multiple categories.
Evolution of Object Detection Technology:
RCNN->SppNET->Fast-RCNN->Faster-RCNN
Start with the task of image recognition
Here is an image task:
Both identify the object in the picture and box its position.
In professional terms, the above task is: image recognition + positioning
Image recognition:
Enter: image
Output: Category of object
Evaluation method: Accuracy
Localization:
Enter: image
Output: Position of the box in the picture (x, Y, W, H)
Evaluation method: test evaluation function intersection-over-union (IOU)
Convolutional neural network CNN has already helped us complete the task of image recognition (determining whether it is a cat or a dog), we just need to add some additional functions to complete the positioning task.
What are the solutions to the problem of positioning?
Idea 1: Think of it as a regression problem
As a regression problem, we need to predict the values of the four parameters (x,y,w,h) to find the position of the box.
Step 1:
• Solve simple problems first and build a neural network to recognize images
• Fine tuning on AlexNet VGG GoogleLenet
Step 2:
• Expand at the end of the neural network above (in other words, the front of CNN remains unchanged, and we improve the end of CNN by adding two heads: “classification head” and “regression head”)
• Become classification + Regression mode
Step 3:
•Regression for that part with Euclidean distance loss
• Use SGD training
Step 4:
• Put two heads together in the prediction phase
• Complete different functions
Two fine-tuning operations are needed here
For the first time, it was done on ALexNet. For the second time, the head was changed to Regression head, with the front unchanged, and fine-tuning was performed
Where do I add the Regression part?
There are two ways to deal with it:
• Add after the last convolutional layer (e.g. VGG)
• Add after the last fully connected layer (e.g. R-cnn)
Regression is too difficult to do, so try to convert to classification problems.
Regression training parameters have a much longer convergence time, so the above network adopts classification network to calculate the connection weights of the common part of the network.
Idea 2: take the image window
• Classification + Regression again
• Let’s take different size “boxes”
• Make the box appear in different positions to get the score for the box
• Get the box with the highest score
Black box in top left corner: score 0.5
Black box in the upper right corner: score 0.75
Black box in lower left corner: score 0.6
Black box in lower right corner: score 0.8
Based on the score, we chose the black box in the lower right corner as the prediction of the target position.
Note: Sometimes the two boxes with the highest scores are selected, and the intersection of the two boxes is taken as the final position prediction.
Question: how big should the box be?
Take different boxes and sweep them from the top left to the bottom right. Very rough.
To sum up the idea:
For an image, use frames of various sizes (traversing the whole image) to intercept the image and input it to CNN, which will output the classification of this frame and the corresponding X, Y, H, W (regression) of this frame image.
This method is taking too long. Make an optimization.
The original network looked like this:
Optimize it like this: change the fully connected layer to the convolution layer, which can speed up.
Object Detection
What happens when the image has a lot of objects? It’s getting a whole lot harder.
The task becomes: multiple object recognition + locate multiple objects
So think of this task as a classification problem?
What’s wrong with categorization?
• You need to find lots of places and give lots of boxes of different sizes
• You also need to categorize the images inside the box
• Of course, if your GPU is powerful, well, go for it…
As classification, is there any way to optimize it? I don’t want to try so many boxes and so many positions!
Someone came up with a good idea:
Find boxes that may contain objects (i.e. candidate boxes, say 1000 candidate boxes) that can overlap and contain each other so that we can avoid violent enumeration of all boxes.
There are many ways to select candidate boxes, such as EdgeBoxes and Selective Search.
The following is a performance comparison of the various methods for selecting candidate boxes.
There is a big puzzle, the algorithm used to extract candidate boxes “selective search” exactly how to select these candidate boxes? That would have to be taken a good look at its paper, which I won’t introduce here.
R-cnn came out of nowhere
Based on the above ideas, RCNN emerged.
Step 1: Train (or download) a classification model (such as AlexNet)
Step 2: Fine-tuning the model
• Change the number of categories from 1000 to 20
• Remove the last fully connected layer
Step 3: Feature extraction
• Extract all candidate boxes of the image (selective search)
• For each region: correct the size of the region to fit the input of CNN, perform a forward operation, and save the output of the fifth pooling layer (that is, the features extracted from the candidate box) to the hard disk
Step 4: Train a SVM classifier (binary classification) to judge the category of objects in this candidate box
Each category corresponds to a SVM. To judge whether it belongs to this category, be positive and be nagative
For example, the picture below shows the SVM for dog classification
Step 5: Use regressors to refine the position of candidate boxes: for each class, train a linear regression model to determine whether the box is perfectly framed.
The idea of SPP Net contributed a lot to the evolution of RCNN. Here is a brief introduction of SPP Net.
SPP Net
SPP: Spatial Pyramid Pooling
It has two characteristics:
1. Scale input of CNNs is realized by combining spatial pyramid method.
Generally, CNN is followed by a full connection layer or classifier, which requires a fixed input size. Therefore, the input data have to be crop or warp, which will cause data loss or geometric distortion. The first contribution of SPP Net is to add the pyramid idea to CNN, realizing the multi-scale input of data.
As shown in the figure below, SPP layer is added between the convolution layer and the full connection layer. At this point, the input of the network can be of any scale, and the filter of each pooling in the SPP layer will be adjusted according to the input, while the output scale of SPP is always fixed.
2. Only one convolution feature is extracted from the original graph
In R-CNN, each candidate box first resize to a unified size, and then serve as the input of CNN respectively, which is inefficient.
Therefore, SPP Net optimized the feature map of the whole image by convolution of the original image only once, then found the mapped patch on zaifeature map of each candidate frame, and input this patch as the convolution feature of each candidate frame into SPP layer and subsequent layer. It saves a lot of calculation time and is about one hundred times faster than R-CNN.
Fast R-CNN
SPP Net is a good method, the advanced version of R-CNN Fast R-CNN is on the basis of RCNN adopted SPP Net method, improved RCNN, making the performance further improved.
What are the differences between R-CNN and Fast RCNN?
First, the disadvantages of RCNN: Even if the selective search and other pre-processing steps are used to extract potential bounding boxes as input, RCNN still has a serious speed bottleneck. The reason is also obvious: there will be repeated calculation when the computer performs feature extraction of all regions. Fast-rcnn was born to solve this problem.
Daniel proposes a network layer, which can be regarded as a single-layer SPPNET, called ROI Pooling. This network layer can map input of different sizes to a fixed scale feature vector. As we know, conV, Pooling, relu and other operations do not require input of a fixed size. After performing these operations on the original images, although the size of input images is different, the size of the feature map obtained is different and cannot be directly connected to a fully connected layer for classification, the magic ROI Pooling layer can be added to extract the feature representation of a fixed dimension for each region. Then the normal Softmax is used for type identification. In addition, the previous RCNN processing process was to make a proposal first, then extract features by CNN, then use SVM classifier, and finally do Bbox regression. In fast-RCNN, the author cleverly put bbox regression into the neural network. And region classification and become a multi-task model. Practical experiments also prove that these two tasks can share convolution features and promote each other. An important contribution of fast-RCNN is that it successfully gives people the hope of real-time detection in the framework of Region Proposal+CNN. Originally, multi-class detection can improve processing speed while ensuring accuracy, which also lays a foundation for the subsequent ftP-RCNN.
Draw the key points:
R-cnn has some pretty big flaws (get rid of them all and you get Fast R-CNN).
Big disadvantage: Because each candidate box has to go through CNN alone, it takes a lot of time.
Solution: Shared convolutional layer. Now, instead of each candidate box being input into CNN, a complete picture is input, and the features of each candidate box are obtained at the fifth convolutional layer
Original method: many candidate boxes (such as two thousand) –>CNN–> get the feature of each candidate box –> classification + regression
Now method: a complete image –>CNN–> get the features of each candidate box –> classification + regression
Therefore, it is easy to see that the reason for Fast RCNN’s speed increase compared with RCNN is that, unlike RCNN, features are extracted from each candidate region to the deep network. Instead, features are extracted from the whole map once, and then candidate boxes are mapped to conv5, while SPP only needs to calculate features once, and the rest only need to operate on conv5 layer.
The performance improvements are also significant:
Faster R-CNN
Problems with Fast R-CNN: There is a bottleneck: selective search to find all candidate boxes, which is also very time-consuming. So can we find a more efficient way to find these candidate boxes?
Solution: add a neural network to extract edges, that is to say, the work of finding candidate boxes is also handed over to the neural network to do.
The neural Network that does this is called the Region Proposal Network(RPN).
Specific practices:
• Place RPN after the last convolutional layer
•RPN direct training to obtain candidate areas
RPN’s brief introduction:
• Slide the window on the Feature Map
• Build a neural network for object classification + box position regression
• The position of the sliding window provides information about the general position of the object
• Box regression provides a more accurate position of the box
A network with four loss functions;
• RPN calssification (anchor good. Bad)
• RPN regression (anchor – > propoasal)
Classification, Fast R – CNN (over classes)
• Fast R – CNN regression (proposal – > box)
Speed comparison
The main contribution of Faster R-CNN is to design network RPN to extract candidate regions, which replaces time-consuming selective search and greatly improves detection speed.
Finally, summarize the steps of each algorithm:
RCNN
1. Identify about 1000-2000 candidate boxes in the image (using selective search)
2. Image blocks in each candidate box are scaled to the same size and input into CNN for feature extraction
3. Classifiers are used to determine whether the features extracted from the candidate box belong to a specific class
4. For the candidate box belonging to a certain feature, further adjust its position with the regressor
Fast RCNN
1. Identify about 1000-2000 candidate boxes in the image (using selective search)
2. Input CNN into the whole image to obtain the feature map
3. Find the mapping patch of each candidate frame on the feature map and input this patch as the convolution feature of each candidate frame to the SPP layer and subsequent layers
4. Classifiers are used to determine whether the features extracted from the candidate box belong to a specific class
5. For candidate boxes belonging to a certain feature, further adjust their positions with regressors
Faster RCNN
1. Input CNN into the whole image to obtain the feature map
2. Convolution features are input to RPN to obtain feature information of candidate boxes
3. Classifiers are used to determine whether the features extracted from the candidate box belong to a specific class
4. For the candidate box belonging to a certain feature, further adjust its position with the regressor
In general, from R-CNN, SPP-NET, Fast R-CNN and Faster R-CNN, the process of target detection based on deep learning has become more and more streamlined, with higher precision and Faster speed. It can be said that r-CNN series target detection method based on region proposal is the most important branch in the current target detection technology field.