· Analysis on the evolution of object detection technology in deep learning
Object detection refers to the precise location of an object in a given image and the classification of the object. Object Detection is to solve the problem of the whole process of where and what the object is. However, this problem is not so easy to solve, objects can vary in size, Angle and posture, and can appear anywhere in the image, not to mention objects can be multiple categories.
Evolution of object Detection technology: RCNN->SppNET-> fast-RCNN -> ftP-RCNN
Let’s start with the task of image recognition. Here’s an image task: both identify the object in the picture and box its position.
The above tasks in professional terms is: image recognition + positioning image recognition (classification) : input: picture output: object category evaluation method: accuracy
Localization: input: image output: location of the box in the image (x, Y, W, H) evaluation method: detection evaluation function intersection-over-union (IOU)
Convolutional neural network CNN has already helped us complete the task of image recognition (determining whether it is a cat or a dog), we just need to add some additional functions to complete the positioning task.
What are the solutions to the problem of positioning? As a regression problem, we need to predict the values of the four parameters (x,y,w,h) to find the position of the box.
Step 1:
• Solve simple problems first and build a neural network to recognize images
• Fine tuning on AlexNet VGG GoogleLenet
• At the end of the neural network described above (i.e., the front of CNN remains unchanged, and we improve the end of CNN by adding two headers: “classification header” and “regression header”) • Becomes classification + Regression mode
Step 3: • Regression that section with Euclidean distance loss • SGD training
Step 4: • Put the two heads together in the prediction stage • Complete different functions
Here, two fine-tuning operations are required. The first one is performed on ALexNet, and the second one is performed by changing the head to regression head, with the front unchanged
Where do I add the Regression part?
There are two processing methods: • add after the last convolutional layer (e.g. VGG) • add after the last fully connected layer (e.g. R-cnn)
Regression is too difficult to do, so try to convert to classification problems. Regression training parameters have a much longer convergence time, so the above network adopts classification network to calculate the connection weights of the common part of the network.
Let’s take the “box” of different sizes, let the box appear in different locations, get the decision score of this box, and get the box with the highest score ~~ (1). Calculate the score using the actual annotation position and the prediction box to calculate the error. Is it understood correctly
Black box in top left corner: score 0.5
Black box in the upper right corner: score 0.75
Black box in lower left corner: score 0.6
Black box in lower right corner: score 0.8
Based on the score, we chose the black box in the lower right corner as the prediction of the target position. Note: Sometimes the two boxes with the highest scores are selected, and the intersection of the two boxes is taken as the final position prediction.
Question: how big should the box be? Take different boxes and sweep them from the top left to the bottom right. Very rough.
To summarize the idea: For an image, use frames of various sizes (traversing the whole image) to intercept the image and input it to CNN, which will output the classification of this frame and the corresponding X, Y, H, W (regression) of this frame image. (2) The intercepted photos were used for prediction, and the highest score of certain species was selected to obtain X, Y, H,w)
This method is taking too long. Make an optimization. The original network looked like this:
Optimize it like this: change the fully connected layer to the convolution layer, which can speed up.
What happens when there are many objects in the image? It’s getting a whole lot harder.
The task becomes: multiple object recognition + locate multiple objects
So think of this task as a classification problem?
What’s wrong with categorization? • You need to find a lot of positions, give a lot of boxes of different sizes • you also need to sort the images inside the boxes • Of course, if your GPU is powerful, well, go ahead and do it…
As classification, is there any way to optimize it? I don’t want to try so many boxes and so many positions! Someone came up with a good idea: find the boxes that might contain objects (i.e. candidate boxes, say 1000 candidate boxes), which can overlap and contain each other, so that we can avoid violent enumeration of all boxes.
There are many ways to select candidate boxes, such as EdgeBoxes and Selective Search.
The following is a performance comparison of the various methods for selecting candidate boxes.
There is a big puzzle, the algorithm used to extract candidate boxes “selective search” exactly how to select these candidate boxes? That would have to be taken a good look at its paper, which I won’t introduce here.
R-cnn was born out of nowhere and with that in mind, RCNN was born.
Step 1: Train (or download) a classification model (such as AlexNet)
Step 2: Fine-tuning the model
• Change the number of categories from 1000 to 20
• Remove the last fully connected layer
Step 3: Feature extraction
• Extract all candidate boxes of the image (selective search)
• For each region: correct the size of the region to fit the input of CNN, perform a forward operation, and save the output of the fifth pooling layer (that is, the features extracted from the candidate box) to the hard disk
Step 4: Train a SVM classifier (binary classification) to judge the categories of objects in the candidate box. Each category corresponds to a SVM. Judge whether it belongs to this category by positive, and conversely nagative, as shown in the picture below, it is the SVM of dog classification
Step 5: Use regressors to refine the position of candidate boxes: for each class, train a linear regression model to determine whether the box is perfectly framed. (3. Firstly, classify images through candidate boxes to obtain positive examples, then select candidate boxes with higher scores, and finally fine-tune them through linear regression)
The idea of SPP Net contributed a lot to the evolution of RCNN. Here is a brief introduction of SPP Net.
Net SPP: Spatial Pyramid Pooling (SPP) Net SPP: Spatial Pyramid Pooling
1. Scale input of CNNs is realized by combining spatial pyramid method. Generally, CNN is followed by a full connection layer or classifier, which requires a fixed input size. Therefore, the input data have to be crop or warp, which will cause data loss or geometric distortion. The first contribution of SPP Net is to add the pyramid idea to CNN, realizing the multi-scale input of data.
As shown in the figure below, SPP layer is added between the convolution layer and the full connection layer. At this point, the input of the network can be of any scale, and the filter of each pooling in the SPP layer will be adjusted according to the input, while the output scale of SPP is always fixed. (How to achieve)
In R-CNN, each candidate box is first resized to a uniform size and then used as the input of CNN respectively, which is inefficient. So THE SPP Net is optimized for this shortcoming: The feature map of the whole image is obtained by convolution of the original image only once, and then the mapped patch on zaifeature map of each candidate frame is found, and this patch is input into SPP layer and subsequent layer as the convolution feature of each candidate frame (how to achieve this). It saves a lot of calculation time and is about one hundred times faster than R-CNN.
Fast R-CNN SPP Net is really a good method, r-CNN advanced version of Fast R-CNN is on the basis of RCNN adopted SPP Net method, made improvements to RCNN, making the performance further improved.
What are the differences between R-CNN and Fast RCNN? First, the disadvantages of RCNN: Even if the selective search and other pre-processing steps are used to extract potential bounding boxes as input, RCNN still has a serious speed bottleneck. The reason is also obvious: there will be repeated calculation when the computer performs feature extraction of all regions. Fast-rcnn was born to solve this problem.
Daniel proposes a network layer, which can be regarded as a single-layer SPPNET, called ROI Pooling. This network layer can map input of different sizes to a fixed scale feature vector. As we know, conV, Pooling, relu and other operations do not require input of a fixed size. After performing these operations on the original images, although the size of input images is different, the size of the feature map obtained is different and cannot be directly connected to a fully connected layer for classification, the magic ROI Pooling layer can be added to extract the feature representation of a fixed dimension for each region. Type recognition is done through normal Softmax (similar to Word2Wec). In addition, the previous RCNN processing process was to make a proposal first, then extract features by CNN, then use SVM classifier, and finally do Bbox regression. In fast-RCNN, the author cleverly put bbox regression into the neural network. And region classification and become a multi-task model. Practical experiments also prove that these two tasks can share convolution features and promote each other. An important contribution of fast-RCNN is that it successfully gives people the hope of real-time detection in the framework of Region Proposal+CNN. Originally, multi-class detection can improve processing speed while ensuring accuracy, which also lays a foundation for the subsequent ftP-RCNN.
Draw a picture: R-CNN has some pretty big flaws (get rid of them all and you get Fast R-CNN). Big disadvantage: Because each candidate box has to go through CNN alone, it takes a lot of time. Solution: Shared convolutional layer. Now, instead of each candidate box being input into CNN, a complete picture is input, and the features of each candidate box are obtained at the fifth convolutional layer
Original method: many candidate boxes (such as 2000) –>CNN–> get the feature of each candidate box –> classification + regression Method: a complete picture –>CNN–> get the feature of each candidate box –> classification + regression
Therefore, it is easy to see that the reason for Fast RCNN’s speed increase compared with RCNN is that, unlike RCNN, features are extracted from each candidate region to the deep network. Instead, features are extracted from the whole map once, and then candidate boxes are mapped to conv5, while SPP only needs to calculate features once, and the rest only need to operate on conv5 layer.
The performance improvements are also significant:
The problem with Faster R-CNN Fast: There is a bottleneck: selective search to find all candidate boxes, which is also very time-consuming. So can we find a more efficient way to find these candidate boxes? Solution: add a neural network to extract edges, that is to say, the work of finding candidate boxes is also handed over to the neural network to do. The neural Network that does this is called the Region Proposal Network(RPN).
• RPN is placed after the last convolutional layer • RPN is directly trained to obtain the candidate region
Introduction to RPN: • Sliding window on feature map • Build a neural network for object classification + regression of frame position • Sliding window position provides general position information of the object • Frame regression provides more accurate position of the frame
A network with four loss functions; • RPN regression(Anchor -> Propoasal) • Fast R-CNN classification(over classes) • Fast R-CNN regression(proposal ->box)
Speed comparison
The main contribution of Faster R-CNN is to design network RPN to extract candidate regions, which replaces time-consuming selective search and greatly improves detection speed.
Finally, the steps of each algorithm are summarized as follows: RCNN 1. Determine about 1000-2000 candidate boxes in the image (using selective search) 2. Image blocks in each candidate box are scaled to the same size and input into CNN for feature extraction. 3. For the features extracted from the candidate box, classifiers are used to determine whether they belong to a specific class. 4. For candidate boxes belonging to a certain feature, the position of them is further adjusted by regressors
Fast RCNN 1. Determine about 1000-2000 candidate boxes in the image (using selective search) 2. Input CNN into the whole picture to obtain the feature map. 3. Find the mapping patch of each candidate frame on the feature map, and input this patch as the convolution feature of each candidate frame into SPP layer and subsequent layer. For the features extracted from the candidate box, classifiers are used to determine whether they belong to a specific class. 5. For candidate boxes belonging to a certain feature, the position of them is further adjusted by regressors
Faster RCNN 1. Input CNN to the whole image to obtain feature map 2. Convolution features are input to RPN to obtain the feature information of the candidate box. 3. Classifiers are used to determine whether the features extracted from the candidate box belong to a specific class. For candidate boxes belonging to a certain feature, the position of them is further adjusted by regressors
In general, from R-CNN, SPP-NET, Fast R-CNN and Faster R-CNN, the process of target detection based on deep learning has become more and more streamlined, with higher precision and Faster speed. It can be said that r-CNN series target detection method based on region proposal is the most important branch in the current target detection technology field.