Welcome to follow my public account [Jizhi Vision], reply 001 to get Google programming specification
O_o >_< O_o O_o ~_~ O_o
Hello everyone, I am Jizhi vision, this paper introduces the design and practice of Clay comeback and YOLOv2 algorithm.
This article is the second article to share the realization of object detection algorithm. One article has been written before. Interested students can refer to it:
(1) “[Model Training] Target detection Implementation Sharing 1: Detailed Explanation of YOLOv1 Algorithm implementation”;
It can be seen from the name that YOLOv2 is an improved version based on YOLOv1. In the paper “YOLO9000: Better, Faster, Stronger, and YOLO9000 model are also proposed. The structure of this model is the same as that of YOLOv2, but the training mode is innovated. This paper mainly focuses on YOLOv2, not YOLO9000. Again, we’re not just going to talk about principles but we’re going to talk about practice, so without further ado, let’s start.
1. Principle of YOLOv2
Let’s start with the experimental data:
The figure above is the comparison of efficiency accuracy between YOLOv2 and the mainstream detection model at that time (data set is VOC0712). From the comprehensive consideration of efficiency accuracy, the higher the right, the better, it can be seen that YOLOv2 has done a good job in both efficiency and accuracy. Then explain why YOLOv2 has several points, which correspond to different input resolutions. The data is as follows:
YOLOv1 has two defects:
(1) High position error rate/serious deviation;
(2) Low recall rate;
YOLOv2 has been improved for the above two defects. Let’s see what improvements are made to make YOLOv2 so much better. The new tricks are summarized as follows:
Here’s how it goes.
1.1 Batch Normalization
BN layer is a commonly used operator in detection networks. The data from BN obeys gaussian distribution, which can greatly accelerate model convergence and improve model accuracy. The author tested that adding BN layer after the convolution layer of YOLO can make the mAP lift 2 points, and the structure is almost like this:
1.2 High Resolution Classifier
While YOLOv1 Backbone pretrains the input resolution of 224 x 224 on ImageNet dataset, YOLOv2 adjusts to train 10 epoches on ImageNet dataset with 448 x 448 resolution first. Then the COCO detection dataset was fine-tuned. The authors test that higher resolution pretraining can improve the mAP by 2 points.
1.3 Convolutional With Anchor Boxes
In YOLOv1, the full connection layer is used to directly predict the regression of the prediction box. Since the aspect ratio of various targets varies greatly and there is no constraint, the prediction box is difficult to adapt to different targets. YOLOv2 refers to the concept of RPN network prior frame in Faster R-CNN, and replaces the full connection layer of YOLOv1 with convolution + Anchor Box. After such operation, mAP of YOLOv2 decreased slightly, but recall rate increased greatly.
YOLOv2 still has two problems after using Anchor:
(1) The Anchor size is manually selected, which cannot well adapt to the actual scale of the object;
(2) Anchor is unstable for position prediction.
The following will make corresponding improvements to the above two defects.
1.4 Dimension Clusters
This trick is mainly to solve the problem that the Anchor size cannot be well adapted to the actual scale of the object because of manual selection. Here, K-means is adopted to generate the width and height of Anchor according to automatic clustering of prior frames in the training set. How is it done specifically? In order to have higher IOU between the predicted box and the real box, the distance index of k-means clustering is designed as follows:
Then, different numbers of cluster centers were selected, and the test data on VOC and COCO data sets were as follows:
The author found that when the number of clustering centers k = 5, a good balance could be achieved between model complexity and high recall rate, thus completing the automatic selection of Anchor Box.
1.5 Direct the location prediction
This trick is to solve the instability of Anchor’s position prediction. In YOLOv1, the center position of the prediction box is calculated as follows: Assuming that the center position of the prediction box to be calculated is (x, y), the coordinate offset value of the network output is (TX, TY), the scale of the prior box is (XA, YA), and the center coordinate of the Grid cell is (XA, ya), then:
If you calculate in this way, the problem comes. The above formula is unconstrained, that is, the center position of the prediction box calculated in the end is extremely unstable, which is also an important reason for the high position error rate in YOLOv1. YOLOv2 improves this by using sigmoid function to control the offset value between (0, 1), and the offset is relative to the upper left vertex of the Grid cell. In this way, the center point of the prediction box is constrained in the current Grid cell and will not run anywhere. The formula looks like this: It is assumed that the upper left vertex of the Grid cell is (Cx, CY), the width and height of bounding box prior are PW and pH, and the 5 coordinates predicted by the network for bounding box are TX, TY, TW, TH and TO. Then the position of the prediction box can be calculated as follows:
1.6 Fine – Grained the Features
The final prediction of YOLOv2 is on the feature map of 13 x 13. This optimization is to add the identity mapping structure similar to residual network in the process from 26 x 26 x 512 to 13 x 13 x 2048 to reduce the loss of feature granularity. This is called the Passthrough layer and has a structure like this:
This optimization results in a one-point performance improvement.
1.7 Multi – Scale Training
This is because YOLOv2 replaces the full connection layer of YOLOv1 with only convolution and pooling layer, so the input resolution can be dynamically adjusted. The training technique used here is to adjust the input resolution to {320, 352… every 10 batches. , 608}, the maximum resolution is 608 x 608, the minimum resolution is 320 x 320, which is dynamically adjusted, so that it can better adapt to the detection object of different scales.
1.8 Darknet – 19
This paper proposes the backbone of YOLOv2: Darknet -19. At that time, VGG was the main backbone of detection framework. Vgg-16 was the most popular due to its powerful feature extraction ability, but it cost a lot, requiring 30.96 billion floating point operation. The backbone of YOLOv1 is based on the structure of GoogleNet, and the forward complexity of the model is reduced to 8.52 billion operations. YOLOv2 is further optimized, using the idea of Network in Network (NIN) for reference, adopting the alternating form of 1 x 1 and 3 x 3 convolution, combining BN layer to stabilize training, forming the backbone of Darknet-19. The model’s forward complexity was further reduced to 5.58 billion operations.
Take a look at the model structure of Darknet-19:
The above improvement point of YOLOv2 said again, the following into practice.
2. YOLOv2 practice
In the case of Darknet training YOLOv2, the Darknet environment and VOC0712 dataset are ready by default (check out my previous article if they don’t).
2.1 training
The training instructions for YOLOv2 are somewhat different from those for YOLOv1, where the training data set path and weight save path are written out in code, whereas YOLOv2 can be passed in from the instructions and read in the VOC.data configuration.
Create a yolov2 folder in the CFG directory and add VOC. data, VOC. names, yolov2_train. CFG (for training), and yolov2.cfg (for reasoning). The resulting directory tree looks something like this:
Voc. names is the category name, and VOC. data looks something like this:
Then perform the training:
./darknet detector train cfg/yolov2/voc.data cfg/yolov2/yolov2_train.cfg -gpus 0
Copy the code
There will be corresponding loss change diagram, which can intuitively see how the convergence is:
During YOLOv2 training, it adjusts the resolution of the input itself, and then you can see the effect after training.
2.2 validation
To lai! Klay returns to the NBA today after 941 days. Take the Klay test
Today’s five best ball, Klay dunk, announced the return of the king. Execution of detection (due to the training data set is VOC0712, and NBA detection scene is still very different, so the detection effect is just as good) :
./darknet detector demo cfg/voc.data cfg/yolov2/yolov2.cfg cfg/yolov2/backup/yolov2.weights data/nba_kelai.mp4
Copy the code
Does that sound like old Kern
Well, the above mainly share clay comeback, incidentally share the principle and practice of YOLOv2 algorithm, I hope my share can help you learn.
【 public number transmission 】 【 model training 】 target detection realization share two: heard that Clay came back today?