The YOLO series is a classic structure in the field of target detection, and although many higher quality and more complex networks have been created, the YOLO structure still has a lot to teach algorithm engineers. These three papers feel like a reference manual, teaching you how to use various tricks to improve the accuracy of the target detection network in your hand

YOLOv1


You Only Look Once: Unified, Real-time Object Detection

  • Address:Arxiv.org/abs/1506.02…

Introduction

The simplicity of YOLO, a network that simultaneously classifs and locations multiple objects without the concept of a proposal, is a milestone in a one-stage real-time detection network. The standard version achieves 45 FPS on TitanX, and the fast version achieves 150fps, but not as accurate as the SOTA network of the time

Unified Detection

Divide the input intoIf the center point of the GT is in the grid, the grid is responsible for the prediction of the GT:

  • Per-grid predictionEach bbox predicts 5 values:Is the coordinate of the central point and the width and height of bbox respectively. The coordinate of the central point is the relative value of the lattice edge, while the width and height is the relative value of the whole graph. The confidence can reflect whether the lattice contains an object and the probability of containing an object, and is defined asIf there is no object, it is 0; if there is, it is IOU
  • Per-grid predictionConditional probability of each classNotice that this is predicting by grid, not by Bbox

In the test, the probability of the final class was obtained by multiplying the single Bbox probability by the conditional probability of the class, and the accuracy of the class and position was integrated for PASCAL VOC Settings., a total ofClass, the final predictiondata

Network Design

The backbone network has 24 layers of convolution plus 2 full connection layers. There is no bypass module similar to the Inception ModuleConvolution followed byConvolution reduces dimension. In addition, Fast YOLO’s network has been reduced to nine layers

Training

The first 20 layers of the backbone network shall be connected to the average-pooling layer and the full connection layer for ImageNet pre-training, and the detection network training shall be input fromIncreased to, the last layer uses ReLU, and the other layers use Leaky ReLU

The loss function is shown in Formula 3. One GT corresponds to only one Bbox. Because there are many non-targets in training and few training samples for positioning, the use right is heavyandTo increase the training granularity of positioning, including three parts:

  • The first part is the coordinate regression, using the square error loss. In order to make the model pay more attention to the small error of small targets rather than the small error of large targets, the square root loss is used to carry out the phase weighting for width and height. Here,It refers to whether the current Bbox is responsible for the prediction of GT, which needs to satisfy two conditions: first, the center point of GT is in the corresponding grid of the Bbox; second, the Bbox is in the corresponding gridThe IoU of box and GT is the maximum
  • The second part is the regression of Bbox confidence,So again,For the, given a low weight due to the large number of negative samples. If you have a goal,It’s actually IOU, although many implementations just take 1
  • The third part is the classification confidence, relative to the lattice,Refers to whether the GT center is in the grid

Inference

For PASCAL VOC, a total of 98 Bboxes were predicted, and the results were processed by non-maximal Supression

Experiments

conclusion

The pioneering one-stage Detector connects two full connection layers behind the convolution network for positioning and confidence prediction, and design a new lightweight backbone network. Although the accuracy is a little far from SOTA, the speed of the model is really fast. The author mentions several limitations of YOLO:

  • Each grid predicts only one category, and two boxes do not predict well for dense scenarios
  • It is highly dependent on data and cannot be generalized to unusual aspect ratio objects. Too much downsampling leads to too coarse features
  • The loss function has not completed the difference treatment of large and small objects, so more attention should be paid to the error of small objects, because it has a great impact on IOU, and positioning error is the main source of model error

YOLOv2


YOLO9000: Better, Faster, Stronger

  • Address:Arxiv.org/abs/1612.08…

Introduction

Based on YOLOv1, YOLOv2 adds a series of current popular promotion methods, a faster and more accurate one-stage target detection algorithm. In addition, the author also proposes YOLO9000 with Hierarchical SoftMax, which is a general network for 9000 object detection. The introduction of models is divided into “Better”, “Faster” and “Stronger”. The trick to improve accuracy, the method of network acceleration and the implementation of super classification are respectively introduced

Better

YOLOv1 is still quite navie’s idea, so the author adds a lot of methods to improve accuracy in YOLOv2, which can be regarded as a complete network after careful thinking. The specific methods are shown in Table 2

  • Batch Normalization

The BN layer can accelerate the convergence of the network. Adding the BN layer to YOLO can improve the mAP by 2%. At the same time, the dropout can be discarded for training

  • High Resolution Classifier

Original YOLO backbone network useThe input is pre-trained and then used directlyThis requires the network to adapt to the learning of new pixels and target detection simultaneously. In order to make it smoother, the paper first carries on the backbone network before testing trainingInput 10 epoch fine tunes, which resulted in a 4%mAP boost

  • Convolutional With Anchor Boxes

YOLOv1 directly predicts bbox, and reference Faster R-CNN uses the preset anchor to achieve a good effect. YOLOv2 removes the full connect layer and starts using Achor

Firstly, the final pooling layer is removed to keep the result high in pixel, and the input resolution is modified to 416 to ensure that the feature map is odd. In this way, there is only one central grid, which is convenient to predict large objects. The final feature map is 1/32 times of the input, i.e. After the anchor is added, the prediction mechanism is transformed from binding on the grid to binding on the anchor, and each anchor is predictedObjectness confidence predicts IOU, class confidence predicts conditional probability of classification. After the use of Anchor, the accuracy drops, the specific reason is that more boxes are output, the recall rate is improved, and the relative accuracy is reduced

  • Dimension Clusters

At present, the Anchor is manually set, which may not be the optimal setting. K-means is used to cluster the box of the training set to obtain a more accurate preset anchor. The cluster uses IOU as the distance calculation, which is, as can be seen from Figure 2, when there are 5 clusters, the cost performance is the highest, which is also the setting used by YOLOv2

  • Direct location prediction

After using ACHor, the initial training of YOLOv2 was very unstable, mainly from the center pointThe region proposal method uses the ratio relative to the width and height of anchor to carry out the displacement of the center point. Since there is no constraint, the center point can be anywhere in the figure, leading to instability in the initial training

Therefore, YOLOv2 continues to follow YOLO’s strategy to predict the central position relative to the width and height of the grid, and uses logistic regression to constrain the value atAnd width and height were changed to the proportion relative to the width and height of Anchor. Therefore, each cell predicts five Bboxes, each containing five contents, with the center point plus the coordinates of the upper left corner of the cell. After limiting the center location, the mAP is improved by 5%

  • Fine-Grained Features

The last of theFeature maps are sufficient to predict large targets, but finer features are needed to locate small targets. Faster R-CNN and SSD use feature maps of different layers for prediction, while YOLOv2 proposes passthrough layer, which precedes the earlier layerFeatures are sampled at intervals and will be originalIs sampled as(The feature map is divided into multiple, and then combine the values of positions 1, 2, 3, and 4 of all the grids into a new feature mAP), and then combine them with the final feature mAP concatenate for prediction, which leads to a 1%mAP improvement

  • Multi-Scale Training

Since YOLOv2 is a full convolutional network, the input size can be modified at will. During training, the input resolution is switched at will every 10 batches, and the candidate resolution is a multiple of 32, as shown in. In actual use, different resolutions can be used to meet the requirements of different accuracy and speed. The results are shown in Table 3

  • Main Result

Faster

To speed things up, YOLOv2 uses the new backbone network Darknet-19, which contains 19 layers of convolution and 5 pooling layers, usingConvolution toThe convolution results are compressed, the BN layer is used to stabilize the training, the convergence is accelerated and the model is regularized, and global pooling is used for prediction

Stronger

YOLOv2 proposed to combine classification data and detection data for training, so as to obtain a super many classification models

  • Hierarchical classification

The label granularity of ImangeNet and COCO is different. Therefore, the data should be labeled with multiple labels, which is similar to the classification method of species subject and class boundary to construct WordTree

Dogs such as the Norfolk terrier belong to the sub-classification of the hound node, and the probability of the Norfolk terrier classification is the product of the probability of all nodes on the path from the root node to the current node

ImageNet1k after re-annotation, a total of 1369 WordTree nodes, each level classification using a SoftMax, based on WordTree re-training darknet-19, reached 71.9% top-1 accuracy, only a little lower. From the results, most of the errors are fine-grained hierarchies. For example, the wrong results also think that the current object is a dog, but the dog breed is wrongly classified. Therefore, this hierarchical classification should be helpful for guiding the extraction of features

  • Dataset combination with WordTree

COCO and ImageNet are combined to obtain the WordTree in Figure 6, with 9418 classes in total

  • Joint classification and detection

Due to excessive ImageNet data, the COCO data set is oversampled by a factor of 4. When the input image is the detection data, the back propagation of the total loss function is carried out, in which the back propagation of classification is limited to the label level and above of GT. When the input images are classified data, the highest confidence () for the back propagation of the classification part of the loss function

Training

YOLOv2 is similar to YOLOv1. First, GT is given the maximum Bbox of IOU to the corresponding grid according to the center point (some implementations on the Internet here are used as the maximum Anchor of IOU, and the author’s implementation is bbox, which is to be verified). The loss calculation includes three parts:

  • The IOU of the largest bbox in the corresponding grid is smaller than thresh’s Bbox, and only returns to objectness, leading to 0
  • For Bbox with GT, return all loss
  • For all boxes, the coordinates between them and the preset box are regression in the first 12,800 iterations. This is because the coordinate regression itself is very small. In the early stage, the prediction is first fitted with anchor for stable training

conclusion

YOLOv2 incorporates some of the more work methods of YOLO, with a number of improvements:

  • Join the Batch Normalization
  • High resolution Fine Tune for main trunk network training
  • Added anchor Box mechanism
  • Use K-mean to assist in setting the Anchor
  • The center of the Anchor is modified using the method of YOLO
  • Use passthrough layer to fuse low-dimensional features
  • Multi-scale trainning is used to improve accuracy
  • Bring in darknet-19 for acceleration
  • Hierarchical Classification is used for classification of super multiple targets

YOLOv3


YOLOv3: An Incremental Improvement

  • Address:Arxiv.org/abs/1804.02…

Introduction

The publication of YOLOv3 is not a complete paper, but the author sorted out some small work at hand, mainly adding some effective tricks into it

Bounding Box Prediction

The global coordinate regression of YOLOv3 is similar to that of YOLOv2. The logistic regression function is still used to predict the Objectness of anchor. Each GT only assigns one anchor with the largest IOU and generates total loss (bounding box prior not bounding box is written in the paper. In this way, the calculated level can be found, which is roughly the same as the original. However, the author implements bounding box (to be verified), and other anchors whose IOU of GT is greater than 0.5 do not generate any loss. While the Anchor whose IOU is less than 0.5 with GT only generates Objectness loss

Class Prediction

In order to support multiple labels, an independent logical classification is used for class prediction and a binary cross entropy loss function is used for training

Predictions Across Scales

YOLOv3 performs Bbox prediction on three different feature maps. These feature maps use methods similar to FPN to adopt high-level features and then concatenate with low-level features. Each layer feature map has three anchors for specific use. A 3-D tensor is then predicted, with the location information, the Objectness information and the category information. For example, in COCO, tensorThat is, channel is 255

Feature Extractor

YOLOv3 proposed a new backbone network, Darknet-53, which fused darknet-19 with the residual network, the previous oneConvolution andAdd a shortcut connection to the convolution combination

Darknet-53’s accuracy is similar to the current SOTA classification network, but much faster

Main Result

conclusion

YOLOv3 is an informal version, and the author has made few improvements, mainly incorporating some methods to improve accuracy:

  • Change category confidence prediction to logical independent classification
  • Combined with the structure of FPN, multi-level prediction was carried out
  • Bring up Darknet-53 and add shortcut connection to the network

Conclusion


The YOLO series is a classic structure in the field of target detection, and although many higher quality and more complex networks have been created, the YOLO structure still has a lot to teach algorithm engineers. After reading these three papers, it feels like an instruction manual for tuning parameters. It teaches you how to improve the accuracy of target detection network in your hand. All kinds of tricks are worth reading up

If this article is helpful to you, please click a like or watch bai ~ more content please pay attention to the wechat public number [Xiaofei’s algorithm engineering notes]