Target detection (1) RCNN detail — the first work of deep learning target detection (2) SPP Net — let convolution computation share (3) Target detection (3) Fast RCNN — let RCNN model can train end-to-end (4) target detection Faster RCNN — RPN network to replace Selective Search [Target Detection (5)] YOLOv1 — opens the chapter of one-stage target detection

1. Motivation: Transform the two-stage target detection network into a single-stage network

From the development history of target detection, RCNN pioneered deep learning target detection, and then SPP Net, Fast RCNN and Faster RCNN, all of which require Region Proposal. The first three generate nearly 2000 Region proposals through selective search method, and the Faster RCNN generates candidate frame regions through RPN network, but without exception, candidate frame regions need to be identified, and then candidate frame regions are classified and regressive modified. Therefore, RCNN series algorithms are uniformly called two-stage target detection algorithms.

YOLO’s idea was that since the RPN network generates several suggestion boxes, classifies them (dichotomies, whether they are targets or not), and regressively corrects them, why not let the network directly predict categories and box positions? YOLO integrates the two-stage algorithm and can complete the determination of target category and regression task of Bbox box in just one stage. Therefore, we collectively refer to the YOLO series of algorithms as one-stage algorithm. This paper mainly introduces the YOLOv1 algorithm in detail.

2. YOLOv1 principle

2.1 Ideas of YOLO

  • Divide an image into equal partsS*SIf the center of an object falls in this grid cell, the grid is responsible for predicting that object. For example, if the center of the dog in the image below falls on the fifth row, second column, then this cell is responsible for predicting the dog classification and box information.
  • Each Bounding Box needs to predict B Bounding boxes (S=2, B=2 in the paper), and each Bounding Box needs to predict a confidence value in addition to its position. In addition, each network predicts scores in C categories (for VOC, C=20). The output dimension isS * S * (B * (4 + 1) + C)Take S=7, B=2, C=20 for example and output tensor as shown below:

  • Each Bounding Box contains 5 predicted values: X, Y, W, H, and confidence. X and y represent the relative position of the center coordinate point of the frame in the Grid cell, that is, x,y∈[0, 1], W and H represent the ratio of the width and height of the frame to the whole graph, and w, H ∈[0, 1], confidence represents the IOU value of the prediction frame and the real frame. In particular, Can be expressed as confidence=Pr(object)* IOUpredtruthIOU^{truth}_{pred}IOUpredtruth, where Pr(object) is 1 when the grid cell is the target center, When the grid cell does not contain the target center, Pr(Object) is 0. Details of the above statements are captured in the following figure.

    1. When testing, multiply the probability and confidence of the prediction of the corresponding category as the prediction score.

2.2 YOLOv1 network structure

The design inspiration of the entire network architecture is derived from GoogleNet. The Inception Module is replaced with a 1*1 and a 3*3 convolution operation. As shown in the figure below, a 448*448*3 image is input, and a feature map with a dimension of 7*7*1024 is obtained through 24 convolution layers in total. Then flatten operation is carried out to obtain a one-dimensional vector of 1*50176, followed by a fully connected layer of 4096. Next up is a 1470 full connected layer, and then the 0 0 is 0 0 output 7*7*30 (5 + 5 + 20)

2.3 Design of Loss function

Loss mainly consists of bbox Loss, confidece Loss and class classification Loss.

  • Bbox loss: Exponential function 1 before regression function indicates whether the JTH predicted Bounding Box of the ith grid cell is responsible for the prediction of this target (i.e. the JTH predicted Bounding Box has the largest IOU with GT Box). The idea of loss calculation is very simple, that is, the distance between center point, width and height and L2 of GT is calculated respectively. Since it is necessary to consider that the loss of large target may be much larger than that of small target, the square root sign is added, otherwise loss will be extremely biased towards large target.

  • Confidence loss: It consists of two parts. The first part is that the CONFIDENCE GT value is 1 when the JTH predicted Bounding Box of the ith grid cell is responsible for the prediction of this target; the second part is that the CONFIDENCE GT value should be 0 when there is no (responsible) target in the grid cell.

  • Classification Loss: Indicates whether the i-th Grid cell should be responsible for the prediction of the target (that is, whether the center of the target is in the i-th Grid cell), and calculates the loss of the category.

  • Equilibrium coefficient: Since most grid cells in the image are not responsible for target prediction, that is to say, items in loss are in no_object state in most cases, so we need to improve the regression loss weight of bbox and reduce confidence loss without target, and add λcoord and λnoobj parameters. λcoord=5, λnoobj=0.5.

3. Effect evaluation and pros and cons analysis of YOLO

3.1 Accuracy and speed effects of YOLO

Let’s focus on the comparison in the red box:

  • YOLO and Faster RCNN both use VGG16 backbone, which is 3 times Faster, but mAP accuracy is 7 points lower.
  • With YOLO, the speed of real-time detection network can reach 45fps, which can realize real-time detection, but the mAP accuracy is reduced by nearly 10 points.

3.2 Advantages and disadvantages of YOLO

  • YOLO advantages:

    • The real time target detection is realized, which makes CV target detection start to be applied to industry on a large scale, and the inference speed is fast.
    • There is no need to extract the region proposal, and the whole image is put into the network for detection, which can connect more context information and features and reduce the error of background detection as an object.
  • YOLO faults:

    • Overlapped targets cannot be detected: YOLO performs the checkbox operation, and each grid cell can only predict one category. Therefore, the detection effect will be poor when overlapped targets, especially small overlapped targets, such as groups of birds.
    • The detection effect of small targets is not good: there are two reasons. First, the design of loss function is relatively rough. Although the method of square root is used to suppress the overpower effect of large objects, the effect is still poor. Second, after multiple downsampling, the resolution of the final feature is relatively low, that is, the coarse feature is obtained, which may affect the positioning of the object.
    • YOLO predicts BBox based on training data, but its generalization ability is low when objects in test data have aspect ratios that objects in training data do not.
    • Direct regression of GT and predicted value, instead of regression offset, increases the difficulty of training.
    • BN is not used.

Reference:

  1. Gitthhub. Making. IO / 2019/03/17 /…
  2. Arxiv.org/pdf/1506.02…