Abstract: YOLO series of target detection algorithms can be said to be the history of target detection macro masterpiece, next we will introduce the content of YOLO V3 algorithm in detail.

Basic idea of algorithm

Firstly, the input features are extracted by feature extraction network, and the output of feature graph with specific size is obtained. The input image is divided into 13×13 grid cells. Then, if the center coordinate of an object in the real box falls in a grid cell, the grid cell will predict the object. Each object has a fixed number of bounding boxes, and YOLO V3 has three bounding boxes, using logistic regression to determine the regression boxes used for prediction.

The network structure

The DBL above is the basic component of Yolo V3. Darknet’s convolutional layer is followed by BatchNormalization (BN) and LeakyReLU. Aside from the last convolutional layer, BN and LeakyReLU are already inseparable parts of the convolutional layer in YOLO V3, making up the smallest component.

Five RESN structures are used in the trunk network. N is a number, res1, res2… ,res8, and so on, indicating that the RES_block contains n RES_units, which are the major components of Yolo V3. From Yolo V3, we borrowed ResNet’s residual structure, which can be used to make the network structure deeper. The interpretation of RES_block, which is also DBL, can be seen in the lower right corner of the network result above.

There is a concat operation on the prediction branch. This is achieved by concatenating the DarkNet middle layer with an upsample of a layer after the middle layer. Note that tensor concatenation operates differently from Res_unit add. Tensor concatenation extends the dimensions of tensors, while add does not change the dimensions of tensors by adding them directly.

Yolo_body has 252 floors. 23 RES_units correspond to 23 Add layers. The number of BN layer and LeakyReLU layer is 72, which is shown in the network structure as follows: Each BN layer is followed by another LeakyReLU layer. There are two upsampling and two tensor splicing operations each, and five zero fills correspond to five RES_blocks. There are 75 convolutional layers, of which 72 are followed by DBL consisting of BatchNormalization and LeakyReLU. The outputs of three different scales correspond to three convolution layers. The number of convolution cores in the last convolution layer is 255, and 80 classes of COCO data set are: 3×(80+4+1)=255, 3 indicates that a grid cell contains three bounding boxes, 4 represents the four coordinate information of frames, and 1 represents confidence.

The following figure shows the specific network results.

The input maps to the output

Regardless of the details of neural network structure, in general, for an input image, YOLO3 maps it to output tensors of three scales, representing the probability of various objects in each position of the image.

Let’s see how many predictions YOLO3 makes. For a 416*416 input image, 3 prior boxes were set in each grid of the feature graph of each scale, and there were a total of 13*13*3 + 26*26*3 + 52*52*3 = 10647 predictions. Each prediction is a (4+1+80)=85 dimensional vector that contains border coordinates (4 values), border confidence (1 value), and the probability of object classes (for the COCO dataset, there are 80 types of objects).

Bounding Box Prediction

Yolo V3 uses k-means clustering in Yolo V2 for the initial size of bounding boxes, and this prior knowledge is very helpful for the initialization of bounding boxes. After all, many bounding boxes are guaranteed for the effect. However, the impact on algorithm speed is still relatively large.

On COCO dataset, 9 clusters are shown in the following table. Note: The larger the feature map, the smaller the receptive field. The more sensitive it is to small targets, the small Anchor box is selected. The smaller the feature map, the larger the receptive field. The more sensitive it is to large targets, the larger anchor box is selected.

Yolo V3 uses direct prediction of relative positions. The relative coordinates of the b-box center point with respect to the upper left corner of the grid cell are predicted. Directly predict (TX, TY, TW, TH, t0), and then calculate the position size and confidence of b-box by the following coordinate offset formula.

Tx, TY, TW, and TH are the predicted outputs of the model. Cx and CY represent the coordinates of grid cells. For example, if the size of the feature map of a certain layer is 13×13, then there are 13×13 grid cells. The coordinates of grid cells in row 0 and column 1 are Cx and CY is 1. Pw and pH represent the size of bounding box before prediction. Bx, BY, BW and BH are the coordinates and sizes of the center of the predicted bounding box. Sum of squared error loss (square and distance error loss) was used in the training of these coordinate values, because the error in this way can be calculated quickly.

Confidence = Pr(Object)*IoU ** that is, if this box corresponds to the background, then this value should be 0, if this box corresponds to the foreground, then this value should be the IoU corresponding to the foreground GT.

Yolo V3 uses logistic regression to predict scores for each bounding box. If the bounding box overlaps the real box better than any other bounding box before it, the value should be 1. If the bounding box is not the best, but does overlap with the real object beyond a certain threshold (the threshold set here in Yolo V3 is 0.5), then ignore this prediction. Yolo V3 assigns only one bounding box for each real object. If the bounding box does not match the real object, there will be no coordinate or category prediction loss, only object prediction loss.

Multiscale prediction

As can be seen from the network structure diagram above, Yolo V3 sets three boxes for each grid cell to predict, so each box needs to have five basic parameters (X, Y, W, H, CONFIDENCE). Yolo V3 outputs feature maps of three different scales, as shown in the figure above: Y1, y2 and y3. Y1,y2, and y3 all have depths of 255, and the rule of side lengths is 13:26:52.

The feature size obtained by each prediction task is N ×N ×[3∗(4+1+80)], where N is the grid size, 3 is the number of boundary boxes obtained by each grid, 4 is the number of boundary box coordinates, 1 is the predicted value of the target, and 80 is the number of categories. For COCO categories, there are 80 category probabilities, so each box should output a probability for each category. So 3 times 5 plus 80 is equal to 255. That’s where 255 came from.

Yolo V3 uses the sampling method to realize such multi-scale feature map. Based on the feature map obtained by Darknet-53, the first feature map is obtained through six DBL structures and the last convolutional layer, and the first prediction is made on this feature map. Y1 branch, from the bottom third of the forward after a convolution layer’s output, after a DBL structure and a (2, 2) sampling, the sampling characteristics to the second output characteristics of convolution Res8 structure tensor connection, after six DBL structure and finally got the second layer convolution feature maps, a second prediction on this feature maps. Y2 branch, from the bottom third of the forward after a convolution layer’s output, after a DBL structure and a (2, 2) sampling, the sampling characteristics to the first output characteristics of convolution Res8 structure tensor connection, after six DBL structure and finally get the third layer convolution feature maps, in making forecasts for the third time this feature maps.

In terms of the whole network, the output feature map size of Yolo V3 multi-scale prediction is Y1 (13×13), Y2 (26×26), and Y3 (52×52). We compute that the network receives a (416×416) graph that downsamples after five convolution steps of 2 (416/2 ˆ5 = 13, y1 outputs (13×13). Sampling (X2, up sampling) from the penultimate convolution layer of Y1 is then connected with the last feature graph tensor with a size of 26×26, and y2 outputs (26×26). X2, up sampling is taken from the convolution layer of the penultimate layer of Y2, and then connected with the last feature graph tensor with a size of 52×52. Y3 output (52×52) can feel the size of the nine prior frames. The blue box in the following figure is the prior frame obtained by clustering. The ground truth in yellow box is the grid where the center point of the object is located.

Prediction box for 3 scenarios

The prediction box is divided into three cases: positive cases, negative cases and ignore cases.

  • Positive example: Take any ground truth and calculate IOU with all the 10647 boxes calculated above. The largest prediction box of IOU is the positive example. And one prediction box can only be assigned to one ground truth. For example, the first ground truth has matched a positive example detection box, so the next ground truth will find the largest detection box of IOU as a positive example among the remaining 10646 detection boxes. The sequence of ground truth can be ignored. Positive example Generates confidence loss, detection frame loss, and category loss. The prediction box is the corresponding Ground truth box label (calculated by using real X, Y, W and H); The corresponding category label is 1, and the other categories are 0. The confidence label is 1.
  • Examples to be ignored: Except the positive ones, if the IOU with any ground truth is greater than the threshold (5 is used in this paper), the examples are ignored. Ignoring the sample does not result in any loss.

O Why are the samples ignored?

Because Yolov3 adopts multi-scale detection, there will be repeated detection phenomenon in re-detection. For example, when there is a real object, the detection box assigned to it during training is the third box in feature figure 1 with an IOU of 0.98. At this moment, the IOU of the first box in feature Figure 2 and the ground truth is 0.95, and the ground truth is also detected. If the confidence of the object is forced to label 0 at this moment, The effect of network learning will not be ideal.

  • Negative example: Except positive example (the largest IOU detection box after calculation with the Ground Truth, but the IOU is less than the threshold, it is still a positive example). If the IOU of all ground Truth is less than the threshold (0.5), it is a negative example. The negative example only produces ****loss with a confidence label of 0.

As shown below:

  • λ is a weight parameter used to control the confidence loss and category of the detection frame (Loss, OBj, and NOOBj)
  • For positive classes, 1ijobj output is 1; For negative cases, 1ijnoobj output is 1; For ignoring the sample, all are 0;
  • The category uses cross entropy as the loss function.

Category forecast

In terms of category prediction, Softmax classifier in Yolo V2 network considers that a target only belongs to one category, and by outputting Score size, each box is assigned to a category with the largest Score. However, in some complex scenarios, a target may belong to multiple classes (with overlapping category labels), so Yolo V3 replaces Softmax layer with multiple independent Logistic**** classifiers to solve the problem of multi-label classification without decreasing accuracy.

For example, the SoftMax layer in the original classification network assumes that an image or an object belongs to only one category. However, in some complex scenes, an object may belong to multiple classes. For example, there are two classes woman and Person in your category, so if there is a woman in an image, Then the category labels in your test results should contain both woman and Person classes, which is multi-label classification, and the Logistic classifier should be used to perform dichotomies for each category. Logistic classifier mainly uses sigmoID function, which can constrain the input within the range of 0 to 1. Therefore, if the output of a certain type of image after feature extraction is constrained by sigmoID function is greater than 0.5, it means that the target in charge of the boundary box belongs to this type.

Object score and class confidence

Object score: Represents the probability that a bounding box contains an object, which is almost always 1 for the red box and the surrounding box, but may be almost always 0 for the corner box. Object scores also represent probability values through a sigmoid function.

Class confidence: Represents the probability that detected objects belong to a specific class. Previous versions of YOLO used Softmax to convert class scores into class probabilities. In YOLOv3, the author decides to use the sigmoid function instead, because SoftMax assumes that classes are mutually exclusive. For example, belonging to “Person” cannot mean belonging to “Woman”. However, in many cases, the object is both “Person” and “Woman”.

Output processing

Our network generates 10,647 anchor frames, and there is only one dog in the image. How do we reduce 10,647 frames to one? Firstly, we filter some anchor frames by object fraction. For example, those below the threshold value (assume 0.5) are directly omitted. Then, NMS (non-maximum suppression) is used to solve the problem of multiple anchor frames detecting one object (for example, three anchor frames in the red box detect one frame or consecutive cells detect the same object, resulting in redundancy), and NMS is used to remove multiple detection frames.

Use the following steps: Discard boxes with low scores (meaning boxes have little confidence in detecting a class); Select only one NMS when multiple enclosures overlap and all detect the same object.

To make it easier to understand, let’s use the car image above. First, we use thresholds to filter a portion of the anchor frame. The model has 19*19*3*85 numbers, and each box is described by 85 numbers. Divide (19,19,3,85) into the following shapes:

Box_confidence :(19,19,3,1) 19*19 cells, each cell has 3 boxes, each box has the confidence probability of the object;

Boxes :(19,19,3,4) indicates three boxes in each cell.

Box_class_probs :(19,19,3,80) indicates three boxes of each cell, and each box has 80 class detection probabilities.

For each anchor frame we calculate the following element-level multiplication and get the probability that the anchor frame contains an object class, as shown below:

Even though some of the anchor boxes are filtered by the class score threshold, there are many overlapping boxes left. The second process, called NMS, has an IoU, as shown in the following figure.

The key to achieve non-maximum suppression is to select a box with the highest score; Calculate its coincidence degree with other boxes and remove the boxes whose coincidence degree exceeds the IoU threshold. Go back to Step 1 and iterate until there are no boxes lower than the one currently selected.

Loss Function

The loss function used was not explicitly proposed in Yolo V3 papers. To be exact, only Yolo V1 in Yolo series papers explicitly proposed the formula of loss function. In Yolo V1, a loss calculation method called sum-square error is used, which is simply the addition of difference squares. We know that in the target detection task, there are several key information to be determined :(x,y),(w,h),class,confidence. According to the characteristics of the key information can be divided into the above four categories, the loss function should be determined by their characteristics. Finally, they can be added together to form the final Loss Function, that is, a Loss function to complete end-to-end training.

Yolov3 Network core Explanation (Video)

Video Address:

www.bilibili.com/video/BV12y…

How are real values encoded

Predict anchor frame design

Anchor frame and target frame do iOU

This article is shared with Huawei Cloud community “Principle Analysis of YOLOV3”, originally written by Lutianfei.

Click to follow, the first time to learn about Huawei cloud fresh technology ~