[Model training] Target detection implementation sharing four: detailed explanation of YOLOv4 algorithm implementation

“This is the third day of my participation in the First Challenge 2022. For details: First Challenge 2022”

Welcome to my public account [Jizhi Vision]

Hello everyone, I am Jizhi Vision. This article introduces the design and practice of YOLOv4 algorithm in detail. The practice part includes Darknet and PyTorch.

This article is the fourth article to share the realization of target detection algorithm. Three articles have been written previously. Interested students can refer to them:

(1) “[Model Training] Target detection Implementation Sharing 1: Detailed Explanation of YOLOv1 Algorithm implementation”;

(2) “[model training] Objective detection realization share two: heard clay comeback today? Detailed explanation of YOLOv2 algorithm and Clay detection”;

(3) “[Model Training] Target detection implementation Sharing three: Detailed Explanation of YOLOv3 algorithm implementation”;

YOLOv4 is the fourth version of YOLO series. In the paper “YOLOv4: Optimal Speed and Accuracy of Objecti Detection is proposed in the book, which gathers all kinds of state-of-the art tricks in the field of target Detection. Through reading YOLOv4 papers and training practices, It gives you macro control over the many great tricks that have emerged in the field of target detection in recent years. The author experimented and put many tricks into one, resulting in the excellent YOLOv4 network.

Without further ado, let’s see.

Again, we’re going to talk about practice as well as principles.

1. Principle of YOLOv4

As usual, performance data first:

The above test data set is MS COCO, and the inference hardware is Nvidia V100. The horizontal axis is the frame rate FPS, and the vertical axis is the accuracy AP. Therefore, from the perspective of both efficiency and accuracy, the higher to the right, the better. YOLOv4 EfficientDet is two times faster than YOLOv3 with a 10 percent and 12 percent improvement in AP and FPS, respectively.

Take a look at the AP&FPS performance graph running on Maxwell, Pascal and Volta respectively:

You can see that the authors compare many advanced target detection networks, but YOLOv4 is always the brightest star in the upper right corner, whether in Maxwell, Pascal, or Volta architectures.

Let’s look at a more detailed set of performance data, measured on a Volta architecture GPU:

The performance data in the figure above is measured when Batch =1 and TensorRT is not used for acceleration. The blue bar indicates that the FPS > 30, which means real-time detection can be achieved. As you can see, all three input resolutions of YOLOv4 are easily available in real time, as are centermask-Lite, EFGRnet-VGG16-320, HSD-VGG16-320, and DAFS-VGG16-512. In terms of accuracy, AP, AP50, AP75, APs, APm have been dominated by YoloV4-CSPDarknet53-608, centermask-Lite-Vovnet39-FPN-600X is 0.2 points better on APl.

From the above experimental data, it can be seen that the performance of YOLOv4 is very strong, and its proposal has the following two contributions:

(1) An efficient and powerful detection model is proposed. Anyone can train a good detector using only a 2080Ti GPU card;

(2) Many state-of-the-art detection tricks were tested and integrated into YOLOv4 to make it more efficient and powerful;

Let’s take a closer look.

The network structure of YOLOv4 can be divided into four modules, Input, Backbone, Neck and Head, which can be expressed as the following figure:

Then add some tricks called “Bag of freebies” and “Bag of specials” to the model structure and training strategy respectively. Explain these two words:

Bag of freebies (BoF) : A method of improving performance by changing training strategies or increasing training costs without increasing reasoning costs, e.g. data enhancement;
Bag of specials (BoS) : slightly increase the reasoning cost, but can greatly improve the detection accuracy, such as plug-in module and post-processing method;

YOLOv4 incorporates a lot of BoF and Bo

The network structure of YOLOv4 is divided into four modules, Input, Backbones, Neck and Heads, almost like the following figure:

Let’s look at tricks in each module.

1.1 Input

YOLOv4 made many innovative improvements to the input during training, including Mosaic data enhancement, cmBN, SAT autoadroit training, etc., which are described in detail below.

1.1.1 Mosaic

Mosaic evolved from CutMix, which used a Mosaic of two images for data enhancement, to extend Mosaic to four images that were randomly scaled, cropped, and typesetted to dramatically enrich data sets at once. The Mosaic data enhancements are as follows:

1.1.2 CmBN

CmBN is an improved version of CBN, and CBN is an improved version of BN. BN normalizes the current mini-batch data. CBN is referred to as Cross Batch Normalization. It normalizes the current and previous three mini-batch data. However, CmBN is referred to as Cross mini-batch Normalization, which only collects data from four mini-batches in a large Batch and isolates it from the rest of the university. The flow diagram of BN, CBN, and CmBN is as follows:

1.1.3 SAT

SAT (Self-adversarial Training) is also a method of data enhancement, which consists of two stages:

(1) 1st Stage: use neural network to change image data instead of updating weight data, which can be understood as image data generation;

(2) 2nd stage: The neural network is trained on the expanded image data set in a normal way.

1.2 Backbone

1.2.1 CSPDarknet53

As we know, backbone is Darknet19 in YOLOv2 and Darknet53 in YOLOv3. In YOLOv4, backbone has been upgraded again and called CSPDarknet53 this time. CSPNet: A New Backbone that can Enhance Learning Capability of CNN CSPNet: A New Backbone that can Enhance Learning Capability of CNN

CSPDarknet53 compares the number of parameters and performance of some other excellent backbone as follows:

You can see that CS DARKnet 53 has a higher FPS at the same input resolution, which indicates higher efficiency; It also has more parameters, indicating that there are more parameters to learn features, and the feature learning ability is usually stronger.

1.2.2 Mish

Backbone is composed of many CBM blocks (Conv + BN + Mish) and resists. CBM blocks look something like this:

The Mish Activation Function is proposed in the paper Mish: A Self Regularized Non-monotonic Activation Function. Its mathematical expression is as follows:

The image of the function is expressed as follows, where:

The blue curve is Mish;
The orange curve is ln of 1 plus e to the x.

Of course, when you implement Mish, you can split Mish into tanH and SoftPlus, as is often the case with TensorRT:

Unlike Relu, Mish doesn’t have two phases. Mish doesn’t have any obvious folding points, so the gradient between Mish and Relu is smoother.

1.3 Neck

1.3.1 SPP

SPP stands for this thing:

This structure actually exists in yolov3-SPp. CFG, but in YOLOv3 period has not really superior, in YOLOv4 time is really superior. The SPP structure is proposed in the paper dC-SPP-YOLO: Dense Connection and Spatial Pyramid Pooling Based YOLO for Object Detection, and its main purpose is to increase the receptive field.

1.3.2 FPN + PAN

FPN structure also exists in YOLOv3. Taking the input resolution 608 x 608 as an example, after upsample sampling and Conv downsampling in YOLOv3, the three-branch structure of 19 x 19, 38 x 38 and 76 x 76 is finally formed. The schematic diagram is as follows:

The difference between YOLOv4 and YOLOv3 FPN is that a Botton-up structure is followed on the basis of YOLOv3 FPN, and botton-up is connected by PAN. The structure of the whole YOLOv4 FPN + PAN is shown as follows:

PAN in YOLOv4 modiates the addition to concatenation in traditional PAN. There are some differences here. The addition does not change the channel dimension, while concatenation will change the channel dimension, as shown below:

1.4 Dense Prediction

1.4.1 Yolo

The last YOLO prediction layer in YOLOv4 follows yOLO of YOLOv3, but it should be noted that after YOLOv4 passes the Neck of FPN + PAN above, it forms a point that is easy to confuse people:

The last three YOLO layers of YOLOv3 are:

(1) The first YOLO layer: Feature map 19 x 19 ==> Mask = 6, 7, 8 ==> corresponding maximum anchor;

(2) The second YOLO layer: Feature map 38 x 38 ==> Mask = 3, 4, 5 ==> corresponding moderate anchor;

(3) The third YOLO layer: Feature map 76 x 76 ==> Mask = 0, 1, 2 ==> corresponding minimum anchor;

In YOLOv4 the situation is reversed:

(1) The first YOLO layer: Feature map 76 x 76 ==> Mask = 0, 1, 2 ==> corresponding minimum anchor;

(2) The second YOLO layer: Feature map 38 x 38 ==> Mask = 3, 4, 5 ==> corresponding moderate anchor;

(3) The third YOLO layer: Feature map 19 x 19 ==> Mask = 6, 7, 8 ==> corresponding largest anchor;

This difference requires special attention during development, and it’s easy to get the order wrong.

1.4.2 IOU Loss

YOLOv4 also made some innovations in Bounding box Regeression Loss and adopted CIOU_Loss for regression prediction, making the prediction box faster and more accurate.

When it comes to CIOU_Loss, it directly calculates the coordinate point Loss of prediction box from the beginning to IOU_Loss, and then carries out a series of optimizations. The process goes something like this: Smooth L1_Loss -> IOU_Loss -> GIOU_Loss -> DIOU_Loss -> CIOU_Loss.

Smooth L1_Loss: Use Smooth L1_Loss to calculate the loss of the center point or vertex of the prediction frame relative to the real frame. There is no constraint, and the gradient can easily disappear when propagating back.
IOU_Loss: IOU_Loss mainly considers the intersection/union of detection frames and real frames. Problems exist: When IOU=0 (boundary frames do not overlap) or IOU value is constant, the situation is diverse.
GIOU_Loss: On the basis of IOU_Loss, GIOU_Loss adds the measure method of intersection scale, which solves the problem when boundary boxes do not coincide, but there are still problems: when IOU value is constant, the situation is diverse;
DIOU_Loss: DIOU_Loss further considers the overlap area and center point distance on the basis of GIOU, and the coverage is more, but it is not comprehensive enough: When the center point of multiple prediction frames is just on the circle with the center of the real frame as the center, the situation is diverse;
CIOU_Loss: CIOU_Loss adds an influence factor on the basis of DIOU_Loss, adding the aspect ratio of the prediction frame and the real frame, which can be said to cover the situation is very comprehensive.

Ok, the above mainly introduces the principle and improvement points of YOLOv4, and then enter the practice link.

2, YOLOv4 implementation

YOLOv4 of both frameworks is practiced here.

2.1 training

2.1.1 darknet training

Darknet training data set is COCO, the production process of COCO data set is not discussed, the last YOLOv3 has been described in detail, those who can not move to the previous article. So let’s get started.

CFG, coke. data, and coke. names, and create a backup folder in yolov4 to store intermediate weights. The following directories are generated:

Execute training instructions:

./darknet detector train cfg/yolov4/coco.data cfg/yolov4/yolov4.cfg
Copy the code

Of course, pre-training weight can also be added:

./darknet detector train cfg/yolov4/coco.data cfg/yolov4/yolov4.cfg cfg/yolov4/yolov4.conv.137
Copy the code

Attached yolov4.conv.137 transmission: github.com/AlexeyAB/da…

Then you can see that the training has started:

After a long training period, you should see the network slowly converging:

2.1.2 pytorch training

Pytorch is a widely used training framework, known for its dynamic diagrams and flexibility. Of course, pyTorch is implemented in YOLOv4.

Related yoloV4-PyTorch project code has been sorted out for you, follow my public account [ji Zhi Vision] reply YOLOv4 can be obtained.

I also put VOC data set format in the project I provided here, just need to make a training data set according to the provided format, you can easily run.

Train. Sh: Train. Sh: Train.

./train.sh
Copy the code

This should be very friendly, and you might want to modify some parameters in the train.py script, such as whether to do Mosaic data enhancement:

Look at the training process:

2.2 validation

All right, so let’s verify that the model we’ve trained works.

Here, the darknet model is used for verification, and the verification scene is the Spring Festival transportation site. Execute the following command for detection:

./darknet detector demo cfg/yolov4/coco.data cfg/yolov4/yolov4.cfg cfg/yolov4/backup/yolov4.weights data/chunyun.mp4
Copy the code

The detection results are as follows:

You can see that the detection effect is still good.

I also wish the epidemic could dissipate as soon as possible and compatriots could go home safe and happy for the Spring Festival.

The above share of YOLOv4 algorithm design and practice, I hope my share can be a little help to your learning.

[model training] Target detection implementation sharing four: detail implementation of YOLOv4 algorithm