Yolo series principle

To stop the

This part mainly describes the principle of each version of YOLO series. This part will elaborate the principle from YOLOV1 to YoloV5 in detail. First of all, let’s look at two classical detection methods of deep learning:

  • Two-stage: stands for — Fster-RCNN mask-RCNN series

  • One-stage: stands for — Yolo series

What is the difference between two-stage and single-stage? We understand it from the whole: single-stage is one step in place. We input an image, and after a series of transmission, we will finally get an output result. Double stage compared with single segmental some intermediate steps, enter an original image, we will get some intermediate value first, then the output, expressed as more image, we want to select a person as representative, play on behalf of anhui province, so double stage is similar to my first in each city in anhui province, to find some good prospect, and then choose one of the most outstanding from these good prospect. For details, please refer to the following figure:

Since the two methods are different, it is natural to discuss their advantages and disadvantages:

  • One-stage

    • Advantages: Very fast speed, suitable for real-time detection tasks
    • Disadvantages: The results are usually not very good
  • Two-stage

    • Advantages: The results are usually better
    • Disadvantages: Slow speed, not suitable for real-time detection tasks

In fact, we also have a good understanding of their advantages and disadvantages, single-stage detection without intermediate process, the speed is certainly quite wow, but from the effect, it is relatively poor. We can take a look at the comparison (represented by single-stage YOLO and two-stage ftYL-RCNN)

As can be seen from the figure above, YOLO mAP is lower than FAST-RCNN, but FPS is much higher than Fast-RCNN. FPS refers to the detection speed of a network, the larger the speed, the faster the mAP refers to the comprehensive detection effect of the model, the larger the effect is better.

🎈 🎈 🎈 🎈 🎈 🎈 🎈 🎈

There is a term mentioned above: mAP. It represents a comprehensive detection effect, because there are many parameters that represent the model effect, such as IOU, Precision and recall (well, I don’t know 🀐🀐🀐). The following three parameters are introduced first:

  • IOU

    In fact, IOU is easy to understand. It represents the ratio column of (intersection of real value and predicted value) to (union of real value and predicted value), which is also the calculation formula of IOU as follows:

Ah!!!!!! ? Not very easy to understand…… Generally speaking, IOU means the amount of overlap between the real value and the predicted value, the amount of overlap means the amount of IOU, and the detection effect is good!! You can refer to the following figure again for understanding:

  • Precision && Recall

Let’s start with their formula (ok, I admit I didn’t understand it at first 😜😜😜)

We use an example to explain the meaning of TP, FP, FN, and then precision and recall in the appeal formula.

Known: there are 100 students in the class, including 80 boys and 20 girls

Goal: Find all the girls

The result: twenty girls were chosen from a class of fifty, and thirty boys were mistakenly selected

Let’s take a look at the meaning of the three English letters T, F, P and N in TP, FP and FN, which may be easier for you to understand

  • T — Ture Correct judgment
  • F — False False judgment
  • P — Positives (meaning you need to test someone, in this case a girl)
  • N – Negatives negative class

Knowing what these letters mean, it’s easy to explain the three brothers.

  • TP — Ture Positives = when you are right, you can judge a woman as a girl in this example.
  • FP — False Positives: when you are in a bad environment, you can judge a negative person as a positive person, in this case a boy as a girl.
  • FN — False Negatives to communicate (when positive is negative, as in this case, female is male)
  • TN — True Negatives to determine whether a negative is correct, and in this case, a male is used to communicate this

The above may be a little convoluted, but if you take a moment to study it, it’s easy. We can calculate the values of TP, FP, FN, TN in our example.

  • TP=20
  • FP=30
  • FN=0
  • TN=50

Ok, now that TP, FP, and FN have been calculated, precision and recall can be calculated.


Reading this, I think you can understand how precision and recall are calculated, but there may still be some confusion as to why Precison and recall are expressed in such a formula. First, the two formulas are described.

  • First, precision refers to the accuracy of the classification, which is equal to the ratio of (positive classification correct) to (positive classification correct and wrong) column. For the example precision=20/(20+30)=2/5.

  • Recall (also called recall rate)

    The recall rate is the ratio of (correctly classifying positive classes) to (correctly classifying positive classes and judging positive classes as negative classes). For example, recall=20/(20+0)=1. Informally, recall refers to the proportion of undetected objects, such as an image with 10 objects to be detected. If you detect 10 objects in one method, your recall rate is good. The other method detects only eight images, so your recall rate is not good.


Knowing precision and recall, both of these two indicators can represent the detection effect. In order to comprehensively represent the detection effect, mAP is generated. First, what is AP? AP actually refers to the fact that we can obtain different Precision and Recall by taking different confidence levels. When we take sufficiently dense confidence levels, we can obtain many Precision and Recall. Using different combinations of Precision and Recall points, the area under the plotted curve is the size of AP. As shown below:

AP measures how well a class is detected, and mAP measures how well multiple classes are detected. It is calculated by averaging the AP values of all classes. For example, if there are two classes, the AP value of class A is 0.6 and that of class B is 0.4, then mAP= (0.6+0.4) /2=0.5.



yolo-v1

The overall architecture

Yolo – You Only Look once – sounds good πŸ€™πŸΌ. This also reflects the high speed of YOLO detection, which is suitable for real-time detection tasks. Let’s first look at the overall network architecture of YOLO – V1, as shown in the figure below:

As can be seen from the figure above, YOLo-V1:

  • Network input: 448Γ—448Γ—3 color picture.
  • Convolution layer: it is composed of several convolution layers and maximum pooling layer (pooling layer is not drawn), which is used to extract abstract features of images.
  • Full connection layer: Consists of two full connection layers used to predict the location and category probability values of the target.
  • Network output: 7Γ—7Γ—30 predicted results.

Β 

The specific implementation

As can be seen from the above, the input of YOLO – V1 is 448Γ—448Γ—3 color images. We will divide each image into 7×7 grids on an average, and each grid is responsible for predicting the target that the center point falls in this grid.

The specific implementation process is as follows:

  1. An image is divided into SΓ—S grid cells. If the center of an object falls in this grid, the grid is responsible for predicting the object. Yolo – V1 is divided into 7×7 grids.
  2. Each bounding box is predicted with B bounding boxes, and each bounding box is predicted with 5 values (x, Y, W, H) and CONFIDENCE. 【 YOLO – V1 will predict 2 bboxes 】
  3. Each grid also predicts a category of information, denoted as C classes. 【 YOLO – V1 predicts 20 categories, like cats, dogs, cars, etc. 】
  4. In general, there are SΓ—S grids, and each grid needs to predict B bounding boxes and C classes. The network output is a tensor of S by S by 5 by B plus C. For yOLO -v1, the network output is a 7x7x30 tensor.

If you already know the YOLO algorithm, you probably know the core idea of YOLO v1. However, if you’re watching YOLO for the first time or if you’re not familiar with the principle of YOLO, it’s probably a little confusing. The following describes the details of YOLO – V1 based on the network architecture.

Post this graph again, and we’ll analyze it from this network architecture diagram.

  • The input layer

    The input layer is 448Γ—448Γ—3 color images, and the image size is required to be 448×448 in YOLo-V1, because two fully connected layers are finally connected in YOLo-V1 network. The fully connected layer requires vectors of fixed size as input [because the dimensions of weight matrix W and bias matrix B in the fully connected layer are unchanged]. So the original image also needs to be a consistent image size.

  • Convolution layer

    Convolution layer is the convolution layer by layer. There is nothing to say about this part. If you are not familiar with CNN, you can read this article. By the way, this is from July’s founder, and I think his machine learning article is really good, explaining algorithms in abstract language that will make your toilet break — oh, that’s what it is!!

  • The connection layer

    The full connection layer has two layers, ENMM…… In fact, there is nothing good to talk about, do not understand can see the article recommended above. But we can take a look at the latter dimension of the fully connected layer, 1470×1. Is it a little strange that we have hardly seen the output of this dimension before? This is described in the network output below.

  • Output layer

    As mentioned above, the last fully connected layer has an output of 1470×1. What is the use of such dimensions? Actually, this is for the network output, we are the requirements for the output of the 7 x7x30 tensor, just is 1470, so all the connection layer 1470 x1 output could reshape to get 7 x7x30 network output. Let’s look at the 7x7x30 network output, why is it in this dimension? Here’s how:

    The input image of YOLO – V1 is divided into 7×7 grid, and 7×7 in the output tensor corresponds to the 7×7 grid of the input image. Or we can think of the 7x7x30 tensor as 49 30-dimensional vectors, that is, one 30-dimensional vector for each grid in the input image. Each grid corresponds to a 30-dimensional vector. What information does this 30-dimensional vector contain? As shown below:

    • The probability of 20 objects

      The probability of 20 objects indicates that YOLO – V1 supports 20 different objects (such as cats, dogs, cars, etc.), where the probability of 20 objects refers to the probability of any object in the corresponding grid.

    • Confidence of 2 bBoxes

      The Confidence of a bbox indicates the degree to which it contains an object and its location is accurate. A high degree of confidence indicates that there is an object and its position is relatively accurate, while a low degree of confidence indicates that there may be no object or even if there is an object, there is a large position deviation.

    • 2 bbox locations

      A bbox position needs 4 values to represent its position (Center_x,Center_y,width,height), namely (X coordinate, Y coordinate,width,height of the center point of bounding box). A total of 8 values are required for the 2 bounding boxes to represent their positions.


Β 

Loss function

The loss function is mainly composed of three parts, namely, coordinate prediction loss, confidence prediction loss and category prediction loss.

Β 

Advantages and limitations of YOLO – V1

  • advantages

    • Fast detection speed
    • Strong mobility
  • disadvantages

    • Input size is fixed: Since the output layer is a fully connected layer, the YOLO training model only supports the same input resolution as the training image during detection. Other resolutions need to be scaled to this fixed resolution;

    • Object detection with small proportion is not good: although each lattice can predict 2 bounding boxes, only the bbox with the highest IOU is finally selected as the object detection output, that is, only one object can be predicted for each lattice at most. When objects are small in proportion to the picture, such as herds or flocks of birds, each grid contains multiple objects, but only one of them can be detected.

    • Yolo – V1 multi-label task is difficult to complete

    • Yolo – V1 has a large positioning error

    Β  Β 

yolo-v2

The overall idea of YOLO – V2 and YOLO -v1 is basically the same, but a lot of improvements have been made, as shown in the figure below: It can be seen that after these improvements, the mAP index of the network has basically increased. Finally, the mAP index of YOLO – V2 reaches 78.6, while that of YOLO – V1 only reaches 63.4. These changes are described below.

Β 

There is batch normalization.

V2 has discarded droupout (in the full connection layer, V2 does not use the full connection layer), but adds batch Normalization (BN) after convolution. What is BN? In fact, it is normalization. Input at each layer of the network will be normalized, which will make convergence easier. It now appears that almost all convolutional networks have a batch normalization step.

Why is BN so strong? I give another example of a popular, and now the city for wang classmate πŸ€– πŸ€– πŸ€– make a plan for the three years of star formation, the final goal is to become the city’s top three years let wang soccer player, but now make up a three years of target can be a bit far, we’re going to a year to conduct a comprehensive inspection, wang and see where he is doing is not good, give adjustment, In order to get as high a level of wang as possible. In fact, BN is similar to doing the detection of wang every year. In the network, we will conduct BN operation on it after a convolution, which will make the network better and easier to converge.

Β 

Hi-res Calssifier (High Resolution Classifier)

When talking about v1 version, it is said that V1 input 448*448 images, but this is the size of the image used in the test, while V1 used 224*224d images in the training, which may lead to inconsistent model, affecting the effect. V2 then added 10 additional 448*448 tweaks during training, which improved V2’s mAP by about 4 percentage points.

Β 

new network

The network structure in v2 is changed to DarkNet19 network (with 19 convolutional layers). It can be seen that there is no full connection layer in the network, and 5 times of downsampling is carried out. The actual input of the network is 416*416. In this network, many convolution with 1*1 convolution kernel are adopted, which saves a lot of parameters.

Β 

Anchor boxes

In version v1, we said that each grid should predict 2 BBoxes. However, this often leads to some problems: when there are too many objects in a grid, all objects cannot be detected, that is, missing detection occurs, which leads to a low recall rate. Then, in V2, each grid is selected to predict 5 Bboxes, which is adopted to reduce the situation of low recall.

The following figure shows the effect after the prior box is added. It can be seen that mAP decreases instead [there is little change, which can be considered as almost unchanged], but the recall of the network increases by 7 percentage points.

Β 

Dimension Priors

As mentioned above, in V2 version, each grid needs to predict 5 Bboxes, but the size of these 5 Bboxes is not randomly given, but obtained through clustering of original images. Object frames in the original image are grouped into 5 categories through k-means algorithm, and then the average value of these 5 categories is taken as the size of Bboxes. In this way, the size of the bBox obtained is more consistent with the actual situation and the detection effect is better.

Β 

location prediction

In YOLO – V1, the positions of bounding box are obtained indirectly by predicting the offset values t X, T Y of bounding box and ground truth. The formula is as follows:

The formula is unconstrained, and the predicted bounding box is easily offset in any direction. Therefore, the boundary box of each position prediction can fall anywhere in the image, which leads to the instability of the model.

So YOLOv2 makes a slight change in this approach: The relative offset of bounding box center point relative to the upper-left coordinate (C x, C Y) of the grid is predicted. Meanwhile, in order to constrain the center point of bounding box in the current grid, the sigmoid function is used to normalize t x, T y and constrain the value at 0-1, which makes the model training more stable.

Β 

passthrough

This part will involve the knowledge of receptive field, do not explain, do not know can refer to the relevant information. However, the relevant properties and conclusions of the receptive field should still be presented: in convolution, we often expect to replace the large convolution kernel with some small convolution kernels (their receptive fields are the same, but the use of small convolution kernels requires fewer network parameters). Convolution layer behind the convolution in the network, the more the greater the receptive field, so it is easier to see global information of original image, but that for small and medium-sized original image object detection is difficult, at this time we hope to get some at the same time the characteristics of the receptive field a little bit small figure (convolution layer more receptive field and the small), so that it can detect small target. Specific practices are as follows:

It can be seen that V2 interpolates the feature graph in the convolution of the previous layer into 4 pieces, and then superposes it with the feature graph of the last layer to obtain the final output. In this way, the result has the feature graph with large receptive field and small receptive field at the same time, thus achieving a good detection effect for both large and small targets of the image.

Β 

multi-scale

Compared with V1, v2 version does not have a full connection layer, but a convolution layer, which enables the network to adapt to various scales of input. Unlike v1 training, in which the image size of network input is fixed, V2 will fine-tune the network input size after several iterations. This allows for a more comprehensive detection capability. Usually the minimum image size is 320×320 and the maximum image size is 608×608.

Β  Β 

yolo-v3

Let’s take a look at how V3 compares to other network models!! See this picture I can’t help laughing, the author is also too interesting, v3 are drawn to the second quadrant (origin is 50) [cross quadrant crushing 😬😬😬] this is enough to see v3 strong!!

Β 

Multiple scale

When explaining the improvement of YOLO – V2, we talked about the use of Passthrough to enable us to detect small target objects more effectively, but in fact, such effect is not very good. Yolo – V3 has been improved, it adopts a multi-scale prior frame for detection. As shown in the figure, we used different prior boxes for network output results of different receptive fields, designed prior boxes of three sizes, and each size had three kinds of Bboxes, that is, a total of nine prior boxes.

Β 

Resnet (Residual Network)

Resnet (residual network), we should be very familiar with it, because it is our Chinese first proposed. In deep neural network training, from experience, with the increase of network depth, the model can theoretically achieve better results. But experiments show that deep neural networks have degradation problems, so people think that neural networks can only do so much. But then a network was proposed: Resnet. In fact, this kind of network principle is easy to understand, just like making an if statement. After each layer is added, I make a judgment. If the result is good, I will keep it, and if the result is bad, I will discard it. Resnet is now basically standard for network models.

Β 

Multilabel classification

Yolo – V3 improves the single label classification of YOLO – V2 to multi-label classification in terms of category prediction, and changes the Softmax layer used for classification in YOLO – V2 to logical classifier in network structure. In YOLO – V2, the algorithm identifies a target as belonging to only one category, and classifies it into a certain category according to the maximum score of the network output category. However, in some complex scenarios, a single target may belong to multiple categories. For example, in a traffic scenario, the type of a certain target belongs to both cars and trucks. If softMax is used to classify the target, SoftMax assumes that the target belongs to only one category, and the target will only be identified as cars or trucks. This classification method is called single-label classification. If the network output determines that the target is both a car and a truck, this is called multi-label classification. In order to achieve multi-label classification it is necessary to use logical classifier to dichotomize each category. Logical classifier mainly uses sigmoID function, which can constrain the output from 0 to 1. If the output value of a feature graph processed by this function is greater than the set threshold value, then the target corresponding to the target box is identified as belonging to this class.

Β 

The network architecture

Compared with the backbone network of YOLOv2, YOLOv3 is greatly improved. Using the idea of handicap network, YOLOv3 improves the original DarkNET-19 to DarkNET-53. Darknet-53 consists primarily of 1Γ—1 and 3Γ—3 convolution layers, each followed by a batch normalization layer and a Leaky ReLU, which were added to prevent overfitting. The convolution layer, batch normalization layer and Leaky ReLU together make up the basic convolution unit DBL in DarkNET-53. Darknet-53 is called DarkNET-53 because there are 53 such DBLS in DarkNET-53.

Β  Β Β  Β 

yolo-v4

The authors of both v3 and V4 have changed. At the time, Redmon, the author of the first three versions, issued a statement on Twitter: “Basically, because YOLO v3 has been used for some military purposes, which he does not want to see, he is quitting CV.” This also reflects the powerful performance of YOLO – V3 from the side. Yolo – V4 was born in 2020 when Alexey Bochkovskiy and others took over the yOLO collection.

Yolo-v4 conducted a lot of tests on some commonly used Tricks in deep learning, and finally chose these useful Tricks: WRC, CSP, CmBN, SAT, Mish Activation, Mosaic Data Augmentation, CmBN, DropBlock regularization, and CIoU Loss. V4 adopted a lot of skills, which are almost in all kinds of advanced algorithm first, I didn’t read these algorithms, for these are not particularly clear, so here not yolo – v4 do cleaning (finishing may have many description inaccurate place 🀑) but, I also see a lot of articles on the Internet, here I think it’s a piece of writing is very clear, I have also posted the source of these algorithms, and you can read them yourself if you are interested. The link is: yolo- V4

The original YOLO – V4 paper is not listed in the link above, and the link is attached: Paper

yolo-v5

Hoo, finally to V5, the above content is not much, but it also took more than 2 days 😭😭😭 finally felt the end of πŸš€ port

Sure enough can’t be lazy, above yOLO – V4 did not write their own, now V5 seems to not want to write their own. However, the sentence “do your own thing by yourself” came into my mind, so I decided to put the link πŸ™ˆπŸ™ˆπŸ™ˆ (really not lazy, this person wrote too good, with pictures and pictures, I think with my current knowledge reserves to write, so I would rather borrow others!! I can also find good articles when I need to read them again.)

The link is as follows: yolo- V5

Swish swish duang give it a thumbs up