Target detection (1) RCNN detail — the first work of deep learning target detection (2) SPP Net — let convolution computation share (3) Target detection (3) Fast RCNN — let RCNN model can train end-to-end (4) target detection Faster RCNN — RPN network to replace Selective Search [Target detection (5)] YOLOv1 — start the chapter of one-stage target detection [Target detection (6)] YOLOv2 — Introduce Anchor, Better, Faster, Understand the regression box loss function of target detection — IoU, GIoU, DIoU, CIoU principle and Python code Focal Loss pushes one-stage algorithm to the peak

1. Motivation

Before RetinaNet, one-stage algorithms led by YOLO series and SSD algorithms generally have lower accuracy than two-stage algorithms represented by Faster RCNN in target detection field. There are two main reasons for this:

  • Two-stage algorithm uses RPN network to generate a series of Region proposals first, and then uses Fast RCNN to conduct fine-tuning on the basis of Region proposals. The natural results are more refined.
  • Sample imbalance: the positive and negative sample ratio is extremely unbalanced. Faster RCNN confirms that the positive and negative sample ratio is 1:3 during training, while the One-stage algorithm is extremely unbalanced (1:1000 is possible). Gradient is dominated by simple samples, and complex samples only account for a small part. In the process of Loss calculation, gradient will be dominated by most simple samples.

The appearance of RetinaNet improves this problem to some extent and enables one-stage method to have better accuracy. This article mainly introduces the principle of RetinaNet.

Focal Loss for Dense Object Detection

2. RetinaNet principle

2.1 RetinaNet network structure and Anchor

RetinaNet is a one-stage algorithm that directly omits the second stage and uses the RPN network to complete the whole target detection task. Its network structure is actually FPN network extracting multi-scale features, and then connecting detection heads on the basis of multi-scale features to predict target classification and position regression, as shown in the figure above. This diagram in the paper still does not clearly explain the network structure. If you think it is not clear enough, you can continue to look at the following diagram.

The RetinaNet design modifiesTHE FPN, P2 is removed in consideration of the fact that the size of the P2 feature map is very large and very space consuming and computationconsuming. At the same time, P6 and P7 layers are obtained by extending two layers up from P5 with two downsamples. P3 and P7 correspond to different scale features. P3 is more suitable for detection of small targets, while P7 is more suitable for detection of large targets. The author designs corresponding anchor mechanism according to scale characteristics, and each feature point on each scale corresponds to 3 anchors of different sizes and 3 different proportions, a total of 9 anchors, as shown in the following table:

Scale Ratio
{32
2 0 2 ^ 0
.
2 1 / 3 2 ^ 1/3 {}
.
2 2 / 3 Two-thirds of 2 ^ {}
}
{1:2, 1:1, 2:1}
64 {
2 0 2 ^ 0
.
2 1 / 3 2 ^ 1/3 {}
.
2 2 / 3 Two-thirds of 2 ^ {}
}
{1:2, 1:1, 2:1}
128 {
2 0 2 ^ 0
.
2 1 / 3 2 ^ 1/3 {}
.
2 2 / 3 Two-thirds of 2 ^ {}
}
{1:2, 1:1, 2:1}
256 {
2 0 2 ^ 0
.
2 1 / 3 2 ^ 1/3 {}
.
2 2 / 3 Two-thirds of 2 ^ {}
}
{1:2, 1:1, 2:1}
512 {
2 0 2 ^ 0
.
2 1 / 3 2 ^ 1/3 {}
.
2 2 / 3 Two-thirds of 2 ^ {}
}
{1:2, 1:1, 2:1}

Then, each scale feature will be followed by a detection boss network, including class subnet and box subnet, which are responsible for classification and box prediction. Here, the detection boss network after the five feature maps all share the weight. The structure of the class subnet and box Subnet are shown in the following figure:

The final output dimension of the class subnet is KA, where K is the number of categories to be classified (excluding background) and A is the number of anchor 9. The output dimension of box Subnet is 4A, where 4 represents coordinates of center point and length and width, and A represents the number of Anchor 9. This gives you the final category and box position predictions.

2.2 Division of positive and negative samples

  • Ious > = 0.5, the positive samples
  • Ious < 0.4, the negative samples
  • Ious ∈ [0.4, 0.5)

2.3 the Focal Loss

As mentioned above, the reason why the single-stage algorithm’s accuracy is not as good as the two-stage algorithm’s sample imbalance — the imbalance of positive and negative samples and the imbalance of difficult and easy samples — was solved in Focal Loss. First, let’s look at the traditional cross entropy Loss function:


C E ( p . y ) = [ y l o g ( p ) + ( 1 y ) l o g ( 1 p ) ] CE (p, y) = – [ylog (p) + (1 – y) log (1 – p)]

It can be rewritten like this:

We define pTP_tpt:

Then there is


C E ( p . y ) = C E ( p t ) = l o g ( p t ) . CE (p, y) = CE (p_t) = – log (p_t).

2.3.1 Equilibrium cross entropy

Since the imbalance of positive and negative samples has been mentioned previously, we introduce a positive example factor α∈[0,1]α∈[0, 1]α, and the negative example factor is 1−α1-α1−α, which is placed on loss as a weight to balance the difference in the number of positive and negative samples:

Expand this formula:

In practice, it can be set as the inverse of the sample ratio, but the best α obtained by the authors is 0.25, the paper says.

2.3.2 Introducing weights of difficult samples

In order to solve the serious imbalance of difficult and easy samples, which led to loss dominated by easy samples, the paper made improvements by introducing weight (1− PT)γ(1-P_T)^γ(1− PT)γ, in which γ is a super-parameter, which can reduce the loss contribution of easy samples. Finally, focal Loss was defined as follows:

After expansion, the form is as follows:

The best hyperparameters obtained by the author are α=0.25,γ=2α=0.25, γ=2α=0.25,γ=2. Let’s understand how this difficult sample weight works:

  • When the real sample is labeled 1 and the prediction probability is 0.9, the prediction is also good. The Loss generated by the Cross Entropy is 400 times greater than that generated by Focal Loss, indicating that Focal Loss reduces the Loss contribution of the easily divided sample.
  • When the real sample is 1 and the predicted probability is 0.1, the prediction is not very good, and the Loss generated by the Cross Entropy is only 5 times greater than the Loss generated by Focal Loss, which increases the Loss contribution of the hard-to-disassemble sample.

3. Analysis of effects and advantages and disadvantages of RetinaNet

3.1 Effects of RetinaNet

As you can see, the accuracy of RetinaNet is still very much improved which is what focal Loss + FPN does.

3.2 Analysis of advantages and disadvantages

Advantages:

  • The accuracy has improved dramatically.
  • Focal Loss was introduced to reduce the impact of sample imbalance.
  • RetinaNet has very superior inference performance when resolution is reduced.

Disadvantages:

  • Focal Loss introduced by RetinaNet is susceptible to noise interference and has high requirements on the accuracy of image annotation. Once there are mislabeled samples, they will be regarded as difficult samples by Focal Loss, and interference samples make a great contribution to Loss, thus affecting the learning effect.

Reference:

  1. Arxiv.org/pdf/1708.02…
  2. Zhangxu.blog.csdn.net/article/det…