In order to efficiently learn the accurate prediction frame and its distribution, the paper extended Focal loss and proposed a Generalized Focal loss that could optimize the continuous value target. Quality Focal loss and Distribution Focal loss are included. The QFL is used to learn a better joint representation of classification scores and positioning quality, and the DFL provides more information and accurate predictions through general distribution modeling of prediction box locations. The experimental results show that GFL can detect the performance of all one-stage algorithms
Source: Xiao Fei’s algorithm engineering notes public account
Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection
- Address:Arxiv.org/abs/2006.04…
- Thesis code:Github.com/implus/GFoc…
Introduction
At present, dense Detector (one-stage) is the mainstream direction of target detection. This paper mainly discusses two approaches:
- Representation of the prediction box: it can be considered as the output of the network to the position of the prediction box. In conventional methods, it is modeled as a simple Dirac delta distribution, that is, direct output of the position result. Some methods model it as gaussian distribution and output mean value and variance to represent the uncertainty of position result and position result respectively and provide additional information.
- Positioning quality estimation: Some recent studies have added additional positioning quality prediction, such as IoU score prediction with IoU Net and Centerness score prediction with FCOS. Finally, positioning quality and classification score are combined into the final score.
After analysis, the paper finds that the two methods mentioned above have the following problems:
- Location quality estimates and classification scores are actually incompatible: First, location quality estimates and classification scores are usually trained independently, but are used together for reasoning. Secondly, location quality estimation is only trained with positive sample points, resulting in negative sample points may estimate high location quality, and this difference between training and testing will reduce detection performance.
- The prediction box representation is not flexible: most algorithms model it as a Dirac Delta distribution, which does not take into account the ambiguity and uncertainty in the data set, and only knows the result, but does not know whether the result is reliable. Although some methods model it as a Gaussian distribution, the Gaussian distribution is too simple and rough to reflect the true distribution of the prediction box.
In order to solve the above two problems, the paper proposes solutions respectively:
- For the estimation of positioning quality, the paper merged it directly with the classification score, retained the category vector, and changed the meaning of the score of each category into the IoU of GT. In addition, this method can be used to train both positive and negative samples, and there will be no difference between training and testing.
- As for the representation method of prediction box, the general distribution is used for modeling, and no constraints are imposed. It can not only obtain reliable and accurate prediction results, but also perceive its potential real distribution. As shown in the figure above, if there is ambiguity or uncertainty, the distribution will appear as a smooth curve; otherwise, the distribution will appear as a sharp curve.
In fact, using both of the strategies mentioned above presents an optimization problem. In the conventional one-stage detection algorithm, the classification branches are optimized with Focal loss, and Focal loss is mainly aimed at discrete classification labels. After combining positioning quality with classification fraction, the output changed into continuous IoU fraction related to category, and Focal loss could not be directly used. Therefore, this paper extends Focal loss and proposes GFL(Generalized Focal Los), which can deal with the global optimization problem of continuous value target. GFL includes two specific forms, QFL(Quality Focal Los) and DFL(Distribution Focal Los). QFL is used to optimize difficult samples and predict continuous value scores of corresponding categories. DFL provides more information and accurate position prediction by modeling the general distribution of the forecast box location. Overall, GFL has the following advantages:
- To eliminate the differences between training and testing of additional quality estimation branches, a simple and efficient joint prediction strategy is proposed.
- It can model the real distribution of prediction box flexibly and provide more information and accurate position prediction.
- In the case of introducing extra overhead, the performance of all one-stage detection algorithms can be improved.
Method
Focal Loss (FL)
FL is mainly used to solve the imbalance between positive and negative samples in the one-stage target detection algorithm:
It contains the standard cross entropy part −log(pt)-log(p_t)−log(pt) and the scaling factor part (1−pt)γ(1-p_t)^{\gamma}(1−pt)γ, which automatically reduces the weight of the easy samples, allowing the training to focus on the difficult samples.
Quality Focal Loss (QFL)
Since FL only supports discrete tags, its ideas are extended to apply to continuous tags that combine classification and positioning quality. First part will cross entropy – log (pt) – log (p_t) – log (pt) is extended to full form – (1 – (y) log (1 – sigma) + y log (sigma)) – ((1 – y) log (1 – \ sigma) + y \ The log (\ sigma)) – (1 – (y) log (1 – sigma) + y log (sigma)), Second will zoom factor (1 – pt) gamma (1 – p_t) ^ {\ gamma} (1 – pt) gamma generalization to forecast sigma \ sigma sigma and the absolute difference between the continuous label yyy ∣ y – sigma ∣ beta | y – \ sigma | ^ {\ \ beta} ∣ y – sigma ∣ beta, its portfolio is QFL:
σ=y\sigma=yσ=y is the global minimum solution of QFL.
The superparameter β\betaβ of the scaling factor is used to control the rate of weight reduction, as shown in the figure above. Assuming that the target continuous label y=0.5y=0.5y=0.5, the farther away from the label, the greater the weight will be. Otherwise, the weight will tend to 0, similar to FL.
Distribution Focal Loss (DFL)
Like other one-stage detection algorithms, this paper takes the distance between the current position and the target boundary as the regression target. Conventional methods will return target yyy is modeled as a Dirac delta distribution, distribution of Dirac delta meet ∫ – up + up the delta (x – y) dx = 1 \ int ^ {+ \ infty} _ {- \ infty} \ delta (x-y) dx = 1 ∫ – up + up the delta (x – y) dx = 1, The label yyy can be obtained in the form of integral:
As mentioned above, this method does not reflect the true distribution of the prediction box and cannot provide more information, so this paper intends to express it as the general distribution P(x)P(x)P(x). Given the value range of the label yyy [y0,yn][y_0, y_n][y0,yn], the predicted value y^\hat{y}y^ can be obtained from the modeled Genreal distribution like the Dirac Delta distribution:
To be compatible with neural networks, The continuous area [y0, yn] [y_0, y_n] [y0, yn] integral into discrete areas {y0, y1,…, yi, yi + 1,…, yn – 1, yn} \ {y_0 y_1, \ cdots y_i, y_ {I + 1}, \ cdots, y_ {}, n – 1. Y_n \} {y0, y1,…, yi, yi + 1,…, yn – 1, yn} the integral, discrete areas of the interval Δ = 1 \ Delta = 1 Δ = 1, predicted y ^ \ hat {} y y ^ can be represented as:
P(x)P(x)P(x) P(x)P(x)P(x) P(x)P(x)P(x) P(x)P(x)P(x) P(x)P(x)P(x) P(x)P(x)P(x) P(x)P(x)P(x) P(x)P(x)P(x) Things like Smooth L1, IoU Loss, and GIoU Loss.
But in fact, the same integral result YYy can be obtained by a variety of different distributions, which will reduce the efficiency of network learning. Considering that more distributions should be concentrated near the regression target YYy, this paper proposes DFL to force the network to improve the probabilities of Yiy_iyi and Yi +1y_{I +1}yi+1 closest to YYy. Since regression prediction does not involve the imbalance of positive and negative samples, DFL only needs the cross entropy part:
Global optimal solution of DFL of Si = yi + 1 – yyi + 1 – yi \ mathcal {S} _i = \ frac {y_ {I + 1} – y} {y_ {I + 1} – y_i} Si = yi + 1 – yiyi + 1 – y, Si + 1 = y – yiyi + 1 – yi \ mathcal {S} _ {I + 1} = \ frac {y – y_i} {y_ {I + 1} – y_i} + 1 = yi Si + 1 – yiy – yi, make y ^ \ hat {} y y ^ yyy infinitely close to the tag.
Generalized Focal Loss (GFL)
QFL and DFL can be uniformly expressed as GFL, and the predicted probability of yly_LYl and yry_ryr are pylp_{y_L} PYL and PyRP_ {y_r} Pyr, respectively. The final prediction result is y^=ylpyl+yrpyr\hat{y}=y_l p_{y_L}+ Y_r p_{y_r}y^=ylpyl+ Yrpyr, GT label is yyy, satisfied that yl≤y≤ yry_L \le y \le Y_ryl ≤y≤yr, Will ∣ – y y ^ ∣ beta | y – \ hat {} y | ^ {\ \ beta} ∣ – y y ^ ∣ beta as scaling factor, GFL formula is:
GFL global optimal in pyl ∗ = yr – yyr – ylp ^ _ {y_l} = {x} \ frac {y_r – y} {y_r – y_l} pyl ∗ = yr – ylyr – y, Pyr ∗ = y – ylyr – ylp ^ _ {y_r} = {x} \ frac {y – y_l} {y_r – y_l} pyr ∗ = yr – yly – yl.
FL, QFL and DFL can all be considered special cases of GFL. With GFL, there are the following differences compared to the original method:
- The output of the classification branch is used directly to the NMS, and there is no need to merge the output of the two branches
- The regression branch’s prediction for each position of the prediction box is changed from an output of a single value to an output of N + 1N + 1N +1 values
After using GFL, the network loss L\mathcal{L}L becomes:
LB\mathcal{L}_{\mathcal{B}}LB = GIoU loss
Experiment
Performance comparison.
Contrast experiment.
ATSS algorithm and SOTA algorithm were compared.
Conclusion
In order to efficiently learn the accurate prediction frame and its distribution, the paper extended Focal loss and proposed a Generalized Focal loss that could optimize the continuous value target. Quality Focal loss and Distribution Focal loss are included. The QFL is used to learn a better joint representation of classification scores and positioning quality, and the DFL provides more information and accurate predictions through general distribution modeling of prediction box locations. The experimental results show that GFL can detect the performance of all one-stage algorithms.
If this post helped you, please give it a like or watch it again
More content please pay attention to the wechat public account