In this paper, DDBNet is proposed to solve the problem of the current Anchor-free target detection algorithm. This algorithm can evaluate the prediction box more accurately, including positive and negative samples and IoU judgment. The innovation of DDBNet mainly lies in the box decomposition and recombination module (D&R) and semantic consistency module, which are used to solve the regression inaccuracy of the central key points and the inconsistency between the central key points and the target semantics respectively. From the experimental point of view, DDBNet achieves SOTA, the whole paper is laudable, but the details of it still need to wait for open source just know
Source: Xiao Fei’s algorithm engineering notes public account
Dive Deeper Into Box for Object Detection
- Address:Arxiv.org/abs/2007.14…
Introduction
At present, more and more target detection algorithms adopt the anchor-free strategy. Although the performance has been improved to a certain extent, the anchor-free method still has the accuracy constraint, mainly due to the current Bbox regression method. Here, the paper lists two problems with the current anchor-free method:
- The keypoint is semantically inconsistent with the target. In the current anchor-free method, the central keypoint is very important, but as shown in Figure 1, the central keypoint region corresponding to the target is more irrelevant background, which will inevitably take the noise pixel as a positive sample. If this simple strategy is used to define the positive sample pixel, it will inevitably lead to significant semantic inconsistency, resulting in the decline of regression accuracy.
- Regression of local features has limitations. Due to the limited size of the convolution kernel, the effective receptive domain corresponding to each central key point may only cover part of the information of the target. Using only key points for Bbox regression will result in performance degradation. As shown in Figure 2, the dashed prediction box is the result of the prediction of the center point, and each box is not perfectly aligned with the target.
In order to solve the above two problems, a new target detection algorithm DDBNet is proposed in this paper, including box decomposition/combination module and semantic consistency module, which are respectively used to solve the regression inaccuracy of the central key point and the semantic inconsistency between the central key point and the target. The result is shown in the solid box in Figure 2. The main contributions of this paper are as follows:
-
A new target detection algorithm DDBNet is proposed based on the Anchor -free architecture, which can solve the regression problem of central key points and the semantic consistency of central key points well.
-
The semantic consistency between the central key points and GT is verified, which can help improve the convergence of the target detection network.
-
DDBNet achieves SOTA accuracy (45.5%) and can be efficiently extended to other anchor-free detectors.
Our Approach
DDBNet is built based on FCOS, as shown in Figure 3. The innovation points mainly lie in THE BOX decomposition and Recombination module (D&R, Decomposition and Recombination) and semantic consistency module:
- D&R module, decomposed multiple prediction boxes into multiple boundaries, and then combined into a new prediction box, combined with the original prediction box for accurate training, this module is removed in the prediction.
- The semantic consistency module adaptively classifies pixels into positive sample pixels and sub-sample pixels according to their corresponding classification scores and intrinsic importance.
Box Decomposition and Recombination
Given a target III, each pixel III in III regays a prediction box PI ={li, Ti,ri,bi}p_i=\{l_i, t_i, r_i, b_i\} PI ={li,ti,ri,bi}, Predict the collection box for BI = {p0, p1,…, pn} B_ {I} = \ {p_0 p_1, \ cdots, p_n \} BI = {p0, p1,…, pn}, four elements respectively at the bottom of the point to the left, top, right and distance. Under normal circumstances, IoU regression loss is defined as:
NposN_{pos}Npos is the number of pixels in all target regions, pI∗p^{*}_{I}pI∗ is the regression target, and the purpose of D&R module is to optimize by IoU loss and predict more accurate PIP_IPI.
As shown in Figure 4, the D&R module is based on IoU and consists of four steps:
-
Decomposition: The prediction box PIp_ipi is decomposed into boundaries Li, Ti,ri,bil_i, t_i, r_i, b_ili, Ti,ri,bi, and then the IoU sis_isi of PIP_ipi and pI∗p^{*}_{I}pI∗ is assigned to each boundary. For target III, the confidence of the boundary is expressed as N×4N\times 4N×4 matrix SIS_{I}SI, And then according to the type of boundary combination into four collection leftI = {l0, l1,…, ln} left_ {I} = \ {l_0, l_1, \ cdots, l_n \} leftI = {l0, l1,…, ln}. RightI = {r0, r1,…, an rn} right_ {I} = \ {r_0 r_1, \ cdots, r_n \} rightI = {r0, r1,…, an rn}, bottomI = {b0, b1,…, bn} bottom_ {I} = \ {b_0, b_1. BottomI \ cdots, b_n \} = {b0, b1,…, bn}, topI = {t0, t1,…, tn} top_ {I} = \ {t_0, t_1, \ cdots, t_n \} topI = {t0, t1,…, tn}
-
Ranking: The optimal prediction box should have the minimum IoU loss. It is a good choice to combine the optimal prediction box BI ‘B^{‘}_{I}BI’ by traversing all the combinations of the prediction boundaries of the target III, but direct traversing brings huge computational complexity O(n4)\mathcal{O}(n^4)O(n4). In order to avoid too much computation, this paper proposes to sort the boundary efficiently first. For each boundary set of III, First calculate its border with GT pI ∗ = {} lI, rI, bI, tI p ^ {*} _I = \ {l_I, r_I b_I, t_I \} pI ∗ = {} lI, rI, bI, tI the deviation of the delta Il \ delta ^ {l} _ {I} the delta Il, Delta Ir \ delta ^ _ {r} {I} the delta Ir, delta Ib \ delta ^ _ {b} {I} the delta Ib, the delta It \ delta ^ {t} _ {I} the delta It, and then sorted according to its deviation values, border with GT is closer to get higher rankings.
-
Recombination: The same rank among different set boundary combination into a new forecast box collection BI ‘= {p0’, ‘p1,…, pn’} B ^ {‘} _ {I} = \ {p ^ {‘} _0, p ^ {} _1, \ \ cdots, p ^ {‘} _n \} ‘BI = {p0’, p1 ‘,…, pn ‘}, And then forecast new box PI ‘p ^ {‘} _ipi’ and GT PI ∗ p ^ _ {x} {I} PI ∗ IoU which gives corresponding boundary, a new N N \ N times 4 * 4 * 4 matrix SI ‘S ^ {‘} _ {I} SI’.
-
Assignment: Two sets of boundary scores SIS_{I}SI and SI’s ^{‘}_{I}SI ‘are obtained. The final score of each boundary is the larger of the two. The above allocation strategy directly takes SI ‘S^{‘}_{I}SI’, mainly considering the following situations: The new prediction box composed of the lower ranking boundary is generally quite different from GT, and the new score Si ‘s^{‘}_isi’ is also much lower than the original score sis_isi. Such serious deviation in score will lead to the instability of the return gradient in the training stage.
During model training, IoU loss is used to optimize boundary prediction. The loss function consists of two parts:
For target III, each edge uses its higher score to calculate the return gradient. Here, it may be a bit of a question, for example, how SI ‘>SIS^{‘}_I > S_{I}SI’ >SI is compared. The boundaries of the original prediction boxes may be combined into different new prediction boxes. Compared with the original Formula 1, Formula 2 is optimized from the perspective of the goal (instance-wise fashion). The box related to the goal is considered comprehensively, that is, the context information of the goal is taken into account. Formula 1 is optimized from the perspective of the box (local-wise fashion). Consider only local information for each box.
Semantic Consistency Module
The performance of D&R module depends on which pixels in the target are used as positive samples. Most of the current methods directly select the pixels in the fixed central region as positive samples. However, this paper proposes an adaptive semantic consistency judgment method, which can help the network learn the accurate pixel label space and can be formulated as follows:
RIR_IRI is the combination of the prediction box corresponding to the pixel of target III and the IoU fraction of GT, and RI, \overline{R_I}RI is the average IoU fraction of RIR_IRI, RI↓, (4) RI↓ \overline{R_{I\downarrow}}RI↓ is the pixel below the average IoU fraction, RI↑ \overline{R_{I\uparrow}}RI↑ is the pixel above the average IoU fraction. Ci ∈CIc_i \in C_Ici∈CI is the category with the highest score among III pixels, GGG is the total number of categories, CI↓ \overline{C_I\downarrow}CI↓ is the pixel with lower than the average classification score, CI↑, (4) is below the average classification score pixel, where the judgment is class-agnostic.
According to Formula 3, pixels are classified as positive and negative samples, as shown in Figure 5. If a pixel can be classified as multiple targets, the smallest target is generally selected. After the pixels are automatically labeled according to semantic consistency, the inner significance of each positive sample pixel is added to the network training to improve the learning of semantic consistency, which is similar to the centerness of FCOS. Intrinsic importance is measured by pixel prediction box and IoU of GT. An additional semantic consistency branch is added to the network for prediction and learning. The loss function is defined as:
Rir_iri is the predicted result. So far, the complete loss function of DDBNet is defined as:
Experiments
Compare with other methods on COCO data sets.
A comparative experiment of two modules.
CONCLUSION
In this paper, DDBNet is proposed to solve the problem of the current Anchor-free target detection algorithm. This algorithm can evaluate the prediction box more accurately, including positive and negative samples and IoU judgment. The innovation of DDBNet mainly lies in the box decomposition and recombination module (D&R) and semantic consistency module, which are used to solve the regression inaccuracy of the central key points and the inconsistency between the central key points and the target semantics respectively. From the experimental point of view, DDBNet achieves SOTA, the whole paper is laudable, but the details of it still need to wait for open source just know.
If this post helped you, please give it a like or watch it again
More content please pay attention to the wechat public account