Target detection (1) RCNN detail — the first work of deep learning target detection (2) SPP Net — let convolution computation share (3) Target detection (3) Fast RCNN — let RCNN model can train end-to-end (4) target detection Faster RCNN — RPN network to replace Selective Search [Target detection (5)] YOLOv1 — start the chapter of one-stage target detection [Target detection (6)] YOLOv2 — Introduce Anchor, Better, Faster, Understand the regression box loss function of target detection — IoU, GIoU, DIoU, CIoU principle and Python code Focal Loss Puts One-stage algorithm to the peak Focal Loss Focal Loss puts One-stage algorithm to the peak Focal Loss Focal Loss NMS Free [Target Detection (12)] FCOS — Target detection of Anchor Free based on the idea of instance segmentation

1. Motivation

Throughout the development history of target detection, both the two-stage Faster RCNN and the single-stage YOLO algorithm have introduced Anchor. The existence of Anchor eliminates the need for train from SCRach, which can help detector achieve better results. However, with the increasing number of Anchor Box, it has become an important bottleneck restricting the performance of target detection. The CornerNet abandoned the Anchor and was regarded as the pioneer of target detection (although YOLOv1 was also Anchor Free). The CornerNet outputs two Heatmaps and a Embedding Vector to predict the top left and bottom right vertices of the target box.

The CornerNet has two disadvantages. One is that the corner of the boundary box tends to fall outside the semantic information, and the vertex of the target tends to fall outside the semantic information of the object. Another is that they require a phase of integration grouping (top-left and bottom-right vertices pairing), which greatly reduces the performance and efficiency of the algorithm. So CenterNet came into being, and when we say CenterNet we mean Object as Points, not the paper with CenterNet in its name. CenterNet, as its name suggests, is very intuitive. The target detection problem is designated as the detection problem of the center point of the target box. It is an improved Anchor Free algorithm based on the CornerNet. CenterNet also uses the CornerNet fixed-point matching method to improve the accuracy and reasoning performance of the model. Compared with the single-stage target detection algorithm yolov3, the accuracy of this algorithm is improved by 4 percentage points under the premise of guaranteeing the speed. Compared with other single-phase or dual-phase target detection algorithms, this algorithm has the following advantages:

  • The Anchor mechanism is omitted and the speed of detection algorithm is improved.
  • The maxpooling operation is performed on the Heatmap, theoretically eliminating time-consuming post-processing operations on the NMS.
  • Key point estimation is used for target detection, which has good versatility and can be used in key point detection and 3D target detection.

2. CenterNet principle

2.1 Pipeline

  • Image zoom to512 * 512Size (field scaled to 512, short padding), then input the scaled image into the network.
  • The size of feature1 is extracted by backbone (ResNet50 as an example)2048 * 16 * 16.
  • After the transposed convolutional module of Feature1, the size of Feature2 is obtained by up-sampling for three times80 * 128 * 128(80 categories are represented here), and the size of the feature graph predicting length and width is2 * 128 * 128, the size of the characteristic graph of the offset of the predicted center point is2 * 128 * 128(2 represents the x and y dimensions).

Notes: The transposed convolution module includes 3 deconvolution groups, each of which includes 1 3*3 convolution and 1 deconvolution. Each deconvolution will double the size of the feature graph. There is a lot of code in which the pre-deconvolution 3*3 convolution is replaced with a DCNv2 (Deformable ConvetNets V2) to improve the fitting capability of the model.

2.2 Model output feature graph

Backbone outputs three branches, including Heatmap, Center Offset, and Box Size, as follows:

  • Heatmap: C * H * W, heat map prediction of target center.
  • Center Offset: 2 * H * W, predict the deviation value of the center point.
  • Box Size: 2 * H * W, each target center corresponds to the width and height of the bbox.

The following figure is used for illustration in the paper:

  • Description: The Heatmap is the core output of CenterNet. There are C channels (category number, excluding background). The Element pair on each Channel in the Heatmap should be the center point of a category object.
  • Meaning of Offset feature diagram: Since the coordinate points of pixel points and feature map must be integers, integer quantization effect will be generated in the process of feature map mapping to the original map, that is, the clipping of decimal part will cause errors, and the effect on small targets is particularly obvious. Such problems exist in ROI Pooling. Hence the subsequent improvements in ROI Align and Precise ROI Pooling. CenterNet specifically sets the offset branch to predict the deviation value of the center point coordinates, with channel 2 corresponding to the width and height.
  • Description: Each element in the feature graph corresponds to the width and height of the prediction Box.

2.3 Backbone network structure

In fact, backbone can be set arbitrarily, resnet18, RESnet50, resnet101 and so on are all available, these backbone has been downsampled for 5 times, so the size of the feature map has become 1/32. CenterNet connected three layers of transposed convolution behind backbone, and sampled twice on each layer, so that the size of the feature graph rose to 1/4 of the size of the original graph. Network image input size is 512*512, then the output feature map size is 128*128.

2.4 generated Heatmap

If you set the Heatmap value of only the real center of the object to 1 and the heatmap value of all other locations to 0, the positive and negative samples will be severely unbalanced, resulting in poor learning effect. In the CornerNet method, a Gaussian kernel is generated near the center point, and the label of the Heatmap is punished according to the distance within the Gaussian kernel. The farther away from the center point, the greater the penalty is, and the label is closer to 0. The closer to the center point, the smaller the penalty is, and the label is closer to 1. Each category has a Heatmap (corresponding to a channel). If a coordinate contains the center point of an object, a Keypoint (represented by a Gaussian circle) is generated at the coordinate, as shown in the following figure:

The procedure for generating a HeatMap is as follows:

  • Scale the target’s bbox to128 * 128Then find the coordinates of the center point of box and set it as point.
  • Calculate the radius of the Gaussian circle according to the size of the target box, set as r. Find the minimum radius of the three cases, here refer to the manuscript of this blog to understand: www.cnblogs.com/silence-cho…

  • On the Heatmap, the center of the circle is Point, and the radius is r. Map points (I, j) on the Heatmap using a Gaussian distribution with standard deviation 1/3 of the radius:

y c i j = e x 2 + y 2 2 sigma 2 Y_ {c_ {ij}} = e ^ {- \ frac {{x} ^ 2 + {y ^ 2}} {2 sigma ^ 2}}

Note: If two Gaussian nuclei overlap on the feature graph, the paper adopts the element-wise maximum method.

2.5 Loss design

  • Heatmap Loss: First of all, let’s clarify what the real Label on the feature map is:

Y x y c = e x p ( ( x p x ~ ) 2 + ( y p y ~ ) 2 2 sigma p 2 ) Y_ {xyc} = exp (- \ frac {(x – \ widetilde {pressure}) ^ 2 + (y – \ widetilde {p_y}) ^ 2} {2 sigma _p ^ 2})

As mentioned above, for position XYC on the heat map, the coordinates of the point corresponding to the center point of GT frame on the feature map were obtained (after dividing by the stride, rounded down), and then Loss was calculated:

In the formula, N represents the number of sample center points. Considering the imbalance of positive and negative samples, the paper transformed the loss function in Focal Loss as the loss function of heat map in CenterNet to solve the problem of unbalanced positive and negative samples and difficult sample imbalance.

  • Offset Loss: mainly fits the error caused by the discrete integer quantization of heatMap center point. The true value is the fractional part of coordinates after scaling the center point of GT Box. The Loss function can be calculated as follows:

Note: only the loss of the center point of the object is calculated here, other points are ignored.

  • Size Loss: measure the error of the length and width of the prediction box, using L1 Loss:
  • Total error: Heatmap Loss + Offset Loss + Size Loss

The lambda size = 0.1, lambda off lambda = 1 _ {size} = 0.1, lambda _ {off} = 1 lambda size = 0.1, lambda off = 1.

2.6 No NMS Design

  • The paper says you can use one3 * 3Replace NMS postprocessing by Max pooling to remove NMS. CenterNet does not have the concept of anchor and is only responsible for predicting the center point of the object. Without a large number of Anchor prior frames, the demand of NMS is not so great.
  • In this paper, the ablation experiments with and without NMS are conducted, and the use of NMS has a weak improvement effect:
  • In this paper, the possibility of eliminating NMS is proposed in theory. In my opinion, it is better to add NMS post-processing insurance in industrial practical application to improve accuracy.

3. CenterNet effect and advantages and disadvantages analysis

3.1 Main Effects

CenterNet has a significant improvement in accuracy under the real-time inference speed. Without the complex Anchor mechanism, the speed of the whole algorithm is also very fast.

3.2 Analysis of advantages and disadvantages

Advantages:

  • Compared with the CornerNet, the center Heatmap greatly simplifies the complexity of the algorithm and does not involve the problem of determining which two corner points belong to a group.
  • The model is simple in structure and easy to implement and understand.
  • It can not only be used for target detection, but also for 3D detection and human posture recognition.
  • The model is fast and can be deployed on end-to-end devices when embedded in a small backbone network.

Disadvantages:

  • For dense objects, CenterNet may be unable to do anything if the center points overlap or the center point Gaussian kernel coincidence degree is high after downsampling, and may train two objects as one object.

Reference:

  1. Arxiv.org/pdf/1904.07…
  2. zhuanlan.zhihu.com/p/96856635
  3. www.cnblogs.com/silence-cho…