• I haven’t written an article for a long time (sorry I’m paddling), recently I was looking at the rent in Beijing (really expensive ah).
  • Notice, nothing recently, according to personal years of securities operation strategy and their own shallow AI time series algorithm knowledge, as well as their own Javascript to learn to sell, in micro channel small program to make a simple auxiliary system. I first try the effect, if good in the future to make an article to introduce you.

1 overview

FPN is short for Feature Parymid Network.

In the target detection task, such as that in YOLO1, feature extraction is carried out by using convolution for an image. After going through multiple pooling layers or convolution layers with stride 2, a small-scale feature map is output. And then do target detection in this feature map.

In other words, the final result of target detection is completely dependent on this feature graph, which is called single-stage object detection algorithm.

It can be imagined that this method is difficult to effectively identify targets of different sizes, so a multi-stage detection algorithm is generated, which in fact uses feature pyramid FPN.

To put it simply, an image is also extracted by convolutional network. Originally, one feature image is output through multiple pooling layers, but now it is output through multiple pooling layers, and each feature image is output through each pooling layer. In fact, multiple feature images with different scales are extracted.

Then feature maps of different scales are dropped into feature pyramid network FPN for target detection.

(If you still don’t understand, read on to find out.)

2 Structure overview of FPN

As can be seen from the figure:

  • C1 ah, C2 ah on the left represent feature graphs of different scales. After a pooling layer or a convolution layer with stride 2, the size of the original image input is reduced by half, thus becoming a C1 feature graph. If you go through another pooling layer, it becomes a C2 feature map.
  • The four feature maps of C3,C4,C5,C6 and C7 with different scales are integrated into the FPN feature pyramid network for feature fusion, and then the detection head is used to predict the candidate box.
  • Here is some personal understanding (if there is any mistake, please correct it) :This is a nice distinctionMulti-stage detection algorithmAnd feature pyramid networks.
    • Multi-stage detection algorithm: From the figure above, we can see that the feature maps of P3,P4,P5,P6 and P7 of different scales enter a candidate box for detection head prediction. The detection head is actually a human detection algorithm, but the input of the neural network is multiple feature maps of different scales, and the output is the candidate box, so this multi-SGTAGE detection algorithm.
    • Feature pyramid network: This is actually a means to integrate feature maps of different scales to enhance the representation ability of feature maps. This process is not a prediction candidate box, should be included in the process of feature extraction. The input of FPN neural network is also a number of feature images of different scales, and the output is also a number of feature images of different scales, which are the same as the input feature images.

Therefore, a multi-stage detection algorithm can directly use C3,C4,C5,C6 and C7 output by convolutional network without FPN structure to put them into the detection head output candidate box.

The simplest FPN structure

In fact, the top-down unidirectional fusion FPN is still the mainstream fusion mode of current object detection model. Such as Faster RCNN, Mask RCNN, Yolov3, RetinaNet, Cascade RCNN, etc.The top-down and unidirectional FPN structure is shown in the figure below

The essence of this structure is: the feature graph of C5 is up-sampled and then spliced with that of C4. The spliced feature graph is then outputted through the convolution layer and BN layer to obtain the P4 feature graph. The shape of the feature graph of P4 and C4 is the same.

With this structure, P4 can learn deeper semantics from C5, and P3 can learn deeper semantics from C4. Personal effective interpretation of this structure, because for prediction accuracy, the deeper the feature is, the better it is extracted, so the prediction is more accurate, but the depth of the feature map is smaller, through up-sampling and shallow feature map fusion, can strengthen the feature expression of the shallow feature map.

4. Multi-stage structure without FPN

This is a structure diagram without FPN structure. A typical representative of non-fusion and multi-scale features is the famous SSD that emerged in 2016. It directly uses feature maps of different stages to detect objects of different scales.

It can be seen that the feature graph output by the convolutional network is directly put into the feature header output candidate box.

5 Simple bidirectional fusion

The original FPN was a one-way fusion from deep to shallow, now it is a two-way fusion from deep to shallow and then from shallow to deep. PANet was the first to propose a bottom-up secondary fusion model:

  • PAnet: Path Aggregation Network.2018 CVPR paper.
  • Address: arxiv.org/abs/1803.01…
  • Path Aggregation Network for Instance Segmentation

As can be seen from the figure, an up-sampling process is the same as FPN, and then down-sampling is completed with the convolution of stride 2 from shallow to deep. The feature graph of shallow layer P3 is sampled by the convolution layer with stride 2, and the size is the same as C4. After the two are splicing, the feature graph of P4 is generated by the convolution layer with 3×3

In addition, there are many complex two-way fusion operations, which are not covered in detail here.

6 BiFPN

The above PAnet is the simplest bidirectional FPN, but the actual BiFPN is a different paper.

  • BiFPN: Proposed by the Google team in 2019.
  • Address: arxiv.org/abs/1911.09…
  • EfficientDet: Scalable and Efficient Object Detection

The structure is not difficult to understand. In fact, some minor improvements have been made in the structure of PAnet. But the main contribution of the paper is EfficientDet, so BiFPN is just a small contribution.

Recursive-FPN cyclic feature pyramid network

  • Recursive FPN: Recursive-FPN is surprisingly effective in that the Recursive FPN is the SOTA of object detection tasks. (2020 paper)
  • Links to papers: arxiv.org/abs/2006.02…
  • DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution

Individuals also used RFN structure in their own target detection tasks. Although the computing power required was doubled, the effect was significantly improved, about 3 to 5 points of improvement.Here’s the structure:As you can see, this has a dotted line and a solid line forming a cycle between the feature map and the FPN network. The following shows a 2-step RFP structure, that is, the FPN structure that loops twice. (If it is 1-step, it is a normal FPN structure)It can be seen that P3, P4 and P5 in the previous FPN structure are splicing into the corresponding feature extraction process of the convolutional network. After splicing, a 3×3 convolution layer and BN layer are used to restore the number of channels to the required value.