0 principle

FPN In 2017, Facebook’s Tsung-Yi Lin et al. proposed the FPN feature pyramid architecture, which can be flexibly applied to different tasks, including target detection, instance segmentation and other end-to-end training.

In the previous multi-scale feature fusion method, the fused features are generally used for prediction, while the difference of FPN algorithm is that the prediction is carried out independently in different feature layers, and the deep features are fused through up-sampling and low-level features. It is widely used to effectively improve small target detection and mAP.

In previous target detection using Faster RCNN, no matter RPN or FAST RCNN, ROI works on the last layer, which has no problem in detection of large targets, but some problems in detection of small targets. For small targets, when the convolutional pooling is carried out to the last layer, actually the semantic information is gone, because we all know that the method of mapping an ROI to a feature map is to directly divide the underlying coordinate by the stride. Obviously, the later the mapping is, the smaller it will be, or even there may be no more. Therefore, in order to solve the problem of multi-scale detection, a feature pyramid network is introduced.

Next we introduce the feature pyramid network. Quote below [1]


 

  • Graph (a) is a fairly common multi-scale method called Featurized Image pyramid, which was widely used in earlier artificial design features (DPM) and has also been used in CNN. Multi scale the input iAMge by setting different scaling scales. This method can solve multiple scales, but it is equivalent to training multiple models (assuming that the input size is required to be fixed). Even if the input size is allowed to be not fixed, it also increases the memory space for storing images of different scales.
  • Figure (b) is the CNN, the CNN than artificial design characteristics, can learn more advanced semantic characteristics, CNN robust to scale change at the same time, as shown in figure, therefore, for calculating the input from a single scale features can also be used to identify, but when you meet the obvious multi-scale target detection or need pyramid structure to further enhance the accuracy. From the current leading methods on imageNet and COCO data sets, the featurized image pyramid method was used during testing, combining (a), (b). It shows that the advantage of each level of the characteristic image pyramid is to generate multi-scale feature representation, and the features of each level have strong semantics (because they are all generated by CNN), including the high resolution level (the input image of the largest scale). However, this pattern has significant disadvantages. It takes up to 4 times more time than the original method, which makes it difficult to use in real-time applications and also increases storage costs, which is why image Pyramid was used only in the test phase. However, if only used in the testing phase, training and testing can be inconsistent when inferring. As a result, some recent approaches do away with image pyramid altogether.

But image pyramid is not the only way to compute multi-scale feature representations. DeepCNN can have hierarchical features, and due to pooling, it will produce pyramid-shaped features with an inherent multi-scale. However, the problem is that high-resolution map (shallow layer) has the characteristics of low-level, so the target recognition performance of shallow layer is weak. This is also the purpose of the convergence of different levels.

  • As shown in Figure (c), SSD made an early attempt to use the hierarchical features of CNN pyramid. Ideally, ssD-style pyramids reuse multi-scale feature maps from multiple layers calculated by the forward process, so this form does not consume additional resources. In order to avoid using low-level features, SSD abandoned the shallow feature map and instead built the pyramid from conv4_3 and added some new layers. Therefore, SSD gives up reusing feature maps with higher resolution, but these feature maps are very important for detecting small targets. This is the difference between SSDS and FPNS.
  • Figure (4) is the structure of FPN. FPN is designed forThe pyramid form of CNN hierarchy features is naturally utilized, and the feature pyramid with strong semantic information at all scales is generated. So the structure of FPN was designedTop-down structure and horizontal connection are used to fuse shallow layer with high resolution and deep layer with rich semantic information.So that’s itA feature pyramid with strong semantic information at all scales can be constructed quickly from a single input image at a single scale without significant cost.

Let’s look at similar networks:

 


The network structure with SKIP connection above is carried out in the prediction of finest level (the last layer from top to bottom). Simply speaking, it is carried out by up-sampling and feature fusion to the last step for many times, and the feature generated in the last step is used for prediction. The FPN network structure is similar to the above, except that the prediction is performed independently at each layer. The subsequent experiment proved that finest Level did not perform as well as FPN, because FPN network is a sliding window detector with fixed window size, so sliding in different layers of the pyramid can increase its robustness to scale changes. In addition, although finest Level has more anchors, the effect is still not as good as FPN, indicating that increasing the number of anchors cannot effectively improve the accuracy.

 

Bottom-up path

Feedforward calculation of CNN is the bottom-up path. The feature graph is usually smaller and smaller after convolutional kernel calculation, and the output of some feature layers is the same as the original size, which is called “Same network stage”. For the feature pyramid in this paper, the author defines a pyramid level for each stage, and then selects the output of the last layer of each stage as the reference set of the feature graph. This selection is natural, as the deepest layers of each stage should have the strongest features. Specifically, for ResNets, the authors use the feature activation output of the last residual structure at each stage. Represent these residual module outputs as {C2, C3, C4, C5}, corresponding to the outputs of conv2, conv3, conv4, and conv5, and note that they have a step size of {4, 8, 16, 32} pixels relative to the input image. Conv1 was not included in the pyramid due to memory footprint.

Top-down paths and horizontal connections

How does the top-down pathway incorporate lower-level high-resolution features? The method is to sample the high-level feature maps with more abstract and strong semantics, and then connect the features with lateral connections to the features of the previous layer, so the high-level features are strengthened. It is important to note that the features of the two levels connected horizontally are the same in spatial dimensions. This should be done primarily to take advantage of the underlying location details.

The following figure shows the connection details. Double up-sampling of the high-level features (nearest up-sampling method, refer to deconvolution), and then combine them with the corresponding features of the previous layer (the former layer can only be used after 1 * 1 convolution kernel, in order to change the channels, which should be the same as the channels of the later layer). The combination method is the addition between pixels. The process is repeated until the finest feature map is generated. At the beginning of iteration, the author adds a 1 * 1 convolution kernel behind the C5 layer to generate the coarsest feature graph. Finally, the author uses a 3 * 3 convolution kernel to process the fused feature graph (in order to eliminate the up-sampling alias effect) to generate the final required feature graph. In order for later applications to share the classification layer at all levels, the output channel of 3*3 convolution is fixed here as D, and 256 is set here. Thus all additional convolution layers (such as P2) have 256 channel outputs. These extra layers are not non-linear.

The corresponding fusion feature layer of {C2, C3, C4, C5} layer is {P2, P3, P4, P5}, and the corresponding layer space size is connected.

 


 

Diagram 1

2 code

'''FPN in PyTorch.
See the paper "Feature Pyramid Networks for Object Detection" for more details.
'''
import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.autograd import Variable


class Bottleneck(nn.Module) :
    expansion = 4

    def __init__(self, in_planes, planes, stride=1) :
        super(Bottleneck, self).__init__()
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, self.expansion*planes, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(self.expansion*planes)

        self.shortcut = nn.Sequential()
        ifstride ! =1 orin_planes ! = self.expansion*planes: self.shortcut = nn.Sequential( nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion*planes)
            )

    def forward(self, x) :
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out


class FPN(nn.Module) :
    def __init__(self, block, num_blocks) :
        super(FPN, self).__init__()
        self.in_planes = 64

        self.conv1 = nn.Conv2d(3.64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)

        # Bottom-up layers
        self.layer1 = self._make_layer(block,  64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)

        # Top layer
        self.toplayer = nn.Conv2d(2048.256, kernel_size=1, stride=1, padding=0)  # Reduce channels

        # Smooth layers
        self.smooth1 = nn.Conv2d(256.256, kernel_size=3, stride=1, padding=1)
        self.smooth2 = nn.Conv2d(256.256, kernel_size=3, stride=1, padding=1)
        self.smooth3 = nn.Conv2d(256.256, kernel_size=3, stride=1, padding=1)

        # Lateral layers
        self.latlayer1 = nn.Conv2d(1024.256, kernel_size=1, stride=1, padding=0)
        self.latlayer2 = nn.Conv2d( 512.256, kernel_size=1, stride=1, padding=0)
        self.latlayer3 = nn.Conv2d( 256.256, kernel_size=1, stride=1, padding=0)

    def _make_layer(self, block, planes, num_blocks, stride) :
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, stride))
            self.in_planes = planes * block.expansion
        return nn.Sequential(*layers)

    def _upsample_add(self, x, y) :
        '''Upsample and add two feature maps. Args: x: (Variable) top feature map to be upsampled. y: (Variable) lateral feature map. Returns: (Variable) added feature map. Note in PyTorch, when input size is odd, the upsampled feature map with `F.upsample(... , scale_factor=2, mode='nearest')` maybe not equal to the lateral feature map size. e.g. original input size: [N,_,15,15] -> conv2d feature map size: [N,_,8,8] -> up2d feature map size: [N,_,16,16] So we choose bilinear upsample which supports arbitrary output sizes.
        _,_,H,W = y.size()
        return F.upsample(x, size=(H,W), mode='bilinear') + y

    def forward(self, x) :
        # Bottom-up
        c1 = F.relu(self.bn1(self.conv1(x)))
        c1 = F.max_pool2d(c1, kernel_size=3, stride=2, padding=1)
        print(f'c1:{c1.shape}')
        c2 = self.layer1(c1)
        print(f'c2:{c2.shape}')  

        c3 = self.layer2(c2)
        print(f'c3:{c3.shape}') 
        c4 = self.layer3(c3)
        print(f'c4:{c4.shape}') 
        c5 = self.layer4(c4)
        print(f'c5:{c5.shape}') 

        # Top-down
        p5 = self.toplayer(c5)
        print(f'p5:{p5.shape}') 
        p4 = self._upsample_add(p5, self.latlayer1(c4))
        print(f'latlayer1(c4):{self.latlayer1(c4).shape}, p4:{p4.shape}')

        p3 = self._upsample_add(p4, self.latlayer2(c3))
        print(f'latlayer1(c3):{self.latlayer2(c3).shape}, p3:{p3.shape}')

        p2 = self._upsample_add(p3, self.latlayer3(c2))
        print(f'latlayer1(c2):{self.latlayer3(c2).shape}, p2:{p2.shape}')

        # Smooth
        p4 = self.smooth1(p4)
        p3 = self.smooth2(p3)
        p2 = self.smooth3(p2)
        return p2, p3, p4, p5

def FPN101() :
    # return FPN (Bottleneck,,4,23,3 [2])
    return FPN(Bottleneck, [2.2.2.2])

def test() :
    net = FPN101()
    fms = net(Variable(torch.randn(1.3.600.900)))
    for fm in fms:
        print(fm.size())

test()
Copy the code

Output:

c1:torch.Size([1, 64, 150, 225])
c2:torch.Size([1, 256, 150, 225])
c3:torch.Size([1, 512, 75, 113])
c4:torch.Size([1, 1024, 38, 57])
c5:torch.Size([1, 2048, 19, 29])
p5:torch.Size([1, 256, 19, 29])
latlayer1(c4):torch.Size([1, 256, 38, 57]), p4:torch.Size([1, 256, 38, 57])
latlayer1(c3):torch.Size([1, 256, 75, 113]), p3:torch.Size([1, 256, 75, 113])
latlayer1(c2):torch.Size([1, 256, 150, 225]), p2:torch.Size([1, 256, 150, 225])

# p2, p3, p4, p5
torch.Size([1, 256, 150, 225])
torch.Size([1, 256, 75, 113])
torch.Size([1, 256, 38, 57])
torch.Size([1, 256, 19, 29])
Copy the code

 

1 the reference

  1. Blog.csdn.net/xiamentingt…
  2. Blog.csdn.net/shiwanghual…