Next we introduce the feature pyramid network. Quote below [1]
-
Graph (a) is a fairly common multi-scale method called Featurized Image pyramid, which was widely used in earlier artificial design features (DPM) and has also been used in CNN. Multi scale the input iAMge by setting different scaling scales. This method can solve multiple scales, but it is equivalent to training multiple models (assuming that the input size is required to be fixed). Even if the input size is allowed to be not fixed, it also increases the memory space for storing images of different scales.
-
Figure (b) is the CNN, the CNN than artificial design characteristics, can learn more advanced semantic characteristics, CNN robust to scale change at the same time, as shown in figure, therefore, for calculating the input from a single scale features can also be used to identify, but when you meet the obvious multi-scale target detection or need pyramid structure to further enhance the accuracy. From the current leading methods on imageNet and COCO data sets, the featurized image pyramid method was used during testing, combining (a), (b). It shows that the advantage of each level of the characteristic image pyramid is to generate multi-scale feature representation, and the features of each level have strong semantics (because they are all generated by CNN), including the high resolution level (the input image of the largest scale). However, this pattern has significant disadvantages. It takes up to 4 times more time than the original method, which makes it difficult to use in real-time applications and also increases storage costs, which is why image Pyramid was used only in the test phase. However, if only used in the testing phase, training and testing can be inconsistent when inferring. As a result, some recent approaches do away with image pyramid altogether.
But image pyramid is not the only way to compute multi-scale feature representations. DeepCNN can have hierarchical features, and due to pooling, it will produce pyramid-shaped features with an inherent multi-scale. However, the problem is that high-resolution map (shallow layer) has the characteristics of low-level, so the target recognition performance of shallow layer is weak. This is also the purpose of the convergence of different levels.
-
As shown in Figure (c), SSD made an early attempt to use the hierarchical features of CNN pyramid. Ideally, ssD-style pyramids reuse multi-scale feature maps from multiple layers calculated by the forward process, so this form does not consume additional resources. In order to avoid using low-level features, SSD abandoned the shallow feature map and instead built the pyramid from conv4_3 and added some new layers. Therefore, SSD gives up reusing feature maps with higher resolution, but these feature maps are very important for detecting small targets. This is the difference between SSDS and FPNS.
-
Figure (4) shows the structure of FPN, which is designed to make natural use of the pyramid form of CNN level features and generate feature pyramids with strong semantic information at all scales. Therefore, top-down structure and horizontal connection are designed in FPN to integrate shallow layer with high resolution and deep layer with rich semantic information. In this way, a feature pyramid with strong semantic information at all scales can be constructed quickly from a single input image at a single scale without significant cost.
Let’s look at similar networks:
The network structure with SKIP connection above is carried out in the prediction of finest level (the last layer from top to bottom). Simply speaking, it is carried out by up-sampling and feature fusion to the last step for many times, and the feature generated in the last step is used for prediction. The FPN network structure is similar to the above, except that the prediction is performed independently at each layer. The subsequent experiment proved that finest Level did not perform as well as FPN, because FPN network is a sliding window detector with fixed window size, so sliding in different layers of the pyramid can increase its robustness to scale changes. In addition, although finest Level has more anchors, the effect is still not as good as FPN, indicating that increasing the number of anchors cannot effectively improve the accuracy.
Bottom-up path
Feedforward calculation of CNN is the bottom-up path. The feature graph is usually smaller and smaller after convolutional kernel calculation, and the output of some feature layers is the same as the original size, which is called “Same network stage”. For the feature pyramid in this paper, the author defines a pyramid level for each stage, and then selects the output of the last layer of each stage as the reference set of the feature graph. This selection is natural, as the deepest layers of each stage should have the strongest features. Specifically, for ResNets, the authors use the feature activation output of the last residual structure at each stage. Represent these residual module outputs as {C2, C3, C4, C5}, corresponding to the outputs of conv2, conv3, conv4, and conv5, and note that they have a step size of {4, 8, 16, 32} pixels relative to the input image. Conv1 was not included in the pyramid due to memory footprint.
Top-down paths and horizontal connections
How does the top-down pathway incorporate lower-level high-resolution features? The method is to sample the high-level feature maps with more abstract and strong semantics, and then connect the features with lateral connections to the features of the previous layer, so the high-level features are strengthened. It is important to note that the features of the two levels connected horizontally are the same in spatial dimensions. This should be done primarily to take advantage of the underlying location details.
The following figure shows the connection details. Double up-sampling of the high-level features (nearest up-sampling method, refer to deconvolution), and then combine them with the corresponding features of the previous layer (the former layer can only be used after 1 * 1 convolution kernel, in order to change the channels, which should be the same as the channels of the later layer). The combination method is the addition between pixels. The process is repeated until the finest feature map is generated. At the beginning of iteration, the author adds a 1 * 1 convolution kernel behind the C5 layer to generate the coarsest feature graph. Finally, the author uses a 3 * 3 convolution kernel to process the fused feature graph (in order to eliminate the up-sampling alias effect) to generate the final required feature graph. In order for the later application to share the classification layer at all levels, the output channel of 3*3 convolution is fixed here as D, which is set to 256. Thus all additional convolution layers (such as P2) have 256 channel outputs. These extra layers are not non-linear.
The corresponding fusion feature layer of {C2, C3, C4, C5} layer is {P2, P3, P4, P5}, and the corresponding layer space size is connected.
Diagram 1
2 code
See the paper "Feature Pyramid Networks for Object Detection" for more details. import torch.nn.functional as F from torch.autograd import Variable class Bottleneck(nn.Module): def __init__(self, in_planes, planes, stride=1): super(Bottleneck, self).__init__() self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False) self.bn1 = nn.BatchNorm2d(planes) self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=False) self.bn2 = nn.BatchNorm2d(planes) self.conv3 = nn.Conv2d(planes, self.expansion*planes, kernel_size=1, bias=False) self.bn3 = nn.BatchNorm2d(self.expansion*planes) self.shortcut = nn.Sequential() if stride ! = 1 or in_planes ! = self.expansion*planes: self.shortcut = nn.Sequential( nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False), nn.BatchNorm2d(self.expansion*planes) out = F.relu(self.bn1(self.conv1(x))) out = F.relu(self.bn2(self.conv2(out))) out = self.bn3(self.conv3(out)) def __init__(self, block, num_blocks): super(FPN, self).__init__() self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False) self.bn1 = nn.BatchNorm2d(64) self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1) self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2) self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2) self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2) self.toplayer = nn.Conv2d(2048, 256, kernel_size=1, stride=1, padding=0) self.smooth1 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1) self.smooth2 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1) self.smooth3 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1) self.latlayer1 = nn.Conv2d(1024, 256, kernel_size=1, stride=1, padding=0) self.latlayer2 = nn.Conv2d( 512, 256, kernel_size=1, stride=1, padding=0) self.latlayer3 = nn.Conv2d( 256, 256, kernel_size=1, stride=1, padding=0) def _make_layer(self, block, planes, num_blocks, stride): strides = [stride] + [1]*(num_blocks-1) layers.append(block(self.in_planes, planes, stride)) self.in_planes = planes * block.expansion return nn.Sequential(*layers) def _upsample_add(self, x, y): '''Upsample and add two feature maps. x: (Variable) top feature map to be upsampled. y: (Variable) lateral feature map. (Variable) added feature map. Note in PyTorch, when input size is odd, the upsampled feature map with `F.upsample(... , scale_factor=2, mode='nearest')` maybe not equal to the lateral feature map size. original input size: [N,_,15,15] -> conv2d feature map size: [N,_,8,8] -> up2d feature map size: [N,_,16,16] So we choose bilinear upsample which supports arbitrary output sizes. Return f.upsample (x, size=(H,W), mode='bilinear') + y c1 = F.relu(self.bn1(self.conv1(x))) c1 = F.max_pool2d(c1, kernel_size=3, stride=2, padding=1) p4 = self._upsample_add(p5, self.latlayer1(c4)) print(f'latlayer1(c4):{self.latlayer1(c4).shape}, p4:{p4.shape}') p3 = self._upsample_add(p4, self.latlayer2(c3)) print(f'latlayer1(c3):{self.latlayer2(c3).shape}, p3:{p3.shape}') p2 = self._upsample_add(p3, self.latlayer3(c2)) print(f'latlayer1(c2):{self.latlayer3(c2).shape}, P :{p2.shape}') return FPN(link, [2,2,2]) FMS = net(link (link))Copy the code
Output:
c1:torch.Size([1, 64, 150, 225])
c2:torch.Size([1, 256, 150, 225])
c3:torch.Size([1, 512, 75, 113])
c4:torch.Size([1, 1024, 38, 57])
c5:torch.Size([1, 2048, 19, 29])
p5:torch.Size([1, 256, 19, 29])
latlayer1(c4):torch.Size([1, 256, 38, 57]), p4:torch.Size([1, 256, 38, 57])
latlayer1(c3):torch.Size([1, 256, 75, 113]), p3:torch.Size([1, 256, 75, 113])
latlayer1(c2):torch.Size([1, 256, 150, 225]), p2:torch.Size([1, 256, 150, 225])
torch.Size([1, 256, 150, 225])
torch.Size([1, 256, 75, 113])
torch.Size([1, 256, 38, 57])
torch.Size([1, 256, 19, 29])
Copy the code
1 the reference
- Blog.csdn.net/xiamentingt…
- Blog.csdn.net/shiwanghual…