directory
Abstract
I. Overview of SENet
Ii. Detailed explanation of SENet structure
3. Detailed calculation process
Application of SENet in a specific network (code implementation SE_ResNet)
SE module
The first residual module
The second residual module
SEResNet18, SEResNet34 model complete code
SEResNet50, SEResNet101, and SEResNet152 are complete
Abstract
I. Overview of SENet
Squeezing-and-congestion Networks (SENet for short) is a new network architecture proposed by Momenta And WMW that utilizes SENet, Won the last ImageNet 2017 Contest Image Classification task, reduced the top-5 error in ImageNet dataset to 2.251%, the original best score was 2.991%.
In this paper, SENet block is inserted into various existing classification networks and good results are achieved. The author’s motivation is to explicitly model the interdependencies between feature channels. In addition, the author does not introduce a new spatial dimension for the fusion of feature channels, but adopts a new strategy of “feature recalibration”. Specifically, the importance of each feature channel is automatically acquired through learning, and then the useful features are promoted according to this importance and the features that are not useful for the current task are suppressed.
Generally speaking, the core idea of SENet is to learn feature weights based on loss through the network, so that the effective feature map has a large weight, and the ineffective or ineffective feature map has a small weight to achieve better results. SE Block embedded in some original classification networks inevitably increases some parameters and calculation, but it is acceptable in terms of effect. The Sequeeze-and-congestion (SE) block is not a complete network structure, but rather a substructure that can be nested into other classification or detection models.
Ii. Detailed explanation of SENet structure
Squeeze and Congestion are two key operations in the above configuration, which are explained in detail below.
The diagram above shows the SE module. Given an input x, the number of characteristic channels is, a feature with the number of characteristic channels C is obtained after a series of general transformations such as convolution. Also re-label the previously obtained characteristics with the following three operations:
1. Squeeze operation, carry out feature compression along the spatial dimension, and turn each two-dimensional feature channel into a real number, which has global receptive field to some extent, and the dimension of output matches the number of feature channels of input. It represents the global distribution of responses on the characteristic channel and enables the layer close to the input to obtain the global receptive field, which is very useful in many tasks.
2. It is a gate mechanism similar to the one found in a cyclic neural network. Weights are generated for each feature channel by parameter W, where parameter W is learned to explicitly model the correlation between feature channels.
3. For the Reweight operation, the weight of the output is the importance of each attribute after the attribute selection. Then the weighting is multiplied to the previous attribute per channel.
3. Detailed calculation process
First of all,This step is a conversion operation (strictly speaking, it does not belong to SENet, but to the original network, as can be seen from the combination of SENet and Inception and ResNet network). In this paper, it is just a standard convolution operation. The definition of input and output is as follows:
Then thisThe formula of is the following formula 1 (convolution operation,Denotes the c convolution kernel,Represents the s th input).
The U you get is the second three-dimensional matrix on the left in Figure1, also called tensor, or C H*W feature maps. While UC represents the c-th two-dimensional matrix in U, and the subscript C represents channel.
Then there is the Squeeze operation, and the formula is very simple: a global average pooling:
Therefore, formula 2 converts the input of H*W*C into the output of 1*1*C, corresponding to the Fsq operation in Figure1. Why does that happen? The result of this step is equivalent to showing the numerical distribution of C feature maps at this layer, or global information.
The next operation is equation 3. In this case, we multiply W1 by z, which is a full connection layer operation. The dimension of W1 is C/r * C, and the r is a scaling parameter, which is 16 in this article. The purpose of this parameter is to reduce the number of channels and thus reduce the computation. Since z is 1*1*C, W1z is 1*1*C/r. Then through a ReLU layer, the output dimension is unchanged; And then multiply by W2, which is also a fully connected process, and W2 has dimensions C*C/ R, so the output dimensions are 1*1*C; Finally, through the sigmoid function, s is obtained:
In other words, the dimension of the resulting S is 1*1*C, where C is the number of channels. This S is actually the core of this paper. It is used to describe the weights of C feature maps in tensor U. And this weight is learned through the previous fully connected layer and nonlinear layer, so it can be trained end to end. The function of these two fully connected layers is to merge the feature map information of each channel, because the preceding squeeze operates in the feature map of a channel.
And then after you have s, you can play with your original tensor U, which is equation 4 down here. If it is a multi, it is channel-wise. Then what does that mean?It’s a two-dimensional matrix,It’s a number, it’s a weight, so it’s the same thing as thetaEvery value in the matrix is multiplied by. Corresponds to Fscale in Figure1.
Application of SENet in a specific network (code implementation SE_ResNet)
After introducing the specific formula implementation, here is how SE Block is applied to a specific network.
The figure above is an example of embedding an SE module into an Inception structure. The dimension information next to the box represents the output for that layer.
Here we use global Average pooling as a Squeeze operation. Then two Fully Connected layers form a Bottleneck structure to model the correlation between channels, and the output and input features have the same number of weights. We first reduced the feature dimension to 1/16 of the input, and then raised it back to the original dimension through a Fully Connected layer after ReLu activation. The advantage of doing this over using a Fully Connected layer is that:
1) It has more nonlinearity and can better fit the complex correlation between channels;
2) Greatly reduce the number of parameters and calculation. Then the normalized weight between 0 and 1 is obtained through a Sigmoid gate. Finally, the normalized weight is weighted to the characteristics of each channel through a Scale operation.
In addition, SE modules can be embedded in modules that contain Skip-connections. The upper right picture is an example of SE embedded in ResNet module. The operation process is basically the same as se-inception except that the Residual features on the branch are recalibrated before Addition. If the features on the rear main branch of the Addition are recalibrated, due to the existence of scale operation of 0~1 on the trunk, gradient dissipation will easily occur near the input layer when the network is deep enough for BP optimization, which makes the model difficult to optimize.
Most of the current mainstream networks are based on these two similar units superimposed by repeat. Thus, SE modules can be embedded in almost all current network structures. We can obtain different kinds of SENet by embedding SE modules in the building block units of the original network structure. Such as SE-BN-Inception, SE-Resnet, SE-Renext, Se-Inception – Resnet-V2, etc.
This example shows how to embed the SE module into the ResNet network by implementing SE-Resnet. The SE-ResNET model is shown as follows:
SE module
class SELayer(nn.Module) :
def __init__(self, channel, reduction=16) :
super(SELayer, self).__init__()
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Sequential(
nn.Linear(channel, channel // reduction, bias=False),
nn.ReLU(inplace=True),
nn.Linear(channel // reduction, channel, bias=False),
nn.Sigmoid()
)
def forward(self, x) :
b, c, _, _ = x.size()
y = self.avg_pool(x).view(b, c)
y = self.fc(y).view(b, c, 1.1)
return x * y.expand_as(x)
Copy the code
The first residual module
The first residual module is used to implement ResNet18 and ResNet34 models, and SENet is embedded after the second convolution.
class ResidualBlock(nn.Module) :
Implementation child module: Residual Block
def __init__(self, inchannel, outchannel, stride=1, shortcut=None) :
super(ResidualBlock, self).__init__()
self.left = nn.Sequential(
nn.Conv2d(inchannel, outchannel, 3, stride, 1, bias=False),
nn.BatchNorm2d(outchannel),
nn.ReLU(inplace=True),
nn.Conv2d(outchannel, outchannel, 3.1.1, bias=False),
nn.BatchNorm2d(outchannel)
)
self.se = SELayer(outchannel, 16)
self.right = shortcut
def forward(self, x) :
out = self.left(x)
out= self.se(out)
residual = x if self.right is None else self.right(x)
out += residual
return F.relu(out)
Copy the code
The second residual module
The second residual module is used to implement ResNet50, ResNet101, ResNet152 models, and the SENet module is embedded after the third convolution. ` `
class Bottleneck(nn.Module) :
def __init__(self, in_places, places, stride=1, downsampling=False, expansion=4) :
super(Bottleneck, self).__init__()
self.expansion = expansion
self.downsampling = downsampling
self.bottleneck = nn.Sequential(
nn.Conv2d(in_channels=in_places, out_channels=places, kernel_size=1, stride=1, bias=False),
nn.BatchNorm2d(places),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=places, out_channels=places, kernel_size=3, stride=stride, padding=1, bias=False),
nn.BatchNorm2d(places),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=places, out_channels=places * self.expansion, kernel_size=1, stride=1, bias=False),
nn.BatchNorm2d(places * self.expansion),
)
self.se = SELayer(places * self.expansion, 16)
if self.downsampling:
self.downsample = nn.Sequential(
nn.Conv2d(in_channels=in_places, out_channels=places * self.expansion, kernel_size=1, stride=stride,
bias=False),
nn.BatchNorm2d(places * self.expansion)
)
self.relu = nn.ReLU(inplace=True)
def forward(self, x) :
residual = x
out = self.bottleneck(x)
out = self.se(out)
if self.downsampling:
residual = self.downsample(x)
out += residual
out = self.relu(out)
return out
Copy the code
Complete code for SEResNet18, SEResNet34 models
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from torchsummary import summary
class SELayer(nn.Module) :
def __init__(self, channel, reduction=16) :
super(SELayer, self).__init__()
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Sequential(
nn.Linear(channel, channel // reduction, bias=False),
nn.ReLU(inplace=True),
nn.Linear(channel // reduction, channel, bias=False),
nn.Sigmoid()
)
def forward(self, x) :
b, c, _, _ = x.size()
y = self.avg_pool(x).view(b, c)
y = self.fc(y).view(b, c, 1.1)
return x * y.expand_as(x)
class ResidualBlock(nn.Module) :
Implementation child module: Residual Block
def __init__(self, inchannel, outchannel, stride=1, shortcut=None) :
super(ResidualBlock, self).__init__()
self.left = nn.Sequential(
nn.Conv2d(inchannel, outchannel, 3, stride, 1, bias=False),
nn.BatchNorm2d(outchannel),
nn.ReLU(inplace=True),
nn.Conv2d(outchannel, outchannel, 3.1.1, bias=False),
nn.BatchNorm2d(outchannel)
)
self.se = SELayer(outchannel, 16)
self.right = shortcut
def forward(self, x) :
out = self.left(x)
out= self.se(out)
residual = x if self.right is None else self.right(x)
out += residual
return F.relu(out)
class ResNet(nn.Module) :
ResNet34 ResNet34 contains a Residual block at each layer with a Residual block at the _make_layer function
def __init__(self, blocks, num_classes=1000) :
super(ResNet, self).__init__()
self.model_name = 'resnet34'
# First few layers: Image conversion
self.pre = nn.Sequential(
nn.Conv2d(3.64.7.2.3, bias=False),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(3.2.1))
There is residual block at 3,4,6,3
self.layer1 = self._make_layer(64.64, blocks[0])
self.layer2 = self._make_layer(64.128, blocks[1], stride=2)
self.layer3 = self._make_layer(128.256, blocks[2], stride=2)
self.layer4 = self._make_layer(256.512, blocks[3], stride=2)
Full connection for classification
self.fc = nn.Linear(512, num_classes)
def _make_layer(self, inchannel, outchannel, block_num, stride=1) :
""" Build layer with multiple residual blocks
shortcut = nn.Sequential(
nn.Conv2d(inchannel, outchannel, 1, stride, bias=False),
nn.BatchNorm2d(outchannel),
nn.ReLU()
)
layers = []
layers.append(ResidualBlock(inchannel, outchannel, stride, shortcut))
for i in range(1, block_num):
layers.append(ResidualBlock(outchannel, outchannel))
return nn.Sequential(*layers)
def forward(self, x) :
x = self.pre(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = F.avg_pool2d(x, 7)
x = x.view(x.size(0), -1)
return self.fc(x)
def Se_ResNet18() :
return ResNet([2.2.2.2])
def Se_ResNet34() :
return ResNet([3.4.6.3])
if __name__ == '__main__':
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = Se_ResNet34()
model.to(device)
summary(model, (3.224.224))
Copy the code
SEResNet50, SEResNet101, and SEResNet152 are complete
import torch
import torch.nn as nn
import torchvision
import numpy as np
from torchsummary import summary
print("PyTorch Version: ", torch.__version__)
print("Torchvision Version: ", torchvision.__version__)
__all__ = ['SEResNet50'.'SEResNet101'.'SEResNet152']
class SELayer(nn.Module) :
def __init__(self, channel, reduction=16) :
super(SELayer, self).__init__()
self.avg_pool = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Sequential(
nn.Linear(channel, channel // reduction, bias=False),
nn.ReLU(inplace=True),
nn.Linear(channel // reduction, channel, bias=False),
nn.Sigmoid()
)
def forward(self, x) :
b, c, _, _ = x.size()
y = self.avg_pool(x).view(b, c)
y = self.fc(y).view(b, c, 1.1)
return x * y.expand_as(x)
def Conv1(in_planes, places, stride=2) :
return nn.Sequential(
nn.Conv2d(in_channels=in_planes, out_channels=places, kernel_size=7, stride=stride, padding=3, bias=False),
nn.BatchNorm2d(places),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))class Bottleneck(nn.Module) :
def __init__(self, in_places, places, stride=1, downsampling=False, expansion=4) :
super(Bottleneck, self).__init__()
self.expansion = expansion
self.downsampling = downsampling
self.bottleneck = nn.Sequential(
nn.Conv2d(in_channels=in_places, out_channels=places, kernel_size=1, stride=1, bias=False),
nn.BatchNorm2d(places),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=places, out_channels=places, kernel_size=3, stride=stride, padding=1, bias=False),
nn.BatchNorm2d(places),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels=places, out_channels=places * self.expansion, kernel_size=1, stride=1, bias=False),
nn.BatchNorm2d(places * self.expansion),
)
self.se = SELayer(places * self.expansion, 16)
if self.downsampling:
self.downsample = nn.Sequential(
nn.Conv2d(in_channels=in_places, out_channels=places * self.expansion, kernel_size=1, stride=stride,
bias=False),
nn.BatchNorm2d(places * self.expansion)
)
self.relu = nn.ReLU(inplace=True)
def forward(self, x) :
residual = x
out = self.bottleneck(x)
out = self.se(out)
if self.downsampling:
residual = self.downsample(x)
out += residual
out = self.relu(out)
return out
class ResNet(nn.Module) :
def __init__(self, blocks, num_classes=1000, expansion=4) :
super(ResNet, self).__init__()
self.expansion = expansion
self.conv1 = Conv1(in_planes=3, places=64)
self.layer1 = self.make_layer(in_places=64, places=64, block=blocks[0], stride=1)
self.layer2 = self.make_layer(in_places=256, places=128, block=blocks[1], stride=2)
self.layer3 = self.make_layer(in_places=512, places=256, block=blocks[2], stride=2)
self.layer4 = self.make_layer(in_places=1024, places=512, block=blocks[3], stride=2)
self.avgpool = nn.AvgPool2d(7, stride=1)
self.fc = nn.Linear(2048, num_classes)
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
def make_layer(self, in_places, places, block, stride) :
layers = []
layers.append(Bottleneck(in_places, places, stride, downsampling=True))
for i in range(1, block):
layers.append(Bottleneck(places * self.expansion, places))
return nn.Sequential(*layers)
def forward(self, x) :
x = self.conv1(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
return x
def SEResNet50() :
return ResNet([3.4.6.3])
def SEResNet101() :
return ResNet([3.4.23.3])
def SEResNet152() :
return ResNet([3.8.36.3])
if __name__ == '__main__':
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = SEResNet50()
model.to(device)
summary(model, (3.224.224))
Copy the code