Convolutional Block Attention Module (CBAM) is a plug-and-play Attention mechanism Module that combines spatial and channel. Compared with SENet’s attention mechanism focusing only on channels, it can achieve better results.
Thesis title: CBAM: Convolutional Block Attention Module
The author: Sanghyun Woo, Jongchan Park, Joon-young Lee and In So Kweon, Korea Advanced Institute of Science and Technology, Daejeon, Korea
Abstract
- Convolutional Block Attention Moudule (CBAM) is a simple and effective feedforward Convolutional neural network Attention module.
- In this module, attention map is deduced sequentially from channel and space for mixed-domain attention mechanism.
- CBAM is a lightweight, general-purpose module that can be seamlessly integrated into any CNN.
Key words: Object recognition, attention mechanism, gated convolution
introduce
- Convolutional neural networks (CNNs) significantly improve the performance of visual tasks based on their rich expression ability. At present, three important factors of the network are mainly focused on: depth, width and Cardinality.
- From LeNet to residual network, the network becomes deeper and more expressive. GoogLeNet shows that width is another important factor to improve model performance. Xception and ResNext increase the cardinality of the network to achieve greater expressiveness than depth or width while saving parameters (cited in the ResNext paper).
- In addition to these factors, this paper examines a different aspect of network architecture design — attention.
Attentional mechanism
- Attention plays an important role in human perception. An important feature of the human visual system is that it does not attempt to process the entire scene at once, but instead uses a sequence of partial glimpses to gain attention to the salient parts.
- There are a few attempts to put Attention into CNN in recent years, for example the Residual Attention Network using the Endcoder-decoder Attention module, and SEnet using the squeezing-Congestion module.
- Specifically, we can see the mechanism of attention
CBAM
The overall structure
-
CBAM introduces attention mechanism in the mixed domain (channel domain and space domain) and has stronger expression ability.
-
The whole process can be summarized as:
Where F is the input of the module, McM_cMc and MsM_sMs represent channel attention map and space attention map, ⊗\otimes⊗ represent elemen-wise multiply, which will be broadcast accordingly in the specific implementation process.
Channel attention module
The relationship between channels is used to generate channel attention map, which can specifically see the attention mechanism
Channel attention is primarily concerned with the “what” of the image that makes more sense.
Implementation process:
- Global average pooling and global maximum pooling are performed on the input to aggregate information in the space;
- In order to reduce the number of channels in the parameter hiding layer, CR\ FRAc {C}{R}RC sets the ReLU function after the first layer to introduce nonlinearity (similar to SENet, this structure appears in various networks, One is to reduce parameters and computation, the other is to obtain more nonlinearity);
- After summation, the final Channel attention map is obtained through a ReLU layer
- Multiply it by input (which is automatically broadcast).
Code replay:
class Channel_module(nn.Module): def __init__(self, in_ch, ratio=16): super(Channel_module, self).__init__() self.gap = nn.AdaptiveAvgPool2d(1) self.fc1 = nn.Conv2d(in_ch, in_ch//ratio, 1, bias=False) self.relu = nn.ReLU(inplace=True) self.fc2 = nn.Conv2d(in_ch//ratio, in_ch, 1, bias=False) self.sigmoid = nn.Sigmoid() def forward(self, x): a = self.fc2(self.relu(self.fc1(self.gap(x)))) m = self.fc2(self.relu(self.fc1(self.gap(x)))) attention = self.sigmoid(a + m) return x*attentionCopy the code
Spatial attention module
Generate spatial attention Map by using the relationship between Spaces
Spatial attention focuses on “where” the important information is, complementing channel attention.
Implementation process:
- Average pooling and maximum pooling are carried out on channel dimension respectively, and then concat is carried out.
- After a 7X7 convolution layer, the number of channels is reduced to 1.
- The Sigmoid function.
- Inputs: channel-refined feature
Code replay:
class Spatial_module(nn.Module): def __init__(self, kernel_size=7): super(Spatial_module, self).__init__() self.conv = nn.Conv2d(2, 1, kernel_size, padding=3, Self.Sigmoid = nn.Sigmoid() def forward(self, x): A = torch. Mean (x, dim=1, keepDim =True) keepdim=True) attention = torch.cat([a, m], dim=1) attention = self.sigmiod(self.conv(attention)) return attention*xCopy the code
Arrangement of attention modules.
The above two attention modules compute complementary attention, and with this in mind, the two modules can be arranged in parallel or sequentially. Experimental results show that sequential arrangement is better than parallel arrangement, and channel priority is slightly better than space priority.
5) Ablation Studies
The authors’ team first looked for efficient methods to calculate channel attention and then spatial attention. Finally, consider how to combine channel attention module and space attention module.
Channel attention
The authors’ team compared three different channels of attention: average pooling, maximum pooling, and a combination of the two pools.
As you can see, maximum pooling is just as important as average pooling, and SENet ignores the importance of maximum pooling.
The maximum pooling feature of encoding significant parts can compensate for the average pooling feature of soft coding global information.
In the study of spatial attention, maximum pooling feature and average pooling feature will be directly used, and R will be set to 16.
Spatial attention
The team considered two schemes for spatial attention: one is to use average pooling and maximum pooling on channel dimensions, and the other is to use 1X1 convolution for dimensionality reduction. In addition, the effects of convolution kernels of 3X3 and 7X7 are studied. In the experiment, the spatial attention module was placed after the channel attention module.
As can be seen, channel pooling is better, and at the same time, using larger cores yields better accuracy, meaning that a larger receptive field is required to determine spatially important areas.
Arrangement of the channel and spatial attention.
The authors’ team considered three different modular arrangements: channel first, space first, and parallel.
As you can see, channel first works better.