Preface:

In recent years, attention mechanisms have been used to improve model performance, and everyone is comfortable with them. This paper introduces a new coordinate attention mechanism, which solves some problems in SE and CBAM and produces better results, and is as simple to use as SE and CBAM.

Address: arxiv.org/pdf/2103.02…

Code address: github.com/AndrewQibin…

Click follow and update two computer vision articles a day

Introduction

Most of the attentional mechanisms used in deep neural networks can bring good performance improvement, but these attentional mechanisms used in mobile networks (with smaller models) will obviously lag behind large networks, mainly because the computational overhead of most attentional mechanisms, such as self-attention, is unbearable for mobile networks.

Therefore, the primary use for the mobile network is squeeze-and-congestion (SE), BAM, and CBAM. However, SE only considers internal channel information and ignores the importance of location information, while the spatial structure of the target in vision is very important. BAM and CBAM try to introduce location information by global pooling on channels, but this approach can only capture local information, not long-range dependent information.

To explain a little here, after several layers of convolution, each position of feature maps contains information of a local area of the original image. CBAM takes the maximum and average value of multiple channels at each position as the weighting coefficient, so this weighting only considers information of local range.

In this paper, a novel and efficient attention mechanism is proposed to embed location information into channel attention, so that mobile networks can obtain information in a larger area without introducing large overhead. In order to avoid the loss of location information introduced by 2D global pooling, the decomposition channel is proposed to encode two parallel 1D features to efficiently integrate spatial coordinate information into the generated attention Maps.

Specifically, two 1D global pooling operations were used to aggregate input features along vertical and horizontal directions into two separate direction-aware feature maps, respectively. Then, the two feature maps with embedded directional information are encoded into two attention maps, and each attention map captures the remote dependency of the input feature map along a spatial direction. Location information can therefore be stored in the generated attention map. Then, both attention maps are applied to input feature maps by multiplication to emphasize the representation of attention areas.

Considering that its operation can distinguish spatial direction (coordinate) and generate coordinate-aware attention maps, this paper calls the proposed attention method “coordinate attention”.

This coordinate attention has three advantages:

  1. It captures not only cross-channel information, but also direction-aware and position-sensitive information, which enables the model to locate and identify the target region more accurately.

  2. This approach is flexible and lightweight and can be easily inserted into existing classic mobile networks, such as the residual block in MobileNet_v2 and hourglass block in MobileNeXt to improve feature representation performance.

  3. For a pre-trained model, this coordinate attention can bring significant performance improvement to down-stream tasks processed by mobile network, especially those tasks with intensive prediction, such as semantic segmentation.

Coordinate Attention

So let’s review SE and CBAM before we introduce coordinate attention.

SE is relatively simple, as shown in Figure A. Look at the structure diagram.

A little bit about CBAM, as shown in Figure B, CBAM contains two parts: spatial attention and channel attention.

Channel attention: Global average pooling and global maximum pooling are performed for each input feature map to obtain two 1D vectors, and then input feature maps are weighted after normalization by CONv, ReLU, 1x1CONV and SigmoID.

Spatial attention: Maximum pooling and average pooling are performed on all channels at each position of the feature map to obtain two feature maps, and then 7×7 Conv is performed on these two feature maps to normalize BN and sigmoID.

The details are shown in the figure below:

Return to Coordinate Attention, as shown in the figure below, average pooling in horizontal and vertical directions is carried out respectively to obtain two 1D vectors. In the spatial dimension, Concat and 1x1Conv compress channels, and BN and non-Linear encode spatial information in vertical and horizontal directions. Then split, and get the same channel number as input feature maps through 1X1 respectively, and then normalize the weighting.

To put it simply, Coordinate Attention is to carry out maximum pooling in horizontal and vertical directions, then carry out transform to encode spatial information, and finally fuse spatial information through weighting in the channel.

Conclusion

This approach is significantly improved from SE and CBAM.

The next article will summarize the mechanism of attention.

This article comes from the public CV technical guide technical summary series.

Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.

A summary PDF of the following articles can be obtained by replying to the keyword “Technical Summary” in the public account.

Other articles

Past, present and possibility of visual object detection and recognition

Siamerse network summary

Summary of computer vision terms (a) to build the knowledge system of computer vision

Summary of under-fitting and over-fitting techniques

Summary of normalization methods

Summary of common ideas of paper innovation

Summary of efficient Reading methods of English literature in CV direction

A review of small sample learning in computer vision

A brief overview of intellectual distillation

Optimize the read speed of OpenCV video

NMS summary

Loss function technology summary

Summary of attention mechanism technology

Summary of feature pyramid technology

Summary of pooling techniques

Summary of data enhancement methods

Summary of CNN structure Evolution (I) Classical model

Summary of CNN structural evolution (II) Lightweight model

Summary of CNN structure evolution (iii) Design principles

How to view the future trend of computer vision

Summary of CNN visualization technology (I) – feature map visualization

Summary of CNN visualization Technology (ii) – Convolutional kernel visualization

CNN Visualization Technology Summary (iii) – Class visualization

CNN Visualization Technology Summary (IV) – Visualization tools and projects