GCT gated channel attention, through global context embedding, channel normalization, adaptive gating to explicitly model the relationship between channels, promote their competition or cooperation.

Gated Channel Transformation for Visual Recognition

Zongxin Yang, Linchao Zhu, Y U Wu, and Yi Yang

Code:github.com/z-x-yang/GC…

Abstract

  • The GCT module is a general-purpose gated conversion unit that can be optimized with network weights.
  • Different from SEnet’s implicit learning through full connections, it uses explicable variables to explicitly model the relationship between channels, determining whether they are competitive or cooperative.

Key words: interpretability, explicit relation, gating

introduce

  • A single convolution layer only operates on the adjacent local context of each spatial position in the Feature Map, which may lead to local ambiguity. There are usually two ways to solve this problem: one is to increase the depth of the network, such as VGG and Resnet; the other is to increase the width of the network to obtain more global information. For example, GEnet makes extensive use of domain embedding, and SEnet uses global embedding information to model channel relations.

  • However, there are two problems with using fc layer in SEnet:

    1. Because the FC layer is used, it cannot be used on all layers to save parameters
    2. The parameters of FC layer are complicated and it is difficult to analyze the correlation between different channels, which is actually a kind of implicit learning
    3. There are problems after certain layers

Related work

Gating mechanism

Gating mechanisms have been successfully applied to some recurrent neural network structures. LSTM introduces an input gate, an output gate and a forgetting gate to regulate the information flow in and out of the module. Based on the gating mechanism, some attention methods focus computing resources on the part with the most abundant feature information.

Normalized layer

In recent years, normalized layer has been widely used in deep networks. Local response normalization (LRN) calculates statistics in a small neighborhood between channels for each pixel; Batch normalization (BN) uses global spatial information on the batch dimension; Layer normalization (LN) is computed along the channel dimension rather than the batch dimension; Group normalization (GN) divides channels into groups in different ways and computes mean and variance within each group to perform normalization.

GCT

Design idea:

  1. Embed global context information by p-norm
  2. Channel normalization is carried out by embedding information and trainable parameters
  3. Gating attention mechanism is realized by gating weight and bias

The overall structure

GCT module consists of three parts: global context embedding, channel normalization, and gated adaptive. Where, there is no parameter during normalization operation.

At the same time, in order to make GCT learnable, three weights are introduced — α\alphaα, γ\gammaγ, β\betaβ, α\alphaα are responsible for adaptive embedding output. Gated weights γ\gammaγ and bias β\betaβ are responsible for controlling gate activation.

In addition, the parameter complexity of GCT is O(C), while that of SEnet is O(C^2).

Then the activation characteristics of GCT module are:


x ^ = F ( x Alpha. . gamma . Beta. ) . Alpha. . gamma . Beta. R c \hat x=F(x|\alpha,\gamma,\beta),\alpha,\gamma,\beta \in \mathbb{R}^c

Global context embedding

Large receptive fields can avoid local semantic ambiguity, so a global context embedding module is designed to aggregate global context information in each channel.

GAP (global average pooling) fails in some cases, such as when the SE module is deployed behind the LN layer, because LN fixes the average for each channel and the output of GAP is constant for any input.

Here, P-norm is used for global context embedding. 2-norm has the best effect, and 1-norm is very close to it. However, note that when P =1, for non-negative input (such as deployment after ReLU), it will be equivalent to GAP

The parameter alpha \ alpha alpha defined as alpha = [C] alpha 1… alpha \ alpha = alpha [\ \ alpha_C alpha_1…] = [alpha 1… alpha C], when alpha n \ alpha_n alpha n close to zero, the channel will not participate in normalization

The module is defined as:


s c = Alpha. x c p = Alpha. { [ i = 1 H i = 1 W ( x c i . j ) p ] + Epsilon. } 1 p s_c=\alpha||x_c||_p=\alpha\{[\sum_{i=1}^H\sum_{i=1}^W(x_c^{i,j})^p]+\varepsilon\}^{\frac{1}{p}}

Among them, ε\varepsilonε is a very small constant, which avoids the problem of finding the derivative at zero, which is also a commonly used technique and will not be mentioned in the future.

Channel normalization

Normalization can set up competition between neurons (or channels), making those with a larger response relatively larger and inhibiting other channels with a smaller response. (This idea may have been first proposed in an LRN paper, but the paper doesn’t explain it.) I think maybe when C ∣ ∣ sc ∣ ∣ > 2 1 \ frac {\ SQRT {C}} {| | s_c | | _2} > 1 ∣ ∣ sc ∣ ∣ 2 C > 1 will play the role of competition relationship, because, here use l_2 regularization for channel normalization.

Similar to LRN, it is defined as follows:


s c ^ = C s c s 2 = C s c [ ( i = 1 C s c 2 ) + Epsilon. ] 1 2 \hat{s_c}=\frac{\sqrt{C}s_c}{||s||_2}=\frac{\sqrt{C}s_c}{[(\sum_{i=1}^{C}s_c^2)+\varepsilon]^{\frac{1}{2}}}

Gated adaptive

The definition is as follows:


x c ^ = x c [ 1 + tanh ( gamma c s c ^ + Beta. c ) ] \hat{x_c}=x_c[1+\tanh(\gamma_c\hat{s_c}+\beta_c)]

When the gating weight of a channel is actively activated, GCT promotes that channel to compete with other channels. When gating weights are negatively activated, GCT encourages this channel to cooperate with other channels.

In addition, when the gating weight and gating bias are 0, the original feature is allowed to pass to the next layer, i.e., equivalent to the residual:


x ^ = F ( x Alpha. . 0 . 0 ) = x \ hat {x} = F (x | \ alpha, 0, 0) = x

ResNet also benefits from the idea that this feature can effectively address deep network degradation.

Therefore, it is recommended to initialize γ and β to 0 during GCT layer initialization. In this way, the initial steps of the training process will be more stable and the final performance of the GCT will be better.

code

class GCT(nn.Module): def __init__(self, num_channels, epsilon=1e-5, mode='l2', after_relu=False): super(GCT, self).__init__() self.alpha = nn.Parameter(torch.ones(1, num_channels, 1, 1)) self.gamma = nn.Parameter(torch.zeros(1, num_channels, 1, 1)) self.beta = nn.Parameter(torch.zeros(1, num_channels, 1, 1)) self.epsilon = epsilon self.mode = mode self.after_relu = after_relu def forward(self, x): if self.mode == 'l2': embedding = (x.pow(2).sum((2, 3), Keepdim =True) + self.epsilon).pow(0.5) * self. Alpha #[B,C,1,1] norm = self. Gamma / \ (embedding. Pow (2). Keepdim =True) + self.epsilon).pow(0.5) # [B,1,1,1] elif self.mode == 'l1': if not self.after_relu: _x = torch.abs(x) else: _x = x embedding = _x.sum((2, 3), keepdim=True) * self.alpha norm = self.gamma / \ (torch.abs(embedding).mean(dim=1, keepdim=True) + self.epsilon) else: print('Unknown mode! + torch. Tanh (embedding * norm + self.beta) #Copy the code

interpretability

The door control weight

Visualize the distribution of gating weights on RESNET-50:

  • Value Indicates the weight value
  • Index of layers indicates the dimension of the network layer where the weight is located. The larger the weight is, the closer it is to the output
  • Density of params = log⁡(1+z)\log(1+z)log(1+z

Calculate the mean and variance of gating weights at different positions:

You can see:

  • In shallow network, the mean value of gating weight is less than 0. GCT module tends to reduce channel differences and encourage cooperation between channels.
  • In deep network, the mean value of gating weight is greater than 0, and GCT module tends to increase channel differences and encourage competition among channels.

An explanation of cooperation and competition

  • At the shallow layer of the network, we mainly learn low-level features, such as texture and edge, etc. For these basic features, we need cooperation between channels to extract features more widely.
  • At the deep level of the network, advanced features are mainly learned, and the differences between them are often very large and directly related to the task. We need competition among channels to obtain more valuable feature information.

Melting research

This paper did not explore the validity of each part of GCT, but compared p-norm method and activation function in each part.

In this paper, only the effectiveness of gating weight is given, and the functions of gating bias and embedding weight are not specifically analyzed.

supplement

training

When adding THE GCT module into the existing model, the other parameters of the network can be frozen first and only the parameters in the GCT module can be trained, and then the network can be unfrozen and trained together.

GCT can also be added to the network from the beginning, training from scratch.

thinking

According to the weight distribution diagram in 5.1, it can be found that a considerable part of weight is concentrated around 0. Can it be indicated that GCT has certain redundancy?

Related studies of SENet also explored the influence of the location of attention module on the overall performance of the network. Attention module can be placed in the shallow or deep layer of the network, rather than behind all layers, because the effect improvement and the improvement of parameter calculation may not be linear, so the loss is outweighed by the gain.

More methods of global information embedding and normalization can be explored.