Channel-wise convolution slides on the Channel dimension, subtly solves the complex fully connected characteristics of input and output in the convolution operation, but it is not as rigid as grouping convolution, which is a very good idea

Source: Xiaofei algorithm engineering Notes public account

Thesis: ChannelNets: Compact and Efficient Convolutional Neural Networks via channel-wise Convolutions

  • Thesis Address:Arxiv.org/abs/1809.01…
  • Thesis Code:Github.com/HongyangGao…

Introduction


Deeply separable convolution can reduce the computation and parameter number of the network, among which point-wise convolution occupies most of the parameter number. The paper believes that the next core of network lightweight lies in changing the dense connection mode between input and output. Therefore, the paper proposes the concept of channel-wise convolution, which makes the dimensional connections of input and output sparse rather than fully connected, which is different from the strict grouping of grouping convolution, so that the convolution slides on the channel dimension and can better preserve the information communication between channels. Based on the idea of channel-Wise convolution, channel-Wise deeply separable convolution is proposed, and ChannelNets are constructed based on this structure to replace the last full connection layer + global pooling operation of the network.

Channel-Wise Convolutions and ChannelNets


Figure A shows the depth-separable convolution structure, while Figure B shows the depth-separable convolution structure after grouping, where each point represents a one-dimensional feature.

Channel-Wise Convolutions

The core of channel-wise convolution lies in the sparse connection of input and output. Each output is only connected with part of the input, which is conceptually different from the grouping convolution. Instead of strictly differentiating the input, it samples multiple relevant inputs to output with a certain stride (sliding in the Channel dimension). Can reduce the number of parameters and ensure a certain degree of information flow between channels. Assume that the convolution kernel size is Dkd_kdk, the output size dimension is NNN, the input feature graph size is df×dfd_f\times d_fdf×df, and the parameter number of ordinary convolution is m×dk×dk×nm\times d_k\times d_k\times nm×dk×dk×n. The calculation quantity is m×dk×dk×df×df×df× df×df×nm\times d_k\times d_f\times d_f\times d_f\times d_f\times d_f\times d_f\times nm× DK × DK ×df×df×df×n, The parameter number of channel-wise convolution is dc×dk×dkd_c\times d_k\times d_kdc×dk×dk. Dcd_cdc is usually a number far less than MMM, representing the input dimension of a sample. The calculation quantity is dc×dk×dk×df×df×nd_c\times d_k\times d_f\times d_f\times d_f\times NDC ×dk×dk×df×df×n, and the number of parameters and the calculation quantity are separated from the input characteristic dimension MMM.

Group Channel-Wise Convolutions

The grouping idea of grouping convolution will lead to information barrier between channels. In order to increase channel information communication between groups, it is generally necessary to add a fusion layer behind to keep grouping while integrating the features of all groups. In this paper, the grouping channel-Wise convolution layer is used as the fusion layer, which contains GGG channel-Wise convolution. Define the input feature dimension NNN, the number of groups GGG, the stride of each channel-wise convolution is GGG (here refers to the sliding step on channel), and output n/ GN/GN/G feature graph (sliding N /gn/gn/g times). In order to ensure that the output of each grouping covers all the inputs, it is necessary to satisfy dc≥gd_c \ge GDC ≥g, and finally concate all the outputs, as shown in Figure C.

Depth-Wise Separable Channel-Wise Convolutions

Deep separable channel-Wise convolution is followed by a channel-Wise convolution to fuse features to reduce the number of parameters and computation, as shown in FIG. D. In the figure, the stride of channel-wise convolution is 1 and dCD_cdc is 3, which can reduce the number of parameters while carrying out feature fusion.

Convolutional Classification Layer

Networks typically end up using global pooling and full connection layers for final classification, but the number of parameters in such combinations is huge. The combination of global pooling and full connection layer can actually be converted into depth separable convolution. Global pooling is replaced by deep convolution with fixed weights, and full connection layer is replaced by pointwise convolution. Therefore, the above depth separable Channel-wise convolution can be further used for optimization. However, since there is no activation function or BN operation between pooling and full connection, conventional THREE-DIMENSIONAL convolution is more efficient.

Assuming that the input feature graph is m×df× DFM \times d_f\times d_fm×df×df, and the number of classes is NNN, deep convolution or global pooling can be considered as the convolution kernel size df×df×1d_f\times d_f\times 1df×df×1. The three-dimensional convolution of 1/ DF21 /d^2_f1/df2 with fixed weight, while channel-wise can be considered as the three-dimensional convolution of convolution kernel size of 1×1×dc1\times 1\times d_c1×1×dc, The two can be combined to form a three-dimensional convolution with a convolution kernel size of df×df×dcd_f\times d_f\times d_cdf×df×dc. To match the number of categories, dc=m− N +1d_c= M-n + 1DC = M − N +1, that is, only (M − N +1)(M − N +1) input feature graphs are required for each category prediction.

The paper visualized the weight of the fully connected classification layer, and the blue represents the weight of 0 or close to 0. It can be seen that the weight of the fully connected classification layer is actually very sparse, that is, only part of the input itself is used, so it is reasonable to use the partial input feature here.

ChannelNets

ChannelNet is built according to the infrastructure of MobileNet and designed the grouping module (GM) and grouping Channel-Wise module (GCWM) in Figure 3. Since the GM module has the problem of information barrier, GCWM is used in front of the GM module to generate grouping features containing global information.

ChannelNet contains three versions:

  • Channelnet-v1 replaces part of the depth-separable convolution with GM and GCWM, with a grouping number of 2 and a total of about 3.7 million parameters.
  • Channelnet-v2 replaces the final depth-separable convolution with depth-separable channel-wise convolution, saving about 1 million parameters, accounting for 25% of Channelnet-V1 parameters.
  • Channelnet-v3 replaces the final pooling Layer plus the full connection Layer with the aforementioned Convolutional Classification Layer, saving approximately 1 million (1024×1000-7x7x25) parameters.

Experimental Studies


The network performance was compared in ILSVRC 2012.

For lighter network performance, MobileNet’s width Multiplier idea is used here to scale the dimensions of each layer.

By comparing the influence of grouping channel-wise convolution on ChannelNet, GCWM is replaced by GM module. Considering that only 32 parameters are added to GCWM module, such performance improvement is efficient.

Conclustion


Channel-wise convolution slides on the Channel dimension, subtly solves the complex fully connected characteristics of input and output in the convolution operation, but it is not as rigid as grouping convolution, which is a very good idea. But feel the performance of the paper itself is not optimal enough, the paper is only compared to MobileNetV1, with MobileNetV2 worse.





If this article was helpful to you, please give it a thumbs up or check it out

For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]