Author: Redflashing

In this paper, various convolution encountered in deep learning are sorted out, exemplified and summarized to help people better understand and build convolutional neural networks.

The following concepts of convolution will be introduced in detail in this paper:

  • 2D Convolution
  • 3D Convolution
  • ∗11*11∗1 Convolution (1∗11*11∗1 Convolution)
  • Transposed Convolution
  • Dilated Convolution/Atrous Convolution
  • Spatially Separable volume Products
  • Deep Separable volume Product (Depthwise)
  • Flattened Convolution
  • Grouped Convolution
  • Shuffled Grouped Convolution
  • The Pointwise Grouped Convolution

1. First, what is convolution?

Convolution is an important operation in mathematics (especially mathematical analysis). Convolution is widely used in signal and image processing and other engineering fields. Convolutional neural network (CNN) in deep learning is named after the concept of convolution. The essence of convolution in deep learning is cross-correlation between signal and image processing. In signal/image processing, convolution is defined as follows:

Continuous case:


( f g ) ( t )    = d e f   up up f ( tau ) g ( t tau ) d tau (f * g)(t) \ \ \stackrel{\mathrm{def}}{=}\ \int_{-\infty}^\infty f(\tau) g(t – \tau) \, d\tau

In discrete case:


( f g ) [ n ]    = d e f   m = up up f [ m ] g [ n m ] = m = up up f [ n m ] g [ m ] (f * g)[n]\ \ \stackrel{\mathrm{def}} {=}\ \sum_{m=-\infty}^{\infty} {f[m] g[n-m]} = \sum_{m=-\infty}^\infty f[n-m]\, g[m]

1.1. The relationship between convolution and cross-correlation

Convolution is a mathematical operator that generates a third function through two functions FFF and GGG, representing the area of a curved trapezoid enclosed by the product function of FFF and GGG after inversion and translation.

The figure above (source: Wikipedia) shows the convolution of the square pulse wave GGG (filter) in the RC circuit with the exponential decay of the pulse wave FFF. Similarly, the area of the overlap is equivalent to the convolution value at TTT.

And cross-correlation is the sliding dot product or sliding inner product between two functions. Filtering in cross-correlation does not go through inversion, but directly through the function FFF. The cross region between FFF and GGG is cross-correlation. The following figure (source: Wikipedia) shows the difference between correlation and cross-correlation.

Thus, the filters in cross-correlation do not reverse. Strictly speaking, convolution in deep learning is a cross-correlation operation, which essentially performs element-by-element multiplication and addition. But it is conventionally called convolution because the weights of filters are learned during training. If the inverted function GGG in the above example is the correct function, then after training, the learned filter will look like the inverted function GGG. So before training, there is no need to first reverse the filter as in real convolution.

1.2. Convolution in deep learning

The purpose of performing convolution is to extract useful features from the input. In image processing, there are many filters to choose from. Different filters can extract different features. Such as horizontal, vertical, diagonal edge features. In the convolutional neural network (CNN), different features are extracted by convolution, and the weights of filters are learned during training. Then the extracted features are compounded to obtain the final result. The reason why convolution is adopted is that the convolution operation has weight sharing and translation without deformation and convolution also considers the connection of pixel space. These characteristics of convolution make it have excellent performance in computer vision tasks. The figure below shows the calculation of convolution (also known as standard convolution) on a single-channel graph.

In the figure above, input is a matrix of 5∗55*55∗5, filter is a matrix of 3∗33*33∗3 [[0,1,2],[2,2,0],[0,1,2], Stride=1, Padding=0, and output is a matrix of 3∗33*33∗3.

In most applications, we typically need to deal with multi-channel images. The most typical is RGB images. The following figure (source :Andre Mouton) is a single channel diagram of RGB image decomposition.

Another example of multi-channel data is layers in CNN. The convolutional layer is usually composed of multiple or even hundreds of channels. Each channel describes a different feature of the upper layer. How to convert between layers with different number of channels? That is, how to convert a layer with channel number NNN to a layer with channel number MMM?

To describe this problem, we need to introduce some terms: Layers, Channels, Feature Maps, Filters, Kernels. From a hierarchical point of view, the concepts of layers and filters are at the same level, while channels and convolution kernels are at the next level. Channels and feature maps are the same concept. A layer can have multiple channels (or feature maps). If an RGB image is entered, there will be 3 channels. Channels are usually used to describe the structure of layers. Similarly, the convolution Kernel is used to describe the structure of a Filter. The following figure shows the relationship between layer and channel.

The difference between a filter and a convolution kernel is a little tricky to understand. The two are interchangeable in some cases, which may cause confusion for us. What are the differences between the two? Convolution kernels tend to favor 2D weight matrices. The filter mostly refers to the 3D structure of multiple convolution kernels stacked. If it’s a 2D filter, it’s the same thing. But for 3D filters, in most deep learning convolution, it contains the convolution kernel. Each convolution kernel is unique to emphasize different parts of the input channel.

So with that in mind, let’s move on to multichannel convolution. Each convolution kernel is applied to the input channel of the previous layer to generate an output channel. All output channels are combined to form the output layer. As shown below:

The input layer is a matrix of 5∗5∗35*5*35∗5∗3 (i.e., three channels). The filter is a matrix of 3∗3∗33*3*33∗3∗3 (that is, containing three convolution kernels). First, each convolution kernel in the filter is applied to three channels of the input layer respectively. Perform the last convolution calculation and output three channels of 3∗33*33∗3.

Then, the three channels are added, i.e. matrix addition, to obtain a single channel of 3∗3∗13*3*13∗3∗1. This channel is the result of applying a single filter (3∗3∗33*3*33∗3∗3) at the input layer (5∗5∗35*5*35∗5∗3).

Similarly, the above process can be viewed more uniformly as a 3D filter for processing the input layer. The input layer and the filter have the same depth, that is, the number of channels in the input layer is the same as the number of convolution kernels in the filter. A 3D filter simply slides the two dimensions of height and width in the input layer (such as an image) (this is why when a 3D filter is used to deal with a 3D matrix, the operation is called 2D convolution). At each slide position, multiplication and addition are performed to produce an operation result (a single number). In the following example, slide on the input layer of 5∗5∗ X5 *5* X5 ∗5∗ X (XXX is any value), and the resulting output layer contains only one output channel.

Further, we can easily understand how to convert at different depths of the Layer. Suppose the input layer has XinX_{in}Xin channels and the output layer needs DoutD_{out}Dout channels. Only Dout {out}Dout filters are required to process the input layer, and each filter has Din _{in}Din convolution kernels. Each filter provides one output channel. After this process is completed, Dout _{out}Dout channels form the output layer.

1.3. Calculation of 2D convolution

Input layer: Win∗Hin∗DinW_{in}*H_{in}*D_{in}Win∗Hin∗Din

Super parameters:

  • Number of filters: KKK
  • Convolution kernel dimension in the filter: W ∗hw*hw∗h
  • Stride: SSS
  • Padding: PPP

Output layer: Wout∗Hout∗DoutW_{out}*H_{out}*D_{out}Wout∗Hout∗Dout

The parameter relationship between the output layer and the input layer is,


{ W o u t = ( W i n + 2 p w ) / s + 1 . H o u t = ( H i n + 2 p h ) / s + 1 . D o u t = k \begin{cases} W_{out} = (W_{in} +2p – w)/s + 1 ,\\ H_{out} = (H_{in} +2p – h)/s + 1, \\ D_{out} = k \end{cases}

Reference number is:, (w ∗ ∗ Din + 1 h) ∗ k D_ (w * h * {in} + 1) * k (h w ∗ ∗ Din + 1) ∗ k

2. 3 d convolution

In the previous illustration, you can see that this is actually completing 3D convolution. But in general, it’s still called 2D convolution for deep learning. Because the filter depth is the same as the input layer depth, the 3D filter moves only in two dimensions (such as the height and width of the image), resulting in a single channel.

Through the extension of 2D convolution, 3D convolution is defined as the depth of the filter is less than the depth of the input layer (that is, the number of convolution kernels is less than the number of channels in the input layer), so the 3D filter needs to slide in three dimensions (length, width and height of the input layer). A convolution operation is performed at each position of the filter slide to obtain a numerical value. As the filter slides through the entire 3D space, the output structure is ALSO 3D.

The main difference between 2D convolution and 3D convolution is the spatial dimension of filter sliding. The advantage of 3D convolution is to describe the relationship of objects in 3D space. 3D relationship is very important in some applications, such as 3D object segmentation and medical image reconstruction.

2.1. Calculation of 3D convolution

Input layer: Win ∗ Hin ∗ ∗ Din CinW_ {in} * H_ {in} * D_ {in} * C_ {in} Win ∗ Hin ∗ ∗ Din Cin

Super parameters:

  • Number of filters: KKK
  • The convolution kernel dimension in the filter: W ∗h∗ Dw * H *dw∗h∗d
  • Stride: SSS
  • Padding: PPP

Output layer: Wout∗Hout∗Dout∗CoutW_{out}*H_{out}*D_{out}*C_{out}Wout∗Hout∗Dout∗Cout

The parameter relationship between the output layer and the input layer is,


{ W o u t = ( W i n + 2 p w ) / s + 1 . H o u t = ( H i n + 2 p h ) / s + 1 . D o u t = ( D i n + 2 p d ) / s + 1 . C o u t = k \begin{cases} W_{out} = (W_{in} +2p – w)/s + 1 ,\\ H_{out} = (H_{in} +2p – h)/s + 1, \\ D_{out} = (D_{in} +2p – d)/s + 1, \\ C_{out} = k \end{cases}

The number of parameters is: (W ∗h∗d+1)∗k(W * H *d +1) * K (W ∗h∗d+1)∗k

3.
1 1 1 * 1
convolution

1∗11*11∗1 convolution is interesting. At first glance 1∗11*11∗1 Convolution for a single channel is simply multiplying each element by a number, but if the input layer is multichannel, the situation becomes interesting.

The figure above explains how convolution of 1∗11*11∗1 applies to the input layer whose dimension is H∗W∗DH*W*DH∗W∗D, the filter size is 1∗1∗D1*1*D1∗1∗D, and the output channel size is H∗W∗1H*W*1H∗W∗1. If NNN such filters are applied and then combined together, the output layer size obtained is H∗W∗nH*W*nH∗W∗ N.

3.1.
1 1 1 * 1
What convolution does

  • Adjusting channel number

    Since 1×11×11×1 convolution does not change the height and width, the first and most intuitive result of changing the channel is that the original data volume can be increased or decreased. Other articles and blogs refer to it as dimension up or dimension down. But the actual dimension hasn’t changed, The only change is the size of channelschannelschannels in height×width× Channelsheight ×width× Channelsheight ×width× Channels.

  • Incremental nonlinearity

    1∗11*11∗1 Convolution kernel, which can greatly increase the nonlinear properties (using the subsequent nonlinear activation function such as ReLU) on the premise that the scale of the feature graph remains unchanged (i.e., unchanged). Nonlinearity allows the network to learn more complex functions and allows the entire network to be further deepened.

  • Cross-channel information

    Using 1∗11*11∗1 convolution kernel, the operation of reducing and increasing dimension is actually a linear combination change of inter-channel information. Such as: In the combination of filters with convolution kernels of 3∗33*33∗3 and number of convolution kernels of 64 and convolution kernels of 1∗11*11∗1 and number of convolution kernels of 28, the output layer size is equal to the output layer size obtained by filters with convolution kernels of 3∗33*33∗3 and number of convolution kernels of 28. The original 64 channels can be understood as a cross channel linear combination into 28 channels, which is the information interaction between channels.

  • Reduce the parameters

    In fact, the dimensionality reduction mentioned above also reduces the parameters, because fewer feature graphs will naturally reduce the parameters, which is equivalent to convolution on the number of channels of the feature graph, compression of the feature graph, and secondary extraction of features, making the feature expression of the new feature graph better.

An interesting point about 1∗11 * 11∗1 convolution comes from Yann LeCun, “In convolutional neural networks, there is no concept of ‘fully connected layers’. Only the convolution layer has 1∗11*11∗1 convolution kernel and fully connected table.”

3.2.
1 1 1 * 1
Application of convolution

1∗11*11∗1 convolution plays an important role in many classical networks, and several important applications of 1∗11*1 convolution are briefly introduced below.

  • Network in Network (NIN)

    MLP convolution layer is proposed by NIN. MLP convolution layer improves nonlinear expression through superposition of “Micro Network”, whose basic component unit is 1∗11*11∗1 convolution Network. Speaking of which, 1∗11*11∗1 convolution needs to be explained. This paper is the first to put forward the convolution of 1∗11*11∗1, which has epoch-making significance. The subsequent GoogleNet uses 1∗11*11∗1 convolution for reference, and has specially thanked this paper.

  • Inception

    GoogleNet first proposed Inception module, Inception has four versions V1, V2, V3 and V4 (I will not elaborate on them here). The structure of Inception V1 is shown in two figures below.

    After fully introducing 1∗11*11∗1 convolution for dimensionality reduction, as shown in Figure (b), the number of convolution parameters has been reduced by nearly 4 times compared with Figure (a) in general.

    In the inception structure, 1∗11*11∗1 convolution is widely used, mainly for two functions: A. Dimensionality reduction of data; B. Introduce more nonlinearity to improve generalization ability, because the ReLU activation function is required after convolution;

  • ResNet

    ResNet also uses convolution of 1∗11*11∗1, and is used before and after the convolution layer of 3∗33*33∗3, which not only reduces dimension, but also raises dimension, and the number of parameters is further reduced. The right figure is also known as Bottleneck Design, and its purpose is clear. In order to reduce the number of parameters, the first 1∗11*11∗1 convolution reduces the number of channels from 256 to 64, and then recovers them through 1∗11*11∗1 convolution. The overall number of parameters used is 16.94 times different.

    For regular ResNet, it can be used in 34-tier or less networks, and ResNet for Bottleneck Design is often used in deeper networks such as 101 to reduce the number of calculations and parameters.

3.3.
1 1 1 * 1
Calculation of convolution

1∗11*11∗1 Convolution is actually a special case of 2D convolution, and the calculation process can refer to the calculation process of 2D convolution.

4. Convolution Arithmetic

Now we know the convolution of Depth. Let’s move on to two other directions (Height&Width), equally important convolution algorithms. Some terms:

  • Kernel size: The convolution Kernel has been mentioned in the above section. The convolution Kernel size defines the view of convolution.

  • Stride: defines the size of each step that the convolution kernel moves in the image. For example, Stride=1, then the convolution kernel moves by one pixel. Stride=2, then the convolution kernel moves by 2 pixels in the image (i.e., one pixel will be skipped). We can use stride>=2 to downsample the image.

  • Padding: The Padding can be understood as adding pixels around the image. Padding keeps the spatial output dimension equal to the input image, padding 0 around the input if necessary. On the other hand, the unpadded convolution performs only the convolution on the pixels of the input image, without padding 0. The output will be smaller than the input.

The resources

  • A Comprehensive Introduction to Different Types of Convolutions in Deep Learning | by Kunlun Bai | Towards Data Science

  • Convolutional neural network – Wikipedia

  • Convolution – Wikipedia

  • Understanding 1X1 Convolution Kernel in Convolutional Neural Network – Zhihu (Zhihu.com)

  • [1312.4400] Network In Network (arxiv.org)

  • Inception Network Model – Ahshun – Blog Park (CNblogs.com)

  • ResNet parsing _Lanran2 blog -CSDN blog

  • Article take you learn about the depth of convolution (on) | the heart of the machine (jiqizhixin.com)

  • Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science

  • An Introduction to different Types of Convolutions in Deep Learning
  • Review: DilatedNet — Dilated Convolution
  • ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
  • Self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write
  • Inception Network “A Simple Guide to the Versions of the Inception Network”
  • A Tutorial on Filter Groups (Grouped Convolution)
  • Convolution arithmetic animation
  • Up-sampling with Transposed Convolution
  • Intuitively Understanding Convolutions for Deep Learning