All kinds of convolution in Deep Learning (II)

Author: Redflashing

In this paper, various convolution encountered in deep learning are sorted out, exemplified and summarized to help people better understand and build convolutional neural networks.

The following concepts of convolution will be introduced in detail in this paper:

2D Convolution
3D Convolution
∗11*11∗1 Convolution (1∗11*11∗1 Convolution)
Transposed Convolution
Dilated Convolution/Atrous Convolution
Spatially Separable volume Products
Deep Separable volume Product (Depthwise)
Flattened Convolution
Grouped Convolution
Shuffled Grouped Convolution
The Pointwise Grouped Convolution

5. Deconvolution (transpose convolution)

For many generation models (such as generators in GAN, Autoencoder, semantic segmentation, etc.). We usually want to perform the opposite of normal convolution, that is, we want to perform upsampling, such as autoencoders or semantic segmentation. (For semantic segmentation, the encoder first extracts the feature image, and then the decoder restores the original image size, thus classifying each pixel of the original image.)

The traditional way to implement upsampling is to apply interpolation schemes or create rules manually. Modern architectures, such as neural networks, tend to let the network learn the appropriate transformations automatically, without human intervention. And to do that, we can use the transpose convolution.

Transpose convolution is also known as Deconvolution or Fractionally strided convolution in the literature. However, it should be pointed out that the name deconvolution is not very appropriate because transposed convolution is not true deconvolution as defined by the signal/image processing domain. Technically, deconvolution in signal processing is the inverse of convolution. But that’s not the case here. We’ll see why it’s more natural and appropriate to call this operation the transpose convolution.

We can implement transpose convolution using common convolution. Here, we use a simple example to illustrate that the input layer is 2∗22*22∗2, and the Padding value is 2∗22*22∗2, zero filling of unit step length, and then 3∗33*33∗3 convolution kernel with step length 1 is used for convolution operation to realize upsampling. The size of the upsampling output is 4∗44*44∗4.

It is worth mentioning that we can map the same 2∗22*22∗2 input to different image sizes with various fillings and step sizes. In the figure below, transposed convolution is applied to the convolution kernel of 3∗33*33∗3 on the same input for 2∗22*22∗2 (a zero is inserted between the inputs and surrounded by a zero fill of the unit step of 2∗22*22∗2), and the result (i.e., the upsampling result) is 5∗55*55∗5.

Looking at the transpose convolution in the example above can help us build some intuition. But in order to apply transpose convolution further, we also need to understand how computer matrix multiplication is implemented. From an implementation point of view we can see why transpose convolution is the most appropriate name.

In convolution, we define it as follows: CCC represents the convolution kernel, inputinputinput is the input image, outputOutputOutput is the output image. After convolution (matrix multiplication), we sample inputinputinput from the large image to the small image OutputOutputOutput. This matrix multiplication implementation follows C∗input=outputC*input=outputC∗input=output.

The following example shows how this operation works in a computer. It inputs a flattened (16∗116*116∗1) matrix and transforms the convolution kernel into a sparse matrix (4∗164*164∗16). Matrix multiplication is then used between sparse matrices and spread inputs. After that, the obtained matrix (4∗14*14∗1) is converted into 2∗22*22∗2 for output.

At this point, if convolution check is used, the shape of the obtained result (16∗116*116∗1) should be the same as that of the input (16∗116*116∗1), if transposed CTC^TCT of the sparse matrix (16∗416*416∗4) is multiplied by the flattening of the output (4∗14*14∗1).

However, it is worth noting that the above two operations are not invertible. For the same convolution kernel (because the non-sparse matrix is not the orthogonal matrix), the result is that the transpose operation cannot restore the original value, but only retain the original shape, hence the name of transpose convolution. And I answered the question above, that the transpose convolution is more accurate than the inverse convolution.

The arithmetic explanation of transpose matrix can be found at arxiv.org/abs/1603.07…

5.1. Checkboard Artifacts

One troublesome phenomenon observed in the use of transpose convolution (especially dark parts) is “checkerboard artifacts,” named Checkboard artifacts. The graph below visually illustrates the checkerboard effect (source: distil.pub /2016/deconv…

This document is only a brief introduction. For details, see the paper Deconvolution and Checkerboard Artifacts

The checkerboard effect is the result of Uneven overlap of transpose convolution. Make one part of the image darker than the other. Especially when the size of the convolution Kernel cannot be divisible by the Stride, deconvolution will overlap unevenly. Although in principle the network can be trained to adjust the weight to avoid this situation, in practice it is difficult for neural networks to completely avoid this uneven overlap.

The checkerboard effect is shown more visually with a detailed example. The top part of the figure is the input layer, and the bottom part is the transposed convolution output. The result is a transpose convolution operation in which small-sized inputs are mapped to larger-sized outputs (represented in length and width dimensions).

In (a), the step is 1, and the convolution kernel is 2∗22*22∗2. As shown in red, the first input pixel maps to the first and second pixels on the output. As shown in green, the second pixel of the input maps to the second and third pixels of the output. The second pixel on the output receives information from the first and second pixels on the input. In summary, the pixels in the middle of the output receive overlapping areas of information from the input. As the convolution kernel size in example (b) increases to 3, the central portion of the output that receives most of the information shrinks. But that’s not the biggest problem, because the overlap is still even.

If the step is changed to 2, in the example of convolution kernel size 2, all pixels on the output receive the same amount of information from the input. As can be seen from the following figure (a), the overlap of transpose convolution is depicted here. If the size of the convolution kernel is changed to 4 (figure (b) below), the uniform overlap area will shrink. Meanwhile, since the overlap is uniform, it is still an effective output. However, if the convolution kernel size is changed to 3 and the step size is 2 (as shown in figure (c) below), and if the convolution kernel size is changed to 5 and the step size is 2 (as shown in Figure (d) below), problems will occur. In these two cases, each pixel on the output receives different information from its neighbor pixels. No continuous and uniform overlap region can be found on the output.

The checkerboard effect is more serious in two dimensions. The following figure visually shows the checkerboard effect in two dimensions.

5.1.1 How to avoid checkerboard effect

Take the length of the convolution kernel divisible by the step size

This method can better deal with the checkerboard effect problem, but it is still not satisfactory, because once our convolutional kernel learning is not uniform, the checkerboard effect will still occur (as shown in the figure below).
The interpolation

You can just interpolate Resize and then convolve. This method is common in super resolution related papers. For example, we can use bilinear interpolation, nearest neighbor interpolation and spline interpolation commonly used in common graphics to carry out upsampling.

5.2. Calculation of transpose convolution

Input layer: Win∗Hin∗DinW_{in}*H_{in}*D_{in}Win∗Hin∗Din

Super parameters:

Number of filters: KKK
Convolution kernel dimension in the filter: W ∗hw*hw∗h
Stride: SSS
Padding: PPP

Output layer: Wout∗Hout∗DoutW_{out}*H_{out}*D_{out}Wout∗Hout∗Dout

Win=Hin=inW_{in}=H_{in}= Hin=in, Wout=Hout=outW_{out}= Hout= outWout=Hout=out

The parameter relationship between the output layer and the input layer can be divided into two situations:

Case 1: (in – 1) ∗ s – p + 2 k = out (1) in * s – p + 2 k = out (in – 1) ∗ s – p + 2 k = out

$\begin{cases} out=s*(in-1)-2p+k, \\ D_{out} = k \end{cases}$

Case 2: (in – 1) ∗ s – p + 2 k indicates out (1) in * s – p + 2 k \ not = out (in – 1) ∗  = s – p + 2 k out

$\begin{cases} out=s(in-1)-2p+k+(out+2p-k)\%s, \\ D_{out} = k \end{cases}$

Here, we take fCN-32S, a classical image semantic segmentation model, as an example. The input of upsampling transpose convolution is 7∗77*77∗7, and we hope to restore the original image size 224∗224224*224224∗224. Generation into the formula {out = s ∗ (in – 1) – 2 p + k, out = 224, in = 7 \ begin out = {cases} s * (1) in – 2 p + k, \\ out=224,in=7\end{cases}{out=s∗(in−1)−2p+k,out=224,in=7 6 s + 2 p = k – 2246-2 p = 2246 s + s + k k p = 224-2. Finally, through experiments, the most appropriate set of data: S =16, K =64, P = 32S =16, K =64, P = 32S =16, K =64, P =32.

6. Dilated convolution (empty convolution)

Extended convolution is introduced by these two texts:

Arxiv.org/abs/1412.70…
Arxiv.org/abs/1511.07…

This is a standard discrete convolution:

$(F*k)(p)=\sum_{s+t=p}F(s)k(t)$

The extended convolution is as follows:

$(F*_{l}k)(p) = \sum_{s+lt=p}F(s)k(t)$

When l=1l=1l=1, the extended convolution becomes the standard convolution.

Intuitively, extended convolution makes the convolution kernel by inserting Spaces between the convolution kernel elementsinflation. Newly added parameters $l$ saidExpansion rateRepresents the extent to which we wish to “inflate” the convolution kernel. The implementation will vary, but usually inserts between convolution kernel elements $l-1$ A blank space. They are shown below $L = 1, 4-trichlorobenzene$ Is the expanded size of the convolution kernel.In the image, $3 * 3$ The red dot represents the original size of the convolution kernel $3 * 3$ .Although the convolution kernels of all three dilated convolution are of the same size, the receptive fields of the models are quite different. $l=1$ When feeling wild $3 * 3$ . $l=2$ When feeling wild $7 * 7$ . $l=3$ When feeling wild $15 * 15$ . It is worth noting that the above operationThe number of parameters is the same. The extended convolution can make the model have a larger receptive field without increasing the computational cost (because the convolution kernel size is unchanged), which isThis is especially effective when multiple dilated convolution is stacked on top of each other.

The author of the paper multi-Scale Context Aggregation by Dilated convolutions constructs a network with multiple extended convolutional layers, in which the expansion rate L increases exponentially for each layer. Thus, the size of ** effective receptive field increases exponentially with layers, while the number of parameters increases only linearly. ** The function of extended convolution in this paper is to systematically aggregate multi-scale context information without losing resolution. This paper shows that the proposed module can improve the accuracy of the current best form-sense segmentation system at that time (2016). See that paper for more information.

6.1. Application of extended convolution

This paper mainly discusses the application of extended convolution to Semantic Segmentation

The regularity of a convolution layer of 7∗77*77∗7 is equivalent to the superposition of three convolution layers of 3∗33*33∗3. Such a design can not only greatly reduce the number of parameters, but also make it easier for the convolution kernel with regular properties to learn a feature space that can be generated and expressed. These characteristics are also the reason why most deep networks based on convolution now use small convolution kernels.

However, deep convolutional neural networks still have fatal defects for some tasks. Notably (caused by upsampling and pooling layers) :

The upsampling core and pooling layer are not reversible
Internal data structure missing; Spatial hierarchy information is lost
Details are missing (assuming there are four pooling layers, any details less than 24=162^4=1624=16 pixels will be discarded and cannot be rebuilt.)

In the case of such problems, semantic segmentation has been in the bottleneck and can not significantly improve the accuracy, but the concept of extended convolution can well avoid the above problems

6.3. Problems of extended convolution

** Problem 1: ** grid effect

If we simply extend convolution 3∗33*33∗3 convolution kernel by multiple superpositions, the following problems will occur:

We found that the convolution kernel was not continuous, that is, not all pixels were used for calculation, which was fatal to the task of Pixel-level dense prediction due to the continuous nature of information such as images.
** 主题 2: **** long-ranged information might be not relevant.

From the design background of extended convolution, it can be inferred that such a design is used to obtain long-ranged information. However, light using information of large expansion rate may only have effect on segmentation of some large objects, and may have negative effect on small objects. How to deal with the relationship of objects of different sizes (granularity of receptive field) at the same time is the key to design an extended convolutional network.

The treatment of the extended convolution problem is not detailed here.

6.4. Calculation of extended convolution

For example, insert L − 1L – 1L −1 space between convolution kernel elements

Input layer: Win∗Hin∗DinW_{in}*H_{in}*D_{in}Win∗Hin∗Din

Super parameters:

Expansion rate: LLL
Number of filters: KKK
Convolution kernel dimension in the filter: W ∗hw*hw∗h
Stride: SSS
Padding: PPP

Output layer: Wout∗Hout∗DoutW_{out}*H_{out}*D_{out}Wout∗Hout∗Dout

The parameter relationship between the output layer and the input layer is,

$\begin{cases} w_0 = (w-1)*(l-1) + w,\\h_0 = (h-1)*(l-1) + h,\\W_{out} = (W_{in} +2p – w_0)/s + 1 ,\\ H_{out} = (H_{in} +2p – h_0)/s + 1, \\ D_{out} = k \end{cases}$

Reference number is:, (w ∗ ∗ Din + 1 h) ∗ k D_ (w * h * {in} + 1) * k (h w ∗ ∗ Din + 1) ∗ k

The resources

A Comprehensive Introduction to Different Types of Convolutions in Deep Learning | by Kunlun Bai | Towards Data Science
Convolutional neural network – Wikipedia
Convolution – Wikipedia
Understanding 1X1 Convolution Kernel in Convolutional Neural Network – Zhihu (Zhihu.com)
[1312.4400] Network In Network (arxiv.org)
Inception Network Model – Ahshun – Blog Park (CNblogs.com)
ResNet parsing _Lanran2 blog -CSDN blog
Article take you learn about the depth of convolution (on) | the heart of the machine (jiqizhixin.com)
This article introduces you to convolution in deep learning (part 2)
Computer vision | board effect
www.zhihu.com/question/54…
Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science

An Introduction to different Types of Convolutions in Deep Learning
Review: DilatedNet — Dilated Convolution
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
Self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write self-write
Inception Network “A Simple Guide to the Versions of the Inception Network”
A Tutorial on Filter Groups (Grouped Convolution)
Convolution arithmetic animation
Up-sampling with Transposed Convolution
Intuitively Understanding Convolutions for Deep Learning