Abstract: In convolutional neural networks, different features are extracted by using filters whose weights are learned automatically during training, and then all these extracted features are “combined” to make decisions.

This article is shared from “Summary of Common Convolution in Neural Networks” published by Huawei Cloud community, originally written by FDafad.

The purpose of convolution is to extract useful features from the input. In image processing, you can choose from a variety of filters. Each type of filter helps to extract different features, such as horizontal/vertical/diagonal edges, from the input image. In a convolutional neural network, different features are extracted by using filters whose weights are learned automatically during training, and all these extracted features are then “combined” to make decisions.

Directory:

1. 2 d convolution

2. 3 d convolution

3. 1 * 1 convolution

4. Space separable convolution

5. Depth-separable convolution

6. Group volume data

7. Extended convolution

Deconvolution. 8

9. Involution

2 d convolution

Single channel: In deep learning, convolution is essentially the multiplication and accumulation of the elements of the signal to obtain the convolution value. For an image with one channel, the following figure demonstrates the operation form of convolution:

The filter here is a 3 x 3 matrix with elements of [[0,1,2], [2, 2], [0,1,2]]. Filter slides through the input data. At each location, it is multiplying and adding elements by elements. Each slide position ends with a number and the final output is a 3 x 3 matrix.

Multi-channel: since images generally have RGB3 channels, convolution is generally used for multi-channel input scenes. The following figure illustrates the operation form for a multi-channel input scenario:

The input layer is a 5 x 5 x 3 matrix with 3 channels. The filters are 3 x 3 x 3 matrix. First, each kernels in the filters are applied to three channels in the input layer and convolution is performed three times to produce three channels with dimensions of 3×3:

Then add the three channels (element by element) to form a single channel (3 x3 x 1) that is the result of convolving the input layer (5 x5 x3 matrix) with the filters (3 x3 x3 matrix) :

3 d convolution

In the last illustration, you can see that this is actually doing 3D-convolution. But in general, it’s still called 2D-convolution for deep learning. Because the depth of the filters is the same as the depth of the input layer, the 3D-filters move only in two dimensions (the height and width of the image), resulting in a single channel. By extending 2D-convolution, 3D-convolution is defined as the depth of the filters is less than the depth of the input layer (that is, the number of convolution kernels is less than the number of channels in the input layer). Therefore, 3D-filters need to slide in three dimensions (length, width and height of the input layer). Perform a convolution operation at each position of the slide on the filters to get a value. As the filters slide across the entire 3D space, the output structure is ALSO 3D. The main difference between 2D-convolution and 3D-convolution is the spatial dimension of the slide of filters. The advantage of 3D-convolution is to describe the relationship of objects in 3D space. 3D relationship is very important in some applications, such as 3D-object segmentation and medical image reconstruction.

1 * 1 convolution

In terms of 1*1 convolution, it seems that each value in Featuremaps is multiplied by a number, but in fact it is not just that. Firstly, due to the activation layer, non-linear mapping is actually carried out, and secondly, the number of featuremaps channels can be changed.

The figure above illustrates the operation on an input layer with dimensions H x W x D. Convolved with 1 x 1 of filters of size 1 x 1 x D, the dimension of the output channel is H x W x 1. If we perform this 1 x 1 convolution N times, and then combine the results, we get an output layer with dimensions H x W x N.

Space separable convolution

In a separable convolution, we can split the kernel operation into multiple steps. We denote convolution by y = conv (x, k), where y is the output image, x is the input image, and k is the kernel. This step is easy. Next, we assume that k can be calculated from the following equation: k = k1.dot (k2). This will make it a separable convolution, because we can get the same result by convolving k1 and k2 in two dimensions, instead of convolving k in two dimensions.

Take the Sobel kernel, which is typically used for image processing. You can obtain the same kernel by multiplying the vectors [1,0,-1] and [1,2,1].t. You only need six arguments instead of nine to perform the same operation.

Depth separable convolution

Spatial detachable convolution (previous section), while in deep learning, deeply detachable convolution performs a spatial convolution while keeping the channel independent, and then performs a deep convolution operation. Suppose we have a 3×3 convolution layer on a 16 input channel and 32 output channel. What will happen then is that each of the 16 channels will be traversed by 32 3×3 kernels, resulting in 512 (16×32) feature maps. Next, we synthesize a large feature map by adding the feature maps in each input channel. Since we can do this 32 times, we get the desired 32 output channels. What about the representation of deeply separable convolution for the same example? We traverse 16 channels, each with a 3×3 kernel, and can give 16 feature maps. Now, before we do any merging, we’re going to walk through the 16 eigenmaps, each of which contains 32 1×1 convolution, and then start adding them one by one. This results in 656 (16x3x3 + 16x32x1x1) arguments as opposed to the 4608 (16x32x3x3) arguments above. More on this below. The 2D convolution kernel mentioned in the previous section is 1×1 convolution. Let’s go through standard 2D convolution very quickly. For a specific example, suppose the input layer is 7 x 7 x 3 (height x width X channel), the filter size is 3 x 3 x 3, and the output layer is 5 x 5 x 1 (with only one channel) after 2D convolution of a filter. As shown below:

In general, multiple filters are applied between two neural network layers, and the number of filters is now assumed to be 128. 128 2D convolution results in 128 output mappings of 5 x 5 x 1. These maps are then stacked into a single layer of size 5 x 5 x128. Spatial dimensions such as height and width shrink while depth expands. As shown below:

Let’s see how the same transformation can be achieved using depth separable convolution. First, we apply deep convolution on the input layer. Instead of using a single filter of size 3x 3x 3, we use 3 convolution kernels (each filter of size 3x 3x 1) in 2D convolution. Each convolution kernel convolved only one channel of the input layer, and such convolution produced a mapping with a size of 5x 5x 1 each time. Then these mappings were stacked together to create a 5x 5x 3 image, and finally an output image with a size of 5x 5x 3 was obtained. In this way, the depth of the image remains the same as the original.

Depth-separable convolution – Step 1: Use 3 convolution kernels (each filter size 3x 3x 1) in 2D convolution instead of using a single filter size 3x 3x 3. Each convolution kernel convolved only one channel of the input layer, and such convolution produced a mapping with a size of 5x 5x 1 each time. Then these mappings were stacked together to create a 5x 5x 3 image, and finally an output image with a size of 5x 5x 3 was obtained. The second step in depth-separable convolution is to enlarge the depth, and we convolve 1×1 with a convolution kernel of size 1x1x3. Each 1x1x3 convolution is convolved with the 5x 5x 3 input image to produce a mapping of size 5x 5 x1.

In this case, 128 convolution of 1×1 gives you a layer of size 5x 5x 128.

Grouping convolution

Group convolution first appeared in AlexNet. Due to limited hardware resources at that time, convolution operations could not be processed by the same GPU during AlexNet training. Therefore, the author distributed feature maps to multiple Gpus for processing respectively. Finally, the results of multiple Gpus are fused.

The following describes how grouping convolution is implemented. First, the traditional 2D convolution steps are shown in the figure below. In this case, an input layer of size 7x 7x 3 is converted to an output layer of size 5x 5x 128 by applying 128 filters of size 3x 3x 3 each. For the general situation, it can be summarized as follows: By applying Dout convolution kernels (the size of each convolution kernel is Hx w x Din), the input layer with the size of Hin x Win x Din can be converted into the output layer with the size of Hout x Wout x Dout. In grouping convolution, filters are split into different groups, each of which is responsible for the traditional 2D convolution work with some depth. The following example is a little clearer.

Expansion of the convolution

The parameter that the extended convolution introduces into another convolution layer is called the expansion rate. This defines the spacing between values in the kernel. A 3×3 kernel with an expansion rate of 2 will have the same view as a 5×5 kernel, using only nine parameters. Imagine using a 5×5 kernel and removing rows and columns for each interval. (as shown in the figure below) the system can provide a larger receptive field for the same computational cost. Extended convolution is particularly popular in the field of real-time segmentation. It can be used only when a larger viewing range is required and multiple convolution or larger kernels cannot be tolerated.

Intuitively, empty convolution makes the convolution kernel “expand” by inserting space between parts of the convolution kernel. This increased parameter L (cavity rate) indicates how far we want to expand the convolution kernel. The figure below shows the convolution kernel size when l=1,2,4. When l=1, the empty convolution becomes a standard convolution.

deconvolution

The deconvolution mentioned here is quite different from 1-dimensional signal processing deconvolution, which the authors of FCN refer to as backwards convolution, Some people claim that Deconvolution layer isa very unfortunate name and should rather be called a transposed convolutionallayer. We can know that there are con layer and Pool layer in CNN. Con layer extracts features from images by convolution, and Pool layer filters important features by reducing the image by half. For classic image recognition CNN networks, such as IMAGENET, The final output is 1X1X1000, 1000 is the category type, 1×1 is. The FCN author, or later the end to end researcher, is to deconvolution the final 1×1 result (in fact, the FCN author’s final output is not 1×1, but 1/32 of the image size, but does not affect the use of deconvolution). The principle of deconvolution of the image here is the same as that of full convolution in FIG. 6. This deconvolution means is used to make the image larger. The method used by the author of FCN is a variant of deconvolution mentioned here, so that the corresponding pixel value can be obtained and the image can be realized end to end.

Currently, there are two types of deconvolution most commonly used:

Method 1: full convolution, complete convolution can make the original domain larger

Method 2: record the pooling index, expand the space, and fill it with convolution. The deconvolution process of images is as follows:

Input: 2×2, convolution kernel: 4×4, slide step size: 3, output: 7×7

That is, the input image of 2×2 goes through the deconvolution process of step 3 with the convolution kernel of 4×4

1. Perform full convolution for each pixel of the input image. According to the calculation of the size of full convolution, it can be known that the size of each pixel after convolution is 1+4-1=4, namely, a feature image of 4×4 size

2. Perform fusion (add) for the 4 feature images with step size of 3; For example, the red feature image is still at the original input position (upper left corner), and the green one is still at the original position (upper right corner). The step size of 3 means that fusion is performed every 3 pixels, and the overlapping parts are added. That is, the first row and fourth column of the output is obtained by adding the first row and fourth column of the red feature graph to the first row and first column of the green feature graph, and so on.

It can be seen that the size of deconvolution is determined by the size of the convolution kernel and the sliding step. In is the input size, k is the size of the convolution kernel, S is the sliding step, and out is the output size

I get out is equal to in minus 1 times s plus k

The process in the figure above is 2 minus 1 times 3 plus 4 is 7

Involution

Involution:Inverting the Inherence of Convolution for Visual Recognition (CVPR’21)

Source code address: github.com/d-li14/invo…

Despite the rapid development of neural network architecture, convolution is still the main component of deep neural network architecture construction. Inspired by classical image filtering methods, the convolution kernel has two significant characteristics: spatial-Agnostic and channel-specific. In Spatial, the property of the former ensures the sharing of convolution kernels between different locations and achieves translation invariance. In the Channel domain, the spectrum of the convolution kernel is responsible for collecting different information encoded in different channels, which satisfies the latter feature. In addition, since the emergence of VGGNet, modern neural networks satisfy the compactness of the convolution kernel by limiting the space span of the convolution kernel to no more than 3*3.

On the one hand, although the properties of spatial-Agnostic and spatial-compact are meaningful in improving efficiency and explaining translational invariance equivalence, they deprive the convolution kernel of its ability to adapt to different visual patterns at different Spatial locations. In addition, locality limits the receptive field of convolution, posing challenges to small targets or fuzzy images. On the other hand, it is well known that the inter-channel redundancy within the convolution kernel is prominent in many classical deep neural networks, which limits the flexibility of the convolution kernel for different channels.

In order to overcome the above limitations, the author of this paper proposes an operation called involution. Compared with standard convolution, involution has symmetrical reverse properties, namely, spatial-specific and channel-agnostic. Specifically, the involution nucleus is different in spatial scope, but shared in channel. Due to the spatial characteristics of involution kernel, if the involution kernel is parameterized into a fixed-size matrix such as convolution kernel and updated with the back propagation algorithm, the transmission of the learned involution kernel between input images with different resolutions will be hindered. At the end of processing variable feature resolution, the involution kernel belonging to a specific spatial position may only be generated as an instance with the input feature vector of the corresponding position itself. In addition, we also reduce the redundancy of the involution kernel by sharing the channel dimension.

Combining the above two factors, the computational complexity of involution operation increases linearly with the number of feature channels, and the dynamic parameterized Involution kernel has extensive coverage in spatial dimension. Through reverse design, the involution proposed in this paper has the double advantages of convolution:

1: Involution can aggregate context in a broader space, thus overcoming the difficulty of modeling remote interaction

2: Involution can allocate weight adaptively in different positions, so as to prioritize the visual elements with the most abundant information in the spatial domain.

It is also well known that recent self-attention-based further research has shown that many tasks have been modeled using Transformer in order to capture long-term dependencies of features. In these studies, pure self-attention can be used to build independent models with good performance. This paper will reveal that self-attention models the relationship between adjacent pixels through a complex formula about nuclear structure, which is actually a special case of involution. In contrast, the core used in this article is generated from a single pixel, not its relationship to neighboring pixels. Furthermore, the author proves in experiments that the accuracy of self-attention can be achieved even with the simple version.

The calculation process of involution is shown in the figure below:

For the feature vector on a coordinate point input featureMap, use the ∅ (fC-Bn-relu-FC) and 0 0 transform to expand the shape of kernel Thus, the involutionkernel corresponding to this coordinate point is obtained, and then the multiplication-add is carried out with the feature vector of the neighborhood of the input featuremap to get the final output Featuremap. The detailed operation process and tensor shape changes are as follows:

In addition, the author implements some models of MMClassficton, MMsegmentation and MMDetection based on MM series codes.

Click to follow, the first time to learn about Huawei cloud fresh technology ~