What is the 1X1 convolution kernel in convolutional Neural networks? What does it do?

“Click to pay attention to little White CV, add attention, set star mark, learn from now on faster”

The convolution kernel in the form of 1X1 is often used in current Network modules. 1X1 convolution kernel, also known as Network in Network, is an indispensable module in most current Network structures. For example, 1X1 convolution kernel is often added in Resnet module, separation convolution module, Transformer module and so on. It is also a question that is often asked in interviews.

Before we know 1×1 convolution, let’s first review the basic concept of convolutional networks: what is a convolutional kernel in neural networks? In this article, we will learn two points:

What is the kernel of convolution?
1x1What is the kernel of convolution?
1x1The convolution kernel is often added to the base module. What does it do?

It takes about 2 minutes to read the full text. It takes about 2 minutes to read the full text.

Convolution kernels

The convolution kernel can be thought of as a weighted sum over a local region, which is locally aware. We humans, for example, can’t look at a single pixel of an object, nor can we look at the whole thing at once. Instead, we begin to understand the parts, and eventually the combination of the parts constitutes the overall picture of the observed object. This local region observation corresponds to convolution.

The size of the convolutional kernel single-layer (channel) is generally 1×1,3×3, 5×5, and 7×7 (usually in the form of odd x odd). At the same time, on the channel scale, it can be any size, such as 3x3x32, 5x5x32, etc. Its shape is shown in the figure below:

According to the row structure, the figure above can be divided into two blocks:

The input is a 6x6x1 matrix, here the 1×1 convolution is of the form 1x1x1, with element 2, and the output is also a 6x6x1 matrix. But the value of each element in the output matrix is the result of the value x2 of each element in the input matrix.
When the input is 6x6x32, the 1×1 convolution is of the form 1x1x32, when there is only one 1×1 convolution kernel, then the output is 6x6x1. And then the channel goes from 32 on the input to 1 on the output.

Above, the number of convolution kernels corresponds to the number of output channels. What needs to be explained here is that for each input channel, the convolution kernels on each output channel are different. For example, the input is 28x28x192(WxDxC, where C represents the number of channels). And then in the convolution kernel of 3×3, the number of convolution channels is 128, so the parameters of the convolution are 3x3x192x128,

The parameters in each convolution of the first two corresponding convolution are 3×3, and the total number of the convolution of the last two corresponding convolution is 192×128 (generally understood as, the weight sharing of the convolution kernel is only valid on each single channel, and the corresponding convolution kernel between channels is independent and not shared, so it is 192×128 here).

Thus, the definition and structure of the 1×1 convolution kernel is obvious, namely the convolution kernel of size 1×1, and the channel can be a convolution kernel of any size.

Function:

Reducing or increasing dimension
Additive nonlinear characteristic
Cross-channel information interaction (channal transformation)
Reducing network parameters

1. Reduce or increase dimension

Since 1×1 does not change the height and width, the first and most intuitive result of changing the channel is that the original data amount can be increased or decreased. Here see other articles or blogs are called the increase of dimension, dimension reduction. PS: What is changed here is only the size of the dimension of channels in height × width × channels.

2. Add nonlinear characteristics

1*1 convolution kernel can greatly increase the nonlinear characteristics (using the subsequent nonlinear activation function) on the premise of keeping the scale of feature map unchanged (that is, without loss of resolution), and make the network very deep.

3. Cross-channel information interaction

Using the 1×1 convolution kernel, the operation of reducing and raising dimensions is actually a linear combination change of information between channels, the convolution kernel of 3×3 and 64channels with a convolution kernel of 1×1 and 28channels, becomes the convolution kernel of 3×3 and 28channels, The original 64 channels can be understood as a cross-channel linear combination into 28 channels, which is the information interaction between channels.

4. Lower network parameters

If the output of the last layer is 100x100x128, after the 5×5 convolution layer with 256 outputs (stride=1, pad=2), the output data is 100x100x256. Where, the parameter of the convolution layer is 128x5x5x256=819200.

If there is another combination form, the output of the previous layer is 100x100x128, and the 1×1 convolution layer with 32 outputs is first passed, and then the 5×5 convolution layer with 256 outputs is passed, then the final output data is still 100x100x256, but the number of convolution parameters becomes the sum of the two-layer winding machine, namely: 128 x1x1x32 + 32 x5x5x256 = 4096 + 204800 = 208896. At this point, the parameters are reduced by about four times.

Finally, extend the beginning of NIN, Network In Network, paper address: arxiv.org/abs/1312.44… In Network is only 29M.

This article is innovative in two ways:

MLP (Multilyer perceptron)Convolution Layers: after the conventional Convolution (receptive field greater than 1), a number of 1×1 Convolution is followed, and each feature map is regarded as a neuron, and the feature map is similar to the linear combination of multiple neurons through 1×1 Convolution.
Global Average Pooling: use Global Average Pooling to replace the full connection layer. Specifically, perform Average Pooling on the feature map of the last layer and enter the result vector directly into the SoftMax layer. One of the advantages of this method is that the feature map is directly related to the classification task. Another advantage is that global average pooling does not need to optimize additional model parameters, so the model size and calculation amount are greatly reduced compared with full joining, and overfitting can be avoided.

This is the first article in the AI interview series. Try to describe it as simple as possible. It is important to understand it. If you have any questions, welcome to comment, xiaobian first time to reply. In the following articles, we will also continue to sort out practical and interesting Wen Z to share with you. Like to remember to pay attention.

The first time I spent the weekend reading and writing articles in the cafe of book city, I felt very good. I will come again next time. Mark)

The content of the reference: zhuanlan.zhihu.com/p/40050371

Review past

Click on the picture to review the past hot articles

Python 100题 (1) Everything is difficult before it is easy

Here are eight of the most common questions in a job interview

Python Data Series (I) – List List: The “coolies” of Python \

First time Posting in nuggets, give me a thumbs-up

What is the 1X1 convolution kernel in convolutional Neural networks? What does it do?

​

Convolution kernels

Related Posts

Running Python code on CUDA – Numba is a CudA-enabled Python (2)

Discussion on Texas Hold ’em AI core algorithm: CFR

Haier announced a strategic partnership with Sogou in the field of artificial intelligence