Wechat official number: Ilulaoshi, personal website link: lulaoshi.info/machine-lea…

The main mode of LeNet, AlexNet and VGG networks is: firstly, the convolution layer is used to extract spatial features of images, then the full connection layer is used, and finally the classification results are output. Yan Shuicheng et al. proposed Network in Network to construct convolution layer and full connection layer from another perspective.

1 x 1 convolution layer

As we know, the convolution layer generally needs to set the height and width, and it will recognize the image features in the convolution window. If the height and width of the convolution layer are exactly 1, then the calculation mode is as follows:

In the figure, the convolution kernel has three input channels and two output channels, (K0,0,K0,1,K0,2) (K_{0,0}, K_{0,1}, K_{0,2}) (K0,0,K0,1,K0,2) corresponding to the parameters of the first output channel, (K1, 0, K1, 1, K1, 2) (K_ (1, 0}, K_ {1, 1}, K_ {1, 2}) (K1, 0, K1, 1, K1, 2) corresponding to the output of the second channel parameters. The output is multiplied by the light-colored part of the input and the light-colored part of the convolution kernel, as follows:


I 0 . i . j x K 0 . 0 + I 1 . i . j x K 0 . 1 + I 2 . i . j x K 0 . 2 I_{0, i, j} \times K_{0, 0} + I_{1, i, j} \times K_{0, 1} + I_{2, i, j} \times K_{0, 2}

We can (I0, I, j, I1, I, j, I2, I, j) (I_ {0, I, j}, I_ {1, I, j}, I_ {2, I, j}) (I0, I, j, I1, I, j, I2, I, j) on different channels such as input vector as the MLP network characteristics, (K0,0,K0,1,K0,2)(K_{0, 0}, K_{0, 1}, K_{0, 2})(K0,0,K0,1,K0,2) are understood as the weight parameters in MLP network, and the features and weights are multiplied one by one, which is almost consistent with the operation of the full connection layer. Then, the 1×1 convolution layer can be used to complete the work required by the full connection layer.

After passing through the 1×1 convolution layer, the height and width remain the same, but the number of channels changes. In the example above, the number of output channels is 2, the height and width of the output remain the same, and the number of channels becomes 2.

NiN

NiN mainly uses 1×1 convolution layer to replace the full connection layer, which is different from LeNet/AlexNet/VGG networks as shown in the figure below:

Similar to VGG, NiN also proposes the concept of base block. An NiN base block is composed of ordinary convolution layer and two 1×1 convolution layers.

def nin_block(in_channels, out_channels, kernel_size, strides, padding) :
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding),
        nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU())
Copy the code

The structure of the network is as follows:

We constructed a NiN network based on the Fashion MNIST data set and the number of input channels was 1. NiN network has four basic blocks:

  • The first base block consists of 11×11 convolutional layers and two 1×1 convolutional layers
  • The second base block consists of a 5×5 convolution layer and two 1×1 convolution layers
  • The third base block is composed of 3×3 convolution layers and 2 1×1 convolution layers
  • The fourth basic block consists of 3×3 convolution layers and 2 1×1 convolution layers

NiN removes LeNet/AlexNet/VGG’s final full connection layer. Instead, NiN uses an NiN block where the number of output channels is equal to the number of label categories, and then uses a global average pooling layer to average all elements in each channel and directly for classification. After the last base block, the data dimension becomes 10 × 5 × 5, and 10 is the number of channels, which is also the number of classification labels. We need to use the global average pooling layer to transform a 5×5 matrix into a 1×1 (a scalar).

NiN design can significantly reduce the size of model parameters, thus alleviating the overfitting. However, this design sometimes results in an increase in the training time required to obtain an effective model.

def nin() :
    ''' Returns the NiN network '''
    # Fashion-MNIST 1 * 28 * 28, resize into the input into 1 * 224 * 224
    # input shape: 1 * 224 * 224
    net = nn.Sequential(
        # 1 * 224 * 224 -> 96 * 54 * 54
        nin_block(1.96, kernel_size=11, strides=4, padding=0),
        # 96 * 54 * 54 -> 96 * 26 * 26
        nn.MaxPool2d(kernel_size=3, stride=2),
        # 96 * 26 * 26 -> 256 * 26 * 26
        nin_block(96.256, kernel_size=5, strides=1, padding=2),
        # 256 * 26 * 26 -> 256 * 12 * 12
        nn.MaxPool2d(3, stride=2),
        # 256 * 12 * 12 -> 384 * 12 * 12
        nin_block(256.384, kernel_size=3, strides=1, padding=1),
        # 384 * 12 * 12 -> 384 * 5 * 5
        nn.MaxPool2d(3, stride=2),
        nn.Dropout(0.5),
        # 384 * 5 * 5 -> 10 * 5 * 5
        nin_block(384.10, kernel_size=3, strides=1, padding=1),
        # 10 * 5 * 5 -> 10 * 1 * 1
        nn.AdaptiveAvgPool2d((1.1)),
        # get the final classification result
        nn.Flatten())

    return net
Copy the code

The training process of the model is basically similar to LeNet, AlexNet and VGG shared before, and all the codes are put on GitHub.

summary

  • NiN repeatedly uses NiN foundation block to construct the network, which is composed of convolution layer and 1 × 1 convolution layer instead of full connection layer.
  • NiN removes the fully connected layers that are prone to overfitting. In the final output part, the number of output channels is equal to the number of label categories, and then the global average pooling layer is used to obtain the final classification result. After removing the full connection layer, the size of model parameters also decreases significantly.
  • The above design ideas of NiN influenced the subsequent design of a series of convolutional neural networks.

The resources

  1. Lin, M., Chen, Q., & Yan, S. (2013). Network in network.
  2. Ai/chapter_con d2l….
  3. Tangshusen. Me/Dive into – D…