Classic network

Classic networks include LeNet, AlexNet and VGG.

LeNet: 1998, Gradient Based learning Applied to Document Recognition

For handwritten number recognition, it can be seen that the basic framework of convolutional neural network has been established, and the basic components of convolution, activation, pooling and full connection are complete.

However, after 1998, there were not many breakthroughs in deep learning. It had been silent until 2012, when AlexNet came out of the blue, bringing deep learning back to the attention of everyone and ushering in the golden age of deep learning.

Why 2012? One is data. Before, there was no large-scale data to be fully trained for a wide range of tasks. Now there is ImageNet. Second, computing. Limited by previous hardware conditions, large-scale training cannot be carried out, but now it has the bonus of GPU with powerful performance. Three is AlexNet itself is very good, to later laid a good foundation for the network, let everyone suddenly found that the original can also play like this!

AlexNet: 2012, ImageNet Classification with Deep Convolutional Neural Networks

ImageNet’s Top5 error rate: 16.4%, compared with 28.2% for the non-deep learning approach two years ago

The overall structure of AlexNet is similar to LeNet, but with significant improvements:

LeNet uses ReLU activation function and Dropout, which can be used as a regular term to prevent overfitting and improve the robustness of the model. Some good training skills. Including data augmentation, learning rate strategy, weight decay, etc

AlexNet uses GTX 580 video card with 3GB video memory (very old), one video card is not enough, so the model is divided into two parts and placed on two video cards for parallel calculation as shown in the above picture. Although this was only done when the resources of a single graphics card were limited, many subsequent networks took the idea of grouping convolution further (albeit with different motives).

VGG: 2014, Very Deep convolutional Networks for Large-scale image Recognition

After AlexNet, another network that improved a lot was VGG, where the Top5 error rate on ImageNet decreased to 7.3%.

The main improvement is: deeper, deeper! The number of network layers has increased from AlexNet’s 8 layers to 16 and 19 layers. A deeper network means more powerful network capabilities, which also means more powerful computing power. Fortunately, hardware is also developing rapidly, and the computing power of graphics cards is also growing rapidly, boosting the rapid development of deep learning.

At the same time, only 3×3 convolution kernel is used, because two 3×3 receptive fields are equivalent to a 5×5 with fewer parameters, and subsequent networks basically follow this paradigm.

GoogLeNet and ResNet

Convolutional stack layer by layer, VGG is the master, but then it is difficult to go further. Simply increasing the number of network layers will encounter problems, and the deeper network is more difficult to train and the number of parameters is also increasing.

Inception V1 (GoogLeNet) : 2015, Going deeper with convolutions

ImageNet Top5 error rate 6.7%

GoogLeNet increases network power from another dimension. Each unit has many layers of parallel computing, making the network wider. The basic units are as follows:

The overall structure of the network is shown below, containing multiple Inception modules above, and adding two auxiliary classification branches to supplement gradients for better training:

Through the horizontal arrangement of the network, the shallow network can be used to obtain good model capability and perform multi-feature fusion, which is easier to train. In addition, in order to reduce the amount of computation, 1×1 convolution is used to reduce the dimension of the feature channel first. A stack of Inception modules is called a Inception network, and GoogLeNet is an example of a well-designed and well-performed Inception network (Inception V1), GoogLeNet is a type of Inception V1 network.

However, the problem that the network was too deep for good training was not solved until ResNet brought up residual connection.

ResNet: 2016, Deep residual learning for image recognition

ImageNet Top5 error rate 3.57%

ResNet solved this problem by introducing Shortcut direct connection:

With the introduction of direct connection, it is not easy to learn the complete reconstruction mapping and create the output from scratch, but after the introduction of direct connection, you only need to learn the difference between the output and the original input. Absolute quantity changes relative quantity, which is much easier, so it is called residual network. Moreover, by introducing residual and identity mapping, which is equivalent to a gradient high-speed channel, the problem of gradient disappearance can be easily trained to avoid. Therefore, a deep network can be obtained, with the number of network layers ranging from 22 layers of GoogLeNet to 152 layers of ResNet.

The network structure of ResNET-34 is as follows:

If LeNet, AlexNet, and VGG lay the foundation for classical neural networks, Inception and ResNet represent a new paradigm for neural networks, based on which innovations are developed and borrowed from each other, There are Inception V2 to V4, Inception-ResNet V1 and V2 in the Inception genre, as well as ResNeXt, DenseNet and Xception in the ResNet genre, etc.

Inception genre

The Inception genre, the core of which is the Inception module, has various variations, including Inception V2 to V4 and Inception-ResNet V1 and V2, etc.

Inception V2 (BN-Inception) : 2015, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

ImageNet Top5 Error rate: 4.8%

(PS: According to the division of the third paper in the Inception series, a network similar to Inception v3 is called V2, but according to the division of the fourth paper, Bn-inception is called V2, the division of the fourth paper is used here, Inception v2 refers to bn-inception)

Batch Normalization has been added. Previously, the neural network relied on good initialization and gradient dispersion when the network was too deep. Both problems were caused by the unsatisfactory activation distribution in the middle of the network, so since we wanted an ideal distribution, we manually converted it to the ideal distribution. Therefore, the normalization transformation is added after each layer of output, and the mean value of each feature of each training batch is subtracted by the standard deviation to obtain the output distribution of 0 mean value and 1 standard deviation. In this way, the training can be very good, and the gradient is not easy to disperse.

Inception V3:2015, Rethinking the Inception Architecture for Computer Vision

ImageNet Top5 Error rate: 3.5%

Convolution is further decomposed, 5×5 is replaced by two 3×3 convolution, and 7×7 is replaced by three 3×3 convolution. A 3×3 convolution kernel can be further replaced by a combination of 1×3 convolution kernel and 3×1 convolution kernel to further reduce the amount of computation:

The overall network structure is as follows:

Inception V4, Inception-ResNet V1 and V2: 2016, Inception- V4, Inception-ResNet and the Impact of Residual Connections on Learning

ImageNet Top5 error rate: 3.08%

Inception V1 to V3, obvious traces of artificial design can be seen, the arrangement of different convolutional kernels and network structure is very special, I don’t know why it is arranged like this, it is confirmed experimentally. The author states that there was too much historical baggage due to the limitations of hardware and software, but now with TensorFlow (a wave of ads in the paper), networks can be implemented as desired, so a normative design is designed for Inception V4, similar to Inception V3, But there aren’t a lot of specific inconsistencies.

Meanwhile, the success of ResNet also demonstrates the effectiveness of residual connection, so residual connection is introduced for Inception module. The results are minus-resnet V1 and minus-resnet-v2, the former is smaller and equivalent to Inception V3, and the latter is larger and equivalent to Inception V4. The Inception module is as follows:

ResNet genre

The ResNet genre is another mainstream subgenre, including WRN, DenseNet, ResNeXt, and Xception.

DenseNet:2016,Densely Connected Convolutional Networks

DenseNet takes residual connection to the extreme. The output of each layer is directly connected to all the layers behind, which can better reuse features. Each layer is relatively shallow, which integrates all features from all the previous layers and is easy to train. The disadvantages are that the video memory footprint is larger and the backpropagation calculation is a bit more complicated. The network structure is as follows:

ResNeXt: 2017, Aggregated Residual Transformations for Deep Neural Networks

ImageNet Top5 error rate: 3.03%

Inception borrowings from ResNet to get Inception -resnet, while ResNet borrowings from Inception to get ResNeXt, for each base unit of ResNet, scale horizontally, divide the input into several groups, use the same transformation, Convolution:

Above is ResNet on the left and ResNeXt on the right. By splitting the input on the channel and grouping convolution, each convolution kernel does not need to be extended to all channels, and more and lighter convolution kernels can be obtained. Moreover, the coupling between convolution kernels is reduced, and higher accuracy can be obtained with the same amount of computation.

Xception: 2016, Xception: Deep Learning with Depthwise pronouncing Convolutions

Xception takes the idea of grouping convolution to its extreme, with each channel divided into a separate group. Use depthwise convolution as shown in the figure below. J input channels are convolution with a separate spatial convolution kernel (e.g. 3×3). J convolution kernels will form J output channels. Then, K convolution is used to check the J output channels obtained in the previous step for 1×1 ordinary convolution, and K final outputs are obtained:

Xception is based on the assumption that horizontal and vertical spatial convolution (e.g., 3×3 convolution in step 1) and deep channel convolution (e.g., 1×1 convolution in step 2) can be performed completely independently, which reduces coupling between different operations and allows efficient use of computational power. Experimental results show that the same amount of calculation can improve the accuracy obviously. (However, the low-level support for grouping convolution is not good enough, and the actual speed is not as good as the theoretical calculation, which needs better support from the low-level library.)

The mobile terminal

In addition to the mainstream ResNet genre and Inception genre, mobile applications are also a major direction, such as SqueezeNet, MobileNet V1 and V2, ShuffleNet, etc.

MobileNet V1:2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Similar to Xception, use depthwise convolution to reduce computation and design a network suitable for mobile devices that achieves a good balance between performance and efficiency.

Thread: May I know what time thread is working on the thread? Mobile Networks for Classification, Detection and Segmentation

Thus, ReLU6 (Clip the output of ReLU so that the maximum output is 6) is adapted to mobile devices for better quantification. Then, a new Inverted Residuals and Linear Bottleneck is proposed. That is, the depthwise convolution is used in the middle of the ResNet basic structure, one channel is one convolution kernel, reducing the amount of calculation, the number of channels in the middle is more than the two ends (ResNet is like a funnel, MobileNet V2 is like a willow leaf), and all the ReLU of the final output is removed. The specific basic structure is shown on the right side of the figure below:

ShuffleNet: 2017, ShuffleNet: An Extremely Efficient Neural Network for Mobile Devices

The Xception is doing a good job, but the 1×1 is taking too much time and becoming a bottleneck, so group a little bit less computation, but group a little bit more separation between groups, shuffle a little bit, force the information to flow. The specific network structure is shown on the left in the figure above. Channel shuffle is to rearrange the channels and allocate the output of each group of convolution to different groups of the next convolution:

In the figure above, A is without shuffle and the effect is poor, while B and C are equivalent with shuffle. ShuffleNet achieves the same accuracy as AlexNet and is 13 times faster in practice (and 18 times faster in theory).

SENet

In addition to the proven networks described above, there are a variety of new networks, such as NASNet, SENet, MSDNet, and more. After the ordinary convolution (single convolution or compound convolution) is used for the input X to get the output U, the Squeeze for each channel is aggregated globally. The weight value for each channel is calculated by using two layers of FC. This process is called feature recalibration. By introducing attention re-weighting, you can control the invalid features, increase the weight of the effective features, and easily combine with the existing network. Improve the performance of the existing network without increasing the amount of computing.

SE Module is a common module that can be integrated with existing networks to improve the existing effect.

conclusion

Finally, a summary table of Top5 accuracy on ImageNet is shown in the figure below. It can be seen that the classification error rate on ImageNet decreases year by year, and has been lower than the error rate of human (5.1%).