After analyzing the structure and related algorithms of convolutional neural networks in a post on my personal blog, I know these basic principles. This blog post mainly introduces some classic network models in the development of convolutional neural networks.
LeNet5
LeCun et al. applied BP algorithm to multi-layer neural network, proposed Lenet-5 model [1] (see the effect and paper here), and applied it to handwritten digit recognition. Convolutional neural network was formally proposed. The network model of LENET-5 is shown in Figure 1. The specific parameters of the network model are shown in Figure 2.
The input
The output
- C1: Parameters of the convolution kernel are shown in the table. The size calculation formula after convolution is as follows:
Therefore, after C1 convolution layer, the size of each feature graph is 32-5+1=28, and the number of neurons output from this layer is 28286=784. The number of parameters of the convolution operation of this layer is 5516+6=156, in which the number of parameters has nothing to do with the number of neurons, but only with the size of the convolution kernel (55 here) and the number of convolution kernels (6 here, the default depth of the image of the previous layer is 1).
- S2: Enter 28286. The network uses maximum pooling for downsampling, and the pooling size is 22. After the pooling operation, the number of output neurons is 1414 * 6;
- C3After the C3 layer, the output is 101016. The number of parameters is 556*16+16=2416 parameters;
- S4: Enter 101016. Parameters are consistent with S2 layer, and the number of output neurons after pooling is 5516.
- C5: After passing through layer C5, the output is 11120, the number of arguments is 5516120+120=48120 parameters. (The convolution size at this level is 55, the input size of the image is also 5*5, which can be equivalent to the full connection layer);
- F6The output is 1184, the number of arguments is 11120*84+84=10164 Total parameters: 60856
As can be seen from the specific parameters in Table 1, LeNet’s network structure is very simple and single. Convolutional layer C1, C3 and C5 adopt the same parameters except for the output dimension, and pooling layer S2 and S4 also adopt the same parameters
AlexNet
In 2012, Krizhevsky won the ILSVRC 2012 Image Classification Contest by using convolutional neural network and proposed AlexNet model [2] (paper address). With many innovative methods, this paper promoted the subsequent wave of neural network research. The proposed AlexNet network is of milestone significance for convolutional neural network. Compared with LeNet5, there are the following improvements
- Data to enhance
- Flip horizontal
- Random clipping, translation transformation
- Color light conversion
- Dropout: Dropout methods, like data enhancement, are designed to prevent overfitting. Simply put, dropout drops neurons from the network at a certain rate. A good illustration of this is shown in Figure 2. The image on the left is before dropout and the image on the right is after dropout. Dropout can prevent network overfitting to some extent and speed up network training.
The input
224
227
The output
- conv1: The output is 555596. The number of parameters is 11113 * 96 + 96 = 34944
- pool1: The output is 272796;
- conv2: The output is 2727256, the number of parameters is 5596 * 256 + 256 = 614656
- pool2: The output is 1313256;
- conv3: The output is 1313384, the number of arguments is 33256 * 384 + 384 = 885120
- conv4: The output is 1313384, the number of arguments is 33384 * 384 + 384 = 1327488
- conv5: The output is 1313256, the number of parameters is 33384 * 256 + 256 = 884992
- pool3: The output is 66256;
- fc6The output is 114096. The number of parameters is 11256 * 4096 + 4096 = 1052672
- fc7The output is 114096. The number of parameters is 114096 x 4096+4096=16781312 Total parameters: 21581184
By comparing the network structure of LenET-5 and AlexNet, it can be seen that AlexNet has deeper network structure and more parameters.
ZFNet
ZFNet[3] (paper address) was designed by Matthew Zeiler and Rob Fergus of New York University. This network has been slightly improved on AlexNet, but the main contribution of this paper lies in explaining to some extent why convolutional neural networks are effective and how to improve the performance of the network. The network’s contribution is to:
- The deconvolution network is used to visualize the feature graph. It is proved by the feature graph that the shallow network learns the edge, color and texture features of the image, and the high network learns the abstract features of the image.
- According to feature visualization, it is proposed that the convolution kernel of AlexNet’s first convolution layer is too large, resulting in blurred extracted features.
- Through several groups of occlusion experiments, the key parts of the image are found by comparative analysis.
- The deeper network model is proved to have better performance.
The network model of ZFNet is shown in Figure 4, and the specific parameters are shown in Table 3.
VGG16
VGGNet[4] is a convolutional neural network model jointly developed by the computer vision Group of Oxford University and the researchers of Google DeepMind project, including VGG16 and VGG19 models. The network model is shown in Figure 5. You can also click here to view the network model.
The convolution layer of 3 and 2
The size of the maximum pooling layer is also 2
- Vgg16 has 16 layers of network, AlexNet only 8 layers;
- Multiple scales were used for data enhancement during training and testing.
GoogLeNet
GoogLeNet[5] (paper address) further increases the depth and width of the network model, but simply increasing the width and depth of the network on the basis of VGG16 will bring the following defects:
- Too many parameters can easily lead to overfitting.
- As the number of layers deepens, it is easy to cause the disappearance of gradient.
The proposal of GoogLeNet was inspired by the paper Network in Network (NIN). NIN made two contributions:
- Multi-layer perceptron convolution layer is proposed to enhance the feature extraction capability of network. The structure diagram of ordinary convolution layer and multilayer perceptive convolution layer is shown in FIG. 6. Mlpconv is equivalent to adding a 1*1 convolution layer after the general convolution layer.
GoogLeNet proposed Inception structure based on Mlpconv, which has two versions. Figure 7 is a naive version of Inception. The structure cleverly combines 11, 33 and 5*5 convolution kernels with the maximum pooling layer as a layer structure.
ResNet
The development of convolutional neural network model has proved time and again that deepening the depth and width of the network can achieve better results. However, subsequent studies have found that the network model with deeper network layers has worse effects than the network with shallow layers, which is called “degradation” phenomenon, as shown in FIG. 10.
reference
[1] Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11):2278-2324.
[2] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]// International Conference on Neural Information Processing Systems. Curran Associates Inc. 2012:1097-1105.
[3] Zeiler M D, Fergus R. Visualizing and Understanding Convolutional Networks[J]. 2013, 8689:818-833.
[4] Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. Computer Science, 2014.
[5] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2015:1-9.
[6] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]// Computer Vision and Pattern Recognition. IEEE, 2016:770-778.