GoogLeNet, as the champion of ILSVRC-2014 classification and detection task, has proposed a more creative Inception module compared to VGG Net’s simple stacking of small convolutional layers (3×3). Although the network structure is complex, However, the number of model parameters is reduced, only 1/12 of AlexNet, while the number of parameters of VGGNet is three times of AlexNet, but the model accuracy is higher than VGG.

Because of GoogLeNet’s high performance, the subsequent optimization of Inception V1 version is continued, resulting in Inception V2, Inception V3, Inception V4, Inception-ResNet V1,Inception-ResNet V2 and other optimization models.

One, the Inception v1

First, we need to clarify that the most direct way to improve the performance of deep neural networks is to increase the depth and width, but this brings two problems:

1. Larger size usually means more parameters, which makes it easier to overfit the enlarged network, especially when the labeled samples of the training set are limited.

2. It consumes a lot of computing resources.


The design philosophy of GoogLeNet is:

1. Protrusions in an image can vary greatly in size.

2. With such huge changes in information position, it is difficult to select the correct kernel size for convolution operation.

3. For globally distributed information, a larger kernel is preferred; for locally distributed information, a smaller kernel is preferred.

4. Very deep networks are easy to overfit. It is also difficult to deliver gradient updates across the network.

5. Simple stack large convolution operation results in high computational complexity

For the above considerations, an initial Inception module is proposed as shown in the figure below



A parallel parallel convolution operation is performed on the input of the previous layer, using the convolution of multiple receptive field sizes (1×1, 3×3, 5×5).

However, the above initial Inception, the number of parameters is too large, resulting in too much computation.

Inspired by Network in Network, the author uses 1×1 convolution to reduce the dimension of the number of channels in the feature graph, which is the final Network structure of Inception V1 as follows:

Major contributions of inception architecture:

One is to use the convolution of 1×1 for dimensionality reduction; The other is convolution reaggregation in multiple dimensions simultaneously

Finally, the GoogLeNet network structure based on Inception V1 is shown below:

                                    


The details of each layer are shown below:


GoogLeNet V1 has a total of 22 layers and only 1/12 of the number of network parameters as Alexnet. As the number of network layers deepens, the problem of gradient disappearance still exists, so the author adds two auxiliary Softmax in the middle layer to increase the gradient size of back propagation and play a regularization role at the same time. When calculating the network loss, the intermediate auxiliary Softmax loss will be multiplied by a weight (0.3) and added to the loss value of the last layer. When forecasting, the output of the intermediate Softmax layer is ignored.


Second, the Inception v2

The main improvements for Inception V2 are as follows:

1. The Batch Normalization layer has been added, with the standard structure of convolution – BN-RELu

2. Referring to the use of VGG, use two 3*3 convolution series to replace the 5*5 convolution module in the Inception module. Because two 3*3 convolution and a 5*5 convolution have exactly the same sensation field, but fewer parameters than the 5*5 convolution. In addition, as a layer of convolution operation is added, Relu is added, that is, a layer of nonlinear mapping is added, which makes the feature information more discriminative

3. The convolution of 3*3 is further decomposed into the convolution of 3*1 and 1*3 using asymmetric convolution. The diagram below:

                       

As for asymmetric convolution, the paper mentioned:


1) N by 1 convolution first and then 1 by n convolution is equivalent to n by n convolution directly.

2) Asymmetric convolution can reduce the amount of computation. It used to be N × N multiplications, but after decomposition, it becomes 2×n multiplications. The larger n is, the more computation is reduced.

3) Although the amount of computation can be reduced, this method is not applicable everywhere. Such asymmetric convolution should not be used in the layer close to the input, which will affect the accuracy, but should be used in the higher layer. The effect of asymmetric convolution is better when the image size is between 12×12 and 20×20.


There are three different Inception modules in Inception V2, as shown below

The first type of module (Figure 5) is used in the feature map of 35×35. It is mainly to replace a 5×5 convolution with two 3×3 convolution


                     

The second module (Figure 6) further decomposed the convolution kernel of 3*3 into the convolution of NX1 and 1xn, which was applied to the feature map of 17×17. The specific module structure is shown as follows:

                        

The third type of module (Figure 7) is mainly used for high-dimensional features and 8×8 feature map in the network. The specific structure is shown as follows:

                 

The complete network architecture of GoogLeNet Inception V2 is shown below:


Third, Inception v3

Inception V3 adopts the overall network structure of Inception V2, with improvements in optimization algorithm, regularization and other aspects as follows:

1. The optimization algorithm uses RMSProp to replace SGD

2. BN operation is also added to the auxiliary classifier

3. Use the Label Smoothing Regularization (LSR) method to avoid overfitting, prevent the network from becoming too confident in a certain category, and pay more attention to low-probability categories. (about LSR optimization method, you can refer to below this blog https://blog.csdn.net/lqfarmer/article/details/74276680).

4. Replace the first 7×7 convolution kernel with two 3×3 convolution kernels


Four, Inception v4

The three Inception modules in Inception V4 are as follows:

1. Inception-A block:

Two 3×3 convolution is used instead of 5×5 convolution, and average pooling is used. The module mainly processes feature map of size 35×35. The structure is as follows:


             

2. Inception-B block:

Instead of NXN convolution, convolution of 1xN and NX1 is used, and average pooling is also used. The module mainly processes feature map with size 17×17. The structure is as follows:

              

3. Inception-C block:

This module mainly processes feature map with size of 8×8. The structure is as follows:

          

When feature map is reduced from 35×35 to 17×17 and then to 8×8, two Reduction structures are used instead of the simple pooling layer. The specific structures are shown as follows:

                 

                

Finally, the complete structure of Inception V4 is shown below:

Inception-ResNet V1, v2

As its name implies, the idea of ResNet residual network is introduced on the basis of Inception, which adds shallow features to high-level features through another branch to achieve feature reuse and avoid the problem of gradient disappearance in deep networks.

Inception – ResNet v1:

The three basic modules are shown below:

The reduction-A module of Inception-ResNet V1 is the same as that of Inception V4,

The Reduction-B module is as follows:



The complete architecture is as follows:

Inception – ResNet v2:

The three basic modules are shown below:

The reduction-A module of Inception-ResNet V2 is also the same as that of Inception V4, and the reduction-B module is as follows:

The overall architecture of Inception-Resnet V2 is the same as that of V1, while the specific Stem structure is different. The Stem structure of Inception-Resnet V2 is the same as that of Inception V4, as shown below:

Welcome to pay attention to my public account, this public account periodically push machine learning, deep learning, computer vision and other related articles, welcome to learn with me, communication.