“This is my 27th day of participating in the First Challenge 2022. For more details: First Challenge 2022.”
MobileNet V1
To cut to the quick, in order to achieve a lightweight network of good size and speed, V1 does one thing — replace all standard convolution in VGG with deeply separable convolution.
Separable convolution
Separable convolution has two main types: spatial separable and depth separable
reference
Space separable convolution
As the name implies, spatially separable convolution separates the standard convolution kernel into two small convolution in the direction of height and width. The following formula is a very famous edge detection operator — Sobel:
In the actual operation, the two small convolution kernels are respectively multiplied with the input three times, instead of the standard convolution nine times, so as to reduce the time complexity and improve the network running speed.
However, not all convolution kernels can be divided into two smaller kernels, which is very troublesome in training, which means that the network can only select part of all small kernels, so spatial separability has not been widely used in deep learning.
Depth separable convolution
For a feature graph, it has not only height and width, but also the depth of the channel
Standard convolution:
For a standard convolution kernel, the number of channels is the same as the number of channels in the input feature graph. We control the number of channels in the output by controlling the number of convolution kernels, as shown in the figure below
Depth separable convolution:
In simple terms, deep separable convolution = deep convolution + point-wise convolution
Deep convolution is the use of single-channel convolution kernels with the same number of channels as the input feature graph, as shown in the diagram below:
Point-by-point convolution is often referred to as 1\times 1 convolution, whose function is to raise the dimension of the output of deep convolution, as shown in the diagram below:
Combined with the above three pictures, it can be seen that the input size of depth-separable convolution is the same as that of standard convolution. Then what are the advantages of depth-separable convolution?
As we know, when the number of input and output channels is large, the number of standard convolution parameters and the amount of calculation are amazing:
For depth separable convolution:
By comparing (3) with (1) and (4) with (2), we can get the ratio of the number of parameters to the amount of calculation:
For our commonly used 3×33\times33×3 convolution kernel, deeply separable convolution can reduce the number of parameters and computation to the original 19\frac{1}{9}91 to 18\frac1881.
The network structure
Convolution layer
The left side of the figure above is a common ConvX: Conv+BN+ReLUConvX: Conv+BN+ReLUConvX: Conv+BN+ReLUConvX: Conv+BN+ReLU. For deeply separable convolution, the subsequent processing of convolution is inserted after deep convolution and point-by-point convolution respectively
It should be noted that ReLU6=min(Max (0,x),6)ReLU6=min(Max (0,x),6)ReLU6=min(Max (0,x),6)ReLU6=min(Max (0,x),6), that is, relative to standard ReLU, its activation value has an upper limit of 6, The author considers that the algorithm is more robust under low precision
The main structure
Only the standard 3×33\ Times 33×3 convolution is used in the first layer, and the subsequent stack depth can be classified convolution. In the middle, continuous downsampling is carried out. Finally, average pooling + full connection layer + Softmax is used to achieve the classification task.
It should be noted that the deeply classifiable convolution is represented here as two layers — Conv dw+Conv 1×1Conv\ dw+Conv\ 1times1conv dw+Conv 1×1
Smaller models
Although basic MobileNet is very small, many times specific cases or applications may require a smaller and faster model. To construct these smaller and less computation-intensive models, a hyperparameter \alpha\in(0,1] is introduced, which reduces the number of input-output channels at each layer of the reduced model. In this case, the number of global parameters and calculation amount are as follows:
α\alphaα is used to limit the width of the model and the resolution of the input image. For this purpose, a hyperparameter ρ∈(0,1]\rho\in (0,1]ρ∈(0,1] is introduced, which is applied to the feature graph of each layer. In this case, the number of global parameters and the computational quantity of the model can be expressed as:
The effect
Very good