“This is my 27th day of participating in the First Challenge 2022. For more details: First Challenge 2022.”

MobileNet V1

To cut to the quick, in order to achieve a lightweight network of good size and speed, V1 does one thing — replace all standard convolution in VGG with deeply separable convolution.

Separable convolution

Separable convolution has two main types: spatial separable and depth separable

reference

Space separable convolution

As the name implies, spatially separable convolution separates the standard convolution kernel into two small convolution in the direction of height and width. The following formula is a very famous edge detection operator — Sobel:


[ 1 0 1 2 0 2 1 0 1 ] = [ 1 2 1 ] x [ 1 2 1 ] \left[\begin{array} {c} -1&0&1\\ -2&0&2&\\ -1&0&1\\ \end{array}\right]= \left[\begin{array} {c} 1\\ 2\\ 1\\ \end{array}\right]\times \left[\begin{array} {c} -1&2&1 \end{array}\right]

In the actual operation, the two small convolution kernels are respectively multiplied with the input three times, instead of the standard convolution nine times, so as to reduce the time complexity and improve the network running speed.

However, not all convolution kernels can be divided into two smaller kernels, which is very troublesome in training, which means that the network can only select part of all small kernels, so spatial separability has not been widely used in deep learning.

Depth separable convolution

For a feature graph, it has not only height and width, but also the depth of the channel

Standard convolution:

For a standard convolution kernel, the number of channels is the same as the number of channels in the input feature graph. We control the number of channels in the output by controlling the number of convolution kernels, as shown in the figure below

Depth separable convolution:

In simple terms, deep separable convolution = deep convolution + point-wise convolution

Deep convolution is the use of single-channel convolution kernels with the same number of channels as the input feature graph, as shown in the diagram below:

Point-by-point convolution is often referred to as 1\times 1 convolution, whose function is to raise the dimension of the output of deep convolution, as shown in the diagram below:

Combined with the above three pictures, it can be seen that the input size of depth-separable convolution is the same as that of standard convolution. Then what are the advantages of depth-separable convolution?

As we know, when the number of input and output channels is large, the number of standard convolution parameters and the amount of calculation are amazing:


P a r a m s : k w x k h x C i n x C o u t (1) Params:k_w\times k_h\times C_{in}\times C_{out}\tag1

F l o p s : k w x k h x C i n x C o u t x W x H (2) Flops:k_w\times k_h\times C_{in}\times C_{out}\times W\times H\tag2

For depth separable convolution:


P a r a m s : k w x k h x C i n + C i n x C o u t (3) Params:k_w\times k_h\times C_{in}+C_{in}\times C_{out}\tag3

F l o p s : k w x k h x C i n x W x H + C i n x C o u t x W x H (4) Flops:k_w\times k_h\times C_{in}\times W\times H+C_{in}\times C_{out}\times W\times H\tag4

By comparing (3) with (1) and (4) with (2), we can get the ratio of the number of parameters to the amount of calculation:


k w k h + C o u t k w k h C o u t = 1 C o u t + 1 k w k h \frac{k_w\cdot k_h+C_{out}}{k_w\cdot k_h\cdot C_{out}}=\frac{1}{C_{out}}+\frac{1}{k_w\cdot k_h}

For our commonly used 3×33\times33×3 convolution kernel, deeply separable convolution can reduce the number of parameters and computation to the original 19\frac{1}{9}91 to 18\frac1881.

The network structure

Convolution layer

The left side of the figure above is a common ConvX: Conv+BN+ReLUConvX: Conv+BN+ReLUConvX: Conv+BN+ReLUConvX: Conv+BN+ReLU. For deeply separable convolution, the subsequent processing of convolution is inserted after deep convolution and point-by-point convolution respectively

It should be noted that ReLU6=min(Max (0,x),6)ReLU6=min(Max (0,x),6)ReLU6=min(Max (0,x),6)ReLU6=min(Max (0,x),6), that is, relative to standard ReLU, its activation value has an upper limit of 6, The author considers that the algorithm is more robust under low precision

The main structure

Only the standard 3×33\ Times 33×3 convolution is used in the first layer, and the subsequent stack depth can be classified convolution. In the middle, continuous downsampling is carried out. Finally, average pooling + full connection layer + Softmax is used to achieve the classification task.

It should be noted that the deeply classifiable convolution is represented here as two layers — Conv dw+Conv 1×1Conv\ dw+Conv\ 1times1conv dw+Conv 1×1

Smaller models

Although basic MobileNet is very small, many times specific cases or applications may require a smaller and faster model. To construct these smaller and less computation-intensive models, a hyperparameter \alpha\in(0,1] is introduced, which reduces the number of input-output channels at each layer of the reduced model. In this case, the number of global parameters and calculation amount are as follows:


P a r a m s : k w x k h x Alpha. C i n + Alpha. C i n x C o u t F l o p s : k w x k h x Alpha. C i n x W x H + Alpha. C i n x Alpha. C o u t x W x H Params:k_w\times k_h\times \alpha C_{in}+\alpha C_{in}\times C_{out}\\ Flops:k_w\times k_h\times \alpha C_{in}\times W\times H+\alpha C_{in}\times \alpha C_{out}\times W\times H

α\alphaα is used to limit the width of the model and the resolution of the input image. For this purpose, a hyperparameter ρ∈(0,1]\rho\in (0,1]ρ∈(0,1] is introduced, which is applied to the feature graph of each layer. In this case, the number of global parameters and the computational quantity of the model can be expressed as:


P a r a m s : k w x k h x Alpha. C i n + Alpha. C i n x C o u t F l o p s : k w x k h x Alpha. C i n x rho W x rho H + Alpha. C i n x Alpha. C o u t x rho W x rho H Params:k_w\times k_h\times \alpha C_{in}+\alpha C_{in}\times C_{out}\\ Flops:k_w\times k_h\times \alpha C_{in}\times \rho W\times \rho H+\alpha C_{in}\times \alpha C_{out}\times \rho W\times \rho H

The effect

Very good