MobileNet series MobileNet_v1
MobileNet series MobileNet_v2
Click follow and update two computer vision articles a day
Introduction:
Following MobileNet_v1 and V2, MobileNet_v3 came out in 2019, MobileNet_v3 paper proposed two models, mobilenet_V3-large and mobilenet_V3-small, The main difference is the number of layers (the rest of the design is the same), as well as a Mobilenet_V3-large LRASPP model for semantic segmentation.
MobileNet_v3 achievement
Mobilenet_v3-large achieves 20 percent lower latency than MobileNet_v2 in ImageNet classification, while achieving 3.2 percent higher accuracy. Mobilenet_v3-small is 6.6% more accurate than MobileNet_v2 in the case of close delay.
Mobilenet_v3-large was 25% faster than V2 at COCO dataset detection, but with almost the same accuracy.
Mobilenet_v3-large LRASPP was 34% faster at Cityscapes than MobileNet_v2 RASPP. I think we can say that we lived up to expectations.
MobileNet_v3 main content
MobileNet_v3 aims to give devices the ability to efficiently achieve high precision and low latency, rather than sending data to the server, which then inferences it back to the device. Therefore, the following contents are proposed:
1) Complementary search technology
2) New effective versions of nonlinearity for mobile device environments
3) New efficient network design
4) New efficient segmentation decoder (meaning semantic segmentation)
NAS and NetAdapt 01
At present, there is a famous saying in the field of artificial intelligence, the so-called artificial intelligence, the more artificial, the more intelligent. The efficient performance of some existing networks is inseparable from the careful design and tuning of the network model. In recent years, how to make computers automatically generate networks according to the requirements and existing hardware conditions has entered an important research direction.
NAS- Neural network architecture search. NAS sets a space, selects a search strategy to search in this space, evaluates the search effect by adopting an evaluation strategy, and finally selects the structure with the best evaluation effect.
In the early days, the NAS main unit level search, use some predefined collection operation, (such as convolution, pooling, etc.), are also included in each layer of the size of the convolution kernel, grew up in a small, convolution kernel number, etc., these collections constitutes the search space, use a search strategy to select the collection operation, training in the training set the set of network operation, And evaluate on the validation set that it will stop when a certain threshold is reached.
Recently, block-level searches have begun, such as Inception structure, which is a Block with different sizes and numbers of blocks used in the search space. Search strategy Directly searches these blocks to form a network for testing.
MobileNet_v3 uses PlatformAware NAS (NAS for Blockwise Search) for block Search to construct the global network structure, and uses NetAdapt to perform layer-level finetune on the network structure. The search target used is ACC(m) × [LAT(m)/TAR]^ W, where ACC(m) is model accuracy, LAT(m) is delay, TAR is target delay, and W is hyperparameter.
The reason why NAS is not introduced much is that NAS requires hundreds or even thousands of Gpus to train for months. For example, NASnet requires 28 days of training on 800 Gpus, which is really out of reach for ordinary people. Of course, there are some algorithms that can be trained with only a small number of Gpus recently. But at least it’s not the kind of time that meets the universal need. If necessary, there is a more comprehensive review of NAS in the public number, you can scan the qr code at the end of the article to pay attention to the public number.
Network improvement
Based on the network obtained by searching, some improvement schemes are put forward. At the beginning and end of the network, computationally heavy layers are redesigned, and a new nonlinearity, H-SWish, is proposed, which is an improved version of recent Swish, and is faster and easier to quantify.
02 Redesigning Expensive Layers
The 1×1 convolution is used as the final layer in MobileNet_v2’s Inverted bottleneck structure for scaling up to a higher-dimensional feature space. This layer is extremely important for providing rich features for prediction, but also increases latency.
In order to reduce latency and maintain high dimensional spatial characteristics, MobileNet_v3 moved this layer behind the average pooling layer, and the final feature set now computes only 1×1 resolution instead of 7×7. The result of this design choice is that the calculation of features becomes almost free in terms of computation and latency.
Once the cost of generating layers for this feature is reduced, the previous bottleneck projection layer is no longer needed to reduce computation. This observation further reduces computational complexity by allowing the projection and filtering layers in the previous bottleneck layer to be removed. The original and optimized phases are shown in the figure above.
Another time-consuming layer is the first layer convolution. The current moving model tends to use 32 filters in a complete 3×3 convolution to build an initial filter library for edge detection. Often these filters are mirror images of each other. MobileNet_v3 enables the H-SWISH nonlinear activation function for this layer. The benefit is that the number of filters can be reduced to 16 with the use of H-SWISH, while maintaining the same accuracy as 32 filters using ReLU or SWISH. This saves an additional 3 milliseconds and 10 million MAdds.
03 Nonlinearities
In a paper on semantic segmentation, we propose the use of Swish’s nonlinear function to replace ReLU function, which can significantly improve the accuracy of neural networks. It is defined as: Swish x = x · σ(x), where σ(x) is the Sigmoid function. However, although Swish improves the accuracy, sigmoid function calculation is extremely expensive, which is not suitable for its existence in embedded mobile terminal. Therefore, MobileNet_V3 proposed a more convenient calculation h-SWish function, which is defined as follows:
As you can see in the picture below, h here stands for hard, which simply means not that smooth. In practice, the Sigmoid and Swish functions do not differ significantly in accuracy from the hard versions. But the hard version is much simpler.
Generally speaking, on different devices, the SigmoID function is implemented in an approximate way (for example, in high numbers, the area of the partition rectangle is used to approximate the integration, the approximation here is to approximate the sigmoID function using some implementation methods), so different implementation methods will lead to potential loss of accuracy. Almost all software and hardware devices can implement ReLU6 directly, so there will be no sigmoID in H-SWish. In addition, H-SWish can be implemented by segmental functions (this is an optimization of H-SWish, compared with the experiment without optimization), which will reduce the number of memory accesses. Thus, the cost of waiting time is greatly reduced.
The paper notes that most of the benefits of SWish are best used at deeper levels, which is why H-SWish is used in the middle and high levels of the MobileNet_v3 structure. In addition, even though h-SWISH uses these optimizations, it still introduces some delay costs. As shown in the figure below, the optimization of H-SWISH using piecewise function is significantly better than that without optimization, saving 6ms(about 10%). Compared with ReLU, optimized H-SWISH only introduces 1ms cost. But it’s more accurate.
04 Large squeeze-and-excite
A brief introduction to this module is not presented in the paper.
This is a module SE in an attentional mechanism, plug and play, for some existing networks. The Squeeze part is compressed to 1×1 by AdaptiveAvgPool2d. The second part is the scheduling part. The compressed part is Linear. ReLU, Linear (the number of channels must be consistent with the number of channels in the next layer), and Sigmoid. The obtained layer with the number of channels in the next 1x1x layer is multiplied pixel by pixel with the next layer obtained by direct convolution of the original network. The advantage of this is that some features will be highlighted and some features will be suppressed, so as to realize the self-attention mechanism of the model.
Back in this article, this module has been added to MobileNet_v3, but with some tweaks. One is to change sigmoID to H-Swish. The second is to change the number of channels in the second Linear to 1/4 of those in the next layer. But then I can’t multiply pixel by pixel because the number of channels doesn’t match, so I went to MobileNet_v3 and found that it used expand_as to expand it to the number of channels at the next level after the number of channels became 1/4.
MobileNet_v3 structure:
The experimental conclusion
The achievements of MobileNet_v3 have been described at the beginning of this article, so just remember those. Here is just a graph of how the accuracy and latency of MobileNet_v3 vary with different hyperparameters, compared to the so-and-so dataset. A little look is enough
This article comes from the public CV technical guide model interpretation series.
Welcome to pay attention to the public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.
A summary PDF of the following articles can be obtained by replying to the keyword “Technical Summary” in the public account.
Other articles
Shi Baixin, Peking University: From the perspective of reviewers, talk about how to write a CVPR paper
Siamese network summary
Summary of computer vision terms (a) to build the knowledge system of computer vision
Summary of under-fitting and over-fitting techniques
Summary of normalization methods
Summary of common ideas of paper innovation
Summary of efficient Reading methods of English literature in CV direction
A review of small sample learning in computer vision
A brief overview of intellectual distillation
Optimize the read speed of OpenCV video
NMS summary
Loss function technology summary
Summary of attention mechanism technology
Summary of feature pyramid technology
Summary of pooling techniques
Summary of data enhancement methods
Summary of CNN structure Evolution (I) Classical model
Summary of CNN structural evolution (II) Lightweight model
Summary of CNN structure evolution (iii) Design principles
How to view the future trend of computer vision
Summary of CNN visualization technology (I) – feature map visualization
Summary of CNN visualization Technology (ii) – Convolutional kernel visualization
CNN Visualization Technology Summary (iii) – Class visualization
CNN Visualization Technology Summary (IV) – Visualization tools and projects