TorchVision V0.9 has a new set of mobile-friendly models that can handle tasks such as classification, object detection, semantic segmentation, and more.
This article will explore the code for these models in depth, share noteworthy implementation details, explain how these models are configured and trained, and explain important trade-offs made by officials during model optimization.
The goal of this article is to show the technical details of the model that were not documented in the original paper or in the database.
The network architecture
The implementation of MobileNetV3 architecture strictly complies with the setting in the original paper, supports user customization, provides different configurations for the construction of classification, target detection and semantic segmentation Backbone. Its architectural design is similar to MobileNetV2 and the two share the same building blocks.
Out of the box. Two official variants are available: Large and Small. The two are built with the same code, the only difference being the configuration (number of modules, size, activation functions, etc.).
Configuration parameters
Although you can customize the InvertedResidual setting and pass it directly to the MobileNetV3 class, for most applications you can adjust the existing configuration by passing parameters to the model builder method. Some key configuration parameters are as follows:
The width_mult parameter is a multiplier that determines the number of model pipes. The default value is 1. By adjusting the default value, you can change the number of convolutional filters, including the first and last layers, to ensure that the number of filters is a multiple of 8. This is a hardware optimization trick that speeds up the vectorization of operations.
The reduced_tail parameter is primarily used for speed optimization, which halves the number of pipes in the last module of the network. This version is often used for object detection and semantic segmentation models. According to the MobileNetV3 related paper, using the reduced_TAIL parameter can reduce the delay by 15% without affecting the accuracy.
The dilated parameter mainly affects the last three InvertedResidual modules of the model. The Depthwise convolution of these modules can be converted to an Atrous convolution to control the output step size of the module and improve the accuracy of the semantic segmentation model.
Implementation details
The MobileNetV3 class is responsible for building a network from the provided configuration, with the following implementation details:
The final convolutional module expands the output of the last InvertedResidual module by a factor of six. The implementation method can adapt to different multiplier parameters.
Similar to the MobileNetV2 model, there is a Dropout layer before the last Linear layer of the classifier.
The InvertedResidual class is the main building block of the network, with the following implementation details:
If the input pipe is the same as the extension pipe, the Expansion step is not required. This happens on the first convolution module of the network.
A Projection step is always required even if Expanded channels are the same as output channels.
The activation of Depthwise module is prior to the Squeeze-and-Excite layer, which can improve accuracy to a certain extent.
MobileNetV3 module architecture diagram
The baseline and configuration, training and quantification details of the pre-training model are explained here.
Initialize the pretraining model:
Large = torchvision. Models. Mobilenet_v3_large (pretrained = True, width_mult = 1.0, reduced_tail = False, dilated=False) small = torchvision.models.mobilenet_v3_small(pretrained=True) quantized = torchvision.models.quantization.mobilenet_v3_large(pretrained=True)Copy the code
Detailed benchmark comparison between old and new models
As the figure shows, mobilenet v3-large can be a substitute for ResNet50 if users are willing to sacrifice a little accuracy for about a six-fold increase in speed.
Note that the reasoning time here is measured on the CPU.
The training process
All the pre-training models were configured as non-dilated models with a width multiplier of 1 and full tails, and fitted on ImageNet. Both the Large and Small variants are trained with the same superparameters and scripts.
Fast and stable model training
Proper configuration of RMSProp is critical to speed up the training process and ensure numerical stability. The authors used TensorFlow in their experiments, running with rmsPROP_epsilon, which is quite high compared to the default.
Normally, this hyperparameter is used to avoid zero denominators, so its value is small, but in this particular model, choosing the right value is important to avoid numerical instability in losses.
Another important detail is that while PyTorch and TensorFlow’s RMSProp implementations generally behave similarly, in the setup here, it is important to note the differences in how the two frameworks handle Epsilon hyperparameters.
Specifically, PyTorch adds epsilon to the square root calculation, while TensorFlow adds epsilon inside. This makes the user need to adjust the epsilon value when porting the hyperparameters in this paper. The formula PyTorch_eps= SQRT (TF_eps) can be used to calculate a reasonable approximation.
The accuracy of the model was improved by adjusting the hyperparameters and improving the training process
With the optimizer configured for fast and steady training, it’s time to start optimizing the model for accuracy. There are techniques that can help users achieve this goal.
First of all, to avoid overfitting,AutoAugment and RandomErasing can be used to augment the data. In addition, it is also of great significance to use cross-validation method to adjust parameters such as weight attenuation and average weights of checkpoints with different epochs after training. Finally, the overall accuracy rate can be improved by at least 1.5% by using methods such as Label Smoothing, stochastic depth and LR noise injection. Key iterations to improve MobileNetV3-Large accuracy
Mobilenetv2-style hyperparameter baseline
Note that once the set accuracy is achieved, the model performance is validated on the validation set. This process helps to detect overfitting.
QNNPACK backend for the Mobilenet V3-Large variant is provided with quantization weights, resulting in a 2.5 times faster run time. To quantify the model, quantitative perception training (QAT) is used here.
Note that QAT allows you to model quantized effects and adjust the weights to improve model accuracy.Compared with the model quantification results of simple training, the accuracy improved by 1.8% :
Target detection
This section will first provide a baseline of published models and then discuss how MobilenetV3-Large Backbone can be used with the FasterRCNN detector to Feature Pyramid Network for target detection.
It also explains how the network is trained and tuned, and where trade-offs must be made (this section does not go into the details of how to use it with SSDlite).
Initialization model:
high_res = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=True)
low_res = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(pretrained=True)
Copy the code
Benchmark comparison between old and new models
It can be seen that the high-resolution Faster r-cnn with mobilenetv3-large FPN backbone can replace the equivalent ResNet50 model if users are willing to sacrifice a little accuracy for the 5 times Faster training speed.
Implementation details
The tester uses fpn-style backbone, which can extract features from different convolution of MobileNetV3 model. By default, the pre-training model uses the output of the 13th InvertedResidual module and the output of the convolution before the pooling layer. The implementation also supports the use of more stage outputs.
All feature maps extracted from the network are projected onto 256 channels by the FPN module, which can greatly improve network speed. These feature graphs provided by FPN Backbone will be used by the FasterRCNN detector to provide box and class predictions of different sizes.
Training and tuning processes
Currently, two pre-training models are available, which can detect targets at different resolutions. Both models were trained with the same hyperparameters and scripts on the COCO dataset.
The high resolution detector was trained with an 800-1333px image, while the mobile friendly low resolution detector was trained with a 320-640px image.
The reason for providing two independent sets of pre-training weights is that training the detector directly on the smaller image results in an increase in accuracy of 5 maps compared to passing the smaller image to the pre-trained high-resolution model.
Two backbone initial use is ImageNet weight, training process also on its weight of the last three stages of fine tuning.
Additional speed optimization can be made for mobile-friendly models by adjusting the threshold of the RPN NMS.Sacrificing the accuracy of 0.2 mAP increases the CPU speed of the model by about 45%. The optimization details are as follows:
Schematic prediction of Faster R-CNN Mobilenet V3-large FPN model
Semantic segmentation
This section first provides some published pre-training model benchmarks and then discusses how Mobilenet V3-large Backbone is combined with segmentation heads such as LR-ASPP, DeepLabV3 and FCN for semantic segmentation.
It will also explain the training process of the network and suggest some alternative optimization techniques for speed-critical applications.
Initialize the pretraining model:
lraspp = torchvision.models.segmentation.lraspp_mobilenet_v3_large(pretrained=True)
deeplabv3 = torchvision.models.segmentation.deeplabv3_mobilenet_v3_large(pretrained=True)
Copy the code
Detailed baseline comparison between the old and new models
In most applications, DeepLabV3 with mobilenet v3-large backbone is a viable alternative to FCN and ResNet50, which can run 8.5 times faster with similar accuracy. ** In addition, the performance of LR-ASPP network outperforms FCN under the same conditions in all indicators.
Implementation details
This section discusses the important implementation details of the tested split Head. Note that all models described in this section use the extended MobileNetV3-Large Backbone.
Lr-aspp is a simplified version of the Reduced Atrous Spatial Pyramid Pooling model proposed by the authors of the MobileNetV3 paper. Unlike other segmentation models in TorchVision, it does not use auxiliary losses, but instead uses low-level and high-level features with output step sizes of 8 and 16, respectively.
Instead of the 49×49 AveragePooling layer and variable step size used in this paper, the AdaptiveAvgPool2d layer is used to handle global features.
This can provide users with a common implementation that runs across multiple data sets. Finally, before returning the output, a bilinear interpolation is always generated to ensure that the dimensions of the input and output images match exactly.
DeepLabV3 & FCN
The combination of MobileNetV3 with DeepLabV3 and FCN is very similar to the combination of other models and the phase evaluation of these methods is the same as that of lr-aspp.
It should be noted that advanced and low-level features are not used here, but normal losses are attached to the feature graph with an output span of 16 and auxiliary losses are attached to the feature graph with an output span of 8.
FCN is not as fast or accurate as LR-ASPP, so it is not considered here. Pre-training weights are still available, with a few modifications to the code.
Training and tuning processes
Two MobileNetV3 pretraining models for semantic segmentation are provided: lr-aspp and DeepLabV3. Backbone of these models is initialized with ImageNet weights and trained end-to-end.
Both architectures are trained on COCO datasets using the same scripts and similar superparameters.
Typically, the size of the image is adjusted to 520 pixels during reasoning. An alternative speed optimization is to build a low-resolution model configuration with high resolution pre-training weights and reduce the inference size to 320 pixels. This increases CPU execution time by about 60% while sacrificing several mIoU points.
The detailed number after optimization
Lr-aspp Mobilenet V3-Large model prediction example
This is the summary of MobileNetV3 implementation details for this installment, which hopefully gives you a better understanding of the model.
MobileNetV3 paper
PyTorch Blog