Content introduction

PyTorch 1.9 has updated several libraries simultaneously, including the new SSD and SSDlite models in TorchVision, making SSDlite more suitable for mobile APP development than SSD.

SSD Single Shot MultiBox Detector is a Single detection algorithm for target detection that can detect multiple objects at a time. This article focuses on a mobile-friendly variant of SSD, SSDlite.

The specific narrative process is as follows:

First, through the main components of the algorithm, the differences from the original SSD are highlighted.

Then discuss how this published model is trained;

Finally, a detailed Benchmark is provided for all newly launched target detection models.

SSDlite network architecture

SSDlite is an upgraded version of SSD, first released in the MobileNetV2 paper and then used again in the MobileNetV3 paper. Because the focus of these two papers is on introducing the new CNN architecture, most of the implementation details of SSDlite are not covered.

Our code follows all the details in these two papers and supplements the official implementation approach where necessary.

As mentioned above, SSDS are a series of models that users can configure with different backbone (VGG, MobileNetV3, etc.) and different Head (conventional convolution, separable convolution, etc.). Therefore, many SSD components are the same in SSDlite. We will discuss only those different parts below.

Classification and regression of Head

According to the MobileNetV2 paper, SSDlite replaces the conventional convolution used in the original Head with separable convolution. Therefore, our implementation method introduces a new Head using 3×3 deep convolution and 1×1 projection.

Since all other components of the SSD method remain unchanged, to create an SSDlite model, we do this by initializing the SSDlite Head and passing it directly to the SSD constructor.

Backbone feature extractor

Our implementation method introduces a new class to build the MobileNet feature extractor. According to MobileNetV3, Backbone returns the output of the Extension layer of the Inverted Bottleneck module. The output step of the extension layer is 16, prior to the pooling step of 32.

In addition, all of Backbone’s extra modules have been replaced with lightweight modules with a 1×1 Compression, a detachable 3×3 convolution with a step size of 2, and a 1X1 extension

Finally, to ensure that Head has sufficient predictive power even when using small width multipliers, the minimum depth size for all convolution is controlled by the min_depth hyperparameter.

SSDlite320 MobileNetV3 – Large model

This section discusses the configuration of the SSDlite pre-training model and the training process to replicate the results of the paper as closely as possible

The training process

The most notable details of the training process are discussed here.

1. Adjusted overparameter

The paper does not provide any superparameters (such as regularization, learning rate, batch size, etc.) that can be used for model training, which are adjusted to optimal values using cross validation based on the parameters listed in the official REPO configuration file. This resulted in a significant improvement in the baseline SSD configuration.

2. Data enhancement

The main difference between SSDlite and SSDS is that Backbone weights of SSDlite are only a fraction of those of SSDS. Therefore, in SSDlite, data enhancement focuses more on making the model robust to target objects of different sizes, rather than just overfitting.

SSDlite uses only a subset of SSD Transformation to avoid over-regularization of the model.

3. LR Scheme 

Since it relies on data enhancement to make the model robust to small and medium target objects, we find that increasing the number of epochs is very beneficial for model training. Specifically, increasing the amount of EPOCH to three times that of SSD would increase the accuracy to 4.2mAP point; Using a 6-fold multiplier, this can be increased to 4.9 maps.

An increase in the epoch would have the opposite effect, reducing training speed and accuracy. Still, based on the model configuration in the paper, the authors appear to have used the equivalent of a multiplier of 16.

4. Weight initialization & Input Scaling & ReLU6

A series of optimizations brought our implementation approach very close to the official approach and narrowed the accuracy gap. These optimizations train Backbone from scratch, instead of starting initialization from ImageNet. In addition, these optimization methods also adjusted our weight initialization scheme, changing Input Scaling and replacing all the standard RELUs added to the SSDlite Head with ReLU6.

Note that since we trained the model from random weights, we also applied the speed optimization method described in the paper, namely using reduced tail in Backbone.

5. Differences in implementation methods

Comparing the above implementation with the implementation in the official REPO, we find some differences.

Most of the differences are very small and related to how weights are initialized (e.g., Gaussian vs. truncated normal) and how LR Scheduling is parameterized (e.g., smaller vs. larger Warmup rate, shorter vs. longer training duration).

The most significant known difference is in the way the classification losses are calculated. The implementation of MobileNetV3 Backbone SSDlite in the official REPO does not use SSD Multibox Loss but uses RetinaNet Focal Loss.

Since TorchVision already provides a complete implementation of RetinaNet, we decided to implement SSDlite with Normal Multi-box SSD Loss.

Improved critical accuracy

Copying the code in the paper does not guarantee accuracy, especially without knowing the full training process and implementation details. Often this process involves a lot of backtracking, as we need to find implementation details and parameters that have a significant impact on accuracy.

Below we will try to visualize important iterative processes that improve accuracy on a baseline basis:

The optimization sequence described above is accurate, although in some cases a little idealistic. For example, although different schedulers were tested during the tuning overparameter phase, there was no significant improvement in accuracy, so we retained the MultiStepLR used in the baseline.

When we subsequently tested different LR schemes, we found that switching to CosineAnnealingLR required fewer configurations and achieved better results.

To sum up, even if we use a correct implementation method and a set of optimal superparameters from the family model, we can always improve the accuracy to some extent by optimizing the training process and adjusting the implementation method.

Admittedly, the above is a pretty extreme case, doubling accuracy, but in many cases there is still plenty of room for optimization that can help us improve accuracy significantly.

Benchmark

Initialize two pre-training models:

ssdlite = torchvision.models.detection.ssdlite320_mobilenet_v3_large(pretrained=True) ssd = torchvision.models.detection.ssd300_vgg16(pretrained=True)

Benchmark comparison between old and new models:

The SSDlite320 MobilenetV3-Large model is the fastest and smallest model so far, so it is very suitable for mobile APP development.

While it is not as accurate as the pre-trained low-resolution Faster R-CNN model, the SSDlite framework is highly tunable and users can improve accuracy by introducing heavier heads with more convolution.

On the other hand, SSD300 VGG16 model runs very slowly and has low accuracy. This is mainly because of its VGG16 Backbone. While the VGG architecture is very influential, it is now somewhat outdated.

Because this particular model has historical significance and research value, it is included in TorchVision. If you want to use a high-resolution detector, we recommend that you either use SSDS in combination with other Backbone or use one of the Faster R-CNN pretraining models.

Reference: PyTorch Blog