With the increasing of end-to-end computing power and the maturity of model miniaturization schemes, it is possible for high-precision deep learning models to run smoothly on mobile terminals, embedded terminals and other terminal devices. However, the integration of deep learning into terminal devices still faces the challenge of balancing the precision of complex neural network structure and the constraints of device performance. Therefore, model developers often need to deeply understand the structure of the model, tune various parameters and conduct detailed and comprehensive optimization to achieve the ideal effect.
Recently, PaddleDetection has introduced a series of compact and efficient models for end-to-end devices, covering mainstream single-phase and two-phase network architectures and achieving high speed and precision performance. Here is a summary of the design ideas and skills used in the process of model iteration, for the reference of interested students.
In order to facilitate you to quickly practice, here provides a model optimization project based on MobilenetV3-Yolov3, using the optimization strategy of clipping and distillation, all the code can be run in AI Studio, you can also debug online through AI Studio.
Aistudio.baidu.com/aistudio/pr…
Characteristics of target detection model
The end-side model released by PaddleDetection covers three mainstream model structures, which can be divided into single stage and two stages according to the design idea:
1. One-stage: This type of model has a relatively simple structure and is usually directly connected to the detection head on the full convolutional network (FCN) to output category and location information. Typical representatives are: SSD[1] series, YOLO[2] series. This kind of structure is friendly to mobile applications due to its relatively short pipeline and low relative delay. SSD series, in particular, are widely used on the side and side, and can be seen in various face and gesture detection applications.
Figure 1 YOLOv3
2. Two-stage: As the name implies, the detection process of this kind of structure is divided into two stages. The first stage outputs a rough proposal, and the second stage extracts the corresponding features of the candidate frame, further predicts the category on its basis and refines the output position information, representing the structure such as Faster R-CNN[3]. The advantage of this design is that the location information is more accurate and the detection accuracy of small targets is better. However, due to the large number of components and long overall process, the delay is often relatively high, which makes it difficult to be successfully implemented on the end side (as can be seen from the following, such models can also achieve good results on the end side by using our optimization method).
Figure 2 Faster – RCNN
Optimization idea
The operation flow of the above two types of models is as follows:
- Single stage: feature extraction -> detection head -> post-processing
- Two stages: feature extraction -> RPN (Region Proposal Network) -> ROI pooling -> detection head -> post-processing
Among them, ROI pooling and post-processing are mostly hand-designed modules, which depend on the implementation of deployment framework. RPN has only two convolution layers, and the number of output channels is only 4 and 1, so the running time is not high. The optimization of feature extractor and detection head is the key to reduce the end – side computation.
Practical Trick sharing
Optimize the feature extractor
The feature extractor of general object detection model usually uses CNN model (ResNet, ResNeXt, MobileNet, ShuffleNet) based on ImageNet pre-training as Backbone network. The accuracy and speed of the trunk network pretraining model will greatly affect the final performance of the detection model. PaddleClas’ pre-training model provides a rich choice for the master network. For example, there are several versions of the MobileNet series commonly used in the end-to-end universal detection model (see the table below for comparison of partial accuracy and delay).
MobileNetV3 of semi-supervised learning knowledge distillation pre-training was used uniformly, and the accuracy of each detection model was improved by 0.7%~1.5% on COCO mAP. Taking YOLOv3 as an example, MobileNetV3 is used to replace MobileNetV1 as the backbone network. The accuracy of COCO is improved from 29.3 to 31.4, and the reasoning delay on snapdragon 855 chip is reduced from 187.13ms to 155.95ms.
It should be noted that because the characteristics of the pre-training model of distillation are very fine, the relative learning rate and L2 decay of the backbone network should be appropriately reduced to prevent the characteristics from being destroyed during training.
Optimized feature fusion FPN
In order to further effectively fuse and refine multi-layer features, the Faster R-CNN model uses the FPN (Feature Pyramid Network) module [4], in which the selection of Feature layers for FPN fusion and subsequent processing is very critical. For MobileNet backbone network, C2~C5 feature layer is usually used for feature fusion.
After fusion processing, five feature layers (P2~P6) are output, with resolution ranging from 1/4 to 1/64. In order to improve the prediction speed, we tried to reduce one feature layer, that is, input only C2~C4 to generate P2~P4. This adjustment can reduce the delay by 21%, but mAP only reduces by 0.9% on COCO dataset. Further analysis shows that the recall rate of large objects output by RPN is very low under this feature combination. Therefore, we added down-sampling convolution in FPN module to generate additional P5~P6. This modification improved COCO mAP by 1.3%, but the prediction time only increased by about 9%.
FIG. 3 FPN adjustment
Optimized detection head
The detection head is usually composed of multilayer convolution. If the standard configuration of detection head continues to be used, time will become the main bottleneck of optimization when the amount of computation for feature extraction backbone network is reduced. For example, in Yolov3-Mobilenet V3, the detection of the header portion takes about 50% of the time. It is a very important optimization step to reduce the detection head and the budget of the detection head. In general, the convolution layer of the detection header is clipped structurally by manual design or model compression clipping strategy.
For example, for the two-stage Faster R-CNN model, we significantly reduced the number of convolutional channels of FPN (256->48) and the number of full-connection layer channels of detection head (1024->128). In YOLOv3 model, the cutting method of model compression is used to cut the detection head, which is described in detail below.
Tuning skills big broadcast
For the end – side model, in addition to modifying the structure to obtain better delay, it is often necessary to improve the accuracy by optimizing the training process. The training process of the model generally includes data preprocessing, loss function design, learning rate strategy, etc. Such training skills can greatly improve the accuracy of the model without increasing the amount of computation.
1) AutoAugment data enhancement
Data enhancement is an effective technique to improve the accuracy of neural networks, but most data enhancement implementations are designed manually. AutoAugment is to combine several combination strategies, such as image translation, rotation and histogram equalization, into an augmented set, and randomly select a sub-strategy (the sub-strategy may also be a fusion of multiple augmentation) each time, and use the strategy to augment the input image. Experimental results show that this strategy can improve the accuracy by 0.5% in the two-stage end-to-end detection model.
It is worth noting that for small models, due to their limited learning ability, excessive data enhancement is difficult to improve the final accuracy of the model, and may even have the opposite effect. For example, the experiment found that the accuracy of the two-stage end-to-end detection model decreased instead of rising after using GridMask.
2) Learning rate strategy
The training of deep learning is an optimization process, during which the weight of the model will constantly change, and the speed of this change is determined by the learning rate. With the progress of training, the learning rate itself should also be adjusted constantly, so as to avoid the model falling into the local optimal, and the saddle point or the optimal advantage always oscillates nearby. Appropriate learning rate adjustment strategy can make the training process smoother and improve the final accuracy of the model. Experimental results show that the cosine learning rate strategy can achieve higher accuracy than the usual three-section learning rate, and the strategy does not introduce extra overparameters and has higher robustness. The calculation formula of cosine learning rate strategy is as follows:
0.5 decayed_lr = learning_rate ∗ ∗ (cos (epoch ∗ math.h piepochs) + 1)
The curve of comparison between cosine learning rate strategy and three-stage learning rate strategy is as follows:
It can be seen that in the whole training process, cosine strategy keeps a large learning rate, so its convergence is slow, but the final convergence effect is better. In the two-stage end-to-end model optimization, the accuracy of the cosine learning rate strategy is improved by 0.8%.
3) Balanced L1 loss
Libra R-CNN[5] proposed Balanced L1 Loss function. The central idea is to balance the two tasks of target detection (location and classification) and the influence of sample difficulty on model training, so as to achieve better precision performance. The specific approach is to improve smooth L1 loss of positioning task, and perform gradient clip on difficult outlier samples to avoid unbalanced update of model weight. The formula is as follows:
Partial Lb partial x = {alpha ln (b | | x + 1) gamma if | | x < 1 otherwise
In the two-stage detection model, Balanced L1 Loss was used to replace the traditional Somooth L1 Loss at the end of the detection head border regression, resulting in a 0.4% mAP improvement.
Model of the cutting
Convolution channel clipping is an effective solution to reduce model size while ensuring model accuracy. Through a certain clipping mechanism, the convolutional channels with low importance are clipped out, thus reducing the amount of computation. Compared with manual design, the clipping algorithm based on model compression is more versatile and transferable.
PaddleDetection integrates PaddleSlim, a flying paddle model compression tool, to provide a compression solution for the detection model. In Mobilenet V3-Yolov3, model acceleration is achieved through this scheme. Those of you who are interested can poke here to do some practical experiments with YOLOv3 clipping.
Knowledge of distillation
Knowledge distillation [6] should be no stranger to deep learning enthusiasts, as the technology has been widely used in CV, NLP and other fields. As far as CV is concerned, the effectiveness of knowledge distillation has been verified in a large number of classification tasks, but its application in the detection field is relatively rare. In the process of optimizing the detection head of YOLOv3, we tried the distillation method to fine Tune the tailored model. The yolov3-Resnet34 model with higher accuracy was used as the teacher model to distillation the Yolov3-Mobilenet V3 model. The final experimental results show that 2-3 points of accuracy gain can be achieved on the COCO dataset. If you’re interested, you can go ahead and poke around and do some real experiments.
The model of quantitative
Considering that the end-to-end model is typically deployed using CPU, there is usually good support for Int8 Vectorization operations. Quantization of FP32 model with Int8 precision can not only reduce the storage volume, but also speed up the common convolution and full connection calculation. The main methods of model quantization include Post training quantization and quantization aware training. The former uses a small amount of sample data to calibrate the model, obtain its weight and active dynamic value range, and then quantify the weight according to the data. The latter performs fake quantization of model weight during model training to make the model adapt to the noise caused by quantization, thus reducing the final loss of accuracy. We used this quantization method during the training of SSDLite[7], and finally achieved an acceleration of about 22% based on a small accuracy loss (MAP-0.4%).
Published model
This time PaddleDetection published three models, using these techniques based on the design characteristics of the models themselves, and the results were significant.
- Ftl-rcnn: On the basis of adjusting FPN structure, the number of channels of FPN and detection head is manually reduced, and AutoAugment data enhancement, cosine learning rate strategy and Balanced L1 loss are used for training
- YOLOv3: Use clipping to reduce detection head to increase speed, and use distillation training to improve model accuracy.
- SSDLite: Use cosine learning rate strategy training, and use quantization training to accelerate further.
The final released model evaluation results can be referred to the following table:
Conclusions and Prospects
The above is a summary of PaddleDetection’s experience in the process of end-to-end model development. I hope it will be helpful for students who are interested in developing end-to-end model. In addition to the models mentioned above, PaddleDetection will continue to publish smaller, faster, and more powerful end-to-end detection models. Among them, the anchor-free model is currently under intensive development, please look forward to it!
reference
- Liu, Wei, et al. “SSD: Single Shot MultiBox Detector.” ArXiv:1512.02325
- Redmon, Joseph, and Ali Farhadi. “YOLOv3: An Incremental Improvement.” ArXiv:1804.02767
- Ren, Shaoqing, et al. “Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks.” ArXiv:1506.01497
- Lin, SUNg-yi, et al. “Feature Pyramid Networks for Object Detection.” ArXiv:1612.03144
- Pang, Jiangmiao et al. “Libra R-CNN: An Approach to Balanced Learning for Object Detection.” ArXiv:1904.02701
- Hinton, Geoffrey, et al. “Distilling the Knowledge in a Neural Network.” ArXiv:1503.02531
- Sandler, Mark, et al. “MobileNetV2: Inverted Residuals and Linear executers.” ArXiv:1801.04381
The code and documents of this project are posted on AI Studio, Baidu’s one-stop online development platform, and the link is as follows:
Aistudio.baidu.com/aistudio/pr…