Which is the best end-to-side acceleration of the model? The article discloses the technical core of Baidu EasyEdge platform

In recent years, deep learning technology has made great achievements in many fields, so it is widely favored by academia and industry. With the development of deep learning, the structure of neural network becomes more and more complex. Although the complex model has better performance, the high storage space and the consumption of computing resources make it difficult to effectively apply it to various hardware platforms. Therefore, the deployment and acceleration of deep learning models on the end has become the focus of both academia and industry.

On the one hand, there are many deep learning frameworks that developers and researchers can use to design models, each with its own unique network structure definition and model retention format. AI engineers and researchers want their models to be able to transition between different frameworks, but the gaps between frameworks prevent the models from interacting with each other. On the other hand, due to the large number of parameters in the deep learning model, the direct deployment of the model at the edge will result in a high delay.

Baidu EasyEdge terminal and edge AI service platform can solve the above problems well. EasyEdge can support the model input of a variety of mainstream deep learning frameworks, and provides convenient deployment functions. It ADAPTS to all kinds of mainstream chips and operating systems in the industry, eliminating the need for complex code processes, and can easily deploy the model to terminal devices. EasyEdge integrates a variety of acceleration technologies and provides multiple levels of acceleration services to balance model reasoning time and precision. On the one hand, it can minimize the delay of model deployment on the end, and on the other hand, it can match a wider range of usage scenarios.

EasyEdge supports deployment of many different types of deep learning models, including common model types including image classification, detection, segmentation, partial face detection and attitude estimation. At present EasyEdge supports more than 60 kinds of classic network and a variety of custom network types.

At the same time, EasyEdge supports multiple deep learning frameworks, including PaddlePaddle, PyTorch, TensorFlow, MXNet, etc. In order to realize deployment more conveniently, EasyEdge currently supports the interconversion of some deep learning framework models, as shown in Figure 1. For example, if a user wants to deploy a PyTorch model using OpenVino on an Intel CPU, EasyEdge can convert the Torch model format to OpenVino IR format after several model transformations, and finally complete the model deployment based on the OpenVino deployment framework.

Figure 1 EasyEdge supports multiple model framework transformations

EasyEdge also supports a wide range of terminal devices. It not only supports common universal chip CPU, GPU and universal ARM devices, but also supports mainstream specialized chips on the market, such as Intel Movidius series, Heis NNIE, etc., as shown in Figure 2. EasyEdge has been built as the industry’s most widely adapted end and edge service platform.

Figure 2 EasyEdge supports multiple hardware device deployments

Analyze the model compression technique in EasyEdge

In order to realize the efficient deployment of a variety of networks in different chips, EasyEdge provides a variety of optimization operations, such as model format conversion, graph optimization, chip optimization, model low-precision calculation optimization, model cutting and distillation. Among them, model compression technology is a crucial link. The model compression technology used in EasyEdge includes common low-precision model calculation, structured cutting and model distillation. As shown in Figure 3, in order to better adapt terminal devices, EasyEdge integrates a variety of model compression libraries, which can be used flexibly according to the actual deployment situation.

Figure 3 Model compression technology in EasyEdge

The low-precision calculation of the model is designed to represent the original 32bit floating-point data with a small number of bits. On the one hand, in order to compress the size of the model, the end-to-side devices can load the model into memory faster for larger models and reduce IO delay. On the other hand, the computing power of the processor is usually stronger than that of the floating point, so the quantized model can often be inferred and calculated faster. As shown in Figure 4, irregularly distributed floating point data is quantified to a few fixed points. EasyEdge support includes common low precision types including FP16 and INT8, where INT8 quantization technology provides maximum lossless compression.

Fig. 4 Model quantification [1]

The implementation methods of INT8 quantization technology can be roughly divided into two kinds: post-training quantization and in-training quantization. Course training after quantization is already trained FP32 inserted into quantitative model nodes, through statistical methods as much as possible to restore the original floating point data by a small amount of fixed-point number, and the quantitative in training can insert quantitative simulation nodes before training, after quantitative simulation in the process of training data to calculate the output of each node, In this way, the model will finally fit and converge to the optimal model quantization. This is shown in Figure 5. Compared to the training quantization has better accuracy, but it takes longer time.

Fig. 5 Principle of training quantification [2]

EasyEdge has the ability to quantify both in training and offline training, and will choose different quantitative methods according to different actual situations. In the deep learning model, the final output of the classification model is usually the topK of the final Layer. This property determines that the model pays more attention to the sorting relation of the final output rather than the size of the value itself. Therefore, compared with the detection model based on numerical regression, the classification model has stronger quantitative robustness. Based on this, the quantization strategy of EasyEdge will be adjusted flexiblyaccording to the model type. In the task related to classification, offline quantization technology is preferred to be used to shorten the publishing time, while in a series of detection models based on Anchor regression, retraining is preferred to ensure the accuracy. On the other hand, the quantitative strategy adopted by EasyEdge will be different according to different end-to-side devices and deployment frameworks. For example, when the model is deployed to the CPU using the PaddleFluid framework, the more sensitive OPs will greatly affect the final accuracy after quantization. Therefore, in EasyEdge, the input and output data types of these OPs will be FP32, while the calculations of the other OPs will be INT8. This Layer level mixed precision quantization strategy can balance reasoning speed and precision well.

In the offline quantization process, some outlier data points will be too far from the center distribution, which will lead to the traditional quantization strategy will overestimate the range, resulting in a low final quantization accuracy, as shown in Figure 13. In order to deal with this situation, EasyEdge integrated post-calibration technology, through multiple iterations to find a more appropriate threshold, so that the post-quantized INT8 data distribution and the pre-quantized FP32 data distribution have the minimum KL divergence, so as to reduce the quantization error.

Model tailoring usually refers to structural tailoring. Structured clipping is channel-level clipping, as shown in Figure 6, that is designed to remove excess compute channels.

Fig. 6 Model structured clipping [3]

For the clipping of a convolution kernel, as shown in Figure 7, when a channel of input and output is clipped by the middle kernel at the same time, the channels corresponding to its input and output tensor will be reduced, which brings two benefits. On the one hand, after reducing the size of the convolution kernel, the model volume will be reduced and the IO time in the reasoning process will be reduced. On the other hand, the tensor itself is compressed, so it requires less memory overhead than before compression.

EasyEdge currently adopts this channel tailoring technology. At the same time, in the selection of clipping channels, various methods based on L1-NORM, L2-NORM and FPGM[8] are encapsulated, and can be flexively selected according to the actual situation. On the other hand, the trimmed model may affect the rationality of the network topology because it changes the shape of part of Layer. The EasyEdge platform integrates the channel adjustment method, realizes the breadth-first search algorithm, corrects the channel number one by one, and skips the configuration of some special blocks that are difficult to adjust. Ensure the rationality of clipping algorithm.

Fig. 7 Structured clipping for a convolution kernel [4]

For the clipping of some models, EasyEdge adopts channel sensitivity analysis technology, and analyzes the sensitivity of each Layer to channel clipping by reasoning and calculating the final precision loss through multiple clipping on each Layer. On the other hand, EasyEdge also integrates the configuration clipping strategy at Layer level. Through the method of threshold filtering, under the same compression rate target, more sensitive layers are retained as much as possible to achieve the minimum precision impact. For example, as shown in Figure 8, in a Resnet50 network, the sensitivity analysis concluded that the start layer and the end layer were more sensitive to clipping, so lower clipping rate was implemented, while the middle layer had more redundancy, so higher clipping rate was adopted.

Moreover, EasyEdge integrates some simple super-parameter search technologies in the upper Layer. On the one hand, it needs to keep the parameter information of the sensitive Layer as much as possible, and on the other hand, it needs to find the model that best matches the set compression rate. For example, a 120M size model can be precisely clipped to about 60M when the clipping rate is 50% in the configuration. This technology enables EasyEdge platform to provide more differentiated services at the model compression level.

Fig. 8 Sensitivity-based clipping, precise clipping rate control [5]

For acceleration of some models, EasyEdge uses distillation technology based on Hinton[9]. The purpose of model distillation is to use the knowledge learned from the large model to adjust the smaller model, so that the precision of the small model can approach that of the large model. As shown in Figure 9, the general distillation method is to associate some layer outputs of the large model with partial outputs of the small model in a certain form in the same session. In this way, in the process of training the small model, the knowledge learned by the large model will act on the gradient back propagation of the small model and promote the convergence of the small model.

Fig. 9 Knowledge distillation function [6]

This new function, the main function is based on the model compression framework Paddleslim development, EasyEdge platform based on the compression function to do further packaging and optimization. To find out more about Paddleslim, go to GitHub.

(Acceleration effect on CPU GPU ARM is shown, and the compression effect of the model is summarized.) We have released ultra-high precision detection models on the three most commonly used end devices, namely CPU, GPU and ARM. The specific device models are as follows:

CPU: Intel® Xeon® Processor E5-2630 v4

GPU: NVIDIA Tesla V100

ARM: Firefly-RK3399

As shown in Fig. 10, ACC1-ACC3 in the histogram respectively represent different acceleration levels. The higher the acceleration level, the more channels the model is clipped. It can be observed that the model compression capacity of EasyEdge has obvious speed gains on the three terminals. InTUITIVELY, the acceleration effect is the best on the general CPU, which can achieve a speed increase of more than one time. This is also related to the acceleration method adopted by EasyEdge platform on different terminal devices. The GPU itself has stronger computing power, so the acceleration effect of reducing FLOPS on GPU is slightly weaker than that of CPU and general ARM.

Figure 10 Acceleration on different end devices

Then compare the acceleration effects of different types of models on the same hardware device. We experimented the reasoning effects of several models with different precision on JETSON (JETSON 4.4-Xavier), including MobileNetV1-SSD, MobileNetV1-YOLOV3 and YOLOV3. As shown in Figure 11, ACC1-ACC3 has the same meaning as the above. In general, the new model compression function can achieve a speed gain of around 40% at most without sacrificing a small amount of precision, with obvious effect. High-performance models, on the other hand, have slightly less acceleration because they have certain acceleration characteristics, such as a smaller model size and fewer FLOPS, so there is less room for further improvement.

Fig. 11 Reasoning delay of different detection models on Jetson

In the actual use process, the specific speed improvement will be different according to the different end-to-side devices and model types. The model compression capacity of EasyEdge platform will also be continuously optimized and updated in the subsequent iterations.

Now you’re ready to experiment with the new functionality, and when you publish the model, you can choose the appropriate acceleration scheme for your needs, as shown in Figure 12.

Figure 12 EasyEdge provides a variety of acceleration solutions

After the model is released, the reasoning effect of SDK on the end can be viewed on the evaluation page. As shown in Figure 13, the fastest acceleration scheme is accompanied by less loss of precision, which can increase the model speed by 30%.

Figure 13 EasyEdge provides model evaluation

The capabilities of EasyEdge are also fully integrated into EasyDL and BML. Using these two platforms, you can complete the whole process of data processing, model training and service deployment in one stop, and realize the efficient development and deployment of AI models.

Recently, Feibar Enterprise Edition launched the “2021 Gravity Plan” activity, giving you a free cash coupon worth 10,000 yuan, which can be directly used to purchase online services of Feibar Enterprise Edition EasyDL and BML public cloud. The maximum redeemable value is:

6000+ hours of customized model training
590+ hours of script parameters
400+ hour quota for public cloud deployment
Or exchange it for 50 device-side SDKs

Limited places, immediately receive: https://ai.baidu.com/easydl/u…

[1] Fang J, Shafiee A, Abdel-Aziz H, et al. Near-lossless post-training quantization of deep neural networks via a piecewise linear approximation[J]. arXiv Preprint arXiv: 2002.00104, 2020.

[2] Jacob B, Kligys S, Chen B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2704-2713.

[3] Han S, Pool J, Tran J, et al. Learning both weights and connections for efficient neural networks[J]. arXiv preprint arXiv:1506.02626, 2015.

[4] LI H, KADAV A, DURDANOVIC I, et al. Pruning filters for efficient convnets[J]. ArXiv Preprint ArXiv :1608.08710, 2011. 2016.

[5] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

[6] Gou J, Yu B, Maybank S J, et al. Knowledge distillation: A survey[J]. International Journal of Computer Vision, 2021, 129(6): 1789-1819.

[7] Wu H, Judd P, Zhang X, et al. Integer quantization for deep learning inference: Principles and Empirical Evaluation [J]. ArXiv Preprint arXiv:2004.09602, 2020.

[8] He Y, Liu P, Wang Z, et al. Filter pruning via geometric median for deep convolutional neural networks acceleration[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 4340-4349.

Distilling in neural networks [J]. ArXiv Preprint ArXiv :1503.02531, 2015. (in Chinese with English abstract) [9] Hinton G, Vinyals O, Dean J. Distilling in neural networks [J].

Which is the best end-to-side acceleration of the model? The article discloses the technical core of Baidu EasyEdge platform

Related Posts

Machine learning parameter server Paracel (1)—– overall architecture

PaddleOCR2.4 based [regular season: Chinese Scene Character Recognition] Baseline

ML 2021 Spring HW1: Regression