Abstract: Taking the recently popular YOLOX[8] target detection model as an example, this paper introduces the principle and process of quantitative perception training, discusses the practical experience of how to realize the accuracy lossless, and shows that the quantized model can achieve the accuracy no lower than the original floating point model, and the optimization effect of model compression 4X and inference acceleration up to 2.3x.

1. An overview of the

Low bit quantization of deep learning model can effectively reduce the cost of storage, computation and communication during model deployment, which is a common model compression and inference optimization technology. However, there are still many challenges in the practical application of model quantization, and the most common problem is the decrease of model accuracy (in this paper, “model accuracy” refers to the effect indicators such as accuracy of model applied to specific tasks unless otherwise specified). Taking computer vision as an example, in complex tasks such as target detection and image segmentation, the accuracy reduction caused by quantization is more obvious.

Quantization-aware training (QAT) can better solve the problem of model Quantization accuracy. Taking the recently popular YOLOX[8] target detection model as an example, this paper introduces the principle and process of quantized perception training, discusses the practical experience of how to achieve lossless accuracy, and shows that the quantized model can achieve accuracy no lower than the original floating point model, model compression of 4X, inference acceleration of up to 2.3x optimization effect.

2. Quantization principle

In the field of digital signal processing, quantization refers to the process of approximating a continuous set of values (or a large number of possible discrete values) of a signal to a finite number of discrete values (or fewer discrete values). Specifically in the field of deep learning, model quantization refers to the process of approximating floating point activation values or weights (usually represented as 32 bits floating point numbers) to integers (16 bits or 8 bits) of low bits, and then completing the calculation in the representation of low bits. Generally speaking, model quantization can compress model parameters and reduce model storage overhead. In addition, the inference speed can be improved by reducing access and effectively utilizing low-bit computing instructions, which is especially important for deploying models on resource-constrained devices.

Furthermore, after quantizing the input data and weights, we can transform the common operations of the neural network into quantization operations. Taking convolution operation as an example, the typical calculation process of quantized version is shown in Figure 1:

FIG. 1 Typical flow chart of quantization convolution operator calculation

The weight and input are quantized to 8bit first, then convolution operation is carried out, and 32bit is used to store intermediate results.

The quantization of BIAS is 32bit, and the sum of is;

Convert 32bit to 8bit by using, and quantization scale;

If the next layer of this layer is also quantized OP, it can be directly output to the next layer; If the OP is unquantized, it is unquantized to a floating point value and then printed to the next layer.

From the above principle of quantization calculation, it can be easily seen that converting floating point number into low-bit integer for calculation will inevitably introduce errors, and the errors of quantization calculation at each layer in the neural network model will accumulate into the errors of the overall accuracy of the model. In the common Post Training Quantization (PTQ) scheme, appropriate Quantization parameters (scale, Zero point, etc.) are selected through statistics of the numerical distribution of variables to be quantized under typical input data to minimize the information loss caused by Quantization.

However, PTQ schemes often fail to achieve lossless model quantization of accuracy. In order to further reduce the precision loss caused by quantization, we can adopt the scheme of quantization perception training, introduce pseudo-quantization operation into the calculation graph of training, and make the model weight “adapt” to the error introduced by quantization through finetuning training. To achieve better, even lossless quantization model accuracy.

3. YOLOX quantitative training

Taking the Yolox-S target detection model (GitHub repo[1]) as an example, we conducted quantitative training experiments on COCO2017 dataset using public pre-training model parameters. LSQ[2,3] is selected as the quantization training algorithm. This series of algorithms use gradient to update quantized scale and zero_point, and can obtain better performance without fine-tuning parameters. In order to obtain better quantization model accuracy through quantization training, we need to focus on the following Settings:

3.1 Quantization method matching the deployment backend

Different deployment backends may adopt different quantization methods, and need to match quantization methods in training and deployment phases to avoid introducing additional errors. Taking the default CPU backend of PyTorch[7] as an example, the basic quantization is

  • Weight: per-channel, INT8, symmetric quantization
  • Activation: Per-tensor, Uint8, Asymmetric quantization
  • Quantize objects: All Conv2d

Taking mobile terminal framework MNN as an example, the basic quantization method is as follows:

  • Weight: per-channel, INT8, symmetric quantization
  • Activation: Per-tensor, INT8, symmetric quantization
  • Quantize objects: All Conv2d

Generally, when quantization model is deployed on concrete inference framework, operator fusion is also carried out for convolution layers such as CONV-BN, CONV-RELu/RELU6 and ConV-BN-relu/RELU6. Therefore, activation after quantization fusion should be set. SiLU activation functions in YOLOX are not normally fused, and the quantization of output activation only needs to be set to the output of BN, not the output of SiLU. A typical quantization position of convolution layer during training is shown in Figure 2.

FIG. 2 Quantization position diagram

At the same time, all BN in QAT are folded into corresponding convolution, and the fold strategy in literature [6] is adopted in the experiment. Since the introduction of analog quantization may make running_mean and running_VAR unstable at BN layer, the quantization training cannot converge. Therefore, we fixed the running_mean and running_var of BN in quantitative training.

In addition, specific deployment back-end implementations may need to quantify activation into a 7-bit numeric range to prevent computation overflows. For CPU models with avX512_VNNI instruction set, there is no corresponding requirement.

3.2 Initialization of quantified training parameters

In order to avoid unstable training loss and even training divergence, quantization parameters obtained from Post training quantization (PTQ) were used to initialize activation scale parameters in LSQ quantization training. The basic steps are:

  • Dozens of typical images were selected for inference on the pre-training model, and activation information of each to be quantified was collected and counted.
  • Metrics such as MSE, KL divergence were used to calculate the optimal scale for each activation.

The basic implementation can use PyTorch’s HistogramObserver to calculate the activation scale & Zero Point, and PerChannelMinMaxObserver to calculate the weight scale.

3.3 Training hyperparameters

In the QAT phase, we use the pre-training model weights that have converged, and use PTQ quantization parameters for QAT initialization. In this case, the model parameters are close to convergence, so we keep the hyperparameters of the overall training consistent with the original YOLOX training, and set the learning rate as the learning rate of the original training convergence stage, that is, 5E-4.

3.4 Calculation error of specific back-end operator implementation

Because the value represents an accuracy problem, the result of the round operation may not be the same on the training framework and reasoning back end, for example:

import torch

Torch. Tensor (2.5).cuda().round() # output tensor(2., device=’cuda:0′)

Torch. Tensor (3.5).cuda().round() # output tensor(4., device=’cuda:0′)

The rounding behavior of 2.5 and 3.5 is not the same on PyTorch, and this behavior difference may not exist on a common deployment backend framework. This problem results in a significant difference between QAT simulation accuracy and the actual running accuracy of the back end, and therefore needs to be corrected during the quantization training phase. For example, for the MNN back end, we can do the following to avoid this difference.

def round_pass(x):

“” “

A simple way to achieve STE operation.

“” “

# y = torch.round(x) # for PyTorch backend

Y = (x + torch. Sign (x) * 1E-6). Round () # for MNN backend, add a small value to round

y_grad = x

return (y – y_grad).detach() + y_grad

4. Experimental results and analysis

4.1 Accuracy and acceleration effect

Perform quantization training on yolox-S model in the above way, and use COCO2017 Validation set to verify the accuracy. The results are shown in Table 1. The precision performance of the real quantized models on both back-end is on par with that of the floating point models.

Table 1 Precision comparison of floating point model, QAT model and back-end real quantization model

In the speed experiment, we selected PyTorch[7] back end and x86 test platform for testing, and the test image resolution was 1x3x640x640. The results for different resource numbers are shown in Table 2. Under the premise of lossless accuracy, the reasoning speed of the quantized model can be improved up to 2.35x, and the model size is 1/4 of the original.

Table 2 Speed comparison of floating point model, QAT model and back-end real quantization model

4.2 Quantify the impact of parameter initialization

LSQ and LSQ+ paper put forward the corresponding quantization information initialization method, the actual use will find that this initialization method for learning rate setting will be more sensitive. On the one hand, if the initial value of quantized parameters is far from the convergence value, they need to be trained by setting a larger learning rate. On the other hand, a large learning rate will make the parameters of the model jump out of the state of convergence and make the model enter the state of “retraining”, increasing the uncertainty. PTQ initialization of quantization parameters can make the parameters in a better initial state, and finetune can achieve better performance by using the learning rate of convergent floating-point model. With a fixed learning rate of 4E-5, the training results and curves of PTQ and LSQ initialization are shown in Figure 4. LSQ initialization training curve is gradually downward when mosSAC data enhancement is started, while PTQ initialization is gradually upward.

FIG. 4 Training curves of different initialization methods under fixed Finetune learning rate

4.3 Influence of training hyperparameters

The setting of learning rate will directly affect the performance of the final QAT model. Based on the learning rate (5E-4) in the training convergence stage of the original model, if the same learning rate is used in the QAT stage, the model accuracy at the early stage of QAT will gradually decline, as shown in Fig.4 in the early stage of PTQ initialization red curve training, but the final accuracy will be improved to the level of the original non-quantized model. If QAT uses a lower learning rate (e.g., 5E-4), the model accuracy will gradually increase compared with PTQ initialization state, but the final accuracy will not improve much:

One possible reason for the above phenomenon is that the model weight and quantitative scale are basically unchanged at a small learning rate. In fact, the initial solution space based on PTQ continues to approach the locally better convergence point, and the possibility of jumping out of the local solution space is low. Therefore, on the premise of PTQ initialization, the learning rate can be directly set to the value of the training convergence stage of the floating point model.

4.4 Selection of training rounds

The results of the above QAT are the model after 300 Epochs training and the result of reducing the amount of epochs training, as shown below:

It can be seen that with the increase of training rounds, the results of QAT will be better. With a low number of QAT training rounds, the result will be inferior to finetune directly with a small learning rate. We can derive an empirical trade-off: if computing resources are plentiful, we can choose to train for longer periods of time for better performance. If there are fewer computing resources, a shorter time can be trained with a small learning rate to achieve higher performance than PTQ.

4.5 To correct the influence of calculation errors of specific operators

The behavior of a specific operator in the training frame and the back-end frame will be slightly different, which will lead to a big difference between the precision of quantization training and the precision of the actual quantization model. Taking MNN[5] as an example, the existence of the following two types of OP in the YOLOX model will lead to this phenomenon.

Fixed the effect of round operation

The results of whether round is modified or not are shown in Table 3. From a training point of view, round performed slightly better after modification than without modification. Meanwhile, if the round behavior is inconsistent with that of the back end during training, the accuracy of the real quantization model may change greatly.

Table 3 Influence of round modification on quantitative training

Sigmoid quickly implements introduced errors

In order to speed up exponential computation, back-end frameworks often do some fast approximations. This usually does not introduce much error in floating-point models. However, for the quantization model, this error may be magnified by the quantization operation.

As shown in Figure 3, for the main pattern of YOLOX (Conv -> SiLU -> Conv), after the output of Conv of the previous layer passes through the SiLU function, the error introduced by fast approximate calculation will be scaled by the quantization operation (divided by scale) at the convolution input of the later layer. The smaller the scale, the larger the scale. Taking Yolox-S as an example, the accuracy of the quantization model for whether to approximate the index calculation is shown in Table 4.

FIG. 3 YOLOX quantization pattern

Table 4 Model performance error caused by approximation of index calculation

5. To summarize

In this paper, the quantitative practice on the YOLOX target detection model is carried out to verify that the quantitative perception training (QAT) can achieve significant model compression and inference acceleration without loss of accuracy. The error factors of quantization accuracy are analyzed in detail, a series of practical means to solve the accuracy problem are pointed out, and the results are verified by experiments, which can be used as a reference for our practical application of model quantization. Although quantization related methods have been studied extensively, there are still many challenges in the application of quantization in practical complex tasks. The real implementation of quantization compression requires the coordination of model and system. In addition, there are more quantization training related technical schemes (such as mixed precision quantization, lower bit quantization, etc.) worth exploring and improving.

About us

The quantitative training practice work in this paper is completed by aliyun-PAI model compression team and Microsoft NNI team, and the technical support from MNN team is also appreciated. For more algorithm implementation of model compression, please refer to NNI (GitHub Repo [4]). For more technical solutions of model inference optimization, please see Aliyun Pai-Blade.

References & code repository

[1] github.com/Megvii-Base…

[2] Wang Y, Wang Y, Wang Y, et al. Quantization of Learned step size [J]. 2019.

[3] Bhalgat Y, Lee J, Nagel M, et al. Lsq+: Improving low-bit quantization through learnable offsets and better initialization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020: 696-697.

[4] github.com/microsoft/n…

[5] Jiang X, Wang H, Chen Y, et al. Mnn: A Universal And Efficient Inference Engine [J]. ArXiv Preprint arXiv:2002.12418, 2020.

[6] Jacob B, Kligys S, Chen B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 2704-2713.

[7] Paszke A, Gross S, Massa F, et al. Pytorch: An imperative style, high-performance deep learning library[J]. Advances in neural information processing systems, 2019, 32:8026-8037.

[8] Ge Z, Liu S, Wang F, et al. Yolox: Exceeding yolo series in 2021 [J]. Journal of arXiv preprint arXiv: 2107.08430, 2021.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.