Abstract: This paper introduces the main methods of model optimization in the current industry, and then focuses on model quantization, introduces the basic principle of quantization, method classification, future development, and the interpretation of cutting-edge papers.
This article is shared by Alan_wen from Huawei Cloud community “Model Quantification Review and Application”.
preface
With the continuous development of deep learning, neural networks are widely applied to different fields, have far more than the performance of the past, but the depth of the parameters of the network model is becoming more and more big, severely restricts the depth of the application of network in the industry, so this article will introduce the main method of model optimization of the industry, and emphasis on the quantitative model, this article introduces the basic principle of quantitative, Methods classification, future development, and interpretation of cutting-edge papers.
1. Method of model optimization
1.1 Design efficient network architecture
The optimization of the model can be realized by designing compact network structures, such as the proposed MobileNet series networks, which have simple Depth Wise Convolution and Point Wise Convolution. However, the neural network designed by hand has been gradually replaced by AutoML and network structure search, which can get high precision and compact network.
1.2 Model pruning
Using manual design of general network structure can achieve high accuracy, but the huge network parameters are difficult to directly applied to the products of industry, generally to pruning, the model is divided into a structured model and pruning the pruning and unstructured pruning, generally it is difficult to achieve the underlying acceleration, unstructured and pruning and model pruning is gradually replaced by the search network structure.
1.3 Distillation of knowledge
In addition to pruning, which can reduce a large model to a small one, knowledge distillation can also achieve this function. Knowledge distillation takes the original large model as the Teacher model, and the small model Student model is designed to realize the knowledge transfer of the Teacher model through soft-target guidance of Student model training.
1.4 the sparse
Sparse is mainly through sparse network weights or features, which can be achieved through regularization training. After sparse network weights are combined with model pruning, inactive weights are pruned to compress the network structure.
1.5 Model quantification
Model quantization is one of the most effective model optimization methods in the industry at present. For example, FP32–>INT8 can achieve 4 times of parameter compression, which can achieve faster calculation while compressing memory. Extreme binary quantization can even achieve 32 times of compression theoretically, but excessive compression will lead to a rapid decline in model accuracy. Model quantification is described in more detail below.
2. Quantitative review of models
2.1 What is quantization?
Quantization in information systems is the process of approximating the continuous value of a signal to a finite number of discrete values (it can be considered as a method of information compression).
In the computer system, quantization is to establish a data mapping relationship between the specified point and floating point data, so as to obtain better benefits at the cost of less precision loss. It can be simply understood as expressing FP32 and other values with “low bit” numbers.
Here are three questions to ask before we begin with quantization:
Why is quantification useful?
- Because convolutional neural networks are insensitive to noise, quantization is equivalent to adding a large amount of noise to the original input.
Why quantization?
- The model is too large, for example, the number of parameters of VGG19 is greater than 500MB, and the storage pressure is too high;
- The weights range of each layer is basically determined and fluctuates little, which is suitable for quantitative compression.
- In addition, quantization can reduce both storage and computation
Why not just train low-precision models?
- Because training requires back propagation and gradient descent, INT8 is a discrete value. For example, our learning rate is generally a few tenths of a point. Int8 does not match and cannot be updated by back propagation.
2.2 Quantization Principle
Quantization is the process of approximating the continuous values of a signal to a finite number of discrete values. Can be understood as a method of information compression. When considering this concept on a computer system, it is generally expressed as “low bit”.
Model quantization is to establish a data mapping relationship between fixed point and floating point data, so that better benefits can be obtained with less precision loss. Details are as follows:
R represents the real floating point value, Q represents the quantized fixed point value, Z represents the quantized fixed point value corresponding to floating point 0, and S is the smallest scale that can be represented after quantization. The quantization formula from floating point to fixed point is as follows:
Floating-point to fixed-point quantization:
2.3 Quantify the base probability
Uniform and non-uniform quantization:
As shown in the figure above, quantization can be divided into uniform quantization and non-uniform quantization. Uniform quantization in the figure above on the left is linear quantization in the way of Formula (1). However, the network weight or feature distribution is not necessarily uniform, and simple linear quantization may lead to obvious information loss of the original network. Therefore, non-uniform quantization can also be carried out. For example, Kmeans is used to cluster the network weight to obtain different cluster centers, and then the cluster center is used as the quantitative representative of the weight of the same cluster.
Symmetric and asymmetric quantization:
In the ideal case, as shown in the left figure above, the feature distribution is uniform, so the model can be quantized symmetrically, that is, with the absolute values of the left and right sides of the 0 point value equal. But in many cases, the model is the uneven distribution of weight or characteristics, not necessarily in the zero values are balanced on both sides, as shown in the above picture shows that direct for symmetric quantification is one side of the characteristics of severe compression, lost a lot of network information, and therefore can be said to as much as possible in order to keep the original network information, asymmetric can be quantified.
Dynamic and static quantization:
There are different calibration methods to determine the shear range of the figure above [α, β]. Another important distinguishing factor in quantitative methods is the determination of clipping scope. The range of weights can be calculated statically because, in most cases, the parameters are fixed during inference. However, the activation map is different for each input sample. Therefore, there are two ways to quantify activation: dynamic quantification and static quantification.
In dynamic quantization, this range is evaluated dynamically for each active map at run time. This approach requires real-time calculation of signal statistics (min, Max, percentile, etc.), which can be very expensive. However, dynamic quantization usually achieves higher accuracy because the signal range is calculated precisely for each input.
Another method of quantization is static quantization, where the clipping range is precomputed and static during reasoning. This method does not add any computational overhead in the reasoning process, but it generally results in lower accuracy than dynamic quantization. A popular approach to predictive computation is to run a series of calibration inputs to calculate typical activation ranges.
In general, dynamic quantization dynamically computes the cutting range for each activation, usually with the highest accuracy. However, calculating the range of crop cuts dynamically is computationally expensive, so the industry most often uses static quantization, where all input crop ranges are fixed.
Different quantization granularity:
In the computer vision task, the activation input of each layer has many different convolution filters convolved as shown above. Each of these convolution filters can have a different range of values. Thus, one difference in the quantization method is how the granularity of the crop range [α, β] is calculated for the weight. They can be classified as layer quantization, grouping quantization and channel quantization: a) layer quantization: In this method, the crop range is determined by considering the ownership weight in the layer convolution filter, as shown in the third column of the figure above. Through the statistics of the whole parameters in the layer (such as minimum, maximum and percentile, etc.), the same cutting range is then used for the whole layer of the convolution filter. Although this method is very simple to implement, it usually leads to suboptimal solutions, because the range of each convolution filter may vary greatly, causing some convolution kernels with relatively narrow parameter ranges to lose quantization resolution. B) Grouping quantization: Multiple different channels within the layer can be grouped to calculate the cutting range (activation or convolution kernel). This may be helpful in cases where the distribution of parameters varies greatly within a single convolution/activation, and grouping quantization can establish a good compromise between quantization resolution and computational overhead. C) Channel quantization: A common choice for the crop range is to use a fixed value for each convolution filter, independent of the other channels, as shown in the last column of the figure above. That is, each channel is assigned a dedicated scaling factor. This ensures better quantization resolution and generally results in higher accuracy. Channel quantization is currently the standard method used to quantize convolution kernels.
Random quantitative
In the process of reasoning, the quantization scheme is usually deterministic. However, this is not the only possibility, and some work has explored stochastic quantization for quantifying perceptual training as well as reduced precision training. A high-level intuition is that random quantization may allow NN to explore more than deterministic quantization. It is generally assumed that small weight updates may not result in any weight changes, since rounding operations may always return the same weight. However, enabling random rounding may provide an opportunity for the NN to transform to update its parameters. The following formula is the random rounding method in Int quantization and binary quantization.
Fine-tuning method
After quantization, parameters in the neural network (NN) usually need to be adjusted. This can be performed using the Retraining model, a process called quantified Perception training (QAT), or it can be done without Retraining, often called post-training quantization (PTQ). The diagram above shows a schematic comparison between the two methods (quantified perceptual training on the left, post-training quantization on the right) and is discussed further below.
- Quantified perception Training (pseudo quantified)
Given a trained model, quantization may introduce perturbations to the trained model parameters, which may shift the model away from the point to which it converges when trained with floating-point accuracy. This problem can be solved by retraining the NN model with quantized parameters so that the model converges to a point with better losses. One popular approach is to use quantized perception training (QAT), where the usual forward and reverse passes are performed on the quantized model in floating point, but the model parameters are quantized after each gradient update. In particular, it is important to perform this projection after performing weight updates with floating point accuracy. Passing backwards using floating-point execution is important because accumulating gradients in quantization accuracy can result in zero gradients or gradients with high errors, especially in low precision. An important subtlety in backpropagation is the treatment of non-quantifiable operators (Formula 1). In the absence of any approximation, the gradient of this operator is zero almost everywhere, because rounding in the formula is the piecewise plane operator. A popular approach to this problem is to approximate the gradient of the operator through what is called a direct estimator (STE). STE basically ignores rounding and approximates it with an identity function, as shown in the figure below.
Although STE is a rough approximation, QAT has been shown to be effective. However, the main disadvantage of QAT is the computational cost of retraining NN models. This retraining may require hundreds of epochs to be executed to restore accuracy, especially for low-precision quantization. If quantitative models are to be deployed over a long period of time, and if efficiency and accuracy are particularly important, the investment in retraining may be worthwhile. However, this is not always the case, as some models have relatively short lifespans.
- Post-training quantization
The alternative to the expensive QAT method is post-training quantization (PTQ) which performs quantization and weight adjustment without any fine-tuning. As a result, PTQ overhead is very low and tends to be negligible. Unlike QAT, which requires a sufficient amount of training data for retraining, PTQ has the additional advantage that it can be applied to situations where data is limited or unmarked. However, this generally comes at the expense of lower accuracy than QAT, especially for low-precision quantization.
- Zero Shot (data-free)
As discussed so far, in order to achieve minimal precision loss after quantization, we need to access all or part of the training data. First, we need to know the range of activation so that we can crop the values and determine the appropriate scaling factor (often referred to in the literature as calibration). Second, quantitative models usually require fine-tuning to adjust model parameters and recover decreased accuracy. In many cases, however, it is not possible to access raw training data during the quantification process. This is because the training data set is either too large to distribute, unique (such as Google’s JFT-300M), or sensitive due to security or privacy issues (such as medical data). Several different approaches have been proposed to address this challenge, which we call Zero Shot quantization (ZSQ). According to a revelation of qualcomm’s work [2], two different levels of Zero Shot quantization can be described: Level 1: no data, no fine-tuning (ZSQ+PTQ). Level 2: No data, but requires fine tuning (ZSQ +QAT). Level 1 allows faster and easier quantization without any fine tuning. Fine-tuning is often time consuming and often requires additional hyperparameter searches. Level 1 can be corrected by statistical parameters of weight equalization or BatchNorm without fine-tuning. However, level 2 usually results in higher accuracy, as fine-tuning helps quantify model recovery accuracy declines, especially in ultra-low precision Settings. Level 2 fine-tuning input data is mainly generated by GAN, which can be used to generate approximate distributed data based on the pre-quantization model without accessing external data.
Zero Shot (aka date-free) quantization performs the entire quantization without accessing training/validation data. This is especially important for providers who want to speed up the deployment of customer workloads without having to access their data sets. In addition, this is important in cases where security or privacy concerns may limit access to training data.
2.4 Quantitative advanced concept
- FP32, pseudo quantization and fixed point quantization
There are two common approaches to deploying quantization NN models, analog quantization (aka pseudo-quantization) and integer quantization only (aka fixed-point quantization). In analog quantization, quantized model parameters are stored in low precision, but operations (such as matrix multiplication and convolution) are performed using floating point algorithms. Therefore, quantization parameters need to be de-quantized before floating-point operations, as shown in the figure above (middle). Therefore, people cannot fully benefit from fast and efficient low-precision logic and analog quantization. However, in pure integer quantization, all operations are performed using low-precision integer arithmetic, as shown above (right). This allows the entire reasoning to be performed with efficient integer algorithms without requiring any floating-point inverse quantization of any parameters or activations.
In general, using floating-point algorithms to perform reasoning at full precision may help in the final quantification of precision, but this comes at the cost of not benefiting from low-precision logic. Low-precision logic has multiple advantages over full-precision logic in terms of delay, power consumption and regional efficiency. Only integer quantization and binary quantization are preferable to analog/pseudo quantization. This is because only integers use lower-precision logic for arithmetic, whereas analog quantization uses floating-point logic to perform operations. This does not mean, however, that pseudo-quantization is never useful. In fact, the pseudo-quantization approach is beneficial for problems with bandwidth limitations rather than computational limitations, such as in recommendation systems where the bottleneck for these tasks is the memory footprint and the cost of loading parameters from memory. Therefore, it is acceptable to perform pseudo-quantization in these cases.
- Quantization of mixed accuracy
It’s easy to see how hardware performance improves as we use lower precision quantization. However, quantizing the model uniformly to ultra-low precision may result in a significant decrease in accuracy. This problem can be solved by mixing precision quantization. In this method, each layer is quantized with a different bit precision, as shown above. One challenge with this approach is that the search space selected for this bit setting is exponential in the number of layers. Different approaches have been proposed to solve this huge search space. A) Choosing this mixed precision for each layer is essentially a search problem, and many different search methods have been proposed.
B) Another kind of mixing accuracy method uses periodic function regularization to train the mixing accuracy model, by automatically distinguishing different layers and the importance of their changes in accuracy, and learning their respective bit width.
C) HAWQ introduces an automatic method based on the second-order sensitivity of the model to find the mixing accuracy Settings. Mixed precision quantization has been proved to be an effective hardware efficient method for low precision quantization of different neural network models. In this approach, the layers of NN are grouped as sensitive/insensitive to quantization, and each layer uses high/low. As a result, one can minimize accuracy degradation and still benefit from reduced memory footprint and faster acceleration with lower precision quantization.
- Hardware aware quantization
One of the goals of quantization is to improve reasoning delay. However, not all hardware provides the same speed after quantizing a layer/operation. In reality, the benefits of quantization depend on the hardware, with many factors affecting the speed of quantization, such as on-chip memory, bandwidth, and cache hierarchy.
Taking this fact into account is important for quantifying the best benefits through hardware awareness. Therefore, hardware latency needs to be simulated. When deploying quantization operations in hardware, the actual deployment latency of each layer with different quantization bit precision needs to be measured.
- Assisted distillation quantization
An interesting line of work in quantization is to combine model distillation to improve quantization accuracy. Model distillation is a method of using large models with high accuracy as teachers to help train compact student models. During the training of student models, model distillation proposes to use soft probabilities generated by teachers, which may contain more input information, rather than just using ground-true class tags. In other words, the total loss function includes student loss and distillation loss.
- Acme quantitative
Binarization is the most extreme quantization method, where the quantization value is limited to a 1-bit representation, resulting in a significant 32× reduction in memory requirements. In addition to the memory advantage, binary (1-bit) and ternary (2-bit) operations can generally be computed efficiently using bit arithmetic and can achieve significant acceleration at higher precision (such as FP32 and INT8). However, a simple binarization method will lead to a significant decrease in accuracy. Therefore, a great deal of work has been done to propose different solutions to this problem, which can be divided into the following three categories:
- Quantization error minimization (use multiple binary matrix combinations to simulate approximations, or use wider networks)
- Improved loss function (e.g., adding distillation loss)
- Improved training methods (e.g. replace sign with a smoother function when propagating back)
Extremely low precision quantization is a promising research direction. However, existing methods generally result in a decrease in accuracy compared to baselines unless very extensive tuning and hyperparametric searches are performed. But this loss of accuracy may be acceptable for less critical applications.
2.5 Future quantitative direction
Here, we briefly discuss several challenges and opportunities for future quantitative research. This includes the joint design of quantization tools, hardware and NN architecture, joint compression methods and quantization training.
- Quantization tools: Using current methods, it is simple to quantize and deploy different NN models to INT8 without significantly losing accuracy. There are several packages available for deploying INT8 quantization models (for example, Nvidia’s TensorRT, TVM, etc.), and each package is well documented. However, software for low-precision quantization is not widely available and sometimes does not exist. For example, Nvidia’s TensorRT does not currently support quantization beyond INT8. In addition, support for INT4 quantization was only recently added to TVM. Low and mixed precision quantization using INT4/INT8 is effective and necessary in practice, so the development of efficient software apis for low precision quantization will have an important impact.
- Hardware and NN architecture joint design: As mentioned above, an important difference between the classical work of low-precision quantization and the more recent work of machine learning is that neural network parameters may have very different quantized values, but can still generalize approximations well. For example, through quantitative perception training, we may converge to a different solution, far away from the original solution with a single precision parameter, but still achieve good accuracy. One can take advantage of this degree of freedom or adjust the NN schema as it is quantified. For example, changing the width of the NN schema can reduce/eliminate the quantized generalization gap. One future effort is to jointly tune other architectural parameters, such as depth or individual kernels, as models are quantified. Another future effort is to extend this co-design to hardware architectures.
- Combine multiple compression methods: As mentioned above, quantization is just one way to deploy NN effectively. Other methods include efficient NN architecture design, hardware and NN architecture co-design, pruning and knowledge distillation, network framework search, etc. Quantization can be used in conjunction with these other methods. However, very little work has been done to explore the best combination of these approaches. For example, NAS joint quantization, pruning, and quantization can be applied to models together to reduce their overhead, and it is important to understand the best combination of structured/unstructured pruning and quantization. Again, another direction for the future is to investigate the combination of these methods with the others mentioned above.
- Quantization training: Perhaps the most important use of quantization is to speed up NN training with half precision. This allows training with faster, more energy-efficient low-precision logic. However, it is very difficult to push this further to INT8 precision training. Although there is some interesting work in this area, the proposed methods usually require a lot of hyperparametric tuning, or they are only suitable for NN models for some relatively easy learning tasks. The basic problem is that in the case of INT8 accuracy, training can become erratic and divergent. Addressing this challenge can have a big impact on multiple applications, especially for cutting-edge training.
3. Quantitative paper interpretation
3.1 quantitative Data – Free
Data-Free Quantization Through Weight Equalization and Bias Correction[2]. (ICCV2019)
This is an article proposed by Qualcomm, which proposes the realization of post-training quantization in data-free. The main innovation in this article is the method of weight balance and offset correction.
As shown in the following figure (left), the convolution kernel value range distribution at the same layer is extremely unbalanced, and quantization of each channel is relatively complicated and consumes extra time. Therefore, equalization of weights is proposed, as shown in the following figure (right). After equalization, the overall value range distribution is relatively balanced.
The weight balance mainly uses the principle of equivalent expansion, such as formula
In the fully connected layer of the convolution kernel, the scaling of the weight of the previous layer can be passed to the next layer through the scaling factor.
Offset correction is because the data will have an overall offset in the process of model quantization. Therefore, more accurate quantization effect can be achieved by calculating the offset error and then canceling the error. As shown in the table below, the two methods proposed in this paper can achieve results close to per-channel quantization in the way of layer quantization.
Post training 4-bit quantization of convolutional networks for rapid-deployment[3]. (NIPS2019)
In this paper, post-training quantization is realized with 4bit accuracy, and optimal clipping range (ACIQ) and channel bit width Settings are proposed. ACIQ put forward with the minimum quantization error as the optimization goal, through the optimization of different precision quantitative optimal cutting value, and also puts forward the total bit precision under the condition of fixed, to quantify the accuracy of different channel setting, can achieve more appropriate and accurate quantification, but different channel set different quantization bit difficult to achieve industrial acceleration. The following table shows the experimental results obtained. Under 4bit quantization, the accuracy of the model decreases by about 2%.
Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks[4].(ICCV2019)
For low bit quantization, such as binary quantization can effectively speed up reasoning while reducing the memory consumption of deep neural networks, which is crucial for deploying models on resource-limited devices such as embedded devices. However, due to the discreteness of low bit quantization, the existing quantization methods are often faced with the problems of unstable training process and serious performance degradation. To solve this problem, this paper proposes a Microsoft-quantization (DSQ) method to bridge the gap between full precision networks and low bit networks. DSQ can automatically evolve in the training process and gradually approach standard quantization, as shown in the figure below. Due to its differentiability, DSQ can track accurate gradients in the backward propagation and reduce quantization losses in the forward process within appropriate amplitude limits.
DSQ uses a series of hyperbolic tangent functions to gradually approximate the step function for low bit quantization (such as the sign function in the case of 1 bit), while maintaining smoothness and facilitating gradient calculation. In this paper, DSQ function is reconstructed for an approximate characteristic variable, and an evolutionary training strategy is developed to gradually learn the differential quantization function. In the training process, the approximate value between DSQ and standard quantization can be controlled by characteristic variables, and characteristic variables and limiting values can be determined automatically in the network. DSQ reduces the bias caused by very low bit quantization, thus making forward and backward processes in training more consistent and stable.
As can be seen from the following table, this method has basically no accuracy degradation compared with FP32 in the case of lower bit quantization.
Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search[5].(ICLR2019)
As shown in the figure above, the paper proposes differentiable NAS (DNAS), a NAS approach that randomly activates candidate operations with probability θ as a parameter that can be optimized during the search, rather than weighted by direct Softmax. Since DNAS perform random activation with equal probability in early initial training, and the probability θ in late training is obviously different, DNAS resemble DARTS in early training and ENAS in late training.
In order to achieve the quantization of mixed accuracy, this paper uses different bit quantization as different structures (as shown in the figure below) on the basis of DNAS, and obtains different quantization accuracy of different layers through search optimization.
Quantization of mixed accuracy is obtained through differential search, as shown in the following table. On Cifar10 data set, it can be done under the condition of acceleration of more than 10 times, and the accuracy of quantization is slightly higher than that of FP32.
APQ: Joint Search for Network Architecture, Pruning and Quantization Policy[6]. (CVPR2020)
This paper proposes APQ for efficient deep learning reasoning on resource-constrained hardware. Unlike previous approaches that searched neural architecture, pruning strategy, and quantitative strategy separately, this paper optimizes them in a joint manner. In order to cope with the larger design space brought by it, we propose to train a precision predictor of quantization perception to quickly obtain the accuracy of the quantization model and feed it back to the search engine to select the best fit.
However, training this kind of quantitative perception accuracy predictor requires collecting a large number of quantitative models and precision pairs, which involves quantitative perception fine-tuning, so it is very time consuming. To address this challenge, the paper proposes to transfer knowledge from a full precision (fp32) precision predictor to a quantity-aware (INT8) precision predictor, which greatly improves sample efficiency. In addition, it is efficient to collect the data set of fp32 precision predictors only by sampling from the pre-trained once-for-all[7] network to evaluate the neural network without any training costs. The following figure shows the process of transferring the accuracy predictor of training quantization perception from the full accuracy predictor.
The following table shows the results on the ImageNet dataset:
References:
-
A Survey of Quantization Methods for Efficient Neural Network Inference.
-
Data-Free Quantization Through Weight Equalization and Bias Correction, ICCV2019.
-
Post training 4-bit quantization of convolutional networks for rapid-deployment, NIPS2019.
-
Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks, ICCV2019.
-
Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search, ICLR2019.
-
APQ: Joint Search for Network Architecture, Pruning and Quantization Policy, CVPR2020.
-
Once for All: Train One Network and Specialize It for Efficient Deployment, ICLR2020.
-
Binary Neural Networks: A Survey, Pattern Recognition, 2020.
-
Forward and Backward Information Retention for Accurate Binary Neural Networks, CVPR2020.
Click to follow, the first time to learn about Huawei cloud fresh technology ~