“This is the 8th day of my participation in the First Challenge 2022. For details: First Challenge 2022”

Please follow my official account [Jizhi Vision] for more notes to share

Hi, I’m Jizhi Vision. This paper mainly discusses the quantitative strategies of deep learning models.

Model miniaturization is the key technology of algorithm deployment, and the process of model miniaturization is usually described by model quantization. Quantization is usually a mapping process from high bits to low bits, and the object of quantization can be either weight data or activation values. There are many forms of quantization, either mixed quantization or whole integer quantization. Whether it is single-layer quantization, group quantization, or whole network quantization, there is a process of floating point number mapping to integer number. This process must have accuracy loss, and for us, the goal is to keep the accuracy loss within an acceptable range.

Quantization can be divided into post-quantization, quantization during training or quantization perception training. Compared with training quantization, post-quantization is a more efficient and non-invasive algorithm acceleration technology. Here we mainly talk about post-quantization, including some main types and strategies of quantization:

  • Quantization types: asymmetric quantization and symmetric quantization, nonlinear quantization and linear quantization;
  • Quantitative strategies: MinMax, KLD, ADMM, EQ;

1. Quantitative types

1.1 Asymmetric quantization and symmetric quantization

Asymmetric quantization is a broader concept than symmetric quantization, and symmetric quantization is a special case of asymmetric quantization.

Asymmetric quantization is a linear map with a bias, and its mathematical expression can look like this:

Where Z is not necessarily equal to 0, and the zero representing a floating point number does not necessarily correspond to the zero of an integer. E() represents the truncation function, which truncates it to the numeric representation of the corresponding integer type. For asymmetric quantization, the following constraints are met:

In the above constraints, T1 and T2 are truncated ranges for floating point numbers, which are the thresholds for asymmetric quantization. The mathematical relationship between threshold value and linear mapping parameters S and Z can be obtained through conversion. After determining threshold value, parameters of linear mapping can be determined.

Symmetric quantization can be regarded as a special case of asymmetric quantization when Z = 0, which satisfies T1 = -T2. At this time, the mathematical relationship between threshold and linear mapping parameters is as follows:

1.2 Nonlinear quantization and linear quantization

Linear quantization is often called uniform quantization, which is most commonly used in the current algorithm implementation. The asymmetric quantization and symmetric quantization mentioned above are also based on linear quantization. Nonlinear quantization is less used in practice, represented by LOG quantization.

Different quantization methods are selective for data distribution. For uniform quantization, it is assumed that data is uniformly distributed in the whole expression space, and linear quantization is a better quantization method under uniform distribution. LOG quantization can ensure the optimization of relative error in numerical space, which is the goal of most non-linear quantization at present. Through the analysis of data distribution, the expression ability of high-density data area can be improved.

For example, suppose I want to perform int8 symmetric quantization of a set of data X in the range of (-a, a). For linear quantization, it can be expressed in the following mathematical expression:

Similarly, the nonlinear quantization of LOG can be expressed as follows:

2. Quantitative strategy

From the above mapping relationship, if the threshold value is known, then its corresponding linear mapping parameters are also known, and the whole quantization process is clear. So how do you determine the threshold? Generally speaking, for the quantization of weight, because the data distribution of weight is static, it is generally necessary to directly find the linear mapping of MIN and MAX. For inferential activation values, the data distribution is dynamic. In order to obtain the data distribution of the activation values, a so-called calibration set is often needed to carry out the sampling distribution. After the sampling distribution, some quantization algorithms are used to select the quantization threshold.

2.1 MinMax quantitative

MinMax is the simplest quantization method, and the quantization diagram is as follows:

In fact, MinMax simply maps floating point numbers directly to int8 data range. The mathematical expression of MinMax method from floating point to fixed point is as follows:

Where, R: true floating point value (fp32); Q: quantized fixed-point value (INT8, Q belongs to [-127, 127]); Z: indicates the quantized fixed-point value corresponding to 0 floating point value. S: the smallest scale that can be represented after quantization.

This method of quantization focuses on the maximum and minimum values of the floating point range and then goes through the scale S linear mapping. This quantization method tends to reduce accuracy considerably and is usually followed by a requantize fine-tuning. In fact, when MinMax quantization is applied to static distributed data such as network weights, it has little impact on the final accuracy loss of network reasoning, and the quantization operation cost is lower and the quantization process efficiency is higher, which is the conclusion drawn by Nvidia after a lot of experiments.

2.2 KLD quantitative

KLD quantization uses KL divergence to measure the similarity between two distributions, which is the quantization method for activation values in Nvidia TensorRT. The diagram of KLD quantization is as follows:

  • This method will not directly/min and Max is mapped to the [127127], but to find a threshold T | | < Max (| | | Max, min |), the [] – T, T is mapped to a [- 127, 127]. It is considered that as long as the threshold value is properly selected, the value beyond the threshold value can be discarded, and the precision loss will not be greatly affected.
  • To exceed the threshold + | T | outside the direct mapping to the threshold. For example, the three red dots in the figure above map directly to -127, and the mapping is Saturate.

KLD quantitative method to numerical distribution and int8 fp32 numerical abstract into two distribution, using threshold T | | to update two numerical distribution, and the KL divergence to measure the similarity of two distribution, if KL divergence value is smaller, the more similar the two distribution, also suggests that this threshold | T | choose the best.

The following figure is the pseudo-code of KL divergence calibration in TensorRT, which perfectly explains the whole quantization process of KLD.

2.3 ADMM quantitative

ADMM alternate direction multiplier method is a way to optimize functions, which is generally used for optimal solution problems with constraints. Similarly, gradient descent method, Newton descent method and Lagrange multiplier method are also similar optimization methods.

The general optimization formula is as follows:

In ADMM, its equivalent optimization formula, also known as Lagrangian, is as follows:

Different from Lagrange multiplier method, ADMM adopts the method of distributed iteration to obtain the final solution. It adopts the following steps:

Next, we transform the selection strategy of quantization threshold into ADMM algorithm, and use the symmetrical quantization method mentioned above to treat quantization as a coding problem. Encoding and decoding will restore the data that can be compared with the original data. Under this logic, the optimization objective is designed as follows:

The 2-normal form is easy to differentiate, so the above optimization objective can be transformed into the following form:

Personally, I think there is no need to use ADMM for symmetric quantization, because there is only one variable, and the final S can be obtained by using gradient descent method.

2.4 EQ quantitative

EQ Quantization (EasyQuant) is an open source Quantization algorithm of Gring deep Pupil, which is introduced in EasyQuant: Post-training Quantization via Scale Optimization. The main ideas of EQ quantization method are: error accumulation, whole network decision making into single network decision making, cosine similarity as the optimization objective, and alternate optimization of weight scaling coefficient and activation value scaling coefficient.

Suppose the quantization formula is as follows (for convenience, symmetric quantization is used to illustrate) :

Assume that the quantization target precision is IntN, where the clip function means that the quantized value is normalized to the integer range, as follows:

EQ quantization algorithm regards the acquisition of scaling coefficient S as a mathematical optimization problem, and the optimization objective is expressed as follows: Assuming that the number of samples in the incoming calibration set is N and the number of network layers of the quantized model is L, Qil represents the output value of the ith sample in the network layer of the L layer without quantized inference, Q^il represents the output value of the ith sample in the network layer of the L layer without quantized inference. Because of the means of error accumulation and problem decomposition, Using cosine similarity as the evaluation standard, the final optimization objective is as follows:

Where, when Qil and Q^il are convolved at this layer, they are:

The biggest highlight of EQ algorithm is that the scaling coefficient of weights is also optimized. Other quantization threshold selection strategies basically do not consider the optimization of the scaling coefficient of weights, and directly use min-max to get the scaling coefficient. Adding the scaling coefficient of the weight to the optimization variable leads to the multivariable optimization problem. If multivariable optimization is carried out in a conventional way, it is often difficult to parse the optimization objective. The solution given by EQ is to use alternating optimization weight activation coefficient and activation scaling coefficient.

EQ algorithm flow is as follows:

3. Quantitative experiment

KLD and ADMM quantization algorithms are selected to carry out some experimental simulations.

Use Python Numpy to create a random data distribution. The following are all positive distribution tests. KL and ADMM algorithms were used to calculate the thresholds for the same distribution.

Here are the results of the three trials:

The thresholds obtained by KLD and ADMM algorithms are as follows:

Summary: single from the test result shows that ADMM get quantization threshold is always higher than KLD quantization threshold, basically stable at Max (standard deviation | | mean – 3 * and standard deviation | | mean + 3 *). According to the definition of positive distribution, ADMM’s threshold will cover 99.99% of the data of the distribution, because its optimization target can not tolerate too small threshold, which will make a great error after the numerical quantization beyond the threshold and inverse quantization.

The above has shared some quantization algorithms related things, including quantization types, quantization strategies and posted an experimental analysis, about quantization is a practical and worth digging into the field. I hope my sharing can help you a little bit.


【 Model Reasoning 】 Talk about several quantization strategies: MinMax, KLD, ADMM, EQ