Welcome to follow my public account [Jizhi Vision], reply 001 to get Google programming specification

O_o >_< O_o O_o ~_~ O_o

Hello, everyone, I am Jizhi Vision. This paper analyzes the implementation of ACIQ symmetric quantization algorithm, taking Tengine as an example.

This is the third chapter of quantitative realization, there are one and two in front, interested students can refer to

(1) “[Model Reasoning] Quantization Realization Sharing 1: Detailed explanation of min-max Symmetric Quantization Algorithm implementation”;

(2) “[Model Reasoning] Quantization Realization Sharing ii: Detailed explanation of KL Symmetric quantization Algorithm”;

ACIQ is similar to the previous quantization strategy. It also intercepts a threshold T and maps [-t, T] to the quantization range. The difference is the process of finding T. So let’s start.

1. Principle of ACIQ quantitative strategy

The quantization strategy of ACIQ is proposed in the Post Training 4-bit quantization of convolutional Networks for Rapid-Deployment.

The above comparison adopts 8-bit weight quantization and 4-bit activation quantization, and ACIQ is 4000 times faster than KL quantization in quantization efficiency. Besides resNET-101, it can be seen that: The network quantization effect of other tests is better than KL quantization, which can be said to be both efficiency and effect.

At the beginning of the article, Unlike traditional approaches that focus on the quantization at the network level, in this work we propose to minimize the quantization effect at the tensor level. We can see that ACIQ is a quantitative strategy starting from the Tensor level. The whole derivation logic is as follows:

(1) first, Derive a generic expression for any given distribution for the expected MSE as a function of clipping value;

(2) Then use this expression to develop a specifific expression for each distribution;

(3) finally, establish the optimal clipping values by solving the equations for which the derivative with respect to the clipping Value are set to zero;

Usually, clipping is required during quantization to deal with the long tail problem of original data. Assuming α is the truncation value, truncation can be expressed as:

       

ACIQ requires a strong prior assumption: Tensor (feature map) follows the Laplace distribution or the Gaussian distribution. Then we use optimization to solve the minimum quantization loss corresponding to the truncated value of quantization process. The whole quantization process is to map the value following the original distribution to the 2^M quantization discrete range, where M is the number of quantized bits. [-α, α] = 2^M = 2^M = 2^M

       

Assuming that the probability density function of the original distribution is F (x), truncation value α and quantization function Q(x), then L2 Loss before and after quantization can be calculated as follows:

       

The above formula can obviously be divided into three parts:

(1) [-infinity, -α];

(2) [alpha and alpha];

(3) [α, infinity];

For gaussian distribution N (0, sigma ^ 2) or Laplace distribution Laplace (0, b)) the 0 axisymmetric distribution, (1) and (3) are equivalent, meaning is | | to | alpha | x between the mean square error (mean square error). After doing [-α, α] bisect mapping to 2^M, each quantized value will take the middle values of each segment q1, Q2, Q3… Q2 to the M, term (2) is the cumulative error of truncation. Now the whole quantization process is transformed into finding a truncation value α that minimizes E[(x-q (X))^2] (deep learning is a mathematical problem until the end), and then combining the prior distribution and doing some equivalent transformation of the formula, the final overall quantization loss optimization objective function is obtained:

       

Mathematically, the minimum value of the objective function ==> is required to take the partial derivative, so that it is 0.

For the Laplace distribution, the partial derivative can be expressed as:

       

For the Gaussian distribution, the partial derivative can be expressed as:

       

Finally, for both the Laplace distribution and the Gaussian distribution, M is the bits you want to quantify, and things like β and σ are known, so we can solve for the truncation alpha we want, and for symmetric quantization we can solve for the truncation alpha.


2. Realization of ACIQ quantitative strategy

Let’s look at the implementation of ACIQ in Tengine.

Quantitative realization of the main code:

case ALGORITHM_ACIQ:{
    if (quant_tool.scale_file.empty()){
        quant_tool.scale_file = "table_aciq.scale";
        quant_tool.activation_quant_tool();
    }
    save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
    /* Evaluate quantitative losses */
    if (quant_tool.evaluate){
        fprintf(stderr."[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
        quant_tool.assess_quant_loss(0);
    }
    break;
}
Copy the code

2.1 Quantification of activation value

Activation value quantization entry:

quant_tool.activation_quant_tool();
Copy the code

The first is to calculate the min and Max values. This process is the same logic as the quantization strategy previously written, so I will not go into the ACIQ strategy:

for (int i = 0; i < ir_graph->tensor_num; i++){
    struct tensor* t = ir_graph->tensor_list[i];
    if (t->tensor_type == TENSOR_TYPE_VAR || t->tensor_type == TENSOR_TYPE_INPUT){
        float absmax = 0.f;
        float act_scale = 1.f;
        int act_zero_point = 0;
        int emlement_num = t->elem_num;

        absmax = std::max(std: :abs(max_activation[i]), std: :abs(min_activation[i]));
        float threshold = compute_aciq_gaussian_clip(absmax, emlement_num, 8);
        act_scale = threshold / 127.f;

        /* the scale of softmax is always scale = 1 / 127.f */
        for (int j = 0; j < ir_graph->node_num; j++){
            struct node* noden = ir_graph->node_list[j];
            struct tensor* tensor_tmp = get_ir_graph_tensor(ir_graph, noden->output_tensors[0]);

            if(! (tensor_tmp->tensor_type == TENSOR_TYPE_INPUT || tensor_tmp->tensor_type == TENSOR_TYPE_VAR))continue;

            std: :string tmp_op_name = get_op_name_from_type(noden->op.type);
            std: :string cur_name = t->name;
            std: :string tmp_name = tensor_tmp->name;

            if ((cur_name == tmp_name) && tmp_op_name == "Softmax"){
                act_scale = 1 / 127.f;
                break;}
        }
        fprintf(fp_aciq, "%s %f %d\n", ir_graph->tensor_list[i]->name, act_scale, act_zero_point);}
}
Copy the code

The key is this function, tengine default prior obeys gaussian distribution, INT8 quantization:

float threshold = compute_aciq_gaussian_clip(absmax, emlement_num, 8);
Copy the code

Take a look at its implementation:

static float compute_aciq_gaussian_clip(float absmax, int N, int num_bits)
{
    const float alpha_gaussian[8] = {0.1.71063519.2.15159277.2.55913646.2.93620062.3.28691474.3.6151146.3.92403714};   // For 8-bit quantization, α=3.92403714

    const double gaussian_const = (0.5 * 0.35) * (1 + sqrt(3.14159265358979323846 * log(4))); 

    double std = (absmax * 2 * gaussian_const) / sqrt(2 * log(N));  

    return (float)(alpha_gaussian[num_bits - 1] * std);
}
Copy the code

That gives us the truncation value, and we can evaluate the scale:

act_scale = threshold / 127.f;
Copy the code

This completes the quantification of the activation value.

2.2 Weighting & bias quantization

The quantization process of weight & bias is the same as the logic of the quantization of min-max and KL introduced previously, which will not be described here.

Finally, in practice, it can be found that the quantization process of ACIQ is very fast, which is 4000 times faster than KL quantization. It is not a lie, mainly because the values of prior Gaussian distribution alpha_Gaussian, gaussian_const and STD do not need to be searched.

The quantization principle and realization of ACIQ have been shared above. I hope my sharing can be of some help to your study.


【 public number transmission 】 【 model reasoning 】 quantization realization share three: detailed explanation of ACIQ symmetric quantization algorithm implementation