Welcome to my public account [Jizhi Vision]

Hi, everyone, I am Jizhi Vision. This paper analyzes the implementation of Qualcomm DFQ (Data-free Quantization) Quantization algorithm, taking Tengine as an example.

This paper is the fourth article to share the realization of model quantization. There are three previous ones. Interested students can refer to them:

(1) “[Model Reasoning] Quantization Realization Sharing 1: Detailed explanation of min-max Symmetric quantization Algorithm Implementation”

(2) “[Model Reasoning] Quantization Realization Sharing II: Detailed explanation of KL Symmetric quantization Algorithm implementation”

(3) “[Model Reasoning] Quantization Realization Sharing three: Detailed explanation of ACIQ Symmetric quantization Algorithm implementation”

Qualcomm DFQ Quantization algorithm was proposed in the paper Data-free Quantization Through Weight Equalization and Bias Correction. At the beginning of the paper, the Quantization algorithm was divided into four levels:

Level 1: No data and No backpropagation required. Method works for any model;

Level 2: Requires data but no backpropagation. Works for any model;

Level 3: Requires data and backpropagation. Works for any model;

Level 4: Requires data and backpropagation. Only works for specific models;

I think quantization at Level 1 is the highest Level, but I don’t think DFQ belongs to Level 1. At least whether it can work for any model requires a question mark. Look down you will find that the paper to Mobilenetv2 conV -> BN -> Relu sequence block as the main demonstration object, the network structure of the demonstration is still relatively limited, but the method is still relatively novel.

The principle and implementation will be introduced in detail below.

1. DFQ quantization principle

First, the quantization effect of DFQ:

In MobileNetV2, MobileNetV1 and ResNet18, DFQ is better than direct per-layer and per-channel quantization strategies for FP32 –> INT8. The author also extended the quantitative test of fp32 -> int6 on ResNet18, the effect is not as good as per-channel.

The core logic of DFQ algorithm is (1) cross-layer equalization; (2) offset absorption; (3) Normal quantization; (4) Offset correction:

     

Among them, (1) and (2) have not started quantification, which are preparatory work before quantification, and (4) are correction work after quantification. Let’s expand it out.

1.1 Cross – layer equalization

First, per-layer quantization and per-channels quantization of weights. Assuming that the shape of weight W is [N, C, H, W], per-layer quantization is to quantify W[:, :, : :] at a stroke. Finally, there’s just a Scale and a Bias; Per-channels W[:, C, : :] and finally c Scale and C Bias. Per output channel weight ranges of the first depthwise-s layer in MobileNetV2

It can be seen that the weight ranges of different channels vary greatly, so it is difficult to use per-layer to coordinate with only one Scale + Bias. If per-layer is used, the weight values of some channels with small Range will be uniformly set to 0 after quantization, which is unreasonable. This defect of per-layer can be perfectly solved by per-channels quantization, because per-channels give each channel a Scale and a Bias. Why don’t we directly use Per-channels? The author thinks that per-channels are less friendly and more expensive than per-layer hardware, so the author tries to transform per-channels to per-layer in his paper. In this way, tricks, which means to narrow the weight Range difference between channels, is born, and a new and more balanced FP32 weight is used to replace the original weight, finally achieving the purpose of per-channels conversion to per-layers. Specifically, the mathematical characteristics of RELU were used to carry out the following derivation:

Let’s take a look at the RELU function:

     

For s of a positive semi-axis, we have:

     

For two adjacent layers in the network:

     

     

Combined with the mathematical characteristics of RELU, the following deduction can be made:

Among them:

     

S = diag(S) is a diagonal matrix, and each value on the diagonal is used to adjust the scaling factor. The coefficients divided at Layer 1 need to be multiplied back at Layer 2, as shown in the following figure:

For symmetric quantization, assuming that the weight range of the ith channel Layer 1 and Layer 2 are R1 and R2 respectively, then channel I is best balanced to r = √ (r1R2) (as proved in the paper). Then S can be calculated as follows:

     

Here, the balance of channel weight range between Layer 1 and Layer 2 is introduced. There must be more adjacent layers in the actual network, and it is necessary to iterate to see whether the quantization error index can be reached. If it is reached, cross-layer balancing can be stopped, thus completing the cross-layer balancing of the entire network.

1.2 Absorbing high biases

Cross-layer equalization mainly considers the weight processing. In the whole quantization process, the quantization of the activated value will also have a great influence on the overall quantization effect. In particular, after the balanced scaling of the weight, the Range of the corresponding activated value will also change. In this case, the Range of activation values of the channel will also become larger. In order to avoid large differences in activation values of different channels, it is necessary to absorb high deviations to the next layer. Again, using the mathematical properties of RELU, let’s do another wave of derivation.

Default CONV + BN + RELU:

     

For c < (Wx + b) | | c = 0 (the BN obey gaussian distribution, the activation of the output value by 3 sigma principle to know there is a 99.865% chance of meet c < = (Wx + b)), there are:

     

For the two layers BN –> RELU, deduce:

Among them:

     

Here, it is more clever to directly subtract C from the activation value h() of the output of Layer 1 to make the Range smaller and reduce the quantization error, while the subtracted part is completely absorbed by the bias of the next Layer 2, so as to achieve a balance. This completes bias absorption.

2.3 Quantization

Not to say, it is normal quantification, the substance of the article did not say more. ~ ~ ~

2.4 Quantization bias correction

Sacle = 255/2.0 = 127.5, Sacle = int8, Sacle = int8, Sacle = int8 After quantization, the Range is [-127.5, 127.5], and then the round() operation is generally performed to obtain the final quantization Range is [-127, 128]. If only symmetric quantization is considered, there is no offset of + Zero_Point. This completes the fp32 –> INT8 process. Int8 –> fp32, you will see that the final Range of fp32 is [-127/127.5, 128/127.5], which is slightly different from the original [-1.0, 1.0]. It is easy to imagine that this deviation is due to round(), which shows that the quantization process is irreversible.

Then back here, the Quantization bias correction is the Quantization bias caused by the irreversible Quantization error mentioned above. Moreover, the key point mentioned in this paper is that quantization error is biased, which means that the mean value of quantization error is not 0 in terms of distribution, that is, it will affect the output distribution, resulting in output deviation. The following solution is similar to the compensation operations used in (1) cross-layer equalization and (2) absorption high biases, such as first multiply and then divide, first subtract and then add, first add and then subtract. Similar ideas are also adopted here to make up the errors. So let’s go back to the derivation.

Assuming that the weight of a certain layer is W and the weight after quantization is W’, then:

     

Among them:

     

Then the offset can be derived:

     

At this time, the center of gravity shifts to solving E[x]. Considering that the network structure is BN -> RELU’s sequential structure, RELU function is characterized by negative half-axis suppression, so only positive half-axis BN output will be retained. The following two methods can be used to calculate E[x] after RELU() :

(1) When there is a calibration set, E[x] can be obtained directly through statistical Data distribution (but this violates the idea of data-free);

(2) When there is no data-free calibration set, the activation value output by BN follows the Gaussian distribution, and then RELU is advanced, which is equivalent to truncating the Gaussian distribution and reserving only the positive half-axis. Then the problem is transformed into finding the mean value of the positive half-axis Gaussian distribution, which can be calculated as follows:

     

All right, that’s how DFQ works. There’s some stuff. Next comes the implementation.

2. Quantitative implementation of DFQ

Let’s take the implementation of DFQ in Tengine as an example.

Let’s start with a small bug list.

And then start.

The main implementation of DFQ quantization is here:

case ALGORITHM_DFQ:
{
    quant_tool.data_free_quant();
    quant_tool.model_file = "test_dfq_fp32.tmfile";
    if (quant_tool.scale_file.empty()){
        quant_tool.scale_file = "table_minmax.scale";
        quant_tool.activation_quant_tool();
    }
    save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
    /* Evaluate quantitative losses */
    if (quant_tool.evaluate){
        fprintf(stderr."[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
        quant_tool.assess_quant_loss(0);
    }
    break;
}
Copy the code

Compared with other quantization algorithms, DFQ only has a quant_tool.data_free_quant(), as follows:

Quant_tool.data_free_quant () : quant_tool.data_free_quant() : Cross-layer equalization and high offset absorption, going into the data_free_quant() interface to look at the code, various judgments + embedded loops make the code not very readable. Let’s take our time.

In the beginning, I mainly do some initialization work, but I won’t go into more details.

Tengine is mainly implemented on DW Conv and Direct Conv operators for DFQ pre-processing equalization, here you may have some questions, the principle part has been talking about some mathematical characteristics of BN, RELU, how to become Conv here, This is mainly due to operator fusion. Tengine or NCNN will do some graph optimization work when converting the model into TMFile (Tengine) or bin/ Params (NCNN). CONV+BN+RELU is the most basic structure that needs to be fused into large operators. Therefore, in Tengine’s DFQ implementation, you can’t see some processing calculations for BN and RELU, but the idea of cross-layer equalization used is consistent with DFQ.

For Direct Conv:

/// Direct Conv
auto op_name0 = graphn->node_list[node_input_id]->op.type;      

// Identify OP_CONV
if (node_proto[node_input_id].output_node_list.size() == 1 && op_name0 == OP_CONV){
    struct conv_param* conv_param0 = (struct conv_param*)graphn->node_list[node_input_id]->op.param_mem;
    if(conv_param0->group ! = conv_param0->output_channel || conv_param0->group ==1){
        node_proto[i].pass = 1;               // layer1 // Two adjacent layers to be balanced
        node_proto[node_input_id].pass = 1;   // layer0

        // layer0 min/max range    
        struct node* nodeP = graphn->node_list[node_input_id];
        struct tensor* input_tensor = get_ir_graph_tensor(graphn, nodeP->input_tensors[1]);
        uint16_t dims0 = input_tensor->dims[0];
        uint16_t dims123 = input_tensor->dims[1] * input_tensor->dims[2] * input_tensor->dims[3];

        std: :vector<float> layer0_max(dims0, 0.0 f);
        std: :vector<float> layer0_min(dims0, 0.0 f);
        std: :vector<float> layer0_range(dims0, 0.0 f);

        float* data_layer0 = (float*)input_tensor->data;
        for (int d0 = 0; d0 < dims0; d0++){
            for (int d1 = 0; d1 < dims123; d1++){
                if (data_layer0[dims123 * d0 + d1] >= layer0_max[d0])
                    layer0_max[d0] = data_layer0[dims123 * d0 + d1];
                if (data_layer0[dims123 * d0 + d1] < layer0_max[d0])
                    layer0_min[d0] = data_layer0[dims123 * d0 + d1];}
        }
        
        for (int d0 = 0; d0 < dims0; d0++){
            layer0_range[d0] = layer0_max[d0] - layer0_min[d0];
        }

        // layer1 min/max range
        nodeP = graphn->node_list[i];
        input_tensor = get_ir_graph_tensor(graphn, nodeP->input_tensors[1]);
        dims0 = input_tensor->dims[0];
        uint16_t dims1 = input_tensor->dims[1];
        uint16_t dims23 = input_tensor->dims[2] * input_tensor->dims[3];

        std: :vector<float> layer1_max(dims1, 0.0 f);
        std: :vector<float> layer1_min(dims1, 0.0 f);
        std: :vector<float> layer1_range(dims1, 0.0 f);

        float* data_layer1 = (float*)input_tensor->data;
        for (int d0 = 0; d0 < dims0; d0++){
            for (int d1 = 0; d1 < dims1; d1++){
                for (int d2 = 0; d2 < dims23; d2++){
                    if (data_layer1[dims1 * dims23 * d0 + dims23 * d1 + d2] >= layer1_max[d1]){
                        layer1_max[d1] = data_layer1[dims1 * dims23 * d0 + dims23 * d1 + d2];
                    }
                    if(data_layer1[dims1 * dims23 * d0 + dims23 * d1 + d2] < layer1_min[d1]){ layer1_min[d1] = data_layer1[dims1 * dims23 * d0  + dims23 * d1 + d2]; }}}}for (int d0 = 0; d0 < dims1; d0++){
            layer1_range[d0] = layer1_max[d0] - layer1_min[d0];
        }

        //////////////////////////////////////////////////////////////////////////////////
        // layer ops sqrt
        float* ops_range = new float[dims1];
        for (int ops = 0; ops < dims1; ops++){
            ops_range[ops] = sqrt(layer0_range[ops] * layer1_range[ops]);    // Calculate the appropriate scale Range r = √ (r1r2)
        }

        // Calculate the Scale
        float* S01 = new float[dims1];
        float* S01_F = new float[dims1];  
        for (int ops = 0; ops < dims1; ops++){
            if (ops_range[ops] == 0){
                S01[ops] = 0.0;
            }
            else{
                S01[ops] = layer0_range[ops] / ops_range[ops];
            }
            if (layer0_range[ops] == 0)
                S01_F[ops] = 0.0;
            else
                S01_F[ops] = ops_range[ops] / layer0_range[ops];
        }
        
        //////////////////////////////////////////////////////////////////////////////////
        // layer0 output Scale balancing
        nodeP = graphn->node_list[node_input_id];
        input_tensor = get_ir_graph_tensor(graphn, nodeP->input_tensors[1]);
        dims0 = input_tensor->dims[0];
        dims123 = input_tensor->dims[1] * input_tensor->dims[2] * input_tensor->dims[3];
        for (int d0 = 0; d0 < dims0; d0++){
            for (int d1 = 0; d1 < dims123; d1++){ data_layer0[dims123 * d0 + d1] = data_layer0[dims123 * d0 + d1] * S01_F[d0]; } } input_tensor = get_ir_graph_tensor(graphn, nodeP->input_tensors[2]);
        dims0 = input_tensor->dims[0];
        float* data_layer0_bias = (float*)sys_malloc(sizeof(float) * dims0);
        data_layer0_bias = (float*)input_tensor->data;
        for (int d0 = 0; d0 < dims0; d0++){
            data_layer0_bias[d0] = data_layer0_bias[d0] * S01_F[d0];
        }

        // layer1 output Scale balancing
        nodeP = graphn->node_list[i];
        input_tensor = get_ir_graph_tensor(graphn, nodeP->input_tensors[1]);
        dims0 = input_tensor->dims[0];
        dims1 = input_tensor->dims[1];
        dims23 = input_tensor->dims[2] * input_tensor->dims[3];
        for (int d0 = 0; d0 < dims0; d0++){
            for (int d1 = 0; d1 < dims1; d1++){
                for (int d2 = 0; d2 < dims23; d2++){
                    data_layer1[dims1 * dims23 * d0 + dims23 * d1 + d2] = data_layer1[dims1 * dims23 * d0 + dims23 * d1 + d2] * S01[d1];}}
        }
        delete[] S01;    // free the memory
        S01 = NULL;
        delete[] S01_F;
        S01_F = NULL;
        delete[] ops_range;
        ops_range = NULL; }}Copy the code

And then… Loop until the entire network is balanced to generate DFq_FP32_TMfile, and then…… It’s over before it even starts. If I read it right, this is the difference between the current tengine DFQ quantization implementation and min-max quantization. The difference of Tengine DFQ is that the weight data of input FP32 TMfile is balanced across layers), but this only realizes the (1) tricks in DFQ paper, and the other tricks are not included, which is a little unmoral.

     

Of course, this should also be a place to be improved, here the principle and implementation of the temporary introduction.

The above detailed sharing of Qualcomm DFQ quantization principle and implementation, I hope my share can be a little help to your learning.


【 原 文 】 【 model reasoning 】 Quantization implementation share four: data-free Quantization incense? Detailed explanation of Qualcomm DFQ Quantization algorithm implementation