Please follow my official account [Jizhi Vision] for more notes to share
O_o >_< O_o O_o ~_~ O_o
This paper mainly discusses why quantization can accelerate model reasoning.
I have written several previous articles on quantization of models: [Model Reasoning] Several quantization strategies: MinMax, KLD, ADMM, EQ, “[Model Reasoning] Talk about the organization of model quantization”, “[Model reasoning] Talk about the quantization of nonlinear activation function”, you can refer to the relevant knowledge, here mainly discuss why quantization can accelerate model reasoning.
Quantization often involves Quantize and Dequantize processes. In fact, quantization will increase operators for flow without quantization, so the reason for quantization acceleration may not be as simple as imagined. Here, the Conv layer is used to illustrate the speed difference between quantization and non-quantization.
Assuming that the input channel is C1, the output is C2HW, and the size of the convolution kernel is K, the clock cycles of different types of operation instructions are as follows:
Assuming that the unquantized Conv operator uses fp16 precision for inference, its time consumption is as follows:
Regardless of Quantize layer and Dequantize layer, calculate separately the time consumed after quantization for INT8 convolution. It should be noted that int8 cannot always be used for convolution operation in order to ensure that the operation does not overflow. Intermediate results are sometimes stored using int16 or even INT32 data types. N1 indicates how many times to do int8 times to prevent overflow int16, n2 indicates how many times to do int16 times to prevent overflow int32:
We can see that when we multiply int8 twice, we need to add the data stored in int16 to int32. To do the N2 int16 addition, we need to add the data originally stored in int32 register to int64 register. Therefore, the time after quantization convolution can be obtained as follows:
Requantize operation has not been added to the above convolution, and the operation time is as follows:
The final quantized convolution operation time is as follows:
Integer operations tend to be twice as powerful as floating-point operations, with one-fourth of the instruction cycles. Purely from the convolution cycle time before and after quantization, there is indeed an improvement in speed.
Then we consider adding Quantize layer and Dequantize layer, and need to introduce the following instruction clock cycle:
We can easily obtain the total time consumption after adding Quantize layer and Dequantize as follows:
From the above formula, it can be seen that the clock cycles of other instructions need to be involved after quantization and reverse quantization are added. In this way, it is impossible to directly judge whether it will be faster than unquantized convolution. This problem needs to be considered in different inference deployment environments.
In fact, the deployment quantization model can be accelerated in most cases not only because of the shortening of the period of floating-point integer operation instructions, but also because of the acceleration unit specially designed for integer operation on many chips specially designed for neural network deployment. In addition, some hardware does not have floating point units and can be deployed using quantization models.
The reasons for quantitative acceleration are analyzed above from the perspective of instruction cycle shortening. Quantitative acceleration is a system engineering with many factors to be considered.
[Model Reasoning] Talk about why Quantization can accelerate reasoning