Welcome to follow my public account [Jizhi Vision], reply 001 to get Google programming specification
O_o >_< O_o O_o ~_~ O_o
Hello, everyone, I am Jizhi vision, this paper analyzes the KL symmetric quantization algorithm implementation, with Tengine implementation as an example.
Previously, I have written a “[model reasoning] quantization to share a: detailed explanation of the min-max symmetric quantization algorithm implementation”, interested students can refer to. This is the sequel to the previous article, and also the second article on quantitative implementation.
I don’t want to introduce the quantification background, but I’ve already covered it in previous articles, so let’s get started.
1. KL quantization principle
KL quantization is a quantization method that uses KL divergence to measure the similarity between real data distribution and quantified data distribution. It is a quantization strategy used in Nvidia TensorRT for activation values. The main logic of KL quantization is as follows:
- KL is different from min-max, not directly
[min, max]
Map to[- 127, 127]
, but to find a threshold|T| < max(|max|, |min|)
the[-T, T]
Map to[- 127, 127]
. It is considered that as long as the threshold value is properly selected, the values beyond the threshold value can be discarded, and the precision loss will not be greatly affected. - Beyond the threshold
Plus or minus | T |
Other values are directly mapped to the threshold, such as the three red dots in the figure above, which are directly mapped to -127. This mapping relationship is said to be saturated.
Numerical distribution and quantitative methods to KL float32 int8 numerical abstract into two distribution, using threshold T | | to update these two numerical distribution, and the KL divergence to measure similarity of these two distribution, if KL divergence value is smaller, the more similar the two distribution, Also means that the threshold T | | to choose the best. For symmetric quantization, Scale is computed from this threshold, and Zero_point is always zero.
The following figure is the pseudo-code of KL divergence calibration in TensorRT, which perfectly explains the whole quantization process of KLD. (Mark figure 2, which will be called later)
2. KL quantization
Again, the realization of KL quantization in Tengine is illustrated here.
There are mainly the following processes:
(1) Quantization of activation value: Min and Max were calculated first, and then KL strategy was used to search quantization to generate activation value calibration table. Fp32toint8;
(2) Weight quantization: Use min-max quantization strategy. Fp32toint8;
(3) Bias quantization: The activation value quantization scale is extended for INT32 quantization. Fp32toint32;
Quantization of weights and biasing is one step further than quantization of active values, and in addition to computing Scale, direct quantization of values is required to apply Scale to generate INT8 TMfile.
The main code for KL quantization in Tengine is as follows:
case ALGORITHM_KL:{
if (quant_tool.scale_file.empty()){
quant_tool.scale_file = "table_kl.scale";
quant_tool.activation_quant_tool(a); }save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
/* Evaluate quantitative losses */
if (quant_tool.evaluate){
fprintf(stderr, "[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
quant_tool.assess_quant_loss(0);
}
break;
}
Copy the code
The most important quantitative search strategy interfaces are quant_tool.activation_quant_tool() and save_graph_i8_perchannel. For KL quantization, these two interfaces do two things respectively:
(1) Activate value quantization to generate table_kl.scale;
(2) weighting & bias quantization, generating scale_weight. TXT, scale_bias. TXT and INT8tmfile;
KL quantization and min-max quantization have the same logic and share the same code because of the calculation method of min and Max and the process of weight & bias quantization in activation value quantization, so it will not be introduced here. Interested students can refer to “[Model Reasoning] Quantization realization to share a: Detailed explanation of min-max symmetric quantization Algorithm Implementation, here mainly introduces the KL quantization search strategy in activation value quantization.
The KL quantitative search strategy entry is here:
quant_tool.activation_quant_tool();
Copy the code
Then we will do the comparison search of min and Max first, mainly using STD ::max_element, STD ::min_element interface, here not to say, after getting min and Max value, open the KL search strategy.
2.1 Outline the probability histogram
Do the first round of probability histogram sketching, and perform KL calculation of the first round. In the second round, you do not need to re-sketch the probability histogram, but iterate on the probability histogram constructed in the first round. Therefore, the more calibration images you have, the closer the final probability histogram will be to the real distribution.
/* calculate hist */
uint32_t inum = 0;
for (int i = 0; i < ir_graph->tensor_num; i++){
struct tensor* ir_tensor = ir_graph->tensor_list[i];
if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
float step_max = std::abs(max_activation[i]);
if (std::abs(min_activation[i]) > step_max)
step_max = std::abs(min_activation[i]);
float step_bin = step_max / 2048.0 f;
std::vector<float> every_edge;
if (nums == imgs_list.size() - 1) {for (int j = 0; j < 2048; j++){
float edge_float = (step_bin * (j + 0.5 f));
every_edge.push_back(edge_float);
}
hist_edge.push_back(every_edge);
hist_gram.push_back(histCount((float*)ir_tensor->data, ir_tensor->elem_num, step_max));
}
else{
std::vector<uint32_t> hist_tmp;
hist_tmp = histCount((float*)ir_tensor->data, ir_tensor->elem_num, step_max);
for (int j = 0; j < 2048; j++){ hist_gram[inum][j] += hist_tmp[j]; } } tensor_hist[i] = inum; hist_tensor[inum] = i; inum++; }}Copy the code
Look at the following histCount interface:
std::vector<uint32_t> histCount(float* data, uint32_t elem_num, float abs_max){
float bin_scale = abs_max / 2047.f;
int bin_zp = 0;
std::vector<uint32_t> hist(2048);
for (int i = 0; i < elem_num; i++){
if(data[i] ! =0) {uint32_t hist_idx = round(std::abs(data[i]) / bin_scale);
hist[hist_idx]++;}
}
return hist;
}
Copy the code
Finally, the probability histogram obtained is normalized as follows:
distribution = normalize_histogram(distribution_in);
Copy the code
The implementation interface for histogram normalization is also simple:
std::vector<float> normalize_histogram(std::vector<uint32_t>& histogram){
std::vector<float> histogram_out(histogram.size());
const size_t length = histogram.size(a);float sum = 0;
for (size_t i = 1; i < length; i++)
sum += histogram[i];
for (size_t i = 1; i < length; i++)
histogram_out[i] = float(histogram[i] / sum);
return histogram_out;
}
Copy the code
2.2 calculate P
The next logical step is to go back to figure 2 and compute P, then Q, and finally KL divergence.
Firstly, the simulated quantization distribution P is calculated and retrieved incrementally from target_bin = 128 –> 2048. The overflow part is mapped to the edge processing. P can be considered as the fp32 data distribution before quantization, that is, the real distribution:
// get P
fill(quantize_distribution.begin(), quantize_distribution.end(), 0.0 f);
const float num_per_bin = static_cast<float>(threshold) / static_cast<float>(target_bin);
for (int i = 0; i < target_bin; i++){
const float start = static_cast<float>(i) * num_per_bin;
const float end = start + num_per_bin;
const int left_upper = static_cast<int> (ceil(start));
if (static_cast<float>(left_upper) > start){
const float left_scale = static_cast<float>(left_upper) - start;
quantize_distribution[i] += left_scale * distribution[left_upper - 1];
}
const int right_lower = static_cast<int> (floor(end));
if (static_cast<float>(right_lower) < end){
const float right_scale = end - static_cast<float>(right_lower);
quantize_distribution[i] += right_scale * distribution[right_lower];
}
for (int j = left_upper; j < right_lower; j++){
quantize_distribution[i] += distribution[j];}
}
Copy the code
2.2 to calculate the Q
Then calculate the real quantized distribution Q, with P retrieved from target_bin = 128 –> 2048 incrementalized, Q can be considered as the quantized INT8 data distribution, i.e., quantized distribution:
// get Q
std::vector<float> expand_distribution(threshold, 0);
for (int i = 0; i < target_bin; i++){
const float start = static_cast<float>(i) * num_per_bin;
const float end = start + num_per_bin;
float count = 0;
const int left_upper = static_cast<int> (ceil(start));
float left_scale = 0;
if (static_cast<float>(left_upper) > start){
left_scale = static_cast<float>(left_upper) - start;
if (distribution[left_upper - 1] != 0){
count += left_scale;}
}
const int right_lower = static_cast<int> (floor(end));
float right_scale = 0;
if (static_cast<float>(right_lower) < end){
right_scale = end - static_cast<float>(right_lower);
if(distribution[right_lower] ! =0){
count += right_scale;}
}
for (int j = left_upper; j < right_lower; j++){
if(distribution[j] ! =0){
count++;}
}
const float expand_value = quantize_distribution[i] / count;
if (static_cast<float>(left_upper) > start){
if (distribution[left_upper - 1] != 0){
expand_distribution[left_upper - 1] += expand_value * left_scale;}
}
if (static_cast<float>(right_lower) < end){
if(distribution[right_lower] ! =0){
expand_distribution[right_lower] += expand_value * right_scale;}
}
for (int j = left_upper; j < right_lower; j++){
if(distribution[j] ! =0){
expand_distribution[j] += expand_value;}}
}
Copy the code
2.3 Calculation of KL divergence
Next, calculate the KL divergence of the real distribution P and quantized distribution Q:
const float kl_divergence = compute_kl_divergence(t_distribution, expand_distribution);
Copy the code
The interface to implement KL divergence calculation is also simple:
float compute_kl_divergence(std::vector<float>& dist_a, std::vector<float>& dist_b){
const size_t length = dist_a.size(a);float result = 0;
for (size_t i = 0; i < length; i++){
if(dist_a[i] ! =0) {if (dist_b[i] == 0){
result += 1;
}
else{
result += dist_a[i] * log(dist_a[i] / dist_b[i]);}}
}
return result;
}
Copy the code
Finally, we want to find a target_bin that minimizes the KL divergence. Since it is retrieved in a loop of 128 –> 2048, the implementation can be written as follows:
// the best num of bin
if (kl_divergence < min_kl_divergence)
{
min_kl_divergence = kl_divergence;
target_threshold = threshold;
}
Copy the code
This results in the coveted target_bin, the target_threshold here.
2.4 calculation Scale
After calculating the target_threshold, it is easy to calculate the Scale, just like this.
float act_scale = hist_edge[i][threshold_bin] / fake_quant_set; // fake_quant_set = 127
int act_zero_point = 0;
Copy the code
Again, since this is symmetric quantization, you only need to calculate Scale, and Zero_point is always zero.
Then we can save our activation value quantization calibration table table_kl.scale. Again, the weight & bias quantization method behind is consistent with min-max, and the quantization method of min-max has been introduced in the previous article, so I will not repeat it here.
The above has completed the realization of the practical KL divergence quantization algorithm, I hope my share can be a little help to your learning.
【 public number transmission 】 【 model reasoning 】 quantization realization share two: detailed explanation KL symmetric quantization algorithm implementation