Binarization network (BNN)
-
Boss: Quantify to INT8 so what! Not small enough! I want to put the AI model inside the earphone watch!! \
-
Employee: That we use binary network!! Everything is zeros and ones!! \
Binarization networks, like low bit quantization, aim to make models smaller, with the most extreme compression rates and very low computational overhead. So what is binary? Binary means that only two values, +1 and -1 (or 0 and 1), are used to represent weights and active neural networks.
Compared with full precision (FP32) neural network, binarization can use XNOR (XNOR gate in logic circuit) or simple counting operation (POP Count), extremely simple combination to replace FP32 multiplication and accumulation and other complex operations to achieve convolution operation, thus saving a lot of memory and calculation. This greatly facilitates the deployment of the model on resource-constrained devices.
BNN is a potential technology to promote artificial intelligence models represented by deep neural networks in resource-constrained and power-constrained mobile devices due to its high compression ratio and acceleration effect.
However, BNN still has a number of shortcomings, such as model accuracy still being lower than full accuracy, inability to generalize effectively to more complex tasks, dependence on specific hardware architectures and software frameworks, etc. But it can also be seen that BNN from 2016 when first proposed, ImageNet only 27% of the Top1 accuracy, to 2020 proposed reactnet-C has a 71.4% accuracy improvement!
Let’s see where BNN is in the AI system’s full stack /AI framework. The orange label is where BNN is located. As can be seen from the figure, the API at the expression layer needs to provide the API interface used by the binary network model. Then there is not much difference between the middle layer and the Runtime. The most important thing is that the bottom layer needs to provide binary operators or specialized hardware circuits for binary reasoning.
1. Basic introduction to BNN
BNN was first proposed by Yoshua Bengio[1] in 2016. In this paper, stochastic gradient descent was used to train a neural network model with binarized weight and activated ACT parameters.
1.1 Forward calculation
In order to solve the gradient transfer problem in the calculation of binarized weight, a real weight (FP32) is kept in the training process, and then a sign function is used to obtain the binarized weight parameters.As the FP32,Is the binarized value:
Where the sign function is 1 as long as the input parameter is greater than or equal to 0, otherwise it is -1:
In the figure below, the above is a 3X3 convolution operation of the binarization weight and the binarization input. The binarization operation is tiling expansion of the convolution kernel and the window of the input data, then XNOR operation, and then bit count to obtain the convolution result.
1.2 Back Propagation
Similar to the perceptive quantization training, the sign function is not differentiable at 0, and there is no way to calculate the gradient when the derivative is 0. Therefore, the straight Through Estimator STE is proposed in this paper, that is, when the gradient transfer encounters the Sign function, the function is directly skipped:
After using the direct estimation STE, the binary neural network can be trained directly by using the same gradient descent method as the full-precision neural network. Weighting parameters can use common optimizer to update, at the same time, considering the training process of weight parameter is not truncated, so weighting parameters may accumulate until big value in particular, between which and the weight of binarization quantization error is bigger and bigger, so the paper weight increase truncation of the training function, the restrictions between 1 and + 1, In this way, the error deviation between weight parameters and binarized weight parameters in the training process will not be too large:
Since there are embedded modifications to the FP32 training process, it will certainly lead to longer training time, and the final experimental results are not as accurate as FP32, what is the use?
In fact, the biggest effect is forward as shown in the figure, can 1 bit xor no and POP count operation, to replace FP32 convolution multiplication and accumulation operation, in the actual model deployment and reasoning, not only can reduce 32 times the memory parameter storage, but also can run faster than the horse!
3. BNN network structure
In recent years, a variety of binary neural network methods have been proposed, ranging from naive binarization, which uses predefined functions to directly quantify weights and inputs, to optimization-based binarization, which uses a variety of angles and techniques, These include approximating full precision values by minimizing quantization errors, limiting weights by modifying network loss functions, and learning discrete parameters by reducing gradient errors.
Among them, Binary Neural Networks: A Survey, the latest review article of Beihang University, has written A good review of many binarization network models. The following ZOMI briefly introduces A BNN network model structure that I think is more exciting.
4. Hardware implementation
In terms of the flow of binary networks, the main reason for BNN acceleration is the substitution of XNOR and Pop Count operations for the expensive multiplication-summation MAC operations used in traditional convolution algorithms.
The general purpose x86 computing architecture is essentially optimized for FP32 full precision type data, so the benefits of deploying BNN directly on the general purpose x86 computing platform are not obvious, and may not only have no acceleration, but may even be slower than the equivalent FP32 network model.
Here is a simple analysis of ARM CPU and FPGA platform respectively.
ARM CPU
BNN is actually currently focused on mobile deployment. BMXNet 2017[3] is a binary open source framework based on MXNet developed by Haojin Yang, a researcher from Germany Hasso Plattner Institute. Support for training using CuDNN and inference using binary operators XNOR and Pop Count. The downside is that the binary kernel is not specifically tuned and therefore does not perform as fast on an ARM CPU.
Dabnn 2019[4] is launched by JINGdong AI Research Institute based on BNN inference tool after assembly tuning. It improves the reasoning speed of BNN on the ARM side of the framework, but this tool is not used for model training, you need to use other tools for training.
\
Bmxnet-v2 2019[9], Bethge and Yang et al. open source the second version of Gluon API support. A series of improvements adopted by the framework greatly reduce the difficulty of model training and the cost of MXNet synchronization. The second release not only improves efficiency, but also continues to support compression and binary reasoning of models that can be deployed on a variety of edge devices.
The FPGA and ASIC
Compared with traditional cpus, FPGas are flexible in hardware architecture and can support efficient bits-wise computing with low power consumption, especially the ultimate ASIC, which can be more efficient and energy efficient than FPGas.
At present, Xilinx devices and development tools are mainly used to design AI accelerators with FPGA, and they also designed a special architecture FINN for binary neural networks. Developers can use high-level Synthesis Tools (HLS) to develop in C language, and directly deploy the binary model on FPGA.
conclusion
Although BNN has made great progress in the last five years, the big problem is that the loss of accuracy is still a headache, especially for large networks and data sets. The main reasons may include:
1) There is no binarization network model of SOTA at present, so it is uncertain what kind of network structure is suitable for binarization;
2) Even with gradient estimators or approximation functions for binarization, optimizing binary networks in discrete Spaces is a challenge.
In addition, as mobile devices become more widely used, more work will be done on these applications to enable different tasks and models to be deployed on different hardware. For example, touch analysis and click analysis in earphones are mostly for signal classification. In fact, large models are not needed. On the contrary, at this time, binary network can classify signal data with high precision to determine whether to double click to stop playing music or drag to amplify sound.
Finally, research on interpretable machine learning shows that there are critical paths in neural network reasoning, and different network structures follow different patterns. Therefore, it is of great significance to design the mixing precision strategy according to the importance of layers and to design a new network structure which is friendly to the information flow of binary neural networks.
reference
- [1] Courbariaux, Matthieu, et al. “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1.” arXiv preprint arXiv:1602.02830 (2016).
- [2] Qin, Haotong, et al. “Binary neural networks: A survey.” Pattern Recognition 105 (2020): 107281.
- [3] Yang, Haojin, et al. “Bmxnet: An open-source binary neural network implementation based on mxnet.” Proceedings of the 25th ACM international conference on Multimedia. 2017.
- [4] Zhang, Jianhao, et al. “dabnn: A super fast inference framework for binary neural networks on arm devices.” Proceedings of the 27th ACM international conference on multimedia. 2019.
- [5] zhuanlan.zhihu.com/p/27