In an era of fragmented reading, fewer and fewer people pay attention to the exploration and thinking behind each paper.
In this column, you will quickly get the highlights and pain points of each selected paper and keep up with the cutting-edge AI achievements.
Click “Read the original article” at the bottom of this article to join the community now and see more of the latest paper recommendations.
63
The recommended paper notes for this issue are from @yanjoy, a PaperWeekly community user. This paper gives a comprehensive overview of the compression methods of deep neural networks, which can be divided into parameter pruning and sharing, low-rank decomposition, transfer/compression convolution filter and knowledge refining. The performance, related applications, advantages and disadvantages of each kind of methods are analyzed.
If you are interested in working on this paper, click on the bottom to read the original paper.
About the author: Xiao Yi, master candidate at Peking University, his research interest is deep model compression acceleration. Personal homepage: http://yanjoy.win
S paper | A Survey of Model Compression will and Acceleration for Deep Neural Networks
S link | https://www.paperweekly.site/papers/1675
S the author | Yu Cheng/Duo Wang/Pan Zhou/Tao Zhang
The research background
In terms of neural networks, Yann LeCun and others have used neural networks to successfully identify handwritten postcodes on emails as early as the end of last century. As for the concept of deep learning, it was first proposed by Geoffrey Hinton et al., and in 2012, Krizhevsky et al adopted deep learning algorithm and won the champion of ImageNet image classification competition by beating the second place with 10% accuracy rate of traditional artificial design feature method.
Since then computer vision competitions have been contracted out to various deep learning models. These models rely on deep networks with hundreds or even billions of parameters that traditional cpus can’t handle. Only gpus with high computing power can train networks relatively quickly.
As mentioned above, the competition model uses a network with 60 million parameters including 5 convolution layers and 3 complete connection layers. Typically, it took two to three days to train the entire model even with the then-top performing GPU NVIDIA K40. For large-scale networks with full connections, the parameter scale can even reach billions of orders of magnitude.
Of course, in order to solve the problem of parameter scale of the full connection layer, people consider adding convolution layer to reduce the full connection parameter. The negative side effect is that the computation time and energy consumption increases greatly.
For larger neural networks with more layers and nodes, reducing their storage and computing costs becomes critical, especially for real-time applications such as online learning, incremental learning, and autonomous driving.
On the other end of deep learning, that is, the mobile end closer to people’s life, how to make deep models run on mobile devices is also an important goal of model compression acceleration.
Krizhevsky put forward two observations in his article in 2014: convolution layer takes up about 90-95% of the calculation time and parameter scale, with large values; The fully connected layer takes about 5-10% of the computation time, 95% of the parameter size, and the values are small. This provides a statistical basis for the compression and acceleration of the depth model.
A typical example is resNET-50 with 50 convolutional layers requiring over 95MB of memory and 3.8 billion floating point operations. After discarding some redundant weights, the network still works as usual, but saves more than 75% of the parameters and 50% of the computation time.
Of course, the ultimate implementation of compression and acceleration of network models requires multidisciplinary joint solutions. In addition to compression algorithms, data structure, computer architecture and hardware design also play a significant role. In this paper, different depth model compression methods will be introduced and compared.
The research status
Based on the existing depth model compression methods, they can be divided into four categories:
-
Parameter Pruning and sharing
-
Low-rank factorization
-
Transferred/Compact convolutional filters
-
“Knowledge distillation”
According to the redundancy of model parameters, the method based on parameter pruning and sharing tries to remove redundant and unimportant items. Low-rank factory-based techniques use matrix/tensor factorization to estimate the information parameters of deep learning models. A special structured convolution filter based on the transmission/compact convolution filter is designed to reduce the storage and computational complexity. Knowledge distillation methods reproduce the output of a larger network by learning a distillation model and training a more compact neural network.
In general, parameter pruning and sharing, low-rank decomposition and knowledge distillation methods can be used for CNN at both the fully connected layer and the convolution layer, but on the other hand, methods using transfer/compact convolution kernels only support the convolution layer.
Low-rank factorization and transformational/compact convolution kernel-based approaches provide an end-to-end pipeline that can be easily implemented in CPU/GPU environments.
Instead, parameter pruning and sharing use different methods such as vector quantization, binary coding, and sparse constraints to perform the task, which often takes several steps to achieve the goal.
Table 1: A brief comparison of different models
With respect to training protocols, parametric pruning/sharing, low-rank decomposition models can be trained from pre-training models or from scratch, so they are flexible and efficient. However, the transfer/compact convolution kernel and knowledge distillation model can only support training from scratch.
These approaches are designed independently and complement each other. For example, transfer layers and parameter pruning and sharing can be used together, and model quantization and binarization can be used with low-rank approximation to achieve further acceleration.
A brief comparison of the different models is shown in Table 1. These methods are briefly introduced and discussed below.
Parameters for pruning and sharing
These parameter pruning and sharing can be further divided into three categories: model quantization and binarization, parameter sharing, and structural matrix, depending on how redundancy (information redundancy or parameter space redundancy) is reduced.
Quantization and binarization
Bring about figure 2
Network quantization compresses the original network by reducing the number of bits required to represent each weight. Gong et al. K-means quantization was used for parameter values. Vanhoucke et al. Used 8-bit parametric quantization to achieve significant acceleration with minimal loss of accuracy.
Han S proposed a complete set of deep network compression process: firstly, trim the unimportant connections and retrain the sparsely connected networks. Then, the weights of the quantized connections are quantized by weight sharing, and Huffman coding is performed on the quantized weights and codebooks to further reduce the compression rate. Figure 2 contains a three-stage compression method: pruning, quantization, and Huffman coding.
Pruning reduces the number of weights that need to be encoded, and quantization and Huffman encoding reduce the number of bits used to encode each weight. Sparse representation can be used for most of the matrices with 0 elements to further reduce spatial redundancy, and this compression mechanism will not bring any loss of accuracy. This Paper won the Best Paper of ICLR 2016.
In the case of more quantization levels, the accuracy can be maintained well, but for binary quantization networks, the accuracy will be greatly reduced when dealing with large CNN networks, such as GoogleNet. Another defect is that the existing binarization methods are based on simple matrix approximation, ignoring the impact of binarization on accuracy loss.
Prune and share
Network pruning and sharing was originally used to solve the problem of fitting, but is now more commonly used to reduce network complexity.
The early pruning methods used are called Biased Weight Decay, Optimal Brain Damage and Optimal Brain Surgeon methods, Hessian matrix is based on the loss function to reduce the number of connections.
Their study showed that this pruning method was more accurate than material-based pruning methods such as Weight Decay. A recent trend in this direction is to trim redundant, non-informational weights in pre-trained CNN models.
It is also becoming more popular to train compact CNNS under sparsity constraints, which are usually introduced in optimization problems as L_0 or L_1 norm modulators.
There are some potential problems with the pruning and sharing approach. First, if l_0 or L_1 regularization is used, the pruning method requires more iterations to converge. In addition, all pruning methods require manual setting of the layer’s hyperparameters, which can be complicated in some applications.
▲ Figure 3: Pruning and sharing
Design structure matrix
The principle of this method is very simple: if an m× N matrix needs fewer than M × N parameters, it is a structured matrix. In general, such a structure not only reduces memory consumption, but also significantly speeds up reasoning and training through fast matrix-vector multiplication and gradient calculation.
A potential problem with this approach is that structural constraints can lead to a loss of accuracy, as constraints may bias the model. On the other hand, how to find an appropriate structural matrix is difficult. There’s no theoretical way to derive it. Therefore, this method has not been widely promoted.
Low rank decomposition and sparsity
A typical CNN convolution kernel is a 4D tensor, while the full connection layer can also be regarded as a 2D matrix, and low-rank decomposition is also feasible. There can be a lot of redundancy in these tensors. All approximations are done layer by layer, and the parameters of a layer are fixed after the layer is approximated by a low-rank filter, after the previous layer has been fine-tuned using a reconstruction Error criterion. This is a typical low-rank method for compressing 2D convolution layers, as shown in Figure 4.
▲ Figure 4: Typical low-rank method for compressing 2D convolution layers
The use of low-order filters to accelerate convolution has been around for a long time. For example, high-dimensional DCT (discrete cosine transform) and wavelet systems using tensor products consist of 1D DCT transform and 1D wavelet respectively.
Learning separable 1D filters was proposed by Rigamonti et al., following the idea of dictionary learning. Jaderberg’s work proposes using a different tensor decomposition scheme to achieve 4.5-fold acceleration with a 1% drop in text recognition accuracy.
A flatten structure transforms the original three-dimensional convolution into three one-dimensional convolution, reducing the parameter complexity from O(XYC) to O(X+Y+C), and the operational complexity from O(mnCXY) to O(Mn (X+Y+C).
Low order approximation is done layer by layer. After the parameters of the first layer are determined, the above layers are fine-tuned according to the reconstruction error criterion. These are typical low-rank methods for compressing two-dimensional convolution layers, as shown in Figure 2.
In this direction, Lebedev proposed the canonical polynomial (CP) decomposition of the kernel tensor, which was calculated using nonlinear least square method. Tai proposed a new low-rank tensor decomposition algorithm for training low-rank constraint CNN from scratch. It uses batch standardization (BN) to transform the activation of internal hidden units. Generally speaking, both CP and BN decomposition schemes can be used to train CNN from scratch.
The low-rank method is good for model compression and acceleration, but it is not easy to implement because it involves computationally expensive decomposition operations. Another problem is that current methods perform low-rank approximation layer by layer and cannot perform global parameter compression because different layers have different information. Finally, decomposition requires considerable retraining to achieve convergence.
Migrate/compress convolution filter
Although there is currently a lack of strong theories, a large amount of empirical evidence supports the importance of translation invariance and convolution weight sharing for good prediction performance.
The compression of CNN model with the transfer convolution layer is inspired by Cohen’s Equivariant Group Theory. Let x be the input, φ (·) as the network or layer, and T(·) as the transformation matrix. Then the concept of equal variation can be defined as:
That is, the transformation matrix T(·) is used to transform the input X and then transmit it to the network or layer φ (·). The result is consistent with the representation result after mapping X to the network and then transforming the mapping. Note that T and T’ may behave differently when applied to different objects. According to this theory, it is reasonable to apply the transformation to the hierarchy or filter φ (·) to compress the entire network model.
Using a compact convolution filter can directly reduce the computational cost. The decomposition of 3×3 convolution into two 1×1 convolution is used in the Inception structure; SqueezeNet proposed to replace 3×3 convolution with 1×1 convolution. Compared with AlexNet, SqueezeNet created a compact neural network with 50 times fewer parameters and similar accuracy.
There are still some minor problems with this approach. First, these methods excel at dealing with broad/flat architectures (e.g., VGGNet) rather than narrow/specialized ones (e.g., GoogleNet, ResidualNet). Second, the assumption of transfer is sometimes too powerful to guide the algorithm, resulting in unstable results for some data sets.
Knowledge of distillation
Using knowledge transfer to compress models was first proposed by Caruana et al. They trained compression/integration models of strong classifiers with pseudo-data markers and replicated the output of the original large network, but this work was limited to shallow models.
Later it was improved to knowledge distillation, which compressed deep and wide networks into shallower ones, in which the compression model simulated the functions learned by the complex model. The main idea was to transfer knowledge from a large model to a small model by learning the class distribution output obtained through SoftMax.
Hinton’s work introduces a knowledge distillation compression framework that reduces the amount of training on the deep Web by following a student-teacher paradigm that punishes the student by softening the teacher’s output. To do this, students are trained to predict the teacher’s output, that is, real categorization labels. This method is very simple, but it also shows good results in various image classification tasks.
The method based on knowledge distillation can make deeper models shallower and significantly reduce computational costs. However, there are some disadvantages, such as that it can only be used for classification tasks with Softmax loss functions, which hinders its application. Another disadvantage is that the model’s assumptions are sometimes too strict and its performance is sometimes inferior to that of other methods.
Discussion and Challenge
Compression and acceleration techniques for depth models are still in their early stages and present the following challenges:
-
Depending on the original model, the space for modifying network configuration is reduced, which is not reliable for complex tasks.
-
Pruning by reducing the number of connections or channels between neurons is more effective in compression acceleration. But this can have a serious impact on the input of the next layer;
-
Structured matrix and transferred convolution filter methods must have a strong human prior knowledge of the model, which has a significant impact on the performance and stability of the model. It is important to study how to control the influence of imposing prior knowledge;
-
The knowledge refining approach has many advantages, such as the ability to accelerate models directly without requiring specific hardware or implementation. Personally, I think it has something to do with transfer learning.
-
Hardware limitations of a variety of small platforms (e.g. mobile devices, robots, autonomous vehicles) continue to be a major obstacle to the development of deep CNN. Model acceleration may be more important than compression, and the appearance of special chips is effective, but it is also a good idea to change multiplication and addition from mathematical calculation to logical and displacement operations.
Problems and discussion in quantization experiment
Selection of experimental framework
TensorFlow supports static diagrams that cannot be modified once the parameters of the model are determined. This brings some difficulties to the stage-by-stage and stratified training. Pytorch, by contrast, uses dynamic diagrams that allow you to modify the parameters as you train after defining the model, giving it great flexibility. This is the future of deep learning.
TensorFlow also supports the definition of dynamic graphs in the current developer edition of NIGHTLY, but it is not yet available. Therefore, Pytorch is preferred for scientific research and experiment, while TensorFlow may be preferred for engineering and application.
The inefficiency of clustering algorithm
Clustering algorithm is the premise of quantization, and Sklearn provides available methods such as K-means, Mean-Shift, Gaussian Heterostructures and so on.
Among them, K-means is a simple, effective and relatively fast method. On personal computers, the training network parameters are small in scale, so it cannot reflect its operation time. In the test of a large network like VGG, it was found that for 256 clustering centers, clustering at the full connection layer of fc’s 4096∗4096 dimension took more than 20 minutes (it stopped before it was finished), seriously affecting the practicability of the algorithm.
The current solution is sampling, such as clustering 1% of the parameters as the clustering center of the whole parameter. After this operation, the clustering time can be reduced to less than 5 minutes.
Is there an easier way? For a well-trained model, especially after L2 regularization, its parameter distribution can basically be regarded as a Gaussian distribution model, and distribution parameters can be obtained by fitting it.
Then the question becomes: for a specific Gaussian distribution model, can the clustering center be directly obtained according to its parameters?
If it can, the clustering process will greatly shorten the time: firstly, the weight is divided into histogram to obtain the parameter distribution of different intervals, then the Gaussian function is fitted from this, and finally the clustering center is obtained by using this function. There’s a little bit of math involved in this, so I’m just guessing.
The time cost of retraining
Data compression, as we talk about daily, is carried out according to the distribution and probability of sources, especially by constructing dictionaries to greatly reduce information redundancy.
This method is often not effective when used directly on the model. One of the most important reasons is that although parameters are beautifully distributed, almost no two parameters are the same. After gzip compression of Alexnet, it only decreased from 233MB to 216MB, far less than the compression rate of text and image compression algorithm. Therefore, traditional methods are difficult to solve the problem of deep learning.
Prune quantization is good, but it has a problem that traditional compression does not have: the time cost of retraining. This process is to artificially add conditions for parameters so that the network can learn to reduce loss again. Therefore, it cannot be restored to the original network, because parameters are not important, but results are important.
The process, like training, is painful and lengthy. And because the training is layer by layer, the training time of a network is closely related to the number of layers.
For simple tasks like Mnist, 512∗512 weight quantization retraining increases from the initial 8s to 36s, increasing the time by about 4 times (the possibility of poor optimization of personal code cannot be ruled out). On a GPU server, Alex’s time cost would be bearable, but with GoogleNet, ResNet, it would be a long time.
In the implementation process, there are many operations that cannot be directly implemented, or there is no easy way to find, have to detour, can seriously degrade performance.
This paper is selected and recommended by PaperWeekly, an AI academic community, which covers natural language processing, computer vision, artificial intelligence, machine learning, data mining and information retrieval. Click “Read the original article” and join the community immediately!