Overparameterization mainly refers to that in the training stage, a large number of differential solutions are required mathematically to capture the small change information in the data. Once the iterative training is completed, so many parameters are not needed for network model reasoning. The pruning algorithm is based on the theory of over-parameterization.

The core idea of pruning algorithm is to reduce the number of parameters and computation in the network model, and to ensure the performance of the model is not affected as much as possible.

In the AI framework, in fact, pruning is mainly used in the application scenario of end-to-end model reasoning in the lower right corner, in order to make the end-to-end model smaller, no matter tablet, mobile phone, watch, headset and other small IOT devices can easily use the AI model. In practice, the pruning algorithm and pruning API provided by the framework are more reflected in the training process.

Pruning algorithm classification

In fact, most pruning algorithms are divided into two classical pruning algorithms, Drop Out and Drop Connect, from a macro level, as shown in the figure below.

1) Drop Out: Randomly zero the output of some neurons, which is called neuronal pruning.

2) Drop Connect: The connection Connect between some neurons is set to zero randomly, so that the weight connection matrix becomes sparse.

Structured pruning VS unstructured pruning

I’ll show you more ways of pruning, which may be a little more complicated. From the granularity of pruning, it can be divided into structured pruning and unstructured pruning, two pruning structure methods. Here are four specific pruning methods:

173 fine-grained pruning fine-grained pruning of connections or neurons. Drop Out and Drop Connect are fine-grained pruning.

2) Vector-level pruning: it is slightly larger than fine-grained pruning and belongs to the pruning of intra-kernel.

3) Kernel-level: that is, to remove a convolution kernel and discard the calculation of the corresponding convolution kernel in the input channel.

4) Filter-level: When the whole convolution kernel set is pruned, the number of output characteristic channels will change in the inference process.

Fine-grained Pruning, vector Pruning, and kernel Pruning strike a balance between the number of parameters and the performance of the model. However, the combination structure of single layer neurons in the network model changes, requiring special algorithms or hardware structures to support sparse operations, which is called Unstructured Pruning.

Among them, the unstructured pruning can achieve higher compression ratio, while keeping the model of high performance, but will bring sparse network model, the sparse structure for hardware acceleration calculation is not friendly, unless the underlying hardware and computing speed has better support to the sparse calculation, otherwise difficult to obtain substantial performance improvement after pruning.

Filter-level mainly changes the number of Filter banks and feature channels in the network, and the obtained model can run without special algorithms and hardware, which is called Structured Pruning. Structured pruning can be further subdivided: channel-wise, filter-wise, or shape-wise.

Structured pruning is on the contrary to unstructured pruning, which can easily change the structural characteristics of network model to achieve the effect of compression model. For example, student network model in knowledge distillation, NAS search or pruning model such as VGG19 and VGG16 can also be regarded as structured pruning behavior in disguised form.

Pruning algorithm flow

Although the classification of pruning algorithm seems to be many, the core idea is to prune neural network model. The current general process of pruning algorithm is basically the same, which can be divided into three types: standard pruning, pruning based on sub-model sampling, and pruning based on search, as shown in the figure below.

Standard pruning algorithm flow

Standard pruning is currently the most popular pruning process, with standard interfaces in Tensorflow and Pytroch. There are three main parts: training, pruning, and fine-tuning.

1) Training: Firstly, train the network model. In the pruning process, the training part mainly refers to the pre-training, the purpose of which is to obtain the original model trained on the specific basic SOTA task for the pruning algorithm.

2) Pruning: in this can be carried out such as fine-grained pruning, vector pruning, nuclear pruning, filter pruning and other different pruning algorithms. One of the most important is to evaluate the structure of the network model after pruning. Identify a layer that needs pruning and set a clipping threshold or ratio. In implementation, a Mask matrix with the same size as the parameter matrix is added by modifying the code. There are only 0s and 1s in the Mask matrix, which are actually used to fine-tune the network.

3) Fine tuning: Fine tuning is a necessary step to restore the expression ability of models affected by pruning operations. Structural model pruning will adjust the original model structure, so the model parameters after pruning retain the original model parameters, but due to the change of model structure, the expression ability of the model after pruning will be affected to a certain extent. In terms of implementation, the network model is fine-tuned, and the parameters are first multiplied by the Mask during calculation. The parameter value of Mask 1 will continue to be trained to adjust the gradient through BP, while the part of Mask 0 has no influence on the subsequent parts because the output is always 0.

4) Pruning again: In the process of pruning again, the fine-tuned network model is sent to the pruning module for model structure evaluation and pruning algorithm implementation. The purpose is to make each pruning on the model with better performance, and continuously optimize the pruning model iteratively until the model can meet the pruning target needs.

Finally, when the output model parameters are stored, because there are a lot of sparsity, the data structure can be redefined to store only non-zero values and their matrix positions. When the model parameters are re-read, the matrix can be restored.

Submodel-based sampling process

In addition to standard pruning, EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning based on sub-model sampling also showed good pruning effect recently. After obtaining the trained model, the sub-model sampling process is carried out. The sampling process of a sub-model is as follows:

1) The prunable network structure in the trained original model is sampled according to the pruning target. The sampling process can be random, or probabilistic sampling can be carried out according to the importance of the network structure or through KL divergence calculation.

2) Prune the sampled network structure to obtain the sampling pattern model. The submodel sampling process is usually performed several times to obtain submodels (≥1), and then the performance of each submodel is evaluated. After the sub-model evaluation, the optimal sub-model was selected for fine-tuning to get the final pruning model.

Search – based pruning process

Search-based pruning mainly relies on a series of unsupervised or semi-supervised learning algorithms such as reinforcement learning, as well as theories related to neural network structure search.

Given the pruning target, search-based pruning searches for the optimal substructure in the network structure. This search process is often accompanied by the learning process of network parameters, so some search-based pruning algorithms do not need to be fine-tuned after pruning.

The development of pruning

In recent years, neural network pruning, as one of the four dragons of model compression technology, is attracting more and more attention. Of course, a variety of better pruning parameter selection methods will emerge in an endless stream. In addition, from the perspective of trends, the following directions are worth watching:

Break fixed assumptions: challenge existing and fixed assumptions, such as The best paper Lottery Hypothesis of ICLR2019: Appearance of Finding Sparse, Trainable Neural Networks. It is also interesting to reflect on whether over-parameterization, and the reuse of existing parameters, is useful as mentioned earlier. This kind of work will give pruning algorithm great inspiration, thus fundamentally changing the way to solve the problem.

Automated pruning: With the wave of AutoML, more and more algorithms are becoming automated. Can model compression be pulled down? Of course not. As we know from the previous introduction, ADC, RNP and N2N Learning are all attempts to automate part of pruning. For example, the “HAQ: Hardware-Aware Automated Quantization” in Quantization considers that the redundancy degree of information at different layers in the network is different, so the automation uses mixed Quantization bits for compression.

Fusion with NAS: As mentioned in the previous model pruning process, the boundary between pruning algorithm and neural network search for NAS has been blurred. NAS has Search methods for structural pruning, such as one-shot Architecture Search, which first has a large network and then does subtraction. NAS and model compression, two seemingly unrelated branches at first, seem to be convergent in recent years due to downstream tasks and deployment scenarios. The two branches have more intersection today, and will certainly rub out more sparks.

Fusion with GAN: GAN, one of the most popular branches of machine learning in recent years, is constantly penetrating into existing fields, and has begun to show its presence in Pruning. For example, “Towards Optimal Structured CNN Pruning via Generative Adversarial Learning” (2019) allows generators to generate a cropped network Discrimintor is used to determine whether the network belongs to the original network or the tailored network, so as to perform more effective structured network tailoring.

Hardware sparsity support: Pruning will bring sparsity characteristics to the neural network model, and parameter sparsity will have a large number of indexes in the calculation, so it cannot be accelerated. While there are libraries like cuSPARSE, the underlying hardware AI chip itself is not designed specifically for sparse data processing. If the sparse computing and processing capability can be incorporated into the chip, the computing efficiency will be greatly improved. In 2021 alone, China launched 10+ MODELS of AI acceleration chips based on ASIC. It is believed that support for sparse scenarios will make breakthroughs in the future.

conclusion

Model compression algorithm for existing models, including: tensor decomposition, model pruning, model quantization. For the newly constructed network, there are: knowledge distillation, compact network design and other methods.

Pruning is only one of the model compression methods, and it does not conflict with other model compression methods, so it will slowly integrate with quantization, distillation, NAS, reinforcement learning and other methods, which are very worthy of research direction. In addition, from the perspective of the above development, breaking the inherent assumptions and integrating with NAS, GAN, AutoML, RL and other technologies may eventually blur the Purning mode, and the emergence of new paradigms or compression modes is also very attractive.