2017 | | Slimming – pytorch (pruning)

Official Code (Torch) : github.com/liuzhuang13… Paper: arxiv.org/abs/1708.06…

Third Party Code (PyTorch) : github.com/mengrang/Sl… Github.com/foolwood/py…

Shock!! The proposed method can significantly reduce computational costs (up to 20 times) on state-of-the-art networks, without loss of accuracy, and even improve accuracy.

1. Motivation:

With the increase of the depth and breadth of the network, the model of the neural network becomes larger and larger. In order to minimize the overhead introduced in the training process, the generated model does not need special software/hardware accelerators. We call this method network slimming, the pruning in this paper is to get a relatively compact and small network.

2. Main implementation process:

Zoom factor gamma, introduced in the Bn layer, the zoom factor associated with convolution layer in each channel, in the process of training to L1 is sparse, the zoom factor through continuous training, can automatically identify not important channel, the greater the scale factor of the more important, the smaller the scale factor is not important), finally rely on the zoom factor cut not important channel.

3. Other related work:

Low-rank decomposition:

Techniques such as SVD with low rank matrices are used to approximate the weight matrices in neural networks. This approach works particularly well at the fully connected layer, producing compression of about 3 times the size of the model, but without significant speed acceleration because the computation operations of CNN are mainly from the convolution layer.

Weight quantization:

HashNet proposes quantifying network weights. Before training, the network weights are hashed to different groups, and the weight values of each group are shared. This saves a lot of storage space by storing only shared weights and hash indexes. Improved quantization techniques have been used in deep compression pipelines, achieving compression rates of 35 to 49 times on AlexNet and VGGNet. However, these techniques save neither runtime memory nor reasoning time, during which shared weights need to be restored to their original locations. Real-value weights are also quantized to binary/ternary weights (the weight values are limited to {−1,1} or {−1,0,1}). This results in a significantly reduced model, and significant acceleration can also be achieved through bitwise manipulation of libraries. However, this extreme low order approximation usually results in a moderate loss of accuracy.

Heavy pruning/thinning:

A small weight method is proposed to eliminate unimportant connections in trained neural networks. The network weights obtained are mostly zero, so the storage space can be reduced by sparse storage model. However, these methods can only be accelerated with dedicated sparse matrix manipulation libraries and/or hardware. The runtime memory savings are also very limited, as most of the memory space is consumed by active mappings, not weights. In paper [12], there is no guidance on sparsity in the training process. The paper [32] overcomes this limitation by explicitly attaching the sparse constraint of the gate variable to each weight and achieving high compression by pruning connections with a gate value of zero. This method achieves better compression ratio than paper [12], but it also has the same disadvantages.

Principle: 4.

Advantages of channel sparsity:

The crudest hierarchical sparsity does not require a special package to obtain inferential acceleration, but it is less flexible because some entire layers need pruning. In fact, layers can only be removed effectively if the depth is large enough, for example, more than 50 layers. Sparse channel level sex, by contrast, between flexibility and easy to implement provides a good compromise, it can be applied to any typical CNN or full to connect to the Internet (see each neuron as a channel), the network is essentially not clip “lite” version of the network, can effectively on CNN platform of traditional reasoning.

Pruning a channel essentially removes all connections in and out of the channel, allowing a narrow network to be obtained without resorting to any special sparse computing packages. Scale factor as the medium of channel selection. Because they are co-optimized with network weights, the network can automatically identify unimportant channels without much impact on generalization performance and safely remove them.

Channel sparse relative challenges:

To achieve channel-level sparsity, all incoming and outgoing connections associated with the channel need to be pruned. This makes the method of directly clipping weights on a pre-trained model ineffective because it is unlikely that ownership values are close to zero at both the input and output ends of a channel. Trimming channels on pre-trained ResNet can trim only about 10% of parameters without loss of accuracy. To achieve better results, we solve this problem by enforcing sparse regularization in the training target.

Scaling factor and sparsity induced penalty:

To achieve the desired effect, we introduced a scaling factor gamma. It multiplied by the output of the channel, then we put the network weights and shrinkage factor to conduct joint training, sparse regularization processing the zoom factor, here we choose g (s) = | s |, is widely used in thin L1 norm, with a gradient descent method to optimize a nonsmooth L1 penalty term. Finally, we use these small scaling factors to trim the channels and fine-tune the trimmed network.

Where (x, y) represents training input and target, W represents trainable weight, the first summation term corresponds to normal training loss of CNN, g(·) is sparsity penalty for scale factor, λ balances these two terms.

Why insert scaling factor γ in BN layer:

One of its big advantages is that it imposes no overhead on the network. In fact, it may also be the most effective way for us to learn meaningful channel pruning scaling factors. 1) If we add a scaling layer to a CNN without BN layer, the value of the scaling factor has no significance for evaluating the importance of a channel, because both the convolution layer and the scaling layer are linear transformations. By decreasing the scale factor value and amplifying the weight, the same result can be obtained. 2) If we insert a scale layer before the BN layer, the normalization process of BN will completely cancel the scale effect of the scale layer. 3) If a scaling layer is inserted after the BN layer, each channel has two successive scaling factors.

Among themAre the input values of mean and standard deviation respectively activation B,Are trainable affine transformation parameters (scale and displacement), which provide the possibility of linear transformation normalization activation back to any scale.

Channel trimming and fine tuning:

When the pruning rate is high, pruning may temporarily cause a certain loss of accuracy. However, this can be largely compensated for by a fine-tuning process on the trimmed network. In our experiments, the fine-tuned narrow network achieves even greater accuracy than the original untrimmed network in many cases.

5. Realization:

For CIFAR and SVHN data sets, when channel sparse regularization training is used, the hyperparameter λ that controls the trade-off between empirical loss and sparsity is determined by the grid search on the CIFAR-10 validation set. For VGGNet, we choose λ=10−4, and for ResNet, VGG-A, and DenseNet, λ=10−5. Keep other Settings the same.

Pruning:

When pruning the channels of the model trained with sparsity, we need to determine the pruning threshold of the scale factor. Unlike other channel pruning, where different layers are pruned at different scales, we use global pruning thresholds for simplicity. The clipping threshold is determined by a percentage of all scaling factors, for example, 40% or 60% of channels are clipped. The pruning process is achieved by building a new narrower model and copying the corresponding weights from the sparsity trained model.

Fine tuning:

After pruning, we got a narrower, more compact model, and then fine-tuned it. On CIFAR, SVHN, and MNIST datasets, fine-tuning uses the same optimization Settings as in training. For the ImageNet dataset, due to time constraints, we fine-tune the pruned VGG-A at a learning rate of 10-3, with only 5 iterations.

Analysis of 6.

The only two hyperparameter pruning ratios and the sparse regularization term λ coefficient in this pruning are analyzed.

Influence of pruning ratio:

We drew Figure 5 through experiments. It can be seen from (Figure 5) that the classification performance of the pruning or fine-tuning model will decline only when the pruning rate exceeds the threshold. The fine-tuning process can usually compensate for the loss of accuracy that pruning may cause. Only when the threshold exceeds 80% does the test error of the fine-tuning model fall behind the baseline model. It is worth noting that when sparsity training was used, the model performed better than the original model even without fine-tuning. This may be due to the regularization effect of L1 sparsity on the channel scale factor.

Setting of the superparameter λ :

The purpose of the L1 sparsity term is to force many scaling factors close to zero. The λ parameter in formula 1 controls its significance compared with normal training loss. In FIG. 4, we plot the distribution of scale factors in the whole network with different λ values. In this experiment, we used VGGNet trained on the CIFAR-10 data set.

The results show that with the increase of λ, the scale factor becomes more and more concentrated near zero. When λ=0, there is no sparse regularization and the distribution is relatively flat. When λ=10−4, almost all the scaling factors fall within a small region close to zero. This process can be regarded as feature selection in the middle layer of deep network, where only channels with non-ignorable scaling factors are selected. We further visualized this process using heat maps. Figure 6 shows the scale factor of a layer in VGGNet along the training process. Each channel has the same weight; As training progressed, some channels had larger scale factors (brighter), while others had smaller scale factors (darker).

7. Conclusion:

In order to obtain a smaller model, sparse-induced regularization is introduced on the scale factor of normalized layer, so that unimportant channels can be automatically identified and pruned during training. On multiple data sets, we have shown that the proposed method can significantly reduce computational costs (up to 20 times) on state-of-the-art networks, without loss of accuracy, and even improve accuracy. More importantly, the proposed approach simultaneously reduces model size, runtime memory, and computational operations, while minimizing the overhead of the training process, and the resulting model does not require special libraries/hardware for effective reasoning.

This article is some superficial understanding and summary of this paper and code. If you have different opinions, welcome to point out and discuss.