From arXiv, Compiled from The Heart of the Machine by Aidan N. Gomez, Ivan Zhang, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. Participation: Thinking Sources.
Dropout, already standard in many models, randomly removes neurons or weights to get different “architectures.” So can we “randomly” remove weak connections, or less important neurons in training, and thus build more “important” architectures? Hinton et al. suggest that this can be done. The Targeted Dropout method is similar to embedding pruning into the learning process, so pruning after training can be beneficial.
This Targeted Dropout paper was accepted for NIPS/NeurIPS 2018 Workshop on Compact Neural networks, which focuses on building compact and efficient neural network representations. In particular, the neural network compression methods such as pruning, quantization and low-rank approximation are discussed. Neural network representation and conversion format; And the way to compress video and media using DNN.
The best paper of the Workshop was Rethinking the Value of Network Pruning introduced by Heart of Machine. This paper reconsidered the role of over-parameterization of neural Network. The paper suggests that the value of pruning algorithms may lie in identifying efficient structures and performing implicit architectural searches, rather than choosing “significant” weights in overparameterization.
Workshop address: Ips. cc/Conferences…
Of course, this paper still focuses on the Targeted Dropout, which is built into the Dropout implicitly. Is it also a covert search for efficient neural network architecture?
At present, a lot of research work focuses on the training of a sparse neural network, which involves setting the weight of the neural network or the activation value of the whole neuron to zero, and at the same time requires that the prediction accuracy cannot be significantly decreased. In the learning stage, we can force the neural network to learn sparse weights by regular terms, such as L1 or L0 regular terms. Of course, sparsity can also be achieved through late pruning, that is, the complete model is used in the training process, and some strategies are used for pruning after training to achieve sparsity.
Ideally, given some measure of task performance, pruning removes weights or neurons that are least helpful to the model. But this process is difficult because it is impossible to determine which subset of the millions of parameters is most important to the task. Therefore, common pruning strategies focus on fast approximation of superior subsets, such as removing a small number of parameters, or sorting according to the sensitivity of tasks to weights, and removing insensitive weights.
The Targeted Dropout proposed by the researchers is based on the observation that the Dropout regularization activates only local neurons in each forward propagation and therefore increases sparsity properties during training. This encourages the neural network to learn a robust representation of sparsity, namely the random deletion of a group of neurons. The authors hypothesize that if we are going to do a particular set of pruning sparseness, we would be better off applying Dropout to a particular set of neurons, such as a set of neurons with a numerical value close to zero.
The authors call this approach Targeted Dropout, and the main idea is to sort weights or neurons based on some measure that quickly approximates the importance of weights, and apply Dropout to elements that are less important. Similar to the regularization Dropout observation, the authors show that the method can encourage neural networks to learn more important weights or neurons. In other words, the neural network learned how to be sufficiently robust to the pruning strategy.
The advantage of Targeted Dropout over other methods is that it makes the convergence of neural networks extremely robust for pruning. It is also very easy to implement, requiring only two lines of code to change using mainstream frameworks such as TensorFlow or PyTorch. In addition, the network is very clear, and the degree of sparsity we need can be customized.
Reviewers were generally positive about the method, but questioned its convergence. Since the importance of neurons (weights) is estimated based on the conditions before Dropout (pruning), rather than after the actual Dropout (pruning), it is possible that the estimation errors will accumulate during optimization iterations, resulting in divergent results. The authors promise more detailed proof in an appendix to the final edition.
Finally, Hinton et al. open source the experimental code for those interested:
Project address: github.com/for-ai/TD/t…
On the project, we’ll find that the most important fixes to Dropout are the following code. The model calculates the absolute value of the weight matrix, and then calculates how many “unimportant” weights to make Dropout based on the TARg_rate. Finally, it only needs to sort the absolute value of the ownership weight and Mask out the specified number of “unimportant” weights.
norm = tf.abs(w)
idx = tf.to_int32(targ_rate * tf.to_float(tf.shape(w)[0]))
threshold = tf.contrib.framework.sort(norm, axis=0)[idx]
mask = norm < threshold[None, :]Copy the code
Once you’ve determined which weights aren’t important, the Dropout operation is just like any other. Now, let’s look at this paper in detail.
Thesis: Targeted Dropout
Address: openreview.net/pdf?id=Hkgh…
Neural networks are extremely flexible because of their large number of parameters, which is very advantageous for the learning process, but also means that the model has a lot of redundancy. These parameter redundancies make it possible to compress neural networks without significant performance losses. We introduce Targeted Dropout, a strategy for neural network weight and Post hoc pruning that builds pruning mechanisms directly into learning.
On each weight update, the Targeted Dropout is used to determine a set of candidate weights using simple selection criteria, and then applied to the candidate set for random pruning. The resulting neural network explicitly learns how to become more robust to pruning, which is easy to implement and tune compared to more complex regularization schemes.
2 Targeted Dropout
2.1 Dropout
Two of the most popular Bernoulli Dropout technologies are used in our research work, namely the unit Dropout proposed by Hinton et al. [8, 17] and the weight Dropout proposed by Wan et al. [20]. For the fully connected layer, if the input tensor is X, the weight matrix is W, the output tensor is Y, and the Mask Mask M_io obeys the distribution Bernoulli(α), then we can define these two methods as:
The Unit Dropout reduces unit interdependence and prevents overfitting by randomly removing units or neurons with each update.
Dropout Weight Dropout randomly removes weights from the weight matrix with each update. Intuitively, deleting weights means removing connections between levels and forcing neural networks to adapt to different connections in different training updates.
2.2 Order of magnitude based pruning
A popular class of pruning strategies can be called order-based pruning, which treats the k largest weight orders as the most important connections. We can generally use argmax-k to return the largest k elements (weights or cells) of all elements.
Element pruning [6], considering the L2 norm of the column vectors of the weight matrix:
Weight pruning [10], if top-k represents the largest k weights in the same convolution kernel, consider the L1 norm of each element in the weight matrix:
Weight pruning can keep more model accuracy, while element pruning can save more computational power.
2.3 methods
If we have a neural network parameterized by θ, and we want to prune W in the way defined by equations (1) and (2). Theta * so we hope to find the optimal parameters, it can make the loss function epsilon (theta *) (W) as small as possible at the same time make (theta *) | | W k or less, that we want to keep the highest order of magnitude of k weights in the neural network. A deterministic implementation can choose the smallest | theta | – k element, and delete them.
But if these smaller values become more important in training, then their values should increase. Therefore, the researchers introduced randomness into the process by using targeted ratio γ and deletion probability α. Which targeted proportion (targeting proportion) said we will choose the gamma | theta | a minimum weight as Dropout candidate weights, and then to discard rate of alpha independently remove the weights in the candidate set.
This means that the Targeted Dropout in each of the unit number remaining weights update for gamma, alpha (1 -) | theta |. As we will see later, Targeted Dropout reduces the dependence of important subnetworks on unimportant subnetworks and thus reduces the performance penalty of pruning trained neural networks.
As shown in Tables 1 and 2 below, the researchers’ weighted pruning experiment indicated that the baseline results of the regularization scheme were worse than Targeted Dropout. Moreover, the Targeted Dropout model performs better than the model without regularization and has only half of the parameters. By gradually increasing the target ratio from zero to 99% over the course of training, the researchers say we can achieve extremely high pruning rates.
Table 2: Smallify for comparing Targeted Dropout and Ramping Targeted Dropout. The experiment was performed on CIFAR-10 using RESNET-32. The left figure compares the best results of three targeted attacks with the best results of six smallify attacks, with the highest pruning rate detected in the middle and the higher pruning rate detected in ramp Targ on the right.
3 conclusion
We propose a simple and efficient regular chemical with Targeted Dropout, which can combine the after-pruning strategy into the training process of neural networks without significantly affecting the potential task performance of a given architecture. Finally, the main advantage of Targeted Dropout lies in its simple and intuitive implementation and flexible hyperparameter setting.