This is the 19th day of my participation in the Genwen Challenge

Dropout is metaphysics?

After reading roughly the famous article Improving Neural Networks by Preventing Co – Adaptation of Feature Emulsion, I didn’t know Hinton could write a paper like that. This paper only presents a large number of test results of Dropout through different experiments, but the principle of Dropout is three guesses. I am also drunk, so it is not easy to understand the Dropout [Dog head]. It is a metaphysical thing that you need to understand in detail, but can only be understood without words. In practical applications, the Dropout can be super focused and the Dropout can pull the hip when used badly, so it is an option for training model skills. But from the actual situation in general, can indeed achieve a certain performance improvement effect, the end of the joke, the text begins.

What is the Dropout

In order to introduce Dropout, we need to start with a common problem of overfitting in machine learning or deep learning. When our training data set is small and the model parameters are large, overfitting often occurs, resulting in poor test results. Dropout is a common approach to overfitting. As for the argument that Dropout solves the problem of time consumption, I’m not sure, for reasons explained below.

Dropout Is a commonly used technique for overfitting. During training, Dropout randomly drops a portion of neurons at a preset probability. This method greatly reduces overfitting and improves the generalization ability of the model. The reasons for Dropout and the definition of Dropout are explained in the abstract of this paper.

When a large feedforward neural network is trained on a small training set, 例 句 : It requires everyone to perform poorly on holded test data. This “overfitting” is greatly reduced by concubine the feature detectors on each training case.

In fact, this sentence has covered the main content of the paper. The other long papers are just about the comparison of test effects before and after Dropout.

Dropout Process

[Training stage]

A) The probability of retaining neurons is set as P in advance, and the probability of discarding neurons is 1-P. The higher P means more neurons are retained. What this code says is, let the output of a neuron go to zero with probability 1-p.

B) If the number of neurons in a layer is 10, the output value of its activation function is y1, y2, y3… And y10, we set P to 0.6, so after dropout of this layer of neurons, the values of about 4 neurons are set to 0.

C) Although 4 neurons may have been discarded, we still need to evaluate the output value y1… Y10 is scaled, so times 1 over P. The reason may be that if the normal output of a neuron is X, the expected value of the output becomes E = P *x + (1-p)*0= P *x after adding the Dropout, and then E*1/p=x, ensuring that the expected value of the output does not change.

[Test phase]

Part of neurons were discarded during training, so it was necessary to scale the output results of neurons multiplied by P during the test. But this adds time to the test. In general, to reduce the computation time during testing, scaling can be moved to the training phase, as shown in the c) procedure above, so that the weights are no longer reduced during testing. If a neuron normally outputs X, then the expected value of dropout is E = P *x + (1-p)*0= P *x. The neuron was always present during the test. The normal input was X, but x needed to be adjusted to PX in order to maintain the same expected output.

In a nutshell, during training, there is a probability that the neuron is p size, so the output of each neuron is divided by P; In the test phase, each neural unit exists, and the weight parameter W is multiplied by P to become PW.

Why Dropout solves overfitting

The paper also did not give a clear explanation, the author may not say clearly, but there are three statements can be referred to, to help understand the internal principle, anyway, no matter which kind of your own fine taste, in the end is right or wrong, you then fine taste.

[First argument] Bagging randomly selects different training data, trains different decision trees to form a random forest, and completes the decision-making task by voting. Dropout can be thought of as an extreme form of Bagging, where the probability of dropping neurons is set, and each time you train the data you are training a new model, and each parameter of the model is shared with the corresponding parameter of all the other models to enhance regularization.

Dropout is an extreme case of “naive Bayes,” where each input feature is individually trained to predict category labels, and then the predicted distribution of all features is multiplied when tested. When training data is small, this is usually much better than logical classification, which trains each input feature to work well in the context of all the other features.

[Third argument] In order to survive, species tend to adapt to such environment, and environmental mutation will make it difficult for species to make timely response. The emergence of gender can reproduce varieties that adapt to the new environment, which can effectively prevent over-fitting, that is, to avoid the possible extinction of species when the environment changes.

Original text:

A popular alternative to Bayesian model averaging is “bagging” in which different models are trained on different random selections of cases from the training set and all models are given equal weight in the combination. Bagging is most often used with models such as decision trees because these are very quick to fit to data and very quick at test time . Dropout allows a similar approach to be applied to feedforward neural networks which are much more powerful models. Dropout can be seen as an extreme form of bagging in which each model is trained on a single case and each parameter of the model is very strongly regularized by sharing it with the corresponding parameter in all the other models. This is a much better regularizer than the standard method of shrinking parameters towards zero.

A familiar and extreme case of dropout is “naive Bayes” in which each input feature is trained separately to predict the class label and then the predictive distributions of all the features are multiplied together at test time. When there is very little training data, this often works much better than logistic classification which trains each input feature to work well in the context of all the other features.

Finally, there is an intriguing similarity between dropout and a recent theory of the role of sex in evolution. One possible interpretation of the theory of mixability articulated in is that sex breaks up sets of co-adapted genes and this means that achieving a function by using a large set of co-adapted genes is not nearly as robust as achieving the same function, perhaps less than optimally, in multiple alternative ways, each of which only uses a small number of co-adapted genes. This allows evolution to avoid dead-ends in which improvements in fitness require co- ordinated changes to a large number of co-adapted genes. It also reduces the probability that small changes in the environment will cause large decreases in fitness a phenomenon which is known as “overfitting” in the field of machine learning.

Where to use Dropout

Both input data Dropout and hidden layer Dropout. Read the original description:

Without using any of these tricks, the best published result for a standard feedforward neural network is 160 errors on the test set. This can be reduced to about 130 errors by using 50% dropout with separate L2 constraints on the incoming weights of each hidden unit and Further reduced to about 110 errors by also dropping out a random 20% of the pixels.

Here we see that the author not only used Dropout in the hidden layer during the experiment, but also randomly performed 20% Dropout on the input image pixels. There are many similar cases in the paper.

Dropout test effect

Here, the diagram of the test performance comparison using MNIST data sets in the paper is captured. It can be seen that the effect of 20% Dropout on the input and 50% Dropout on the hidden layer is better than that of 50% Dropout on the hidden layer alone. However, we can also see from the figure that there are too many epochs in training. Under normal circumstances, it is impossible to carry out so many epochs in the data set. Therefore, FROM this point, I am skeptical that the Dropout can reduce the training time. Moreover, a lot of training is necessary to learn the appropriate weight parameters in the model of neural network which has been discarded.

reference

  • Hinton G E , Srivastava N , Krizhevsky A , et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. Computer Science, 2012, (4) : 3 pags. 212-223.
  • Blog.csdn.net/program_dev…
  • www.cnblogs.com/makefile/p/…