This is the ninth day of my November challenge

The fitting ability of neural networks depends on a large number of parameter adjustments (the most advanced deep neural networks contain millions or billions of parameters), so over-fitting problems may arise. In order to improve the neural network, a technique for detecting overfitting is needed to alleviate overtraining.

Early termination of training

By tracking the accuracy of the test data set as it changes with training, a simple way to mitigate overfitting is to stop training when the accuracy of the test data is no longer significantly improved.

Looking at the MNIST dataset, you can see that there are three datasets: Training_DATA, TEST_DATA, and validation_Data. In the previous study, we only used training_data and test_data. In fact, the function of validation_Data is to ease overfitting.

Why validation_Data instead of test_data?

The training process is actually the process of learning hyperparameters. If test_DATA is used to relieve the overfitting of training_data, it may result in the overfitting of the learned hyperparameter training_data. Therefore, using a third party dataset validation_Data is a better choice. This method is sometimes referred to as the hold-out method, where validation_Data is “pulled” from the data set.

Increase the training sample size

One of the best ways to reduce overfitting is to increase the number of training samples. With enough training data, even a very large network is not easy to overfit, but the difficulty of this method lies in the fact that the possibility of such a large amount of marker data exists in reality, and the cost of manual marker data is terrible.

Data normalization

The normalization method is also one of the most commonly used normalization methods — weight decay or L2 normalization.

The core idea of L2 normalization is to add an additional term to the cost function, called normalization term. A cross entropy cost function with normalization term added is shown as follows:


C = 1 n x j [ y j ln a j L + ( 1 y j ) ln ( 1 a J L ) ] + Lambda. 2 n w w 2 C=-\frac{1}{n}\sum_{xj}[y_j\ln a^L_j+(1-y_j)\ln (1-a^L_J)]+\frac{\lambda}{2n}\sum_w w^2

Obviously λ2n∑ww2\frac{\lambda}{2n}\sum_w w^22nλ∑ww2 is the added normalization term, where λ>0\lambda>0λ>0 is called the normalization parameter, and NNN represents the size of the training set.

The general form for adding normalized terms to cost functions can be written (C0C_0C0 denotes the original cost function) :


C = C 0 + Lambda. 2 n w w 2 C=C_0+\frac{\lambda}{2n}\sum_w w^2

Notice that there is no bias in the normalized term, that is, it will not affect the change of bias. Take the partial derivative of the two parameters CCC:


partial C partial w = partial C 0 partial w + Lambda. n w \frac{\partial C}{\partial w}=\frac{\partial C_0}{\partial w}+\frac{\lambda}{n}w

partial C partial b = partial C 0 partial b \frac{\partial C}{\partial b}=\frac{\partial C_0}{\partial b}

According to the rules of gradient descent, the adjustment formulas of two parameters are obtained:


b b eta partial C 0 partial b b \rightarrow b-\eta \frac{\partial C_0}{\partial b}

Obviously the adjustment of BBB has not changed


w w eta ( partial C 0 partial w + Lambda. n w ) = ( 1 eta Lambda. n ) w eta partial C 0 partial w w \rightarrow w- \eta(\frac{\partial C_0}{\partial w}+\frac{\lambda}{n}w)=(1-\frac{\eta\lambda}{n})w-\eta\frac{\partial C_0}{\partial w}

ηλn\frac{\eta\lambda}{n}nηλ must be a positive number, meaning that the WWW becomes smaller by 1−ηλn1-\frac{\eta\lambda}{n}1−nηλ, which is determined by the size of λ\lambda lambda. (This is also the reason why it is called weight attenuation). It can also be seen from the formula that when NNN becomes larger, it will affect the adjustment of regularization terms. Therefore, the value of λ\lambdaλ should be appropriately increased when the training data set is larger.

Why can normalization help reduce overfitting

Explain the principle in a simple example, for the following data set:

Build a model to fit the data, and obviously such simple data doesn’t need such a powerful weapon as a neural network, so choose onepolynomialTo fitting. There are 10 points in the figure, and a full fitting results in a polynomial of order 9:
y = a 0 x 9 + . . . + a 9 y=a_0x^9+… +a_9
, the image of the polynomial is shown below:

But use a simple linear model
y = 2 x y=2x
You can also get good results, as shown below:

Now we need to think about: which model is the fitting model we want?

With these two models to predict a new sample, predicted results will obviously have a large gap, here also need to introduce a kind of intuitive explanation, Occam’s razor principle, simple to understand is that the same problem, if there are multiple theories can explain at the same time, choose to believe assuming less weaker, until can not explain the phenomenon or problem.

But obviously this theory does not always work, and determining which of the two explanations is simpler is a rather delicate task. The real test of the model is not its simplicity, but its ability to predict new activities in new scenarios.

In short, normalized neural networks often outperform normalized ones in generalization, which is just an empirical fact.

Other standardization techniques

L1 standardization

Formula:


C = C 0 + Lambda. n w w C=C_0+\frac{\lambda}{n}\sum_w|w|

In order to observe the normalization effect of L1, the partial derivative of CCC is obtained:


partial C partial w = partial C 0 partial w + Lambda. n s g n ( w ) \frac{\partial C}{\partial w}=\frac{\partial C_0}{\partial w}+\frac{\lambda}{n}sgn(w)

The SGNSGNSGN function returns the corresponding value based on the WWW positive and negative values, W > 0, SGN (w) = > 0, 1 w SGN (w) = > 0, 1 w SGN (w) = = 0, 1, w SGN (w) = 0 w = 0, SGN (w) = 0 w = 0, SGN (w) = 0, w < 0, SGN (w) = – 1 w < 0, SGN (w) = 1 w < 0, SGN (w) = – 1.

The update equation of WWW is:


w w = w eta Lambda. n s g n ( w ) eta partial C 0 partial w w \rightarrow w’=w-\frac{\eta\lambda}{n}sgn(w)-\eta\frac{\partial C_0}{\partial w}

It can be found that the value of ηλ NSGN (w)\frac{\eta\lambda}{n} SGN (w)nηλ SGN (w)n is a constant, that is, L1 normalization reduces the WWW variation by subtracts a constant from the WWW, While in L2 standardization by letting the WWW multiplied by 1 – eta lambda n1 – \ frac {eta \ \ lambda} {n} 1 – n eta value of lambda to shrink, such as WWW, the difference between the two when ∣ w ∣ | | w ∣ w ∣ value is large L2 standardization role will become more obvious, ∣ w ∣ | | w ∣ w ∣ value very hour L1 standardization role will be more obvious. The final result is that L1 normalization tends to focus on a relatively small number of high-importance connections, while other weights tend to be smaller. (Special case is w=0w=0w=0 will not reduce the weight)

Dropout

Unlike L1 and L2 normalization, Dropout technology does not modify the cost function. The principle of Dropout technology is that some neurons are hidden at a certain probability (p=0.5) during each iteration of forward propagation. In code level, when a neuron stops working at probability P, its activation function becomes 0 at probability P. This can make the model more generalized.

The principle of Dropout is equivalent to training different data sets with networks of different structures, and then taking their average values. Different networks may produce different overfitting, and the Dropout method can reduce the influence of overfitting by canceling each other through fitting. In addition, the updating of weights in this method no longer depends on the joint action of implicit nodes with fixed relationships, forcing the network to learn more applicable features.