In the process of machine learning, when the number of parameters is relatively large, it is likely to appear the phenomenon of over-fitting, so that the model fits well in the training set, but the fitting effect is not so pleasant in the test set.
In this case, it means that the trained model has poor generalization and cannot adapt to the complex reality. We certainly need to find ways to prevent or mitigate overfitting. In this article, CoorChice will introduce several methods to prevent overfitting.
regularization
In model training, the most important step is to approach the global optimal solution by constantly reducing the loss value of the loss function, which simply means that our characteristic weights can correctly describe the real situation.
But often trained weights may have a lot of noise, that is, there are a lot of useless feature weights, in the data without similar noise is likely to misjudge. It is also possible that the weights of some features are so large that as long as there are some large features, the model will ignore the useful weights of the large features with smaller weights. In this case, the model is also prone to misjudgment.
How to solve this dilemma? The answer could be regularization, or regularization. Regularization is a concept in algebraic geometry. In machine learning, adding regularization term means adding a prior rule term. To put it bluntly, we have certain expectations of the outcome and can’t just play it by heart.
Norm is usually used as the normalized term of the loss function.
This is the norm formula. According to the value of P, the 0 norm, 1 norm and 2 norm can be generated, that is, p is 0, 1 and 2. Different norms deal with different scenarios.
Zero norm, one norm
0 norm
The above norm formula, when p is 0, is called 0 norm, Lp formula becomes the following form:
As you can see, the zero norm involves solving for the 0th root of a number, which is a very weird formula, and it’s hard to tell what it means and how to solve it.
So there is an optimized form in the 0 norm that gives it a clear meaning:
The meaning of this formula is to solve for the number of non-zero elements in the w-vector space.
Why does it prevent overfitting? Adding it to the loss function is as follows:
An explanation is inserted here. The above Loss function formula is composed of two parts. Part C0 is the traditional Loss function, known as “empirical risk”. The latter re term is called “structural risk”, which sort of constrains the first half of “empirical risk” structurally.
We can look at the norm image as P approaches 0:
It’s not hard to see the trend, as p and approach 0, it gets closer and closer to each axis. So you can imagine, when p is equal to 0, it all falls on the coordinate axes. When the function of the lost part is tangent to it, it is naturally possible to cut to the coordinate axis, that is, only one coordinate has value and all the others are 0.
This means that in the process of gradient descent to continuously reduce Loss, it is necessary to consider setting more weight w to 0, so that the 0 norm term can be continuously reduced. Then the meaning of it emerges, which is to set unimportant weights to zero as much as possible, reduce the number of parameters, and keep only useful features as much as possible. That is, it makes the weight set sparsity (including 0) and plays the role of screening features.
Doesn’t it look great? It automatically filters features! However, in machine learning with a large number of feature weights W, it is necessary to dynamically learn how to set those w values to 0 to ensure the continuous reduction of Loss and the continuous reduction of 0 norm items. It is too possible to make those weights smaller and those weights directly to 0. So many that it’s almost impossible to find an optimal path. Such a problem that has a solution but is almost impossible to solve is called an NP-hard problem, which means that it is possible to check the correctness of each calculation, but there are so many calculations that it is impossible to check each one.
1 norm
In mathematics, generally the problem that cannot be solved directly will use approximate algorithm to constantly approximate the real answer, and it is acceptable to get an approximate solution.
The 1 norm is an approximate algorithm with the advantage of 0 norm and easy to solve.
Similarly, the norm 1 is the norm when p is equal to 1:
The one norm is the sum of the absolute values of the property value W. The formula of Loss function with 1 norm is as follows:
As can be seen from Loss, in order to minimize the value of the whole formula, the value of the 1 norm term is also required to be minimum. Then, take a look at the image of the 1 norm amount:
The figure above is a 1-norm image in three-dimensional space. It is an octahedron with vertices that intersect the coordinate axes. It follows that in higher dimensional space, there must also be many vertices that intersect the coordinate axes in this way. In other words, in the process of solving the gradient descent, it is also possible to find the solution of the loss function tangent to these vertices, so the 1 norm has a chance to screen out useful features and eliminate noise, just like the 0 norm, so as to improve the generalization ability of the model.
In the calculation process, it can be seen from the above image that the 1 norm has non-differentiable points, namely its vertices, which is not so convenient in calculation.
2 norm
Avoid overfitting
Since the calculation of the 1 norm is not convenient, can we upgrade it again to see if the higher norm can be calculated more conveniently?
Look at the norm for p with 2:
Now look at what the image looks like:
A three-dimensional image is a ball! This means that the 2-norm is a continuous and differentiable function everywhere in space. Of course, it also intersects the coordinate axis, which means that it’s possible that the tangent point to the loss function is also on the coordinate axis, but you can see from the graph that it’s less likely than the one norm.
The Loss function with 2 norm looks like this:
That’s the sum of the squares of w. I took the open root off here, it doesn’t really matter, it’s about the same but it’s a lot less computation.
Since the tangent point between the 2-norm and the loss function is difficult to fall on the coordinate axis, that is to say, it is difficult to screen features and reduce useless noise features, why can it avoid overfitting the model?
The CoorChice will come slowly.
Consider the example of identifying a puppy with three features 🐶, which in simple terms have three feature weights, namely [w1, W2, W3].
W1 = 1, W2 = 1, W3 = 1 -> W1 + W2 + W3 = 3 w1 = 3, W2 = 0, w3 = 0 -> W1 + W2 + W3 = 3Copy the code
So these are the two cases where w can be evaluated. Although the sum of w is both 3, the w2 and w3 in the second case are completely overwhelmed by the huge w1=3 in the recognition process. Assuming that W1 represents the feature of four legs, this model considers anything with four legs as a puppy, which is obviously unscientific. That is, the model is highly accurate in training sets, but it is easy to misjudge in complex real-world situations.
If we use the 2 norm, the sum of squares in the first case is 3, and the sum of squares in the second case is 9! Therefore, in order to reduce Loss continuously, it is necessary to constantly weaken the weight of strong features like W1, so that some useful weak features can be added up to overturn strong features, so that the model will not ignore many weak features once it meets a strong feature, resulting in misjudgment.
2 norm weakens the average feature of weight, which makes the model not depend on some strong features, but consider comprehensive features more comprehensively. In this way, the generalization ability of the model is naturally improved, and it is not easy to appear the situation of over-fitting.
Avoid gradient explosions 💥
In “[Get] Recognizing Handwritten Numbers with Deep Learning,” CoorChice talked about the initial bottleneck in training because gradients occurred. bong! bong! The explosion. The result was a complete failure of training. But after CoorChice added regularization terms, it successfully crossed the minefield of gradient explosions and completed its training.
The reason is that during the training process, the size of some extreme power values is too exaggerated, resulting in a huge value in the multiplication process, so it explodes 💥. The function of the 2-norm regular term is to weaken such a strong weight, so to a certain extent, it can effectively reduce the possibility of gradient explosion.
How do I add regularization items to TensorFlow?
1. Record the ownership
tf.add_to_collection(tf.GraphKeys.WEIGHTS, w)
Copy the code
This function is called every time a weight is created to add it to the pool.
2. Construct regularization items and add them to Loss
# Build l2 regularization and set its importance to 5.0/50000Regularization = tf.contrib.layers.l2_regularizer(scale=(5.0/50000))Let regularization work
reg_term = tf.contrib.layers.apply_regularization(regularization)
Add regularization term to loss function
loss = (-tf.reduce_sum(y_data * tf.log(y_conv)) + reg_term)
Copy the code
Dropout
Dropout is a technique that is often used in deep learning to reduce overfitting. Its core principle is that, in the training process, some neurons in a layer are disabled at a specified rate, while some remain effective. The specific neuron will be disabled or effective is random.
Failing neurons are not replaced in an iteration. In the next round of training, another set of neurons will fail, will not be replaced.
In this way, the number of neurons is reduced in each training, and the learning path becomes random due to the different neurons that may be extracted each time. Unlike networks without Dropout, which can only fully update all neurons at a time, adding Dropout neural networks weakens the co-adaptability between neurons, forcing the network to learn more robust features under the requirement of decreasing gradients. This will make the model more adaptable, rather than just fitting into one pattern.
Explained a lot of, may still be very abstract, at this time as a picture!
Doesn’t feel a lot clearer. On the left is the network of Dropout, full connections between neurons. On the right is the Dropout network, and you can see that during a training session, some of the neurons are disconnected, that is, they fail. Every time a neuron disconnects, it’s random, which increases the possibilities of the network, like the ability to mutate.
Typically, Dropout requires setting a hyperparameter that specifies how many neurons need to fail at a time. This value is generally returned to 0.5 because it allows for a wide variety of combinations.
In the famous VGG-16 network model, Dropout is added to the last three full link layers.
Of course, the disadvantage of Dropout is also due to randomness, which will greatly increase the number of training sessions and make the training time longer.
How do I add Dropout to TensorFlow?
tf.nn.dropout(input, keep_prob)
Copy the code
Simply pass in the previous output, such as a full-link layer output, and specify the ratio of neurons to be retained in the second argument.
Note that Dropout is not yet supported in current versions of TensorFlow Lite, so students considering running models on TensorFlow Lite should not be able to use it. 😳
conclusion
In this paper, CoorChice introduces two methods to reduce the possibility of overfitting, regularization and Dropout.
-
L0 and L1 regularization can be sparse, reduce the number of parameters and extract the main features. However, L0 is difficult to solve in complex parameter sets, so L1 regularization is usually used instead if there is a need.
-
L2 regularization can weaken strong features and allow for more comprehensive features to be considered. At the same time, L2 regularization can prevent the occurrence of gradient explosion 💥 to a certain extent.
-
Dropout is a universal way to prevent overfitting, but it increases training time and is not currently supported by TensorFlow Lite.
- Hey, with a thumbs-up ❤️! Lok ‘Tar 🛡
- CoorChice will share dry products from time to time. Go to CoorChice’s [personal page] and follow.