What is overfitting:

Overfitting means that the model works well on the training data set, but not so well on the cross-validation data set, or on the test set. That is to say, the prediction performance of the model for unknown samples is average, that is, the generalization ability is poor. How to measure good performance on training data sets? In machine learning, there are measures, such as roc curves. Example 1: As shown in the figure below: Figure 1 is under-fitting, and the model cannot fit the data well; Figure 2 is the best fitting situation; FIG. 3 is over-fitting, in which each feature is over-learned and a very complex model is adopted, resulting in a large fluctuation of the curve.Copy the code

Example 2: For example, in the image below, the black line on the left fits the data arrangement to a certain extent, while the blue-purple curve takes care of every sample point. The curve is bent and bent, which belongs to over-fitting. The black curve on the right can well classify red and blue data points. Although the overfitted green curve can perfectly separate the two types of data points in this data set, it is far less adaptable to a new data set than the black curve.Copy the code

Causes of overfitting

<1> The training data set is not enough to represent the distribution of the whole data and cannot exhaust all the data. So when there is new data (unknown data) data, the effect is not good. <2> Overtraining the model on the basis of certain training data often leads to overfitting of the model.Copy the code

How to prevent overfitting?

The general methods are as follows: early stopping Data set amplification, Regularization, and Dropout. One concept that needs to be explained first is that in machine learning algorithms, we often divide raw data sets into three parts: What is the validation data? Validation data is used to determine the size of the early stopping epoch based on the accuracy on validation data. <2> Use validation data to determine the learning rate, etc. Why not do this directly on testing data? Because if we do this in testing data, our network will actually be overfitting our testing data bit by bit as the training goes on, resulting in no reference significance for the final testing accuracy.Copy the code

1 / Early stopping:

Set a certain number of iterations artificially to prevent excessive training and lead to over-fitting. That is, iteration is stopped before the model converges iteratively to the training data set to prevent overfitting. The process of training the model is the process of constantly learning and updating the parameters of the model, which often uses some iterative methods, such as gradient descent learning algorithm. This can effectively prevent the occurrence of overfitting, because overfitting is in essence an excessive learning of its own characteristics. As shown in the figure below, during model training, the training error of the model on the training set will decrease over time, while the error on validation data will decrease to a certain extent and gradually rise, forming an asymmetric U-shaped curve.Copy the code

The process of training the model is the process of updating the parameters of the model, which will use some optimization algorithms. In order to get the parameters with the lowest test error, the Early Stopping method is to run the optimization method until the verification error does not increase on the verification data set for several times. Although we may not be able to stop at the lowest point of the U, we can constantly modify "several times" to achieve relatively better results. The general practice is to record the accuracy of the validation dataset to date during the training. When the epochals have declined for 10 consecutive times, the validation accurac dataset has not reached the best accuracy. It is considered that accuracy is no longer improved, and iteration can be stopped at this point.Copy the code

2/ Data set amplification

There is a popular saying in data mining, “Sometimes it’s better to have more data than a good model.” However, more data cannot be collected due to limited conditions, such as insufficient manpower, material and financial resources.

For example, in the task of classification, the data need to be labeled, and in many cases, it is manually marked, so once the amount of data need to be labeled is too much, it will lead to low efficiency and possible errors. Therefore, it is necessary to take some computational methods and strategies to work on the existing data set to get more data. Data set amplification refers to the need to obtain more data that meet the requirements, which must be independent and co-distributed with existing data, or approximately independent and co-distributed. In general, the following methods are used: <1> Collect more data from the data source (you can do this if conditions permit) <2> copy the original data and add random noise <3> resampling <4> Estimate the data distribution parameters from the current data set and use that distribution to generate more dataCopy the code

3/ Regularization (regularization)

xxx
Copy the code

4/dropout

In neural networks, the Dropout approach prevents network overfitting by modifying the number of neurons in the hidden layer, that is, by modifying the deep network itself. For neural networks, some hidden layer neurons are deleted randomly according to the given probability in the training process, and the neurons of input layer and output layer are guaranteed to remain unchanged. And you get the neural network on the left, which simplifies the complex network. At each batch of data training, Dropout randomly removes some neurons at a given probability P, and only the parameters of the neurons that are not dropped, that is, retained, are updated. In each batch of data, neurons are randomly removed, which makes the network sparsity to a certain extent, thus reducing the synergistic effect between different features. Moreover, the parameters of neurons in the whole network are only partially updated due to the different neurons removed each time, which weakens the joint adaptability between neurons and enhances the generalization ability and robustness of neural network. Dropout is only used for training, as a hyperparameter, and not for test sets.Copy the code