Machine learning series with me will be launched on the public account: Yue Lai Inn, welcome to the end of the article scan code attention!

1 the fitting

In the last article, we explained why we need to standardize feature dimensions, what the consequences of non-standardization are, and a common standardization method. At the same time, from another Angle (feature mapping), we introduce how to map original low-latitude features to high-dimensional features by polynomial, so as to solve the problem of fractional linear fitting. In this article, the author will continue to introduce other methods and strategies for model improvement.

Since the concept of fitting has not yet been introduced, the following is added. The process of solving a model is actually the process of fitting model parameters in a certain way (such as gradient descent). Fitting refers to the dynamic process of finding model parameters. After the execution of this process, there will be a variety of post-fitting states, such as over-fitting, under-fitting and so on.

1.1 quotes

In linear regression, we have introduced several commonly used indicators for evaluating models, but now we have a question: does a smaller MAE or RMSE mean a better model? Or is smaller better under certain conditions? A smart reader might know at a glance that the smaller the model, the better. But why?

So let’s say I have a bunch of sample points, which are called delta functionsGenerated (You don’t know that in reality), but due to other factors, the sample points we got did not fall on the curve accuratelyIt’s distributed around it.

The red dots in the figure above are the data set we got, and the blue curve is the actual distribution. Now we need to model it and solve the corresponding prediction function. Suppose you use degree = 1,5,10 to model the 12 sample points respectively (degree represents the highest degree of polynomial), and the visualization after solving is as follows:

As you can see from the figure, as the degree of the polynomial increases,The value of the indicator is also higher and higher (The larger the model is, the better the model is), and when the number is set to 10,It reached 1.0. But we should choose at the enddegree=10This model?

After I don’t know how long, a customer suddenly said to buy your model for commercial use; At the same time, the client brings in a new set of labeled data for evaluating your model (Although you’ve used it yourself beforeYes, but the client won’t believe you, in case you cheat). So you retest your model with the customer’s new data and visualize it as follows:

To your surprise, whendegree=5Would be better thandegree=10What’s wrong with the results of the model? The reason is that when we first modeled through these 12 sample points, we tried to make the model as good as possibleAs large as possible) “and used a very complex model, which resulted in the resulting model severely deviating from its real model (here is), although in the end every sample point fell “exactly” on the prediction curve. And that, of course, is not what we want.

1.2 Overfitting and underfitting

In machine learning, we call the phenomenon of degree=10 as overfitting, the phenomenon of degree=1 as underfitting, and the phenomenon of degree=5 as good fitting. At the same time, the data during modeling is called training data, and the errors generated on the training set are called training errors. The data during testing is called test data or verification data. The error generated by the model on the test set is called generalization error, and the whole modeling solution process is called training.

It should be noted that the cited example only takes linear regression as an example to intuitively introduce what is over-fitting and under-fitting. This does not mean that this phenomenon only occurs in linear regression, in fact, all machine learning models have this problem. In general, the so-called over-fitting phenomenon is good performance on the training set and bad performance on the test set; Underfitting phenomenon is that the performance of both is very poor; The phenomenon of just fit refers to a good performance on the training set (although not as good as overfit), but also a good performance on the test set.

1.3 How to solve under-fitting?

Through the above description, we have had an intuitive understanding of underfitting, which means that the trained model cannot fit the existing training data well at all. The methods to solve the under-fitting are relatively simple, mainly divided into the following two categories:

  • Redesign more complex models

    For example, the number of feature mapping polynomials can be increased in linear regression.

  • Add more characteristic dimensions as input

    Collect or design more feature dimensions as inputs to the model;

1.4 How to solve overfitting?

There are two common methods to solve model overfitting:

  • Collect more data

    This is one of the most effective but most difficult methods to use. The more training data, the more the influence of noise data on the model can be corrected in the training process, which makes the model difficult to overfit. However, it is often difficult to collect new data.

  • Regularization (Regularization)

    This is an effective and easy way to mitigate simulation overfitting, which will be covered in the next article due to space constraints.

2. How to avoid overfitting

In order to avoid over-fitting of the trained model, we generally divide the data set we get into two parts before model training: training set and test set, and the ratio is generally 7:3. The training set is used to train the model (reduce the error of the model on the training set), and then the generalization error of the model on the unknown data is tested by the test set to observe whether the over-fitting phenomenon occurs. But because a complete training process uses data usually: “training set -> test set -> Training set -> test set ->…” Because you can’t choose the right model the first time, test sets are used as training sets unconsciously. So there is another division: training set, Dev Data, and test set, and the ratio is usually 7:2:1. Why are there two ways to divide? This generally depends on how much the trainer requires of the model. If it is divided into three parts, the test set is generally selected through the training set and development set model for the last test. But there is no hard and fast standard for how to divide.

3. Summary

In this paper, the author first introduces what is fit, and then introduces three states brought by fit: underfit, perfect fit and overfit, in which the model of perfect fit is the final result we need. At the same time, the author then introduces the methods to deal with under-fitting and over-fitting, and the specific solutions to over-fitting will be explained in the next article. Finally, the author also introduces how to avoid the phenomenon of over-fitting model by dividing data sets. This content is over, thank you for reading!

If you have any questions or opinions, please send email to [email protected] and attach the link to the article.

Reference:

  • [1] Illustration and sample code: You can get it directly by replying to “Sample code” on the public account.
  • [2] Model selection
  • [3] Bias and variance