Selection and evaluation of machine learning models

Error: The difference between the predicted output value of the model and its true value.

Training: The process of learning from known sample data to produce a model.

Training Error: The Error when the model is applied to the Training set.

Generalize: to enlarge from the specific to the general, from the particular to the general, is called generalization. For machine learning models, generalization refers to the model acting on new sample data (non-training set).

Generalization Error: the Generalization Error when the model is applied to new sample data.

Underfitting and overfitting

Model Capacity: refers to its ability to fit various models.

Overfitting: When a model performs well on the training set but poorly on the new sample. The model learns the characteristics of the training set too well, leading to some non-universal laws being accepted and reflected by the model, thus performing well on the training set but poorly on the new model. Otherwise, it is called Underfitting, that is, the model does not learn the general properties of the training set well, and the model does not perform well when acting on the training set.

Model selection

Model Selection: For a specific task, there are usually multiple models to choose from, and there are multiple sets of parameters for the same Model. The Model with the smallest generalization error can be selected by analyzing and evaluating the generalization error of the Model.

Refer to the picture above:

The blue dotted line represents the training error, which drops first and then gradually approaches 0, indicating that the training error is getting smaller and smaller.

The solid red line represents the generalization error, which starts down and then starts up

Then we choose the location of the black dotted line to select the model, where the generalization error is minimal, the training error is relatively small, and the overfitting is relatively small. This is the optimal choice of the model.

Thinking of model evaluation

The generalization error of the model is evaluated by experimental test, and the model with the smallest generalization error is selected. The full set of the data set to be tested is unknown, and the generalization test is carried out using the test set. Testing Error is the approximation of the generalization Error.

Note:

  • Test sets and training sets are mutually exclusive as much as possible
  • Test set and training set are distributed independently and identically

Set aside method

Hold-out method: Divide the known data set into two mutually exclusive parts, one part is used to train the model, the other part is used to test the model, evaluate its error, as an estimate of the generalization error.

  • The division of two data sets should keep the consistency of data distribution as far as possible to avoid the introduction of artificial deviation due to the data division process
  • The existence of multiple forms of data segmentation will lead to the division of different training sets and test sets. The results of single outflow method are often accidental and its stability is poor, so random division is usually carried out for several times and average value of repeated experimental evaluation is taken as the evaluation result
  • The data set is divided into two parts, and the size of each part will affect the evaluation result. The ratio of test and training is usually 7:3, 8:2, etc

Keep the sample category proportion similar, that is, use stratified sampling.

Cross validation

Cross Validation: Divide the data set into K mutually exclusive data subsets of similar size, and ensure the consistency of data distribution as far as possible (layered sampling). Each time, one data set is selected as the test set and the rest as the training set. Training and testing can be performed for K times to obtain the evaluation mean value. Also known as K-fold cross validation. Repeat p times, that’s p times of k-fold cross verification.

One method

Leave-one-out (LOO) method: it is a special form of K-folding cross-validation. The data set is divided into two parts, One of which has 1 records and is used as the test set, while the rest records are used as the training model of the training set. The model obtained by training is close to the model obtained by using all data sets and the evaluation result is more accurate. The disadvantage is that when the data set is large, the training times and calculation scale are large.

Self-help method

Bootstrapping: is a sampling method to generate samples, its essence is random sampling with back. That is, a record is randomly selected from the known data set, and then the record is put into the test set and put back to the original data set, and the next sampling is continued until the number of data items in the test set meets the requirements.

Use scenarios for model evaluation methods

Set aside method:

  • The implementation is simple and convenient, and generalization error can be evaluated to a certain extent
  • The separation of test set and training set alleviates overfitting
  • A division, the evaluation results of a large contingency
  • After the data is split, there is less data for training and testing

Cross validation method (leave one method) :

  • K can be set according to the actual situation, making full use of all samples
  • Multiple division, the evaluation results are relatively stable
  • The calculation is tedious and requires k training and evaluation

Self-help method:

  • When the sample size is small, multiple self-help sample sets can be generated by self-help method, with about 36.8% of the test samples
  • There is no requirement for a theoretical distribution of the population
  • No-put sampling introduces an additional bias

Several methods to choose from:

  • When the number of known data sets is sufficient, the reserved method or k-fold cross validation method is usually adopted
  • When the amount of known data is small and it is difficult to divide training set/test set effectively, self-help method is adopted
  • When the known data set is small and the training set/test set can be effectively divided, the retention method is adopted