Understanding and application of K-fold cross-validation

Personal Home page –>www.yansongsong.cn/

1. K-fold cross-validation concept

In machine learning modeling, it is common practice to divide data into training sets and test sets. The test set is data independent of the training, not involved in the training at all, and used for the evaluation of the final model. In the process of training, the problem of fitting often occurs, that is, the model can match the training data well, but can’t predict the data outside the training set well. If the test data is used to adjust model parameters at this time, it is equivalent to knowing part of the test data information during training, which will affect the accuracy of the final evaluation results. It is common practice to set aside a portion of the training data as Validation data to evaluate the training effect of the model.

The validation data is from the training data, but does not participate in the training, so that the matching degree of the model to the data outside the training set can be relatively objectively evaluated. Cross-validation, also known as cyclic validation, is commonly used to evaluate models in validation data. It divides the original data into K-folds, performs a validation set for each subset, and uses the rest k-1 subset data as a training set. In this way, K models can be obtained. The K models are evaluated in the validation set respectively, and the final Error MSE(Mean Squared Error) is added and averaged to obtain the cross validation Error. Cross-validation effectively utilizes limited data, and the evaluation results can be as close as possible to the model’s performance on the test set, which can be used as an indicator of model optimization.

2. Give examples

Here is a concrete example of the K-fold process, such as the following data

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
Copy the code

Divided into K=3 groups

Fold1: [0.5, 0.2]
Fold2: [0.1, 0.3]
Fold3: [0.4, 0.6]
Copy the code

The following three models will be used for cross-validation, respectively for training and testing. The total score of cross-validation can be obtained by adding and averaging the MSE errors of each test set

Model1: Trained on Fold1 + Fold2, Tested on Fold3
Model2: Trained on Fold2 + Fold3, Tested on Fold1
Model3: Trained on Fold1 + Fold3, Tested on Fold2
Copy the code

3. Application explanation

 

1. Divide all the training set S into k disjoint subsets. Assume that the number of training samples in S is M, then each subset has m/ K training samples, and the corresponding subset is called {}.

2. Take one out of model set M at a timeAnd then select k-1 in the training subset

{} one at a time), using this k-1 subset of trainingAfter, the hypothesis function is obtained. Use the remaining portion at the endDo tests and get empirical errors.

3. Because we leave one at a time(j equals 1 to k), so you get k empirical errors, so for oneIts empirical errors are the average of the k empirical errors.

4. Select the one with the lowest average empirical error rateAnd then use all the S to do the training again to get the final.

 

Core content:

The model performance is tested through steps 1,2 and 3 above, and the average value is taken as the performance indicator of a model

Method one is to fuse the KFOLDS of all training

In method 2, the optimal model is selected according to the performance indicators, and then the fourth step above is retrained to obtain the final model

Answers to questions:

1. Why not directly split the training set and the data set to verify the performance of the model, instead of adopting the form of multiple partition, isn’t it too troublesome?

In order to prevent the problem of fitting in the training process, the general practice is to divide the data into training set and test set. The test set is data independent of the training, not involved in the training at all, and used for the evaluation of the final model. Such a direct partition will lead to a problem that the test set will not participate in the training, which will waste the part of data in the small data set and fail to optimize the model (the data determines the upper limit of the program performance, and the model and algorithm will approach this upper limit). But we can’t divide the test set because we need to verify network generalization performance. The entire dataset can be utilized in the form of k-fold multiple partitioning. Finally, the average method is used to represent the model performance reasonably.

2. Why all the data set retraining, and is it too time consuming?

We divided by K – a Fold repeatedly in the form of training is to get a model of performance indicators, a single K – a Fold training model can’t say the overall performance, but we can through the K – a Fold training record of training more excellent super parameters, and then to the optimal model of optimal parameters for training, will achieve better results.

You can also take the first approach and not train using model fusion.

3. When to use k-fold

In my opinion, if the total amount of data is small, other methods cannot continue to improve performance, you can try K-fold. In other cases, it is not recommended. For example, there is a large amount of data, so there is no need to train more data. Meanwhile, the training cost should be increased by K times (mainly referring to the training time).

 

4. Give examples

 

After 5 times of training on the above 5 combinations, we had 5 models in the test. Each model predicted the test set once and got 5 probability matrices, and the shape of each probability matrix was (test set sample number x 17). We can directly average the five probability matrices to make dichotomous prediction, or we can finish dichotomous prediction separately and then vote to get the final multi-category prediction result. This result actually uses all five folds of training data, which is more accurate and more stable.

Of course, if you just want to use all the data, it is easier to run the entire training set directly through the model, and then use the trained model to predict the test set. However, we did not adopt this second method. Firstly, all the training samples were “all seen” by the model, and without additional verification sets, it was difficult to evaluate its generalization performance. Secondly, we believe that in the first method, the prediction results of 5 models are made into a simple Ensemble, which will be more stable.

Reference 5.

1. K-fold cross-validation

2. Regularization and Model selection

3.Kaggle’s Survival: Amazon Rainforest