Data set partitioning in machine learning

Set aside (hold out)

The form of N :m and n + m =1 is used to segment the original data, such as train: test = 7:3 or train: test = 6.5:3.5. However, this relatively primitive processing method is not good, and its disadvantages are as follows:

Disadvantage 1: waste of data
Disadvantage two: easy to overfit, and the correction method is not convenient

In this case, we need to use another method of segmentation – cross validation or Leave P out.

LOO or LPO

LOO: For the whole data set, one sample is selected each time as the verification set and the rest samples as the training set. LPO: For the whole data set, P samples are selected each time as the verification set and the rest samples as the training set

LOO has the advantage of avoiding the waste of data, but at the same time has higher performance overhead. Generally,LOO has higher variance compared to K-fold, but in the case where variance is dominant,LOO may have stronger capability than cross validation.

K-Fold

KFold divides all samples into k groups, called folds (if k = n, this is equivalent to the Leave One Out policy), all of the same size (if possible). The prediction function is learned using the data in k-1 folds, and the last remaining fold is used for testing. This is used in the integration algorithm Stacking (Bagging is subsampling, which is also interesting, as described earlier)

Pay attention to

While I.I.D data is a common assumption in machine learning theory, it rarely holds true in practice. If you know that the sample was generated using a time-dependent process, it is safer to use the time-series Aware cross-validation scheme. Similarly, if we know that the generation process has a group structure (samples collected from different subjects, experiments, and measurement devices), It is safer to use group-wise cross-validation.

The question of whether to repeat the test and stratification

Layering: For the K-fold, keeping the proportion of train: test in each group roughly equal repetition: that is, putting back samples, such as Bagging, some samples in the training set will be repeated and some samples will never be repeated: For a K-fold in Sklearn, it means that the proportions of the categories in the sample are roughly equal to the proportions of the categories in the original data set.

Cross validation

The cross-validation between LOO and LPO is that each (or each P samples) is used as a validation set once, and then the average value is calculated to obtain Score. K-fold is similar, but the difference is that it is divided into K-fold.

Sklearn implements the convenient method CV

Quick and easy to use

Load the data

from sklearn.model_selection import train_test_split,LeaveOneOut,LeavePOut
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import accuracy_score
import numpy as np

iris = datasets.load_iris()
clf_svc = svm.SVC(kernel='linear')
iris.data.shape,iris.target.shape
Copy the code

((150, 4), (150))Copy the code

hold out

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0) 

clf_svc.fit(X_train,y_train)
accuracy_score(clf_svc.predict(X_test),y_test)
Copy the code

0.9666666666666667
Copy the code

Leave One Out

loo = LeaveOneOut()
loo.get_n_splits(iris.data)
mean_accuracy_score_list = []
for train_index, test_index in loo.split(iris.data):
    clf_svc.fit(iris.data[train_index], iris.target[train_index])
    prediction = clf_svc.predict(iris.data[test_index])
    mean_accuracy_score_list.append(accuracy_score(iris.target[test_index], prediction))
print(np.average(mean_accuracy_score_list))
Copy the code

0.98
Copy the code

Leave P Out

LeavePOut is very similar to LeaveOneOut in that it creates all possible training/test sets by removing P samples from the entire set. For n samples, this produces m training-test pairs, and m is equal to the number of randomly selected p samples of n that are freely combined in any order. It is worth noting that this method leads to a significant increase in computation overhead. The following example takes m-N more time than the above example. In terms of time complexity, T(n) = O(n^ P) is used when the data volume is large but P is small. Therefore, exercise caution.

loo = LeavePOut(p=2)
mean_accuracy_score_list = []
for train_index, test_index in loo.split(iris.data):
    clf_svc.fit(iris.data[train_index], iris.target[train_index])
    prediction = clf_svc.predict(iris.data[test_index])
    mean_accuracy_score_list.append(accuracy_score(iris.target[test_index], prediction))
print(np.average(mean_accuracy_score_list))
Copy the code

0.9793627184231215
Copy the code

The following example shows the effect better:

X = np.ones(4)
lpo = LeavePOut(p=2)
for train, test in lpo.split(X):
    print("%s %s" % (train, test))
Copy the code

[2, 3] [0, 1] [3] 1 [0] 2 [1, 2] [0, 3] [0, 3] [1, 2], [0 2] [3] 1 [0, 1] [2, 3]Copy the code

K-Fold

A regular K-fold is just a Fold, and then there’s a layered K-fold.

from sklearn.model_selection import KFold,StratifiedKFold
X = ["a"."b"."c"."d"]
kf = KFold(n_splits=4)
for train, test in kf.split(X):
    print("%s %s" % (train, test))
Copy the code

[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]
Copy the code

X = np.array([[1.2.3.4],
              [11.12.13.14],
              [21.22.23.24],
              [31.32.33.34],
              [41.42.43.44],
              [51.52.53.54],
              [61.62.63.64],
              [71.72.73.74]])

y = np.array([1.1.0.0.1.1.0.0])

stratified_folder = StratifiedKFold(n_splits=4, random_state=0, shuffle=False)
for train_index, test_index in stratified_folder.split(X, y):
    print("Stratified Train Index:", train_index)
    print("Stratified Test Index:", test_index)
    print("Stratified y_train:", y[train_index])
    print("Stratified y_test:", y[test_index],'\n')
Copy the code

Stratified Train Index: [1 3 4 5 6 7]
Stratified Test Index: [0 2]
Stratified y_train: [1 0 1 1 0 0]
Stratified y_test: [1 0] 

Stratified Train Index: [0 2 4 5 6 7]
Stratified Test Index: [1 3]
Stratified y_train: [1 0 1 1 0 0]
Stratified y_test: [1 0] 

Stratified Train Index: [0 1 2 3 5 7]
Stratified Test Index: [4 6]
Stratified y_train: [1 1 0 0 1 0]
Stratified y_test: [1 0] 

Stratified Train Index: [0 1 2 3 4 6]
Stratified Test Index: [5 7]
Stratified y_train: [1 1 0 0 1 0]
Stratified y_test: [1 0] 
Copy the code

However, cross_val_score, an encapsulated cross-validation method, is more commonly used for model selection. The default method is K-fold. In addition, cross_val_PREDICT can be used to obtain prediction results, but the effect is not necessarily the best even.

from sklearn.model_selection import cross_val_score
scores_clf_svc_cv = cross_val_score(clf_svc,iris.data,iris.target,cv=5)
print(scores_clf_svc_cv)
print("Accuracy: % 0.2 f (+ / - % 0.2 f)" % (scores_clf_svc_cv.mean(), scores_clf_svc_cv.std() * 2))
Copy the code

[0.96666667 1. 0.96666667 0.96666667 1.] Accuracy: 0.98 (+/ -0.03)Copy the code

from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf_svc, iris.data, iris.target, cv=10)
accuracy_score(iris.target, predicted)
Copy the code

0.9733333333333334
Copy the code

reference

Sklearn 英文版 : Cross validation
Sklearn data splitter

Please refer to more informationMy blogI will keep you updated

Data set partitioning in machine learning

Set aside (hold out)

K-Fold

Cross validation

Load the data

Leave One Out

K-Fold

Related Posts