5 fold cross-validation: Divide the data into 5 equal parts, one for each experiment, and the rest for training. The experiment was averaged for 5 times. As shown in the figure above, in the first experiment, the first sample was taken as the test set and the rest as the training set. In the second experiment, the second sample was used as the test set and the rest as the training set. And so on

But, it’s all pretty simple, but I can’t write the code, like how do I split the data into five equal pieces? How do I make sure that every time I do an experiment, the data is divided this way? In the original general training, data were divided into training set, verification set and test set according to 6:2:2. Images were trained on the training set, the best model was saved on the verification set, and the test set was used for the final test. Now that there is no validation set for cross validation, how do I save the model? Here are the answers.

1. Divide the data into K equal parts

Use the KFold class. KFold(N_Separator =5, *, shuffle=False, random_state=None) Provide training set/test set indexes to split data. Split the data set into k-folds (by default, the data is not scrambled.

Parameter is introduced

  • n_splits: int. The default value is int5. Means split up5fold
  • shuffle: bool. The default value is boolFalse. Whether the data is shuffled before the data set is sharded.TrueShuffle the deck,FalseDon’t shuffle.
  • random_state: int. The default value is intNone. whenshuffleIs True ifrandom_stateNone, the data shards are different each time the code is run,random_stateWhen specified, the same shard data can be obtained each time the code is run to ensure repeatability of the experiment.random_stateYou can set it to an integer as you like, for examplerandom_state=42More commonly used. Once set, it cannot be changed.

Using the KFold class requires initialization and then calling its methods for data partitioning. Its two methods are:

  • get_n_splits(X=None, y=None, groups=None)

Returns the number of split iterations in the cross validator

  • split(X, y=None, groups=None)

Generate indexes to split the data into training sets and test sets. X: array in the shape of: (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features. Y: array, shape (n_samples,), default=None. Return (with or without return) : the index for train and test. Note that the index for each collection is returned, not the data

Example 1: Settingshuffle=False, the result is the same every time

from sklearn.model_selection import KFold
import numpy as np
X = np.arange(24).reshape(12.2)
y = np.random.choice([1.2].12,p=[0.4.0.6])
kf = KFold(n_splits=5,shuffle=False)  Initialize KFold
for train_index , test_index in kf.split(X):  Call split to split the data
    print('train_index:%s , test_index: %s ' %(train_index,test_index))
Copy the code

Result: index of 50% discount data

train_index:[ 3  4  5  6  7  8  9 10 11] , test_index: [0 1 2] 
train_index:[ 0  1  2  6  7  8  9 10 11] , test_index: [3 4 5] 
train_index:[ 0  1  2  3  4  5  8  9 10 11] , test_index: [6 7] 
train_index:[ 0  1  2  3  4  5  6  7 10 11] , test_index: [8 9] 
train_index:[0 1 2 3 4 5 6 7 8 9] , test_index: [10 11] 
Copy the code

To obtain data and corresponding tags by index:

fold1_train_data, fold1_train_label = X[train_index], y[train_index]
Copy the code

Example 2: Settingsshuffle=TrueEach run results are different

Example 3: Settingshuffle=TrueandRandom_state = integer, the result is the same each time

Therefore, it is recommended to use case 3 in practice to ensure repeatability of experiment and increase randomness of data.

Example 4: Real case data partitioning

I have some 3D data of NIi.gz for segmentation. Images and labels are placed in different folders. Such as:

└ ─ ─ the root directory └ ─ ─ image │ ├ ─ ─ 1. Nii. Gz │ │ ─ ─ 2. Nii. Gz │ └ ─ ─ 3. Nii. Gz │ ─ ─ label │ ├ ─ ─ 1. Nii. Gz │ │ ─ ─ 2. Nii. Gz │ └ ─ ─ 3. Nii. GzCopy the code
 images1 = sorted(glob.glob(os.path.join(data_root, 'ImagePatch'.'l*.nii.gz')))
 labels1 = sorted(glob.glob(os.path.join(data_root, 'Mask01Patch'.'l*.nii.gz')))
 images2 = sorted(glob.glob(os.path.join(data_root, 'ImagePatch'.'r*.nii.gz')))
 labels2 = sorted(glob.glob(os.path.join(data_root, 'Mask01Patch'.'r*.nii.gz')))
 data_dicts1 = [{'image': image_name, 'label': label_name}
                   for image_name, label_name in zip(images1, labels1)]
 data_dicts2 = [{'image': image_name, 'label': label_name}
                   for image_name, label_name in zip(images2, labels2)]
 all_files = data_dicts1 + data_dicts2
 Create a dictionary for image and label and place it in a list
Copy the code

All_files is a list of all data, but each data in the list is a dictionary, which is the data address of image and label respectively. We cross-verify the all_files data by 50% :

    floder = KFold(n_splits=5, random_state=42, shuffle=True)
    train_files = []   # Store a 50% off training set partition
    test_files = []     # # Store a 50% off test set partition
    for k, (Trindex, Tsindex) in enumerate(floder.split(all_files)):
        train_files.append(np.array(all_files)[Trindex].tolist())
        test_files.append(np.array(all_files)[Tsindex].tolist())

    Write the partition to CSV and check if it is the same each time
    df = pd.DataFrame(data=train_files, index=['0'.'1'.'2'.'3'.'4'])
    df.to_csv('./data/Kfold/train_patch.csv')
    df1 = pd.DataFrame(data=test_files, index=['0'.'1'.'2'.'3'.'4'])
    df1.to_csv('./data/Kfold/test_patch.csv')
Copy the code

We saved the partition of the dataset to CSV in case the code changes and loses the original partition method.

Once the data set is partitioned, it’s ready for training and testing. Just take one fold of data at a time.

    # 50% off separate train, take one fold off train and test each time
    train(train_files[0], test_files[0])
    test(test_files[0])
Copy the code

In the train and test methods, we must write the corresponding dataloder, because we just partition the data name, not load the data set.

The usual way to do this is to loop five times, run the code once, and get 50% of the results. But the nice thing about the way we write it is, if you want to train the fold, you change the index, you don’t have to train it all at once. As long as you don’t touch the code, and you train again a year from now, the partition of the data set won’t change. It doesn’t matter if it changes, we’ve saved the partition as CSV.

Of course, this is just a way of writing, if there is a better scheme, welcome to discuss ~~

2. How to save the best model when there is no validation set

It’s a question I’ve always wondered about. Because, without cross-validation, I save the best model based on metrics on the test set. For example, the following code is done on a validation set.

if metric > best_metric:
     best_metric = metric
     best_metric_epoch = epoch + 1
     save_dir = 'checkpoints/checkpoint_04264/'
     if not os.path.exists(save_dir):
         os.makedirs(save_dir)
     save_path = save_dir + str(epoch + 1) + "best_metric_model.pth"
     torch.save(model.state_dict(), save_path)
     print('saved new best metric model')
Copy the code

However, now that there are no validation sets, do I save the model against the metrics on the training set, or against the metrics on the test set? This question, there is no uniform answer, both practices have. Because there is no unified answer, we can choose the most favorable answer for ourselves 😜. For example, when writing a paper, save the model based on the results on the test set, and you’ll get better results.

And, as a tip, cross-validation usually gives better results than dividing the training set into 6:2:2 validation sets on the test set. Think about why 😉