5 fold cross-validation: Divide the data into 5 equal parts, one for each experiment, and the rest for training. The experiment was averaged for 5 times. As shown in the figure above, in the first experiment, the first sample was taken as the test set and the rest as the training set. In the second experiment, the second sample was used as the test set and the rest as the training set. And so on
But, it’s all pretty simple, but I can’t write the code, like how do I split the data into five equal pieces? How do I make sure that every time I do an experiment, the data is divided this way? In the original general training, data were divided into training set, verification set and test set according to 6:2:2. Images were trained on the training set, the best model was saved on the verification set, and the test set was used for the final test. Now that there is no validation set for cross validation, how do I save the model? Here are the answers.
1. Divide the data into K equal parts
Use the KFold class. KFold(N_Separator =5, *, shuffle=False, random_state=None) Provide training set/test set indexes to split data. Split the data set into k-folds (by default, the data is not scrambled.
Parameter is introduced
- n_splits: int. The default value is int
5
. Means split up5
fold - shuffle: bool. The default value is bool
False
. Whether the data is shuffled before the data set is sharded.True
Shuffle the deck,False
Don’t shuffle. - random_state: int. The default value is int
None
. whenshuffle
Is True ifrandom_state
None, the data shards are different each time the code is run,random_state
When specified, the same shard data can be obtained each time the code is run to ensure repeatability of the experiment.random_state
You can set it to an integer as you like, for examplerandom_state=42
More commonly used. Once set, it cannot be changed.
Using the KFold class requires initialization and then calling its methods for data partitioning. Its two methods are:
get_n_splits(X=None, y=None, groups=None)
Returns the number of split iterations in the cross validator
split(X, y=None, groups=None)
Generate indexes to split the data into training sets and test sets. X: array in the shape of: (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features. Y: array, shape (n_samples,), default=None. Return (with or without return) : the index for train and test. Note that the index for each collection is returned, not the data
Example 1: Settingshuffle=False
, the result is the same every time
from sklearn.model_selection import KFold
import numpy as np
X = np.arange(24).reshape(12.2)
y = np.random.choice([1.2].12,p=[0.4.0.6])
kf = KFold(n_splits=5,shuffle=False) Initialize KFold
for train_index , test_index in kf.split(X): Call split to split the data
print('train_index:%s , test_index: %s ' %(train_index,test_index))
Copy the code
Result: index of 50% discount data
train_index:[ 3 4 5 6 7 8 9 10 11] , test_index: [0 1 2]
train_index:[ 0 1 2 6 7 8 9 10 11] , test_index: [3 4 5]
train_index:[ 0 1 2 3 4 5 8 9 10 11] , test_index: [6 7]
train_index:[ 0 1 2 3 4 5 6 7 10 11] , test_index: [8 9]
train_index:[0 1 2 3 4 5 6 7 8 9] , test_index: [10 11]
Copy the code
To obtain data and corresponding tags by index:
fold1_train_data, fold1_train_label = X[train_index], y[train_index]
Copy the code
Example 2: Settingsshuffle=True
Each run results are different
Example 3: Settingshuffle=True
andRandom_state = integer
, the result is the same each time
Therefore, it is recommended to use case 3 in practice to ensure repeatability of experiment and increase randomness of data.
Example 4: Real case data partitioning
I have some 3D data of NIi.gz for segmentation. Images and labels are placed in different folders. Such as:
└ ─ ─ the root directory └ ─ ─ image │ ├ ─ ─ 1. Nii. Gz │ │ ─ ─ 2. Nii. Gz │ └ ─ ─ 3. Nii. Gz │ ─ ─ label │ ├ ─ ─ 1. Nii. Gz │ │ ─ ─ 2. Nii. Gz │ └ ─ ─ 3. Nii. GzCopy the code
images1 = sorted(glob.glob(os.path.join(data_root, 'ImagePatch'.'l*.nii.gz')))
labels1 = sorted(glob.glob(os.path.join(data_root, 'Mask01Patch'.'l*.nii.gz')))
images2 = sorted(glob.glob(os.path.join(data_root, 'ImagePatch'.'r*.nii.gz')))
labels2 = sorted(glob.glob(os.path.join(data_root, 'Mask01Patch'.'r*.nii.gz')))
data_dicts1 = [{'image': image_name, 'label': label_name}
for image_name, label_name in zip(images1, labels1)]
data_dicts2 = [{'image': image_name, 'label': label_name}
for image_name, label_name in zip(images2, labels2)]
all_files = data_dicts1 + data_dicts2
Create a dictionary for image and label and place it in a list
Copy the code
All_files is a list of all data, but each data in the list is a dictionary, which is the data address of image and label respectively. We cross-verify the all_files data by 50% :
floder = KFold(n_splits=5, random_state=42, shuffle=True)
train_files = [] # Store a 50% off training set partition
test_files = [] # # Store a 50% off test set partition
for k, (Trindex, Tsindex) in enumerate(floder.split(all_files)):
train_files.append(np.array(all_files)[Trindex].tolist())
test_files.append(np.array(all_files)[Tsindex].tolist())
Write the partition to CSV and check if it is the same each time
df = pd.DataFrame(data=train_files, index=['0'.'1'.'2'.'3'.'4'])
df.to_csv('./data/Kfold/train_patch.csv')
df1 = pd.DataFrame(data=test_files, index=['0'.'1'.'2'.'3'.'4'])
df1.to_csv('./data/Kfold/test_patch.csv')
Copy the code
We saved the partition of the dataset to CSV in case the code changes and loses the original partition method.
Once the data set is partitioned, it’s ready for training and testing. Just take one fold of data at a time.
# 50% off separate train, take one fold off train and test each time
train(train_files[0], test_files[0])
test(test_files[0])
Copy the code
In the train and test methods, we must write the corresponding dataloder, because we just partition the data name, not load the data set.
The usual way to do this is to loop five times, run the code once, and get 50% of the results. But the nice thing about the way we write it is, if you want to train the fold, you change the index, you don’t have to train it all at once. As long as you don’t touch the code, and you train again a year from now, the partition of the data set won’t change. It doesn’t matter if it changes, we’ve saved the partition as CSV.
Of course, this is just a way of writing, if there is a better scheme, welcome to discuss ~~
2. How to save the best model when there is no validation set
It’s a question I’ve always wondered about. Because, without cross-validation, I save the best model based on metrics on the test set. For example, the following code is done on a validation set.
if metric > best_metric:
best_metric = metric
best_metric_epoch = epoch + 1
save_dir = 'checkpoints/checkpoint_04264/'
if not os.path.exists(save_dir):
os.makedirs(save_dir)
save_path = save_dir + str(epoch + 1) + "best_metric_model.pth"
torch.save(model.state_dict(), save_path)
print('saved new best metric model')
Copy the code
However, now that there are no validation sets, do I save the model against the metrics on the training set, or against the metrics on the test set? This question, there is no uniform answer, both practices have. Because there is no unified answer, we can choose the most favorable answer for ourselves 😜. For example, when writing a paper, save the model based on the results on the test set, and you’ll get better results.
And, as a tip, cross-validation usually gives better results than dividing the training set into 6:2:2 validation sets on the test set. Think about why 😉