KNN for R Language Machine Learning

preface

At the end of the introduction of KNN algorithm in the previous period, we point out that it is unreasonable to use the data originally used to train the model to predict the model performance. Based on the content and data of the previous period, this issue introduces the method of cross-validation to evaluate model performance and how to select parameter K to optimize the model.

1. Cross validation

Typically, we divide existing data into two parts: a training set and a test set. The training set is used to train the model and the data from the test set is used to evaluate the model performance. This process is called cross-validation. There are three common cross-validation methods:

Hold-out cross-validation.
k-fold cross-validation.
leave-one-out cross-validation.

Next, the paper introduces the above three cross verification methods respectively from the tasks created in the previous period and learner.

diabetesTask <- makeClassifTask(data = diabetesTib, target = "class")
knn <- makeLearner("classif.knn", par.vals = list("k" = 2))
Copy the code

1.1 Hold – out cross – validation

Hold-out cross-validation is the easiest to understand: we simply “reserve” a random percentage of the data as a test set, train the model on the rest of the data, and then use the test set to evaluate model performance.

When taking this approach, you need to decide what percentage of the data will be used as the test set. If the test set is too small, there will be a large variance in the estimation of performance, but if the training set is too small, there will be a large deviation in the estimation of performance. Typically, two-thirds of the data is for the training set and one-third is for the test set, but this also depends on the number of instances in the data.

1.1.1 Description of Holdout resampling

Using cross-validation in MLR packages, the first step is a resampling description, which is a simple set of instructions to split the data into test sets and training sets.

The first argument to the makeResampleDesc() function is the cross-validation method to use, in this case a Holdout; The second parameter, split, sets what percentage of the data will be used as the training set; Stratify = TRUE ensures that the proportion of diabetics in each category is kept as close as possible when splitting the data into training and test sets.

Holdout < -makeresampledesc (method = "holdout ", split = 2/3, stratify = TRUE)# resampling descriptionCopy the code

1.1.2 Run hold-out cross-validation

HoldoutCV < -resample (Learner = KNN, task = diabetesTask, resampling = Holdout, measures = list(MMCE, ACC))#Copy the code

We feed the created task, Learner, and the resample method just defined to the resample() function, and ask Resample () to calculate MMCE and ACC.

The result can be obtained directly by running the above code, or by using holdoutCV$aggr, as shown below:

HoldoutCV $aggr # mmCE.test. mean acc.test.mean # 0.1632653 0.8367347Copy the code

The accuracy of the model obtained by hold-out cross-validation was lower than the accuracy we evaluated on the data used to train the complete model. This proves the previous view that models perform better on the data they are trained on than on the data they are not seen on.

1.1.3 Calculation of confusion matrix

To better understand which instances are correctly classified and which are incorrectly classified, we can construct an obfuscation matrix. The obfuscation matrix is a tabular representation of the real and predicted classes for each instance in the test set.

In the MLR package, the obfuscation matrix is computed using the calculateConfusionMatrix() function. The first argument to this function is the holdoutCV$pred section, which contains the real and prediction classes of the test set; The optional relative parameter requires the function to show the ratio of each class to true and predicted class tags.

# calculateConfusionMatrix(holdoutCV$pred, relative = TRUE) #Relative confusion matrix (normalized by row/column): # predicted #true Chemical Normal Overt - Err.- # predicted #true Chemical Normal Overt -err. # predicted # True Chemical Normal Overt - Err 0.0.2/0.96 0.00/0.00 0.08 # Overt 0.36/0.25 0.00/0.00 0.64/0.88 0.36 # - Err.- 0.38 0.04 0.12 0.16 #Absolute confusion matrix: # predicted #true Chemical Normal Overt -err.- # Chemical 10 1 1 2 # Normal 2 24 0 2 # Overt 4 0 7 4 # -err.- 6 1 1 8Copy the code

Absolute obfuscation matrices are easier to interpret. The row shows the real class label, and the column shows the prediction class label. These numbers represent the number of cases in each combination of the real and predicted classes. For example, in this matrix, 24 patients were correctly classified as non-diabetic, but 2 patients were incorrectly classified as chemically diabetic. The correctly classified patients can be found on the diagonal of the matrix.

In the relative confusion matrix, it is not the number of cases of the combination of real and predicted classes, but the ratio. / The preceding number is the ratio of this row to this column, / the following number is the ratio of this column to this row. For example, in this matrix, 92% of non-diabetics were correctly classified, while 8% were misclassified as chemically diabetics.

The obfuscation matrix helps us understand which classes our model classifies well and badly. For example, based on this cross-validation, our model seems to have difficulty distinguishing between non-diabetic and chemical-diabetic patients.

The only real benefit of this cross-validation approach is that it is less computationally intensive than other forms of cross-validation. This makes it the only feasible cross-validation method for computationally intensive algorithms.

1.2 k – a fold cross – validation

In K-fold cross-validation, data is randomly divided into roughly equal-sized chunks called folds. Then keep one of the folds as the test set and use the remaining data as the training set. Test the model with the test suite and document the related performance metrics. Use different data folds as test sets, and perform the same operation until all of the folds are used as test sets. Finally, the average value of all performance indexes was calculated as the estimation of model performance. The process of the cross-validation method is shown in Fig 2:

In general, a repeated K-fold cross-validation is preferred to a regular k-fold cross-validation. The choice of k value depends on the size of the data, but 10 is a reasonable value for many data sets, where the data is divided into 10 folds of similar size and cross-validation is performed. If this process is repeated five times, with 10-fold cross-validation repeated five times (which is different from 50 cross-validation), the model performance estimate will be the average of the 50 results.

1.2.1 Perform k-fold cross-validation

kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50, Stratify = TRUE)# resample description kFoldCV < -resample (Learner = KNN, task = diabetesTask, resampling = kFold, Measures = List (MMCE, ACCCopy the code

Repeated K-fold cross-validation is repeated 50 times in a repeated K-fold format. The result will be repeated 500 times.

Extract average performance metrics:

KFoldCV $aggr # mmCE.test. mean acc.test.mean # 0.1030395 0.8969605Copy the code

Thus, the model correctly classified 89.7% of instances on average, lower than the results of the data used to train the model.

1.2.2 How to select repetition times

A reasonable approach is to choose a computationally reasonable number of repetitions, run the process a few times, and then see if the average performance estimates vary significantly, and if so, increase the number of repetitions. In general, the more repetitions, the more accurate and stable these estimates become. However, in some cases, more repetition does not improve the accuracy or stability of performance estimates.

1.2.3 Calculation of confusion matrix

Same as calculating the obfuscation matrix in hold-out cross-validation:

calculateConfusionMatrix(kFoldCV$pred, relative = TRUE)
Copy the code

1.3 leave one out cross – validation

Leave-one -out cross-validation can be thought of as the extreme of K-fold cross-validation: Instead of breaking the data into folds, only one observation is retained as a test set to train the model on the remaining data. Test the model with the test suite and document the related performance metrics. Use different observations as test sets, and perform the same operation until all observations are used as test sets. Finally, the average value of all performance indexes was calculated as the estimation of model performance. The process of the cross-validation method is shown in Fig 3:

For small data sets, splitting into k folds leaves a very small training set, and models trained on small data sets tend to have higher variances because they are subject to more sampling errors or anomalies. Therefore, leave-one-out cross-validation is useful for small data sets, and is more computationally convenient than repeated K-fold cross-validation.

1.3.1 Run leave-one-out cross-validation

The re-sampling description of the cross-validation method is simple, specifying method = “LOO”. Since there is only one instance of the test set, stratify = TRUE is not required; Because each instance is used as a test set and all other data is used as a training set, this process does not need to be repeated.

LOO < -makeresampledesc (method = "LOO")# resampling descriptionCopy the code

Run cross validation and get average performance metrics:

LOOCV <- resample(learner = knn, task = diabetesTask, resampling = LOO, measures = list(mmce, Acc))# cross validation LOOCV$aggr # mmce.test.mean acc.test.mean # 0.08965517 0.91034483Copy the code

1.3.2 Calculation of confusion matrix

calculateConfusionMatrix(LOOCV$pred, relative = TRUE)
Copy the code

We now know how to apply three common cross-validation methods. If we have cross-validated our model and it performs well enough on unseen data, we can train the model on all available data and use it to make future predictions.

2. How to select parameter K to optimize KNN model

In KNN algorithm, K is a hyperparameter, that is, a variable or option that can control the prediction effect of the model and cannot be estimated by data. There are usually three ways to select hyperparameters:

Choose a “reasonable” or default value, which has dealt with similar problems before.
Manually try several different values to see which one gives the best performance.
Use is calledhyperparameter tuningThe automatic selection process.

Among them, the third method is the best, and the third method will be introduced emphatically below:

Step 1. Define hyperparameters and scope (hyperparameter space).

knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:10))
Copy the code

The makeParamSet() function specifies the parameter k to be tuned, ranging from 1 to 10. The makeDiscreteParam() function is used to define discrete hyperparameters. If you want to tune multiple hyperparameters during tuning, you simply separate them with commas inside the function.

Step 2. Search the hyperparameter space.

In fact, there are many ways to search, and we will use Grid Search. This is probably the easiest way to just try every value in the hyperparameter space when looking for the best performance value. Random Search is preferred for consecutive hyperparameters or when there are multiple hyperparameters.

gridSearch <- makeTuneControlGrid()
Copy the code

Step 3. Cross-validate the tuning process.

cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20)
Copy the code

Repeated K-fold cross-validation is used here. For each k value, average performance measures are taken over all these iterations and compared with average performance measures for all other K values.

Step 4. Call the functiontuneParams()tuning

tunedK <- tuneParams("classif.knn", task = diabetesTask,
 resampling = cvForTuning,
 par.set = knnParamSpace, control = gridSearch)
Copy the code

Where, the first parameter is the algorithm name, the second parameter is the previously defined task, the third parameter is the cross-validation tuning method, the fourth parameter is the defined hyperparameter space, and the last parameter is the search method.

Call tunedK to get the optimal k value:

TunedK #Tune result: # op.pars: k=7 #mmce.test.mean=0.0750476Copy the code

The best performance k value can also be obtained directly by selecting the $x component:

tunedK$x

#$k
#[1] 7

Copy the code

In addition, you can visualize the tuning process:

knnTuningData <- generateHyperParsEffectData(tunedK)
plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean",
 plot.type = "line") +
 theme_bw()
Copy the code

Finally, we can use the tuning k value to train our final model:

tunedKnn <- setHyperPars(makeLearner("classif.knn"),
 par.vals = tunedK$x)
tunedKnnModel <- train(tunedKnn, diabetesTask)
Copy the code

Similar to the makeLearner() function, a new learner is created in the setHyperPars() function. Train () is then used to train the final model.

3. Nested cross validation

3.1 Nested cross validation

When we perform some kind of pre-processing on data or model, such as tuning hyperparameters, it is important to include this pre-processing in cross-validation so that the entire model training process can be cross-validated.

This takes the form of nested cross validation, where an internal loop cross-validates the different values of the hyperparameter (as done above), and then the optimal values of the hyperparameter are passed to the external cross validation loop. In the external cross-validation loop, each fold uses an optimal hyperparameter.

In Figure 5, the outer is a 3-fold cross-validation loop, and for each fold, only the training set of the outer loop is used for the inner 4-fold cross-validation. For each inner loop, a different k value is used, and the optimal k value is passed to the outer loop to train the model and evaluate the model performance using the test set.

Nested cross validation procedures can be easily implemented using functions in the MLR package.

Step 1. Define external and internal cross validation.

inner <- makeResampleDesc("CV")
outer <- makeResampleDesc("RepCV", folds = 10, reps = 5)
Copy the code

Perform normal K-fold cross-validation for the inner loop (10 is the default fold count) and 10-fold cross-validation for the outer loop (repeat 5 times).

Step 2. Definitionswrapper

Is basically a learner, associated with some preprocessing steps, in this case, hyperparameter tuning, so use the function makeTuneWrapper():

knnWrapper <- makeTuneWrapper("classif.knn", resampling = inner,
 par.set = knnParamSpace,
 control = gridSearch)
Copy the code

The first parameter in the makeTuneWrapper() function is the algorithm, the second is the resampling parameter, which is the internal cross-validation process, the third is the par.set parameter, which is the hyperparameter search space, and the fourth control parameter, which is the gridSearch method.

Step 3. Run the nested cross validation process.

cvWithTuning <- resample(knnWrapper, diabetesTask, resampling = outer)
Copy the code

The first parameter is the Wrapper we just created, the second parameter is the name of the task, and the third parameter is set to external cross validation.

CvWithTuning Results:

cvWithTuning #Resample Result #Task: diabetesTib #Learner: classif.knn.tuned #Aggr perf: Mmce. Test. Mean = 0.0857143 # Runtime: 57.1177Copy the code

For unseen data, the model was estimated to correctly classify 91.4% of cases.

3.2 Use the model for prediction

Suppose some new patients come into the clinic:

newDiabetesPatients <- tibble(glucose = c(82, 108, 300),
 insulin = c(361, 288, 1052),
 sspg = c(200, 186, 135))
Copy the code

These patients were input into the model to obtain their predicted diabetes status:

newPatientsPred <- predict(tunedKnnModel, newdata = newDiabetesPatients)
getPredictionResponse(newPatientsPred)
#[1] Normal Normal Overt 
#Levels: Chemical Normal Overt
Copy the code

Xiaobian has something to say

In addition to the functions in MLR package introduced above to implement KNN algorithm, there are KNN or KKNN functions in R language can also implement k-nearest neighbor classification and weighted K-nearest neighbor classification, readers can refer to the help instructions in R. Next time, we will introduce a probabilistic classification algorithm in machine learning: logistic regression.