Original link:tecdat.cn/?p=19518

introduce

What are the possible reasons for the wide variation in model performance? In other words, why do our models lose stability when others evaluate them?

In this article, we will explore possible causes. We’ll also examine the concept of cross-validation and some common ways to perform it.

 

directory

  1. Why does the model lose stability?

  2. What is cross-validation?

  3. Several common methods of cross validation

    • Validation set method
    • Leave one method cross validation (LOOCV)
    • K-fold cross validation
    • Layered K-fold cross validation
    • Against the validation
    • Cross validation of time series
    • Custom cross validation technology
  4. How to measure the deviation variance of the model?

 

Why does the model lose stability?

Let’s use the following snapshot to illustrate how the various models fit to understand this:

 

Here we try to find the relationship between quantity and price. To this end, we have taken the following steps:

  1. We set up the relationship using linear equations and graphed it. From the perspective of training data points, the first picture has a high error. In this case, our model cannot capture the underlying trend of the data
  2. In the second figure, we just found the correct relationship between price and quantity, which is a lower training error
  3. In the third diagram, we find a relationship where the training error is almost zero. This is because the model is overly sensitive and captures random patterns that exist only in the current data set by building relationships by taking into account every deviation (including noise) in the data point. This is an example of “overfitting”.

A common practice in data science competitions is to iterate over models to find the one that performs better. To find the right answer, we use verification techniques.

 

What is cross-validation?

In the given modeling samples, take out most samples to build the model, and leave a small part of samples to predict with the model just established, and calculate the prediction error of this small part of samples, and record their sum of squares.

Here are the steps involved in cross validation:

  1. keepSample data set
  2. Train the model with the rest of the data set
  3. Use spare samples from the test (validation) set. Helps you evaluate the validity of model performance.

 

Several common methods of cross validation

There are several ways to perform cross-validation. I’ve discussed some of them in this section.

 

Validation set method

In this approach, we reserve 50% of the dataset for validation and the remaining 50% for model training. However, the main disadvantage of this approach is that since we are only training the model on 50% of the data set, we are likely to miss some information about the data, leading to higher bias.

Python code:

Validation = train_test_split(data, test_size=0.50, random_state = 5)Copy the code

R code:

Set. Seed (101) # Int (n = nrow(data), size = floor(.50*nrow(data)), replace = F)Copy the code

 

Leave one method cross validation (LOOCV)

In this approach, we keep only one data point from the available data set and train the model on the rest. The process iterates over each data point. This has its advantages and disadvantages. Let’s take a look at them:

  • We use all the data points, so the bias is low
  • We repeat the cross-validation process n times (where n is the number of data points), which results in a longer execution time
  • Since we are testing against a single data point, this approach leads to a large difference in the validity of the test model. Therefore, our estimates are influenced by data points. If the data point is an outlier, it can lead to larger changes

 

R code:

for(i in 1:nrow(x)){ training = x[-i,] model = #... Validation [[label]]) # return(unlist(score)) # return a vectorCopy the code

LOOCV points to one data point. Similarly, you can ignore P training examples to make the validation set size p for each iteration. This is called LPOCV (set aside P cross validation)

 

K-fold cross validation

Through the above two verification methods, we learned:

  1. We should train the model on a large part of the data set. Otherwise, we will not be able to read and identify underlying trends in the data. This will eventually lead to higher bias
  2. We also need a good ratio of test data points. As mentioned above, when testing the validity of a model, a small number of data points can lead to errors
  3. We should repeat the training and testing process many times. You should change the training and test the data set distribution. This helps validate the model correctly

Is there a way we can meet all three requirements?

This method is called “k-fold cross validation”. Here are its steps:

  1. Randomly split the entire dataset into K “parts”
  2. For each k-fold in the data set, the model is built on the k-1 fold of the data. Then, test the model to checkkEfficiency of folding
  3. Record the errors seen on each forecast
  4. Repeat this process until each k-fold is used as a test set
  5. The average of the k errors you record is called the cross-validation error and will be used as a performance indicator for the model

Below is a visualization of k-times validation at k = 10.

 

Now, one of the most common questions is: “How do I choose the right k value?” .

The lower the k value, the greater the deviation. On the other hand, higher K values have less deviation, but may show greater variability.

To be precise, LOOCV is equivalent to n-fold cross validation, where n is the number of training.

 

Python code:

kf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=None) 
Copy the code

R code:

Train (Species~., data=iris, trControl=train_control, data=iris, trControl=train_control, Method ="nb") # summarize the resultsCopy the code

 

4. Layered K-fold cross verification

Layering is the process of rearranging data to ensure that each fold is a good representation of the whole. For example, in a binary classification problem where each category contains 50% of the data, it is best to arrange the data so that each category contains about half of the instances in each compromise.


This is usually a better approach when dealing with both bias and variance.

Python code snippet for hierarchical K-fold cross-validation:

For train_index, test_index in skf.split(X,y): print(" train_index ", train_index, "Validation:", val_index)Copy the code

R code:

-folds < -folds (factor(data$target), k = 10, list = FALSE)Copy the code

Having said that, if the training set is not adequately representative of the entire data, then using layered K-folds may not be the best approach. In this case, simple K-fold cross validation with repetition should be used.

In repeated cross-validation, the cross-validation process is repeated n times, resulting in n random partitions of the original sample. The N results are averaged again (or otherwise combined) to produce a single estimate.

Python code for repeating k-fold cross-validation:

# X is the feature set, Print (" train_index ", train_index, Validation:", val_index) X_test, X_test = X[train_index], X[val_index] y_train, y_test = y[train_index], y[val_index]Copy the code

 

5. Adversarial verification

When working with real data sets, you often encounter very different test sets and training sets. As a result, internal cross-validation techniques can give even lower scores than test scores. In this case, adversarial validation provides a solution.

The general idea is to check the similarity between training and testing based on feature distribution. If this is not the case, we can suspect that they are quite different. You can quantify this judgment by combining training and test sets, assigning 0/1 labels (0-training, 1-test), and evaluating binary classification tasks.

Let’s take a look at how to do this by:

  1. Remove the dependent variable from the training set
train.drop(['target'], axis = 1, inplace = True)
Copy the code
  1. Create a new dependent variable that is 1 for each row in the training set and 0 for each row in the test set
train['is_train'] = 1
test['is_train'] = 0
Copy the code
  1. Combine training and test data sets
 pd.concat([train, test], axis = 0)
Copy the code
  1. Using the newly created dependent variables above, fit the classification model and predict the probability of each row going into the test set
# Xgboost = xgb.XGBClassifier(**xgb_params, seed = 10)Copy the code
  1. Sort the training sets using the probabilities calculated in Step 4 and select the first n % samples/rows as the validation group (n % is the scores of training sets to be retained in the validation group)
Sort_values (by = 'probs', Ascending =False) # 30% Validation setCopy the code

Val_set_ids will get ids from the training set, which will make up the validation set that most resembles the test set. This can make validation strategies more reliable in cases where training and test sets are very different.

However, you must be careful when using this type of validation technique. Once the distribution of the test set changes, the validation set may no longer be a good subset of the evaluation model.

 

6. Cross-validation of time series

Randomly splitting time series data sets will not work because the time portion of the data will be scrambled. For the time series prediction problem, we perform cross-validation in the following manner.

  1. Folds for time series cross validation are created in a forward join mode
  2. Let’s say we have a time series for thenAnnual consumer demand for products during the year. Validation is created as:
fold 1: training [1], test [2] fold 2: training [1 2], test [3] fold 3: training [1 2 3], test [4] fold 4: Training [1 2 3 4], Test [5], Training [1 2 3 4], Test [6], Training [1 2 3 4...Copy the code

We gradually select new training and test sets. We start with a training set that has the observations required for a minimum fit model. Step by step, we change the training and test sets with each fold. In most cases, the first prediction may not be very important. In this case, the prediction origin can be moved to use a multi-step error. For example, in a regression problem, the following code can be used to perform cross-validation.

Python code:


     X_train, X_test = X[train_index], X[val_index]
     y_train, y_test = y[train_index], y[val_index]

TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
Copy the code

R code:

TsCV (ts, Arima(x, order=c(2,0,0), h=1) # Arima cross validation SQRT (mean(e^2, na.rm=TRUE)) # RMSECopy the code

H = 1 means that we only use error for prediction 1 step ahead.

(h = 4) The figure below describes the 4-step advance error. You can use this approach if you want to evaluate the model for a multi-step prediction.

 

7. Custom cross validation techniques

If there is no one way to solve various problems most effectively. You can create custom cross validation techniques based on functions or combinations of functions.

 

How to measure the deviation variance of the model?

After k-fold cross-validation, we will obtain k different model estimation errors (E1, E2… . Ek). Ideally, these errors should add up to zero. To get the bias of the model, we take the average of all the errors. Lower the average to make the model better.

Again, to calculate the variance of the model, we take all errors as standard deviations. The low standard deviation indicates that our model has little change under different subsets of training data.

We should concentrate on striking a balance between bias and variance. This can be achieved by reducing the variance and controlling the deviation to some extent. This leads to better predictive models. Such trade-offs also often lead to less complex forecasting models.

 

endnotes

In this article, we discuss overfitting and methods such as cross validation to avoid overfitting. We also looked at different cross-validation methods, such as validation set method, LOOCV, k-fold cross-validation, layered K-fold, etc., and then introduced the implementation of each in Python as well as the R implementation performed on Iris data sets.

 


Most welcome insight

1.Matlab Markov Chain Monte Carlo method (MCMC) estimation of Stochastic Volatility model

2. Threshold selection method for adaptive kernel density estimation in disease mapping based on R language

3.WinBUGS on multivariate stochastic volatility models: Bayesian estimation and model comparison

4. Hosmer-lemeshow goodness of fit test in R language regression

5. Realize markov switching ArMA-GARCH model estimation of MCMC by MATLAB

6. Regression analysis of R language interval data

7.R Language WALD test vs. likelihood ratio test

8. Python predicts stock prices using linear regression

9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression