Hill-climbing test sets are a way to achieve good or perfect predictions in machine learning competitions without affecting the training set or even the development of prediction models. As a method of machine learning contests, this is a no-bradoubt, and it is important that most contest platforms place restrictions on it to prevent this from happening. But the hill-climbing test set is something machine learning practitioners do by accident when they enter competitions. By developing an explicit implementation to climb the test set, it helps to better understand how easy it is to overfit the test data set by overusing it to evaluate the modeling pipeline.

In this tutorial, you will discover how to climb a test set for machine learning. After completing this tutorial, you will know:

  • Without looking at the training data set, you can make perfect predictions by climbing up the test set.
  • How to climb a test set for classification and regression tasks.
  • When we overuse the test set to evaluate the modeling pipeline, we creep up the test set.

Tutorial overview

This tutorial is divided into five parts. They are:

  • Climbing tester
  • Mountain climbing algorithm
  • How to climb a mountain
  • Climbing diabetes classification dataset
  • Hill-climbing housing regression dataset

Climbing tester

Like the machine learning contests on Kaggle, the machine learning contests provide complete training data sets as well as input to test sets. The purpose of a given race is to predict target values, such as labels or values for a test set. Set target values to evaluate the solution for hidden tests and score it appropriately. The entry with the highest score in the test set wins the competition. The challenge of machine learning contests can be defined as an optimization problem. Traditionally, contest participants act as optimization algorithms, exploring the different modeling pipes that lead to different sets of predictions, scoring the predictions, and then making changes to the pipes in hopes of getting a higher score. This process can also be modeled directly with optimization algorithms, generating and evaluating candidate predictions without looking at the training set. Often, this is called the mountain climbing test set, and one of the simplest optimization algorithms to solve this problem is the mountain climbing algorithm. While climbing test sets should be done correctly in an actual machine learning competition, implementing the approach to understand the limitations of the approach and the dangers of overinstalling test sets can be an interesting exercise. In addition, the fact that test sets can be perfectly predicted without having access to training data sets often surprises many novice machine learning practitioners. Most importantly, as we repeatedly evaluated different modeling pipelines, we crept up the test set. The risk is that the test set scores are improved, but at the cost of increased generalization errors, i.e. poorer performance on a wider range of questions. People running machine learning competitions are well aware of this problem and have responded to it by imposing restrictions on predictive evaluations, such as limiting evaluations to once or several times per day and reporting scores on hidden subsets of the test set rather than the entire test set. For more information, see the papers listed in the Further Reading section. Next, let’s look at how to implement a hill-climbing algorithm to optimize the prediction of the test set.

Mountain climbing algorithm

Mountain climbing algorithm is a very simple optimization algorithm. It involves generating candidate solutions and evaluating them. Then there is the starting point of gradual improvement until further improvements are impossible to achieve or we run out of time, resources or interest. Generate new candidate solutions from existing candidate solutions. Typically, this involves making a single change to the candidate solution, evaluating it, and accepting the candidate solution as a new “current” solution if it is as good or better than the previous current solution. Otherwise, discard it. We might think it would be a good idea to accept only candidates with higher scores. This is a reasonable approach for many simple problems, although on more complex problems it is desirable to accept different candidates with the same score to help the search process scale flat areas (plateaus) in the factor space. When climbing the test set, the candidate solution is the prediction list. For binary sorting tasks, this is a list of 0 and 1 values for the two classes. For regression tasks, this is a list of numbers in the range of target variables. The modification to the candidate classification solution would be to select a prediction and flip it from 0 to 1 or from 1 to 0. A candidate solution modification to regression would be to add gaussian noise to a value in the list or replace a value with a new value in the list. Scoring solutions involves calculating scoring metrics, such as classification accuracy for classification tasks or mean absolute error for regression tasks. Now that we’re familiar with the algorithm, let’s implement it.

How to climb a mountain

We will develop the climbing algorithm on the comprehensive classification task. First, we create a binary sorting task with many input variables and 5,000 line examples. We can then divide the data set into training sets and test sets. The complete example is listed below.

# example of a synthetic dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# define dataset
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
print(X.shape, y.shape)
# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
Copy the code

Running the sample first reports the shape of the data set created, showing 5,000 rows and 20 input variables. The data sets were then divided into training sets and test sets, of which about 3,300 were used for training and about 1,600 for testing.

(5000.20) (5000(,)3350.20) (1650.20) (3350(,)1650.)Copy the code

Now we can develop a climber. First, we can create a function that will be loaded, or in this case, define the data set. This feature can be updated later when we want to change the data set.

# load or prepare the classification dataset
def load_dataset():
 return make_classification(n_samples=5000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
Copy the code

Next, we need a function to evaluate the candidate solutions, the prediction list. We will use classification precision, where the score ranges from 0 (worst solution) to 1 (perfect prediction set).

# evaluate a set of predictions
def evaluate_predictions(y_test, yhat):
 return accuracy_score(y_test, yhat)
Copy the code

Next, we need a function to create the initial candidate solution. This is a predictive list of class 0 and 1 tags long enough to match the number of examples in the test set, in this case 1650. We can use the randint () function to generate random values of 0 and 1.

# create a random set of predictions
def random_predictions(n_examples):
 return [randint(0.1for _ in range(n_examples)]
Copy the code

Next, we need a function to create a modified version of the candidate solution. In this case, this involves selecting a value in the solution and flipping it from 0 to 1 or from 1 to 0. Normally, we make one change for each new candidate solution during the climb, but I’ve parameterized this function so you can explore as many changes as you need.

# modify the current set of predictions
def modify_predictions(current, n_changes=1) : #copy current solution
 updated = current.copy(a)for i in range(n_changes):
  # select a point to change
  ix = randint(0.len(updated)- 1)
  # flip the class label
  updated[ix] = 1 - updated[ix]
 return updated
Copy the code

So far so good. Next, we can develop the capability to perform the search. First, create and evaluate the initial solution by calling the random_Predictions () function followed by the validate_Predictions () function. We then loop through a fixed number of iterations and generate a new candidate value by calling Modify_predictions (), evaluate it and replace it if the score is the same or better than the current solution. The loop ends when we have either completed the preset number of iterations (any choice) or reached the ideal score, in which case we know that the accuracy is 1.0 (100%). The following function hill_climb_testset () does this, taking the testset as input and returning the best set of predictions found during the climb.

# run a hill climb for a set of predictions
def hill_climb_testset(X_test, y_test, max_iterations):
 scores = list()
 # generate the initial solution
 solution = random_predictions(X_test.shape[0])
 # evaluate the initial solution
 score = evaluate_predictions(y_test, solution)
 scores.append(score)
 # hill climb to a solution
 for i in range(max_iterations):
  # record scores
  scores.append(score)
  # stop once we achieve the best score
  if score == 1.0:
   break
  # generate new candidate
  candidate = modify_predictions(solution)
  # evaluate candidate
  value = evaluate_predictions(y_test, candidate)
  # check if it is as good or better
  if value >= score:
   solution, score = candidate, value
   print('>%d, score=%.3f' % (i, score))
 return solution, scores
Copy the code

Everything here belongs to it. A complete example of an uphill test rig is listed below.

# example of hill climbing the test set for a classification task
from random import randint
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
 
# load or prepare the classification dataset
def load_dataset():
 return make_classification(n_samples=5000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
 
# evaluate a set of predictions
def evaluate_predictions(y_test, yhat):
 return accuracy_score(y_test, yhat)
 
# create a random set of predictions
def random_predictions(n_examples):
 return [randint(0.1for _ in range(n_examples)]
 
# modify the current set of predictions
def modify_predictions(current, n_changes=1) : #copy current solution
 updated = current.copy(a)for i in range(n_changes):
  # select a point to change
  ix = randint(0.len(updated)- 1)
  # flip the class label
  updated[ix] = 1 - updated[ix]
 return updated
 
# run a hill climb for a set of predictions
def hill_climb_testset(X_test, y_test, max_iterations):
 scores = list()
 # generate the initial solution
 solution = random_predictions(X_test.shape[0])
 # evaluate the initial solution
 score = evaluate_predictions(y_test, solution)
 scores.append(score)
 # hill climb to a solution
 for i in range(max_iterations):
  # record scores
  scores.append(score)
  # stop once we achieve the best score
  if score == 1.0:
   break
  # generate new candidate
  candidate = modify_predictions(solution)
  # evaluate candidate
  value = evaluate_predictions(y_test, candidate)
  # check if it is as good or better
  if value >= score:
   solution, score = candidate, value
   print('>%d, score=%.3f' % (i, score))
 return solution, scores
 
# load the dataset
X, y = load_dataset()
print(X.shape, y.shape)
# split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# run hill climb
yhat, scores = hill_climb_testset(X_test, y_test, 20000)
# plot the scores vs iterations
pyplot.plot(scores)
pyplot.show()
Copy the code

Running the example will result in 20,000 iterations of the search, or stopping the search if desired accuracy is achieved. Note: Your results may differ due to randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example a few times and comparing the average results. In this case, we found an ideal set of test set predictions over approximately 12,900 iterations. Recall that this is done without touching the training data set and without cheating by looking at the test set target values. Instead, we simply optimized a set of numbers. The lesson here is that, using the test pipeline as a mountain climbing optimization algorithm, the evaluation of the repeated modeling pipeline on the test set will do the same thing. The solution will overfit the test set.

. >8092, score=0.996
>8886, score=0.997
>9202, score=0.998
>9322, score=0.998
>9521, score=0.999
>11046, score=0.999
>12932, score=1.000
Copy the code

An optimization progress chart was also created. This helps to understand how changes to the optimization algorithm (for example, changing the selection of content and how it is changed on the ramp) affect the convergence of the search.

Climbing diabetes classification dataset

We will use the diabetes data set as the basis for exploring the climbing test set to solve classification problems. Each record described the women’s medical details and predicted the onset of diabetes within the next five years.

Data set details: PIMA-indians -diabetes.names data set: PIMA-indians -diabetes.csv

The dataset has eight input variables and 768 rows of data; The input variables are numeric, and the target has two category labels, such as this is a binary sorting task. An example of the first five rows of a dataset is provided below.

6.148.72.35.0.33.6.0.627.50.1
1.85.66.29.0.26.6.0.351.31.0
8.183.64.0.0.23.3.0.672.32.1
1.89.66.23.94.28.1.0.167.21.0
0.137.40.35.168.43.1.2.288.33.1.Copy the code

We can load the dataset directly using Pandas, as shown below.

# load or prepare the classification dataset
def load_dataset():
 url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
 df = read_csv(url, header=None)
 data = df.values
 return data[:, :- 1], data[:, - 1]
Copy the code

The rest of the code remains the same. This file was created so that you can put in your own binary sorting tasks and try them out. The complete example is listed below.

# example of hill climbing the test set for the diabetes dataset
from random import randint
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
 
# load or prepare the classification dataset
def load_dataset():
 url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
 df = read_csv(url, header=None)
 data = df.values
 return data[:, :- 1], data[:, - 1]
 
# evaluate a set of predictions
def evaluate_predictions(y_test, yhat):
 return accuracy_score(y_test, yhat)
 
# create a random set of predictions
def random_predictions(n_examples):
 return [randint(0.1for _ in range(n_examples)]
 
# modify the current set of predictions
def modify_predictions(current, n_changes=1) : #copy current solution
 updated = current.copy(a)for i in range(n_changes):
  # select a point to change
  ix = randint(0.len(updated)- 1)
  # flip the class label
  updated[ix] = 1 - updated[ix]
 return updated
 
# run a hill climb for a set of predictions
def hill_climb_testset(X_test, y_test, max_iterations):
 scores = list()
 # generate the initial solution
 solution = random_predictions(X_test.shape[0])
 # evaluate the initial solution
 score = evaluate_predictions(y_test, solution)
 scores.append(score)
 # hill climb to a solution
 for i in range(max_iterations):
  # record scores
  scores.append(score)
  # stop once we achieve the best score
  if score == 1.0:
   break
  # generate new candidate
  candidate = modify_predictions(solution)
  # evaluate candidate
  value = evaluate_predictions(y_test, candidate)
  # check if it is as good or better
  if value >= score:
   solution, score = candidate, value
   print('>%d, score=%.3f' % (i, score))
 return solution, scores
 
# load the dataset
X, y = load_dataset()
print(X.shape, y.shape)
# split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# run hill climb
yhat, scores = hill_climb_testset(X_test, y_test, 5000)
# plot the scores vs iterations
pyplot.plot(scores)
pyplot.show()
Copy the code

Running the sample reports the number of iterations and accuracy of seeing improvements during each search.

In this case, we use fewer iterations because there are fewer predictions to make, making optimization easier.

Note: Your results may differ due to randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example a few times and comparing the average results.

In this case, we can see perfect accuracy achieved in about 1,500 iterations.

. >617, score=0.961
>627, score=0.965
>650, score=0.969
>683, score=0.972
>743, score=0.976
>803, score=0.980
>817, score=0.984
>945, score=0.988
>1350, score=0.992
>1387, score=0.996
>1565, score=1.000
Copy the code

A line chart of search progress is also created, indicating rapid convergence.

Hill-climbing housing regression dataset

We will use the housing data set as the basis for exploring the regression problem of the hill-climbing test set. The Housing dataset contains thousands of dollars of home price forecasts with detailed information about a given home and its neighborhood.

Data set details: Housing.names Data set: Housing.csv

This is a regression problem, which means we are predicting a number. There are 506 observations, including 13 input variables and one output variable. Examples of the first five lines are listed below.

0.00632.18.00.2.310.0.0.5380.6.5750.65.20.4.0900.1.296.0.15.30.396.90.4.98.24.00
0.02731.0.00.7.070.0.0.4690.6.4210.78.90.4.9671.2.242.0.17.80.396.90.9.14.21.60
0.02729.0.00.7.070.0.0.4690.7.1850.61.10.4.9671.2.242.0.17.80.392.83.4.03.34.70
0.03237.0.00.2.180.0.0.4580.6.9980.45.80.6.0622.3.222.0.18.70.394.63.2.94.33.40
0.06905.0.00.2.180.0.0.4580.7.1470.54.20.6.0622.3.222.0.18.70.396.90.5.33.36.20.Copy the code

First, we can update the load_dataset () function to load the housing dataset. As part of loading the data set, we will standardize the target values. Since we can limit floating-point values to the range of 0 to 1, this will make climbing predictions much easier. You don’t usually need to do this, just simplify the search algorithm used here.

# load or prepare the classification dataset
def load_dataset():
 url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
 df = read_csv(url, header=None)
 data = df.values
 X, y = data[:, :- 1], data[:, - 1]
 # normalize the target
 scaler = MinMaxScaler()
 y = y.reshape((len(y), 1))
 y = scaler.fit_transform(y)
 return X, y
Copy the code

Next, we can update the scoring function to use the mean absolute error between the expected and predicted values.

# evaluate a set of predictions
def evaluate_predictions(y_test, yhat):
 return mean_absolute_error(y_test, yhat)
Copy the code

We also had to update the representation of the solution from the 0 and 1 labels to a floating point value between 0 and 1. The generation of the initial candidate solution must be changed to create a random floating point list.

# create a random set of predictions
def random_predictions(n_examples):
 return [random() for _ in range(n_examples)]
Copy the code

In this case, a single change to the solution to create a new candidate solution involves simply replacing a randomly selected prediction in the list with a new random floating point number. I chose it because it was simple.

# modify the current set of predictions
def modify_predictions(current, n_changes=1) : #copy current solution
 updated = current.copy(a)for i in range(n_changes):
  # select a point to change
  ix = randint(0.len(updated)- 1)
  # flip the class label
  updated[ix] = random()
 return updated
Copy the code

A better approach is to add gaussian noise to the existing values, which I leave as an extension for you. If you’ve tried it, let me know in the comments below. Such as:

# add gaussian noise
updated[ix] += gauss(0.0.1)
Copy the code

Finally, the search must be updated. The optimal value is now error 0.0, which is used to stop the search if an error is found.

# stop once we achieve the best score
if score == 0.0:
 break
Copy the code

We also need to change the search from the maximum score to the current minimum score.

# check if it is as good or better
if value <= score:
 solution, score = candidate, value
 print('>%d, score=%.3f' % (i, score))
Copy the code

Updated search capabilities with both changes are listed below.

# run a hill climb for a set of predictions
def hill_climb_testset(X_test, y_test, max_iterations):
 scores = list()
 # generate the initial solution
 solution = random_predictions(X_test.shape[0])
 # evaluate the initial solution
 score = evaluate_predictions(y_test, solution)
 print('>%.3f' % score)
 # hill climb to a solution
 for i in range(max_iterations):
  # record scores
  scores.append(score)
  # stop once we achieve the best score
  if score == 0.0:
   break
  # generate new candidate
  candidate = modify_predictions(solution)
  # evaluate candidate
  value = evaluate_predictions(y_test, candidate)
  # check if it is as good or better
  if value <= score:
   solution, score = candidate, value
   print('>%d, score=%.3f' % (i, score))
 return solution, scores
Copy the code

Taken together, a complete example of test set climbing for regression tasks is listed below.

# example of hill climbing the test set for the housing dataset
from random import random
from random import randint
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot
 
# load or prepare the classification dataset
def load_dataset():
 url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
 df = read_csv(url, header=None)
 data = df.values
 X, y = data[:, :- 1], data[:, - 1]
 # normalize the target
 scaler = MinMaxScaler()
 y = y.reshape((len(y), 1))
 y = scaler.fit_transform(y)
 return X, y
 
# evaluate a set of predictions
def evaluate_predictions(y_test, yhat):
 return mean_absolute_error(y_test, yhat)
 
# create a random set of predictions
def random_predictions(n_examples):
 return [random() for _ in range(n_examples)]
 
# modify the current set of predictions
def modify_predictions(current, n_changes=1) : #copy current solution
 updated = current.copy(a)for i in range(n_changes):
  # select a point to change
  ix = randint(0.len(updated)- 1)
  # flip the class label
  updated[ix] = random()
 return updated
 
# run a hill climb for a set of predictions
def hill_climb_testset(X_test, y_test, max_iterations):
 scores = list()
 # generate the initial solution
 solution = random_predictions(X_test.shape[0])
 # evaluate the initial solution
 score = evaluate_predictions(y_test, solution)
 print('>%.3f' % score)
 # hill climb to a solution
 for i in range(max_iterations):
  # record scores
  scores.append(score)
  # stop once we achieve the best score
  if score == 0.0:
   break
  # generate new candidate
  candidate = modify_predictions(solution)
  # evaluate candidate
  value = evaluate_predictions(y_test, candidate)
  # check if it is as good or better
  if value <= score:
   solution, score = candidate, value
   print('>%d, score=%.3f' % (i, score))
 return solution, scores
 
# load the dataset
X, y = load_dataset()
print(X.shape, y.shape)
# split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# run hill climb
yhat, scores = hill_climb_testset(X_test, y_test, 100000)
# plot the scores vs iterations
pyplot.plot(scores)
pyplot.show()
Copy the code

Running the sample will report the number of iterations and MAE each time you see an improvement during the search.

In this case, we’ll use more iterations, because optimizing it is a more complex problem. The method chosen to create the candidate solution also makes it slower and less likely to achieve perfect error. In fact, we don’t achieve perfect errors; Conversely, if the error reaches a value below the minimum (for example, 1E-7) or a value that makes sense for the target domain, it is best to stop the operation. This is also left as an exercise for the reader. Such as:

# stop once we achieve a good enough
if score <= 1e-7:
 break
Copy the code

Note: Your results may differ due to randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example a few times and comparing the average results.

In this case, we can see that good errors are implemented at the end of the run.

>95991, score=0.001
>96011, score=0.001
>96295, score=0.001
>96366, score=0.001
>96585, score=0.001
>97575, score=0.001
>98828, score=0.001
>98947, score=0.001
>99712, score=0.001
>99913, score=0.001
Copy the code

A line chart of search progress was also created, showing that convergence was rapid and remained constant throughout most iterations.

Author: Yishui Hancheng, CSDN blogger expert, personal research interests: machine learning, deep learning, NLP, CV\

Blog: yishuihancheng.blog.csdn.net

Read more

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

5 minutes to quickly master the Python timed task framework \

Special recommendation \

\

Click below to read the article and join the community