This article is a translation and supplement to the technical book “Deep Learning with Python, Second Edition”. By Francois Chollet, Keras, Deep Learning Engineer at Google
Since the goal is to develop models that can successfully generalize to new data, it is important to evaluate the model generalization capability. In this article, we will formally introduce different approaches to evaluating machine learning models.
Training, validation and test sets
Evaluation models always boil down to dividing available data into three groups: training, validation, and testing. You train against the training data and evaluate the model against the validation data. Once the model is ready, you can run a final test on the test data, which is meant to be as similar as possible to production data. You can then deploy the model in production.
Why, you may ask, are there not two sets: a training set and a test set? You will train based on training data and evaluate based on test data. Much easier!
The reason is that the development model always adjusts its configuration: for example, selecting the number of layers or layer size (called the hyperparameter of the model to distinguish it from the parameters of the network weight). You can do this by using the performance of the model on the validation data as a feedback signal. In essence, this adjustment is a form of learning: looking for good configurations in some parameter space. Thus, even if the model has never been directly trained, adjusting the configuration of the model based on its performance on the validation set can quickly lead to overfitting of the validation set.
At the heart of this phenomenon is the concept of information leakage. Every time a model’s hyperparameters are adjusted based on its performance in the validation set, some information about the validation data is leaked into the model. If you do this only once for a parameter, very little information will be leaked, and your validation set will remain reliable to evaluate the model. However, if you repeat it many times (run an experiment, evaluate the validation set, and eventually modify the model), you will leak more and more information about the validation set into the model.
At the end of the day, you get a model that does artificially well at validating data because that’s what you’re optimizing it for. You are concerned with the performance of brand-new data rather than validation data, so you need to evaluate the model with a completely different, never-before-seen data set: the test data set. Your model should not access any information about the test set, even indirectly. If you have made any adjustments to the model based on test set performance, your generality measures will be flawed.
It may seem simple to divide data into training, validation, and test sets, but there are several advanced methods that can come in handy when data is scarce. Let’s review three classic evaluation methods: simple retention validation, K-fold validation, and iterative K-fold validation with aliasing. We will also discuss the use of general knowledge baselines to check if your training is ongoing.
Simple retention validation
Separate parts of the data as test sets. Train the remaining data and evaluate the test set. As mentioned in the previous sections, to prevent information leakage, you should not adjust the model based on the test set; therefore, you should also retain the validation set.
Schematically, hold validation is similar to graph TODO. The following listing shows a simple implementation.
This is the simplest evaluation protocol and has one drawback: if there is no data available, your validation and test sets may contain too few samples to be statistically representative of the data at hand. This is easy to recognize: you run into this problem if different random shuffling of the data before splitting results in vastly different measurements of the model’s performance. K-fold validation and iterative K-fold validation are two approaches to solving this problem, as described below.
K to fold validation
Using this approach, you can divide the data into K equally sized partitions. For each partition I, the model is trained on the remaining K — 1 partitions and evaluated on partition I. In this way, your final score is the average of the K scores obtained. This approach is useful when the performance of the model shows significant variance based on training test splitting. As with retention validation, this approach does not spare you the trouble of using different validation sets for model calibration. In the diagram, k-fold cross validation looks like graph TODO. Listing TODO shows a simple implementation.
num_validation_samples = 10000
np.random.shuffle(data)
validation_data = data[:num_validation_samples]
data = data[num_validation_samples:]
training_data = data[:]
model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)
# At this point you can tune your model,
# retrain it, evaluate it, tune it again...
model = get_model()
model.train(np.concatenate([training_data,
validation_data]))
test_score = model.evaluate(test_data)
Copy the code
Iterative k-fold verification is performed by reorganization
This is for situations where you have relatively little data available and need to evaluate the model as accurately as possible. I found it very useful in Kaggle competitions. It involves applying k-times validation multiple times, each time reshuffling the data before splitting it into k-mode. The final score is the average of the scores obtained during each k-fold verification. Note that you eventually need to train and evaluate the P×K model (where P is the number of iterations you use), which can be very expensive.
Break common sense benchmarks
In addition to the different evaluation protocols available, the last thing you should know is to use common-sense benchmarks. Training a deep learning model is a bit like pushing a button to launch a rocket in a parallel world. You can’t hear or see it. You can’t observe multiple learning processes – it happens in a space with thousands of dimensions, and even projecting it into 3D doesn’t explain it. The only feedback is your validation metrics, such as the altimeter on the stealth rocket.
It’s especially important to be able to tell if you’re getting started. What is your starting altitude? Your model seems to be 15% accurate. Is there any benefit to that? Always choose a simple benchmark and try to beat it before you start working with the data set. If you exceed that threshold, you know you’re doing something right: your model is actually making broad predictions using the information in the input data – and you can move on. The benchmark can be the performance of a random classifier or the simplest non-machine learning technique you can imagine.
For example, in MNIST’s numerical classification example, a simple baseline would be validation accuracy greater than 0.1 (random classifier); In the IMDB example, the validation accuracy will be greater than 0.5. In the Reuters example, it would be around 0.18-0.19 due to the class imbalance. If you have A binary classification problem where 90% of the samples are class A and 10% are class B, the classifier that always predicts A has A validation accuracy of 0.9, so you need to do better.
Having a common-sense baseline is critical when you start to solve problems that no one has solved before. If you can’t solve a simple solution, your model is worthless — maybe you’re using the wrong model, or maybe machine learning won’t solve the problem you’re trying to solve in the first place. It’s time to go back to the drawing board.
Considerations regarding model evaluation
When selecting an evaluation protocol, please note the following points:
- Data representativeness – You want both the training set and the test set to be representative of the data at hand. For example, if you try to categorize digital image, and starting from the sample array, according to the category of the sample to order samples, the top 80% of array as a training set, the rest of the 20% as the training set for your test set will result in your training set containing only classes 0 to 7, and your test set contains only 8-9 course. This may seem like a ridiculous mistake, but it’s surprisingly common. Therefore, you should generally shuffle data randomly before dividing it into training sets and test sets.
- Arrow of Time – If you are trying to predict the future based on a given past (for example, tomorrow’s weather, stock movements, etc.), do not shuffle the deck before splitting the data, because doing so will create temporal leakage: your model will effectively be trained from the future data. In this case, you should always ensure that all data in the test set comes after data in the training set.
- Redundancy in the data – If some data points in the data occur twice (which is quite common with actual data), mixing the data and dividing it into training and validation sets will result in redundancy between the training and validation sets. In fact, you’ll be testing part of your training data, which is your worst thing! Make sure that the training set and validation set do not intersect.
A reliable way to assess model performance is that you will be able to monitor the core pressures of machine learning — between optimization and generalization, underfitting and overfitting.