Author: Rachel Thomas
takeaway
Is splitting the validation set really as simple as calling train_test_split? It’s not.
A very common scenario is that a seemingly good machine learning model fails completely when used in a real production environment. The consequences include bosses who are sceptical of current machine learning and unwilling to try again. How did that happen?
One of the most likely reasons for the disconnect between development and production results is the wrong choice of validation sets (or worse, none at all). Depending on the nature of the data, selecting a validation set is probably the most important step. Although SkLearn provides a train_test_split method, this method only takes a random subset of the data, which is a poor choice for many practical problems.
The definitions of training set, validation set, and test set can be very subtle, and the terms are sometimes used inconsistently. In the deep learning community, “test-time inference” generally refers to the evaluation of data in production, which is not the technical definition of a test set. As mentioned earlier, Sklearn has a train_test_split method, but no train_validation_test_split method. Kaggle only provides training and test sets, but to do it well, you need to break down their training sets into your own validation and training sets. In addition, Kaggle’s test set is actually broken down into two subsets. It’s no surprise that many beginners may be confused! I discuss these subtleties below.
First, what is a validation set?
When creating a machine learning model, the ultimate goal is to make it accurate with new data, not just work well with the data you use to build it. Here are three examples of different models for a set of data:
under-fitting and over-fitting\
The error of the data points in the graph is minimal for the rightmost model (the blue curve passes the red dot almost perfectly), but this is not the best choice. Why is that? If you were to collect some new data points, they would probably not be on the curve on the right-hand chart, but closer to the curve on the middle chart.
The basic idea is:
- The training set is used to train a given model
- Validation sets are used to make choices between models (e.g., are random forests or neural networks better suited to your question?). Do you want a random forest with 40 trees or 50 trees?
- The test set tells you how well you’re doing. If you try many different models, you may stumble upon a model that performs well on your validation set, and having a test set helps ensure that this is not the case.
A key attribute of validation and test sets is that they must represent new data that you will see in the future. That sounds like an impossible order! By definition, you haven’t seen the data yet. But you know a few things about this data.
When is a random subset not good enough?
It’s useful to look at a few examples. While many of these examples come from Kaggle contests, they are representative of the problems you see in real work.
The time series
If your data is a time series, choosing a random subset of the data is too simple (you can look at the dates before and after the data you are trying to predict) and is not representative of most business samples (the real business is to use historical data to build a model for future predictions). If your data contains dates and you are building a model for future use, you will want to select a contiguous section that contains the most recent date as your validation set (for example, the last two weeks or last month of available data).
Suppose you want to divide the following time series data into training sets and validation sets:
Time series data
A random subset is a bad choice (too easy to fill in the blanks and doesn’t tell you what you need in production) :
Bad choice of training set \
Using older data as a training set (and later data as a validation set) :
A better option for training sets \
Kaggle has a contest to predict the sales of Ecuadorian grocery chains. Kaggle’s “training data” ran from January 1, 2013, to August 15, 2017, and its test data ran from August 16, 2017, to August 31, 2017. A good approach is to use August 1 to August 15, 2017 as your validation set and all previous data as your training set.
New men, new ships, new…
You also need to consider that the data you predict in a production environment may be qualitatively different from the data you have to train your model to use.
In Kaggle’s distracted driver competition, the data is a picture of a driver driving, and the dependent variable is a category like texting, eating, or safely looking ahead. If you’re an insurance company building a model based on this data, note that you’re most interested in how the model performs on drivers you haven’t seen before (because you may only have training data for a small number of people). The same goes for The Kaggle competition: the test data is made up of people not used in the training set.
Two pictures of the same person talking on the phone while driving. \
If you put one of the above images in your training set and the other in your validation set, your model will look better than it does on the newcomers. Another idea is that if you train your model with all the people, your model might overfit the characteristics of those particular people, rather than just learn the state (texting, eating, etc.).
A similar dynamic is at work in the “Kaggle Fishery Competition,” which aims to determine what species of fish fishing vessels catch in order to reduce illegal fishing of endangered stocks. The test set consists of ships that do not appear in the training data. This means that you want your validation set to include ships that are not in the training set.
Sometimes it may not be clear how your test data are different. For example, in the case of using satellite imagery, you need to gather more information to determine whether the training set contains only specific geographic locations or comes from geographically dispersed data.
The dangers of cross-validation
Sklearn does not train_validation_test because it assumes that you will often use cross-validation, using subsets of different training sets as validation sets. For example, in three-fold cross-validation, the data is divided into three groups: A, B, and C. The model is first trained on the training set combined with A and B, evaluated on the verification set C, and then the model is trained on the training set combined with A and C, evaluated on the verification set B, and so on. The model behaves as an average of three.
The problem with cross-validation, however, is that it rarely applies to real-world problems, for reasons described in the sections above. Cross-validation only works if you can randomly scramble the data to select a validation set.
Kaggle’s “training data set” = your training data + validation data
One of the great things about Kaggle contests is that it forces you to think more rigorously about validation sets (in order to do better). For those new to Kaggle, it’s a platform for machine learning contests. Kaggle typically breaks the data into two groups, which you can download:
- A training set that includes independent variables, as well as dependent variables (what you are trying to predict). For example, an Ecuadorian grocery store tries to predict sales, with the independent variables including store ID, item ID and date, and the dependent variable being the amount sold. For example, to try to determine whether a driver has engaged in a dangerous behavior while driving, the independent variable could be a picture of the driver, and the dependent variable could be a category (such as texting, eating, or safely looking ahead).
- A set of tests that have only independent variables. You will make predictions for the test set, which you can submit to Kaggle and get your score.
This is the basic idea you need to start machine learning, but to do it well, you need to understand a bit more complexity. You’ll want to create your own training and validation set (by splitting Kaggle “training” data). You build the model using a smaller training set (a subset of Kaggle’s training data) that you can evaluate on a validation set (a subset of Kaggle’s training data) before submitting it to Kaggle.
The most important reason is that Kaggle breaks the test data into two groups: public and private leaderboards. The scores you see on public leaderboards are only part of your prediction (you don’t know which part!). How well your predictions perform on the private leaderboard won’t be known until after the game. The reason this is important is that you may end up overfitting the public leaderboard, and you don’t realize it until you end up underperforming the private leaderboard. Using a good validation set can prevent this. You can check that the validation set is good by seeing if your model’s score on the Kaggle test set is similar to the score on the validation set.
Another reason it’s important to create your own validation set is that Kaggle limits you to two submissions per day, and you may want to try more. Third, it’s instructive to see exactly what you did wrong on the validation set. Kaggle won’t tell you the correct answer to the test set, or even which data points you got wrong, only your total score.
Understanding these differences is not just useful to Kaggle. In any predictive machine learning project, you want your model to perform well on new data.
— the END —
www.fast.ai/2017/11/13/…
Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Note: to join this site's wechat group or QQ group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet"Copy the code
Like articles, click Looking at the