This article is part 6 of the Machine Learning Bible, and you will be able to understand data leaks in machine learning.
Through classification and regression algorithms of evaluation index, evaluation index of the sorting algorithm and machine learning model method for evaluating the offline, you already know in what indicators to evaluate the use of machine learning model, and use what method to evaluate model, but in real life assessment model often meet a problem: Data leakage (Data leakage), here we do an interpretation of data leakage.
What is a data breach
Let’s take a closer example. As an excellent algorithm engineer, your ambition is to solve many practical problems in your life through machine learning algorithms. When you’re working on a dichotomy problem and you’ve trained a model to make the most of the data set using a cross-validation method, you find that it has an AUC of 0.99 on the test set, and you’re happy to think that you’re going to get a bonus for developing such a great model this year. The system engineers then engineered the model, deployed it into production, and began to solve real business problems. When you’re bragging to your colleagues about how awesome your model is and how it works in a production environment, your legs go limp and you fall to the floor if your colleagues aren’t holding you up.
The above model works well offline, but not online for a variety of reasons, one of which is often encountered: data leakage, sometimes called leakage, leakage, etc. It says that when the data you are using to train a machine learning algorithm happens to have the transaction that will be predicted Information you are trying to predict), meaning that some information from the test data leaked into the training set. This refers to information about target tags or data that is available in training data but not available or illegal in the real world.
Data breaches generally occur in very subtle and undetectable ways. When a data breach occurs, it can cause the model to be “overestimated” offline. “Hypertrophy” means that when you evaluate the model offline, you do well on the test set, but when you deploy it to production to solve real business problems, you do badly. In other words, the offline evaluation overestimates the power of the model.
Some specific examples of data breaches
Although the definition of data leakage is described above, it is too abstract. Here are a few examples to illustrate data leakage. An easy to understand example of a leak: if training data contains test data, this can lead to overfitting of the model. Another example of leakage that is easier to understand is that if the prediction target is a feature of the model, then the model’s conclusion is basically something like “Apples are apples.” This means that if an item is labeled an apple, the model predicts that it is an apple. Let’s look at some more subtle examples of leaks in KDD matches.
Predict whether potential customers will open a bank account
A feature used to predict whether a potential customer will open an account at a bank is called an account number. Obviously, the only field that has a value is the customer who opened the transfer. There was nothing wrong with the training, but in the actual prediction, for all customers, this feature is empty, because you don’t know what the customer’s account is before you predict it, and if you know what the customer’s account is before you predict it, what good is the prediction model for?
Predict whether users will leave the site
On retail sites, predicting when a user will leave the site or go to a new page after browsing the current page is a feature that involves data leakage called session length, which is the total number of pages a user sees while visiting the site. This feature contains future information about how many visits the user will make. One solution is to replace session length with page number in session, which is the total number of page views in the session so far.
Predict whether users will buy the product
In e-commerce sites, it is common to predict whether users will buy products after they are exposed to them. Obviously, in this problem, the commodity rate is a very important factor, general training past data are used to generate the model, such as using data from the past week, in the training data generation rate of goods, if you are using the current time of the goods rate, which can cause the feature contains information in the future, So you should use the favorable rating of the product at the time of exposure. On October 10, 2018, for example, 22 points for 30 seconds to users u exposes an item I, the end user, u have purchased the goods at the time of exposure rate is 99%, the goods after a week, which is on October 17, 2018, 22 points for 30 seconds, the commodity rate is 86%, this time in front of the use of data to construct the training sample, Among them, the value of the product praise rate should be 99% of the exposure, rather than the current 86%.
Predict the patient’s condition
In developing a model to diagnose a particular disease, the existing patient training set includes characteristics of whether a patient has had surgery for that disease. Obviously, using this feature can greatly improve the accuracy of the prediction, but there is a clear data breach because the feature cannot be known until the patient’s diagnosis is known.
Another related example is patient ids, which may be assigned according to a particular diagnostic path. In other words, the ID might have been different if it had been the result of a visit to a specialist, because the original doctor identified a possible illness.
This section describes the types of data leaks
We can classify data leaks into two broad categories: training data leaks and feature leaks. Training data leakage usually means that test data or future data are mixed into training data. Feature leakage means that features contain information about real tags.
Training data leakage may be caused by the following situations:
-
The entire data set (training set and test set) is used for some preprocessing, so the results can affect what is seen during training. This may include scenarios where parameters are evaluated for normalization and scaling, or minimum and maximum eigenvalues are found to detect and remove outliers, as well as using the distribution of variables across the data set to estimate missing values in the training set or perform feature selection.
-
Another key issue to be aware of when dealing with time series data is that records of future events are unexpectedly used to calculate the properties of specific predictions. The session length example we saw is one such example.
Feature leakage may occur in the following situations:
-
Some illegal features were removed, but features containing the same or similar information were ignored (for example, whether the patient had surgery was removed in the previous example, but the patient ID was not removed)
-
In some cases, data set records are intentionally randomized, or certain fields are anonymized that contain specific information about the user, such as the user’s name, location, and so on. Depending on the prediction task, removing this anonymity can reveal user or other sensitive information that would not be legitimate in actual use.
Detecting data leaks
Once we understand what a data breach is, the next step is how to detect a data breach.
Before building the model, we can explore the data a little bit. For example, look for features that are highly relevant to the target tag or value. In the case of medical diagnosis, for example, whether or not the patient had surgery for the disease was highly correlated with whether or not the patient eventually developed the disease.
After building the model, we can check whether the features with high weight in the model have leakage. Or when you build a model and it turns out to be unbelievably good, you need to think about whether a data leak has occurred.
Another more reliable way to check for leaks is to perform limited actual deployments of trained models to see if there is a significant difference between the performance of the trained model and the performance of the real environment. However, if the difference is relatively large, it may also be caused by over-fitting.
Fixing data breaches
If a data breach is detected, how can it be fixed?
First of all, during data preprocessing, the entire data set should not be used for calculation, but should be generated using partitioned training sets.
When dealing with time series problems, it is necessary to ensure that the timestamp of associated features is consistent with the occurrence time, so as to avoid future information in training data.
In addition, for some features with particularly high correlation with predicted targets or particularly high weight in the model, it is necessary to check whether data leakage has occurred, and if so, it must be removed.
exercises
After reading this article, let’s do an exercise to test the learning results:
-
Think about it. When you train a model, how do you quickly determine if the model has a data leak?
Reference:
[1] Daniel Gutierrez.Ask a Data Scientist: Data Leakage [2] University of Michigan.Applied Machine Learning in Python-Data Leakage [3] Leakage in Data Mining: Formulation, Detection, and Avoidance [4] Data Leakage [5] Data Leakage in Machine Learning [6] What is feature Leakage? Can you give some examples?