There is an accepted wisdom in the industry that “data and features determine the upper limit of machine learning projects, and algorithms just get as close to that limit as possible.” In actual combat, feature engineering takes more than half of the time, which is a very important part. Missing value processing, outlier processing, data standardization, imbalance and other problems should be at your disposal. In this article, we will discuss an easily overlooked pitfall: data consistency.
It is well known that most machine learning algorithms assume that the training data sample and the test sample of the location come from the same distribution. If the distribution of test data is inconsistent with the training data, the effect of the model will be affected.
In some machine learning-related competitions, some features of a given training set and test set may themselves be inconsistently distributed. In practice, with the development of business, the distribution of training samples will also change, resulting in the lack of model generalization ability.
Here are some methods to check the consistency of feature distribution between training set and test set:
KDE (kernel density estimation) distribution map
Kernel density Estimation is used to estimate the unknown density function in probability theory. It is one of the non-parametric test methods. The distribution characteristics of data samples can be intuitively seen through the kernel density estimation graph.
Kdeplot in Seaborn can be used for kernel density estimation and visualization for univariate and bivariate variables.
Look at a small example:
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt train_set=pd.read_csv(r'D:\... \train_set.csv') test_set=pd.read_csv(r'D:\... \test_set.csv') plt.figure(figsize=(12,9)) ax1 = sns.kdeplot(train_set.balance,label='train_set') ax2 = sns.kdeplot(test_set.balance,label='test_set')Copy the code
KS test (Kolmogorov-Smirnov)
The KS test is based on the cumulative distribution function and is used to test whether a distribution conforms to a theoretical distribution or to compare whether two empirical distributions are significantly different. The two-sample K-S test is one of the most useful and commonly used nonparametric methods for comparing two samples because it is sensitive to the differences in position and shape parameters of the empirical distribution function between two samples.
We can use ks_2SAMp in the scipy.stats library to perform the KS test:
from scipy.stats import ks_2samp
ks_2samp(train_set.balance,test_set.balance)
Copy the code
Ks test generally returns two values: the first value represents the maximum distance between two distributions. The smaller the value is, the smaller the difference between the two distributions is, and the more consistent the distribution is. The second value is p-value, a parameter used to judge the result of hypothesis testing. The larger the p-value is, the less the null hypothesis (the two distributed isodistributions to be tested) can be rejected, that is, the more the two distributions are identically distributed.
Ks_2sampResult (statistic = 0.005976590587342234, pvalue = 0.9489915858135447)Copy the code
As can be seen from the final result returned, the balance feature follows the same distribution in the training set test set.
Adversarial Validation
In addition to KDE and KS tests, adversarial verification is popular at present, which is not a method to evaluate the effect of models, but a method to confirm whether the distribution of training sets and test sets changes. 1. Combine the training set and test set into a data set, add a label column, and mark the sample of the training set as 0 and the sample of the test set as 1. Repartition a new train_set and test_set(different from the original training set and test set). 3, use train_set to train a binary model, LR, RF, XGBoost, LightGBM, etc., with AUC as the model indicator. 4. If THE AUC is around 0.5, it indicates that the model cannot distinguish the original training set from the test set, that is, the distribution of the two is consistent. If the AUC is relatively large, it indicates that the original training set and test set differ greatly and the distribution is inconsistent. 5, using the classifier model in the step 2, rating projections for the original training set, and the sample points according to the model from big to small order, the bigger the model points, and the closer the test set, then take the training focus on the TOP of N samples of the validation set as the target tasks, so that the original samples can be split to get training set, validation set and test set.
In addition to determining the consistency of feature distribution between training set and test set, adversarial verification can also be used for feature selection. Give it a thumbs up if you’re interested in it. Next time, feature selection, we’ll use an example to see how antagonistic verification works.