This is the 10th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

Whether we play games, do experiments, or do engineering, we often encounter inconsistent distribution of training sets and test sets. In general, we divide a validation set from the training set, adjust some hyperparameters through this validation set, and save the model that works best on the validation set. However, if the difference between the verification set and the test set is large, then the model that performs well on the verification set may not perform equally well on the test set. Therefore, how to reduce the distribution difference between the divided verification set and the test set is a topic worthy of study

Two cases

To be clear, this article considers scenarios where test set data is available, but test set labels are not. There is little we can do if we are submitting model closed evaluation scenarios where we can’t see the test set at all. Why does the distribution of test set and training set differ? There are two main cases

  • One is the inconsistent distribution of labels. In other words, if you only look at the input XXX, the distribution is basically the same, but the corresponding YYY distribution is different. A typical example is the information extraction task (such as relationship extraction). The input XXX of the training set and verification set is the text of a certain field, so their distribution is very similar. However, the training set is often constructed by “remote supervision + manual rough marking”, which has more errors and omissions, while the test set may be constructed by “manual repeated fine marking”, which has few errors and omissions. In this case, you can’t build a good validation set by partitioning the data
  • Second, the input distribution is inconsistent. To put it bluntly, the distribution of XXX is inconsistent, but the labeling of YYY is basically correct. For example, in the classification problem, the category distribution of training set may be different from that of test set. Or in reading comprehension, the ratio of fact/non-fact questions in the training set is different from that in the test set, etc. In this case, we can appropriately adjust the sampling strategy to make the distribution of the verification set and test set closer, so that the results of the verification set can better reflect the results of the test set

Adversarial Validation

The translation on Adversarial Validation is Adversarial Validation. It is not a method to evaluate the model, but a method to verify whether the distribution of the training set and test set is consistent, find out the characteristics that affect the data distribution inconsistencies, and find out some data from the training set that are close to the distribution of the test set. In practice, however, sometimes we do not need to find features that affect the inconsistent distribution of data, because there may be only one feature in the data set, for example, for many tasks in NLP, there is only one text and therefore only one feature. The core idea of adversarial verification is as follows:

Train a discriminator to distinguish between training/test samples, and then apply the discriminator to the training set. In the training set, select the Top N data predicted to be test samples as the validation set, because these data are the data that the model thinks most resemble the test set

Judging device

We first set the label of the training set to 0 and the label of the test set to 1 to train a binary discriminator D(x)D(x)D(x) :


E x …… p ( x ) [ log ( 1 D ( x ) ) ] E x …… q ( x ) [ log D ( x ) ] (1) -\mathbb{E}_{x\sim p(x)}[\log (1-D(x))]-\mathbb{E}_{x\sim q(x)}[\log D(x)]\tag{1}

Where P (x)p(x)p(x) represents the distribution of the training set, and Q (x)q(x)q(x) is the distribution of the test set. It should be noted that we should sample the same number of samples from the training set and test set respectively to compose each batch, that is to say, we need to sample to category equilibrium

Some readers may have worried about the fit problem, that is, the discriminator completely separates the training set from the test set, so that if we were to find the top N samples of the training set that most resemble the test set, we would not be able to find them. In fact, when training the discriminator, we should divide a verification set just like ordinary supervision training, and determine the training epoch number through the verification set, so as to avoid serious over-fitting. Or like some cases on the Internet, use some simple regression model as the discriminator, so it’s not too easy to fit

Similar to the discriminator of GAN, it is not difficult to deduce that the theoretical optimal solution of D(x)D(x)D(x) is


D ( x ) = q ( x ) p ( x ) + q ( x ) (2) D(x)=\frac{q(x)}{p(x)+q(x)}\tag{2}

In other words, after the discriminator is trained, you can think of it as equal to the relative size of the distribution of the test set

code

The following code uses the AUC indicator to determine whether the distributions of two data sets are close. The closer 0.5 is, the more similar their distributions are. Most of the codes of online confrontation verification are for numerical data, and few are for THE data of NLP text type. For the data of NLP text type, text features should be converted into vectors before operation. The code is not comprehensive, for example, there is no implementation to extract Top N samples from the training set close to the test set

import sklearn
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv('data.csv')
df = df.sample(frac=1).reset_index(drop=True)

df_train = df[:int(len(df) * 0.7)]
df_test = df[int(len(df) * 0.7):]

col = 'text'

tfidf = TfidfVectorizer(ngram_range=(1.2), max_features=50).fit(df_train[col].iloc[:].values)
train_tfidf = tfidf.transform(df_train[col].iloc[:].values)
test_tfidf = tfidf.transform(df_test[col].iloc[:].values)

train_test = np.vstack([train_tfidf.toarray(), test_tfidf.toarray()]) # new training data
lgb_data = lgb.Dataset(train_test, label=np.array([0] *len(df_train)+[1] *len(df_test)))
params = {
    'max_bin': 10.'learning_rate': 0.01.'boosting_type': 'gbdt'.'metric': 'auc',
}
result = lgb.cv(params, lgb_data, num_boost_round=100, nfold=3, verbose_eval=20)
print(pd.DataFrame(result))
Copy the code

References

  • Adversarial Validation

  • Adversarial verification: Verify whether the data distribution of training set and test set is consistent

  • Are you still using cross validation? The gods are starting to prove it with confrontation

  • Text Classification with Extremely Small Datasets

  • Adversarial-Validation

  • How to partition a validation set that is closer to a test set?