In actual competitions and machine learning projects, many friends often encounter inconsistent situations between offline and online. Excluding some special situations, such as problems that cannot be predicted or potential patterns that do not exist, I dare to say that 99% of people will not look at the results of N fold verification.

Yes, it’s true!

You might tease me, wait, you watch me and give me a simple example.

Suppose I have a project, such as the recent “Smart Ocean Building” competition, whose evaluation index is the average F1 of the three classes. At this time we have two models, all using the same parameters, showing the results every 50 times, \

lgb_model.fit(train_x, train_y, eval_set=eval_set, eval_metric=f1_macro, early_stopping_rounds=150, verbose=50)
Copy the code

Offline verification results of the first model are as follows:

[50]   valid_0'F1_Macro: 0.92763 [100] VALID_0'S F1_macro: 0.931709 [150] VALID_0'F1_Macro: 0.934416 [200] VALID_0'S F1_macro: 0.931428 [250] VALID_0'F1_Macro: 0.927119 [300] VALID_0'S F1_MACRO: 0.927119 Early Stopping, Best Iteration is: [150] VALID_0's f1_macro: 0.934416
Copy the code

 

The offline verification results of the second model are as follows:

[50]   Valid_0 's f1_macro: 0.92763
[100] valid_0's f1_macro:0.931709
[150] valid_0's f1_macro:0.932416
[200] valid_0' s f1_macro:0.933428
[250] valid_0's f1_macro:0.932119
[300] valid_0's f1_macro:0.932119
[350] valid_0's f1_macro:0.931119
Early stopping, best iteration is:
[222]  Valid_0 's f1_macro: 0.93390
Copy the code

99% of friends when encounter this kind of circumstance, can choose the first model directly, because it can achieve f1_macro can reach 0.934416, and almost won’t go to see the back of the result, if we carefully look at, we will find that the second model compared with the first stable many, many times the results of the second model, Online performance is also relatively better, why?

The bottom line is, we already cheated when we verified it! Why is that?

Because we are looking for the best stopping position in the case of given label, that is to say, our stopping position is the best number of stopping iterations when other parameters are fixed. How can this be possible in practice?

Many friends to play games there will be a kind of experience, sometimes didn’t train of thought, and don’t want to waste time, will try to submit with different number of iterations, especially as the first case above, different number of iterations influence is very big, this is actually offline test cheated, is equal to the said submitted more than 300 times and got the result of the real-time feedback, You then choose the best result as your validation result, which is completely impossible to submit as many times as you would in a real competition. The result of the second model is better than the first result in many commits, especially the result of each subsequent iteration is basically higher than the first result. So it is likely that the second result will be better than the first, which is why in practice the second model’s prediction is more reliable to submit.

This time a lot of friends will say, I practice basic it is improved fifty percent, increased online ah, this not too big problem, described above, after all, only two kinds of the results of the validation index is relatively sensitive, it’s easy to have a bit many times or rarely appear this kind of circumstance, which is why most of the time we improved fifty percent, Online also tends to be the cause of ascension. But a lot of people have also found that in the ocean race, 50% is much worse than 10% or 80%, which 90% of the time is better than 50% in the preliminaries or the rematches. Why?

  • Because it uses more data
  • Because of luck
  • It looks like that’s true in this case
  • This is the tricks I discovered

In fact, this is very easy. Originally, our verification framework is wrong, but most of the time we find that the correlation is very high (the five-fold line rises down, and the large probability on the line also rises), so we don’t bother to find the optimal N, so how can we correct this verification method?

— Make it a two-story one! \

def model(X,y):
  model = NFoldModel(X,y) # N fold training, get N models
  
def FiveFold_Validation(X,y):
    model = first_model()
    for tr_ind, val_ind in KFole(X,y):
        X_tr, X_val, y_tr, y_val = X[tr_ind], X[val_ind], y[tr_ind], y[val_ind]
        model = model(X_tr, y_tr)
        y_pred = model.predict(X_val)
        score(y_pred,y_val) 
Copy the code

The first layer is the same as usual, except that the model of the second layer changes from the previous single mode to n-folded. At this time, the n-folded model obtained by the training of the first layer and the result predicted by the verification set will be clear at a glance when looking at the result of the final 50% folding. In addition, you will find another obvious advantage, the verification results online and offline gap is smaller than before.

Take the ocean games for example, if you try the above method offline, you can find that the effect of 7-10 percent discount is better than that of 50 percent discount in the second verification method offline.

-END-

Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code