The principle of GBDT + LR

Algorithm principle:

Predicting Clicks on Ads at Facebook,2014

Quinonero.net/Publication…

A binary classifier model with stacking thought for binary classification problems

The features are combined by GBDT and then passed into a linear classifier

LR classifies input data generated by GBDT (using L1 regularization to prevent overfitting)

Gradient Boosting Decision Tree

• Regression Decision Tree

GBDT consists of multiple CART regression trees, and the cumulative results of all trees are taken as the final result

The target of GBDT fitting is a gradient value (continuous value, real number), so in GBDT, they are regression trees.

GBDT is used for regression prediction, and can also be used for classification after adjustment

• Gradient Iterative Boosting

Boosting, iteration, namely, mutual decision making by iterating multiple trees

The loss function is used to evaluate the model performance (generally, the degree of fitting + regular term). The best way to do that is to let the loss function go down the gradient

Gradient Boosting principle

Gradient Boosting, each time building a model, is the Gradient descent direction of the previous model loss function

GBDT is the sum of the results of all trees, and the final result => each tree learns the residual of the sum of the results of all trees before

Um participant A true age is 24 years old, the first tree predicted 18, residual is 6 = > the second tree set A age of 6, continue to learn and to predict the second tree it is 5, residual error is 1 = > the third tree forecast residual 1 as A target

GBDT is used for new feature construction

Using GBDT to construct new features:

When GBDT is trained to make predictions, the output is not the final binary probability value, but the position of the leaf node to which the predicted probability value calculated by each tree in the model belongs should be recorded as 1 => to construct new training data

The figure on the right shows two decision trees with a total of five leaves

If an instance selects the second leaf node of the first decision tree. Also, select the first leaf node of the second subtree. Then in the first three leaf nodes, the second bit is set to 1, and in the last two leaf nodes, the first bit is set to 1. Concatenate all eigenvectors to get [0,1,0,1,0]

GBDT is a combination of a bunch of trees, assuming there are k trees (T1,T2… Tk), the number of nodes in each tree is respectively, and GBDT will output a vector of dimension

Evaluation indexes NE, Calibration, AUC

NE * *, * * Normalized Cross – Entropy

NE = The average predicted Log Loss for each presentation divided by the average log loss value for the entire data set

P represents the average empirical CTR of training data, namely, background CTR. NE is insensitive to background CTR, and the smaller NE value is, the better the prediction effect is

Calibration

The estimated CTR divided by the actual CTR is the ratio of the predicted CTR to the actual CTR. The closer the value is to 1, the better the prediction.

AUC

A good measure of sorting quality, but not Calibration, that is, if we want to get an accurate ratio rather than the optimal sorting result, NE or Calibration will be used

E. The estimated value of the model is twice the actual value, which increases for NE. In this case, multiply by 0.5 for calibration. But it stays the same for AUC

TPRate, FPRate computation

ROC curve, the horizontal axis is FPRate, the vertical axis is TPRate

AUC= area under the ROC curve

If TPRate=FPRate, which is y=x, which is the sample of the real category whether it’s a 1 or a 0, the classifier predicts a 1 with equal probability

The classifier has no ability to distinguish positive samples from negative samples, and AUC=0.5

If TPRate>FPRate, that is, Y >x, that is, AUC>0.5, it has classification ability. In extreme cases, AUC=0, then TPRate=1, FPRate=0, that is, positive samples are correctly predicted and negative samples are incorrectly divided into 0

Wide & Deep model

Wide & Deep

• Deep recommended

Features used by Deep model: continuous features, discrete features after Embedding,

Using the feedforward network model, the features are first converted into low-dimensional dense vectors as the input of the first hidden layer, solving the dimension explosion problem

Update the final loss reverse training. Vectors are randomly initialized, and activation functions for hidden layers usually use ReLU

Features used by Wide model: composite features generated by Cross Product Transformation, but composite features that do not appear in the training set cannot be learned

Simultaneous training of Wide join Deep

Wide join Deep

Represents Cross Product Transformation of X

Represents the weight vector of the Wide model

Represents the weight vector of the Deep model

Represents the final neural network number result, and B is partial

FNN model

FNN model,

Most features in CTR are discrete, high dimensional and sparse, and deep learning can be realized only after embedding

FM is used to initialize the embedding layer, that is, each feature corresponds to a bias term and a K-dimensional vector

• Most features in CTR are discrete, high dimensional and sparse, and deep learning can only be realized after embedding

FM is used to initialize the embedding layer, that is, each feature corresponds to a bias term and a K-dimensional vector

(NFM) model

(NFM) model

FNN, Wide & Deep and DeepFM all concatenate the features after embedding in DNN part, without sufficient feature cross calculation.

NFM algorithm adds the features as cross features when embedding is directly multiplicatedelement wise, and then compresses the features directly through DNN, finally concatenate the features of linear part and deep part.

FM and DNN are combined in two ways: DeepFM and NFM

Two ways to combine FM and DNN:

DeepFM, parallel structure, FM and DNN are calculated separately

NFM, serial architecture, takes the results of FM as input to DNN