The principle of GBDT + LR
Algorithm principle:
Predicting Clicks on Ads at Facebook,2014
Quinonero.net/Publication…
A binary classifier model with stacking thought for binary classification problems
The features are combined by GBDT and then passed into a linear classifier
LR classifies input data generated by GBDT (using L1 regularization to prevent overfitting)
Gradient Boosting Decision Tree
• Regression Decision Tree
GBDT consists of multiple CART regression trees, and the cumulative results of all trees are taken as the final result
The target of GBDT fitting is a gradient value (continuous value, real number), so in GBDT, they are regression trees.
GBDT is used for regression prediction, and can also be used for classification after adjustment
• Gradient Iterative Boosting
Boosting, iteration, namely, mutual decision making by iterating multiple trees
The loss function is used to evaluate the model performance (generally, the degree of fitting + regular term). The best way to do that is to let the loss function go down the gradient
Gradient Boosting principle
Gradient Boosting, each time building a model, is the Gradient descent direction of the previous model loss function
GBDT is the sum of the results of all trees, and the final result => each tree learns the residual of the sum of the results of all trees before
Um participant A true age is 24 years old, the first tree predicted 18, residual is 6 = > the second tree set A age of 6, continue to learn and to predict the second tree it is 5, residual error is 1 = > the third tree forecast residual 1 as A target
GBDT is used for new feature construction
Using GBDT to construct new features:
When GBDT is trained to make predictions, the output is not the final binary probability value, but the position of the leaf node to which the predicted probability value calculated by each tree in the model belongs should be recorded as 1 => to construct new training data
The figure on the right shows two decision trees with a total of five leaves
If an instance selects the second leaf node of the first decision tree. Also, select the first leaf node of the second subtree. Then in the first three leaf nodes, the second bit is set to 1, and in the last two leaf nodes, the first bit is set to 1. Concatenate all eigenvectors to get [0,1,0,1,0]
GBDT is a combination of a bunch of trees, assuming there are k trees (T1,T2… Tk), the number of nodes in each tree is respectively, and GBDT will output a vector of dimension
Evaluation indexes NE, Calibration, AUC
NE * *, * * Normalized Cross – Entropy
NE = The average predicted Log Loss for each presentation divided by the average log loss value for the entire data set
P represents the average empirical CTR of training data, namely, background CTR. NE is insensitive to background CTR, and the smaller NE value is, the better the prediction effect is
Calibration
The estimated CTR divided by the actual CTR is the ratio of the predicted CTR to the actual CTR. The closer the value is to 1, the better the prediction.
AUC
A good measure of sorting quality, but not Calibration, that is, if we want to get an accurate ratio rather than the optimal sorting result, NE or Calibration will be used
E. The estimated value of the model is twice the actual value, which increases for NE. In this case, multiply by 0.5 for calibration. But it stays the same for AUC
TPRate, FPRate computation
ROC curve, the horizontal axis is FPRate, the vertical axis is TPRate
AUC= area under the ROC curve
If TPRate=FPRate, which is y=x, which is the sample of the real category whether it’s a 1 or a 0, the classifier predicts a 1 with equal probability
The classifier has no ability to distinguish positive samples from negative samples, and AUC=0.5
If TPRate>FPRate, that is, Y >x, that is, AUC>0.5, it has classification ability. In extreme cases, AUC=0, then TPRate=1, FPRate=0, that is, positive samples are correctly predicted and negative samples are incorrectly divided into 0
Wide & Deep model
Wide & Deep
• Deep recommended
Features used by Deep model: continuous features, discrete features after Embedding,
Using the feedforward network model, the features are first converted into low-dimensional dense vectors as the input of the first hidden layer, solving the dimension explosion problem
Update the final loss reverse training. Vectors are randomly initialized, and activation functions for hidden layers usually use ReLU
Features used by Wide model: composite features generated by Cross Product Transformation, but composite features that do not appear in the training set cannot be learned
Simultaneous training of Wide join Deep
Wide join Deep
Represents Cross Product Transformation of X
Represents the weight vector of the Wide model
Represents the weight vector of the Deep model
Represents the final neural network number result, and B is partial
FNN model
FNN model,
Most features in CTR are discrete, high dimensional and sparse, and deep learning can be realized only after embedding
FM is used to initialize the embedding layer, that is, each feature corresponds to a bias term and a K-dimensional vector
• Most features in CTR are discrete, high dimensional and sparse, and deep learning can only be realized after embedding
FM is used to initialize the embedding layer, that is, each feature corresponds to a bias term and a K-dimensional vector
(NFM) model
(NFM) model
FNN, Wide & Deep and DeepFM all concatenate the features after embedding in DNN part, without sufficient feature cross calculation.
NFM algorithm adds the features as cross features when embedding is directly multiplicatedelement wise, and then compresses the features directly through DNN, finally concatenate the features of linear part and deep part.
FM and DNN are combined in two ways: DeepFM and NFM
Two ways to combine FM and DNN:
DeepFM, parallel structure, FM and DNN are calculated separately
NFM, serial architecture, takes the results of FM as input to DNN