Predicting Clicks on Ads at Facebook

background

The work Facebook has done on CTR estimation, unlike search advertising, is independent of the query and estimates CTR based on the target population and interest.

model

Features are encoded by GBDT and classified by LR. The model results are shown below

GBDT can effectively encode features and transform feature vectors into compact 01 vectors through supervised learning. The implementation is as shown in the figure above. The first subtree has three leaf nodes, the second subtree has two leaf nodes, the first subtree is classified at leaf node 2, and the second subtree is classified at leaf node 1, then the feature is encoded as [0,1,0,1,0][0, 1,0,1,0].

In the LR model, SGD-based LR is compared with BOPR(Bayesian Online Learning Scheme for Probit Regression), and the selection methods of different learning rates are explored in the following paper. This note will be skipped at the end and not summarized.

When GBDT and LR are used together, they are trained separately. After GBDT is trained, features are encoded for LR model training, so LR must be updated after GBDT is updated. Since GBDT is trained in serial, it cannot be updated online in real time. LR can be updated online in real time. The article also includes LR model update.

Model experiment

Training data

This paper uses a week’s online data for training, divides the data offline, and then simulates the online flow for training. Training the label for the current click click advertising is occurring, namely yi ∈ {- 1, + 1} y_ {I} \ \ {1, + 1 \} in yi ∈ {- 1, + 1}.

Model evaluation index

  1. NE(Normalized Entropy)


N E = 1 N i = 1 n ( 1 + y i 2 log ( p i ) + 1 y i 2 log ( 1 p i ) ) ( p log ( p ) + ( 1 p ) log ( 1 p ) ) N E=\frac{-\frac{1}{N} \sum_{i=1}^{n}\left(\frac{1+y_{i}}{2} \log \left(p_{i}\right)+\frac{1-y_{i}}{2} \log \left(1-p_{i}\right)\right)}{-(p * \log (p)+(1-p) * \log (1-p))}

Yiy_iyi is the training data label, PIP_IPI model outputs the CTR estimate, and PPP is the total CTR of the advertisement. The smaller the NE value is, the better. The numerator is the cross entropy loss function, and the denominator considers the size of P. If P is close to 0 and 1, the NE value will be amplified. The reason for this is that when p is close to 0 and 1, the model is easier to learn, and the denominator takes into account the size of the overall click-through rate to better evaluate the results of model learning.

  1. calibration


c a l i b r a t i o n = a v e r a g e e s t i m a t e d C T R e m p i r i c a l C T R calibration = \frac{average estimated CTR}{empirical CTR}

Calibration is an important metric because accurate and well-calibrated CTR predictions are critical to the success of online bidding and auctions. The smaller the distance is, the better the model is.

The experimental results

The experimental results are shown below, using GBDT+LR reduces the NE value by more than 3% compared with using GBDT and LR alone.

Real-time model training

The relationship of model prediction results over time is shown as follows. As time goes by, NE value rises gradually, so real-time training and updating of the model is required.

Since GBDT cannot be parallelized and has a large amount of data, the GBDT model cannot be updated within 24 hours, so only the LR model is updated in real time. The data flow logic for real-time updates to the model is shown below

The module quasi-real-time integrated data from different data streams to form sample features, and finally joined with click data to form complete labeled sample. The following points were made

  1. Waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window: waiting window If the time setting is too small, the sample CTR value will be lower than the real value; if the time setting is too large, the delay will be larger. Therefore, time selection is required. The CTR value of the final sample must be lower than the real CTR value.
  2. Perform distributed stream-to-stream connections for AD displays and AD clicks online, using the request ID as the main component of the connection. A request ID is generated each time a user performs an action on Facebook that triggers a refresh of the content they expose. Facebook also uses HashQueue caching impressions and works with Waiting Window.
  3. The protection mechanism of Online Data Joiner monitors whether the click-through rate of online data is abnormal. In case of abnormality, training should be stopped to prevent the model from being affected.

Model and data analysis

Importance of features

The use of features affects the accuracy and performance of the model. Using too few features will inevitably lead to low model accuracy, and using too many features will lead to slow model estimation. The importance of different features is evaluated in this paper, and the importance degree is shown as follows

After the features are ranked in order of importance from high to low, the horizontal axis represents the number of features used, the vertical axis represents the absolute importance of features, and the right vertical axis represents the cumulative importance of features. Boosting Feature Importance is used in the evaluation of Feature Importance, which represents the degree of Feature reduction to loss.

Historical Features

This paper compares the importance of two types of features: historical features and contextual features. Define historical characteristics as the AD’s or the user’s previous interactions, such as the number of clicks on the AD last week, or the average click-through rate of the user, and context characteristics as the current information about the context of the AD, such as the device the user is using or the current page the user is on.

The comparison results are as follows

The figure shows the proportion of historical features in Top K sorted by the importance of features. The horizontal axis represents the number of features and the vertical axis represents the proportion of features. It can be seen that historical features are very important. The explanation given in this paper is that historical data is very important for AD cold launch and contains some historical information.

Using both feature training alone, the experimental results are as follows

Training data downsampling

In order to control the data scale, it is necessary to conduct data downsampling. In this paper, two kinds of downsampling methods are compared: Uniform subsampling and negative down sampling.

uniform subsampling

For data direct sampling, the sampling rate take {0.001, 0.01, 0.1, 0.5, 1} \ {0.001, 0.01, 0.1, 0.5, 1 \} {0.001, 0.01, 0.1, 0.5, 1}, the experimental results are shown below

When using 10% of the data, there is only a 1% loss compared to when using all the data.

negative down sampling

In this paper, the sampling rate is {0.1,0.01,0.001,0.0001}\{0.1,0.01,0.001,0.0001}{0.1,0.01,0.001,0.0001 \}{0.1,0.01,0.001,0.0001}. The result is as follows

Results are best when the sampling rate is 0.025 (a bit of a puzzle), the guess given in other people’s interpretation is

When the negative sampling frequency is 0.025, Loss is not only better than the model trained with a lower sampling frequency, but also better than the model trained with a negative sampling frequency of 0.1. Although the original text does not give further explanation, it is speculated that the most likely reason is the effect improvement brought by solving the problem of data imbalance.

Negative sampling will lead to the drift of the estimated CTR. For example, the real CTR is 0.1%. After 0.01 negative sampling, the CTR will increase to about 10%. In order to carry out accurate bidding and ROI estimation, the CTR prediction model needs to provide accurate physical CTR value. Therefore, CTR correction is required after negative sampling, so that the expected value of the CTR model can be returned to 0.1%. The correction formula is as follows


q = p p + ( 1 p ) / w q=\frac{p}{p+(1-p) / w}

Where q is CTR after correction, P is CTR estimated by model, and W is sampling rate.

conclusion

The whole paper uses GBDT to encode the features, which feels very interesting. We can look at the feature coding methods in sequence. For GBDT and LR model updating methods, this paper separates them, which is of great significance to engineering practice. This paper also carries out experiments on feature selection, feature importance, sampling method, sampling rate and other aspects, which can be used for reference for model optimization in the future.

The resources

  1. Review the classic Facebook CTR prediction model
  2. Summary of gradient lifting tree (GBDT) principle