2016 Bid-Aware Gradient Descent for Unbiased Learning with Censored Data in Display Advertising
The work of this paper is highly relevant to the work of 2014 Optimal real-time Bidding for Display Advertising. It is suggested to read the paper in 2014 first and then the paper in 2016. The paper notes of 2014 have also been added, see: Paper Reading (011)
background
CTR estimation, bid optimization and other predictions are operated with the full number of bids in the pre-bidding stage, but the training data is collected after the bidding, which leads to inconsistent distribution of training data and online test data. In view of the above problems, this paper has the following work
- Survival Models were used to solve the probability of winning bidding (used to reconstruct biased training data to the overall distribution)
- CTR estimation and bid optimization are modified on the basis of work 1
The training data is skewed
In the bidding process, only one advertiser won, so only that advertiser’s request received feedback, including whether to click, whether to convert, market price value. This results in an inconsistent distribution of data from ADX (training data) and data from bid request (test data), as shown in the figure below
The bidding process is equivalent to a dynamic data filter, depending on the bid value. Because failed bidding requests cannot be exposed and cannot appear in training data dependent on the model, the whole training data is biased compared with full-volume data.
Probability of winning an auction
In the bidding process, the following concepts are defined
- X: Characteristics of the request (including: user characteristics, advertising characteristics)
- Bxb_ XBX: Bid on the request
- Y: User feedback, whether users click or convert after winning the bidding
- Z: Market price: the price charged after bidding, defined in this article as lower than the bid price
- P (win ∣ x, bx) p (win | x, b_x) p (win ∣ x, bx) : during the bid for bxb_xbx winning probability
- Qx (x) Q_x (x) Qx (x) : exposure data distribution
- Px (x)p_x(x) PX (x) : Distribution of bidding data
Therefore, exposure data (x,y,z){(x, y,z)}(x,y,z) has the following relationship, which represents the offset between training data q_x(x) and bidding data data. We need to train on the training data, test on the bidding data, so solve for the offset between the two distributions, P(win∣x,bx)⏟auction Selection \underbrace{P\left(\ operatorName {win} \mid \boldsymbol{x}, B_ {\ boldSymbol {x}}\right)}_{\text {auction selection}} auction selection P(win∣x,bx)
In this paper, we use the Survial Model to solve p (win ∣ x, bx) p (win | x, b_x) p (win ∣ x, bx), PZX (z) p_ {} z ^ {\ boldsymbol {x}} (z) PZX (z) said the market price for the winning probability when z, The probability of winning the auction is the probability that market price Z is lower than the bid bXB_XBX:
To simplify the problem, assume that the market price distribution is at the advertiser granularity rather than the exposure granularity (find a W (bx)w\left(b_{\ boldSymbol {x}}\right)w(bx) for the advertiser, so estimate the corresponding pz(z)p_{z}(z)pz(z) for an advertiser, Thus w(bx)w\left(b_{\ boldSymbol {x}}\right)w(bx).
Assuming that there is no problem of data censorship, i.e. an advertising agent wins all bids and knows the market price, wo(bx)w_o\left(b_{\boldsymbol{x}}\right)wo(bx), The calculation is as follows
Which when z < BXZ < b_ {\ boldsymbol {x}} z at the delta (z < < bx bx) = 1, delta, left (z < b_ {\ boldsymbol {x}} \ right) delta (z < = 1 bx) = 1, otherwise 0.
The real problem is that advertisers can’t get market price if they don’t win the bidding
In bidding, imagine an advertiser engaging N bids, denoted as N tuples ⟨ BI, WI, ZI ⟩ I =1… N \ left \ langle b_ {I}, w_ {I}, z_ {I} \ \ right rangle_ {I = 1 \ ldots N} ⟨ bi, wi, zi ⟩ I = 1… N, where BIB_ibi indicates the size of the bid, wiw_iwi indicates whether the bid was won, and ziz_izi indicates the market price.
⟨ BJ, DJ, NJ ⟩ J =1 for converting the above format data in the log, ⟨ BJ, DJ, NJ ⟩ J =1… M \ left \ langle b_} {j, d_} {j, n_ {j} \ \ right rangle_ {j = 1 \ ldots M} ⟨ bj, DJ, nj ⟩ j = 1… M, i.e., bi < bj + 1 b_i < b_ {j + 1} bi < bj + 1
- Djd_jdj indicates that market price is the number of bids won by BJ − 1B_J-1BJ −1
Represents bid for
The number of unsuccessful bids, including- The number of winning bids at a market price not lower than Bj-1Bj-1Bj-1
- The bid price is not less than the number of failed bids bj− 1b_J-1BJ −1.
So when the bid is BXB_XBX, the probability of losing the AD auction is (survival models solved)
So when the bid is BXB_XBX, the probability of winning the AD auction is
Will ⟨ bi, wi, zi ⟩ I = 1… N \ left \ langle b_ {I}, w_ {I}, z_ {I} \ \ right rangle_ {I = 1 \ ldots N} ⟨ bi, wi, zi ⟩ I = 1… N ⟨ BJ, DJ, NJ ⟩ J =1… M \ left \ langle b_} {j, d_} {j, n_ {j} \ \ right rangle_ {j = 1 \ ldots M} ⟨ bj, DJ, nj ⟩ j = 1… M, and calculate w(bx)w\left(b_{\ boldSymbol {x}}\right)w(bx) as follows
It can be seen that wo(bx)w_{O}\left(b_{boldSymbol {x}}\right) WO (bx) has a larger value than W (bx)w\left(b_{boldSymbol {x}}\right) W (bx), that is, the training data moves to the right relative to the test data.
CTR estimates
Given training data D=(x,y,z)D={(x, y,z)}D=(x,y,z), where X obeys the training data distribution qx(x)q_x(x)qx(x), an unbiased supervised learning model requires that the loss function be minimized on the test data distribution Px (x)p_x(x)px(x).
Minimize loss function on training data
Where fθ(x)f_{\ boldSymbol {\theta}}(\ BoldSymbol {x})fθ(x) represents the prediction result of the model with parameters θ\thetaθ, L (y, theta f (x)) \ mathcal {L} \ left (y, F_ {\ boldsymbol {\ theta}} (\ boldsymbol {x}) \ right) L (y, theta f (x)) to predict the results Theta f (x) f_ {\ boldsymbol {\ theta}} (\ boldsymbol {x}) theta f (x) and the loss function of real tag is calculated in the y, Φ (theta) \ Phi (\ boldsymbol {\ theta}) Φ (theta) is a parameter regularization.
The following relationship exists
That is, loss represented by training set distribution X is transformed into Loss representation of test set distribution X. The above amendments have the following properties
- For the low probability bid bXB_XBX, 1 – ∏ bj < bxni – dini ‾ 1 – \ prod_ {b_ {j} < b_ {x}} \ frac {n_ {I} – d_ {I}} {\ underline {n_ {I}}} – 1 ∏ bj < bxnini – di in training on the estimated value is low, but the corresponding gradient is bigger, The gradient from bidding data should be higher
- If the bidding is very high, which results in a 100% probability of winning the auction, such data can be observed from the real data distribution. Therefore, this data will not be gradient reweighted (weight 1)
The adjustment of winning probability and weight is shown in the figure
Offer to optimize
CTR f(x)f(x)f(x) f(x) has been obtained through the model, and the bidding price b(f(x))b(f(x))b(f(x)). For T times of auction, when the budget is B, the optimization objective of bidding optimization is. Optimal real-time bidding for display advertising, in which budget constraint is set to ≤\ LEq ≤ and ===
Subject to T ∫ xb w (f (x)) (f (x)) (b) p (x) dx = BT \ int_ {\ boldsymbol {x}} b (f (\ boldsymbol {x})) w (b (f (\ boldsymbol {x}))) (\ boldsymbol p_ {x} {x}) d \ boldsymbol {x} = BT ∫ xb w (f (x)) (f (x)) (b) p (x) dx = b
W (b(f(x)) w(b(f(boldSymbol {x})) w(b(f(x)))w(b(f(f(x)))w(b(f(x))) indicates a given bid request x→\mathbf{x} \rightarrowx→ estimated CTR f→f \rightarrowf→ Bidding function b→ B \rightarrowb→ winning probability WWW
The actual data distribution is qx(x)q_x(x)qx(x), so transform the above equation and get as follows
Subject to T ∫ xb w (f (x)) (f (x)) (b) qx (x) w (bx) dx = BT \ int_ {\ boldsymbol {x}} b (f (\ boldsymbol {x})) w (b (f (\ boldsymbol {x}))) (\ \ frac {q_ {x} boldsymbol {x})} {w \ left (b_ {\ boldsymbol {x}} \ right)} d \ boldsymbol {x} = BT ∫ xb w (f (x)) (f (x)) (b) w (bx) qx dx = b (x)
The Lagrange multiplier method is used to express the problem as unconstrained
Euler-lagrangian conditions:
Qx f (x) (x) w (bx) partial w partial (f (x)) (b) b – lambda qx (f (x)) (x) w (bx) [w (b) (f (x)) f (\ boldsymbol {x}) \frac{q_{x}(\boldsymbol{x})}{w\left(b_{\boldsymbol{x}}\right)} \frac{\partial w(b(f(\boldsymbol{x})))}{\partial b(f(\boldsymbol{x}))}-\lambda (\ \ frac {q_ {x} boldsymbol {x})} {w \ left (b_ {\ boldsymbol {x}} \ right)} [w (b (f (\ boldsymbol {x}))) f (x) w (bx) qx (x) (f (x)) partial w partial b (b) (f (x)) – Lambda w (bx) qx (x)/w (b) (f (x)) + b (f (x)) partial w partial (f (x)) (b) b (f (x)] = 0 \ left. + b (f (\ boldsymbol {x})) \ frac {\ partial W (b (f (\ boldsymbol {x})))} {\ partial b (f (\ boldsymbol {x}))} \ right] = 0 + b (f (x)) (f (x)) partial w partial b (b) (f (x))) = 0 ⇒ lambda w (b) (f (x)) = [f (x) – lambda b (f (x)] partial w partial (f (x)) (b) b (f (x)), \ Rightarrow \ lambda w (b (f (\ boldsymbol {x}))) = [f (\ boldsymbol {x}) – \ lambda b(f(\boldsymbol{x}))] \frac{\partial w(b(f(\boldsymbol{x})))}{\partial B (f (\ boldsymbol {x}))}, ⇒ lambda w (b) (f (x)) = [f (x) – lambda b (f (x)] (f (x)) partial w partial b (b) (f (x)),
Optimal bidding function bORTBf(x)b_{ORTB}f(x)bORTBf(x) depends on winning Function W (b)w(b)w(b) W (b), e.g
Where C is a constant, so the optimal bid is
For the above unconstrained problem λ\lambdaλ, expressed by the Lagrange multiplier method, the euler-Lagrangian condition can be expressed as follows:
The following problem is convex.
Update using BGD(Bid-Aware Gradient Descent)
The above formula has the following properties
- The sample is re-weighted, as is done in the CTR estimation
- A small bid bXB_XBx produces a large weight, indicating its importance in the training data
- The historical bid of the training instance also has an effect on the gradient direction
- Ratio B / ∣ ∣ B/D | D | B / ∣ D ∣ will ensure that the budget evenly in the new offer
Observe historical bids and corresponding gradients as if to draw conclusions
- In the two parts of the data, bids 50 and 100, when historical bids are small, the gradient is positive –> λ\lambda lambda becomes large –> bids decrease
- For a data instance with a low historical bid bXB_XBX, the probability of observing the data instance is low, that is, there are no similar data instances in the training set
- If b (f (x), lambda) b (\ lambda, f (x)) (f (x), lambda) high b, b is the best offer (f (x), lambda) b (f (x), \ lambda) b (f (x), lambda) should be reduced, so as to avoid budget overruns in the full amount of data
The experiment
The experimental setup
The offline experiment data included iPinYou and TukMob, and the online experiment was conducted on Yahoo. The process of the experiment is shown below, where a truth-telling bidding strategy is executed to simulate the historical bidding process and generate win (marked but biased) impression data and lost (unmarked) bid request data.
The execution strategies for CTR estimation and bid optimization are
- Bias: CTR estimation and bid optimization for direct training on biased training data
- UOMP: Use wo(bx)w_{O}\left(B_ {boldSymbol {x}}\right) WO (Bx) to correct the data and use CTR estimation and bid optimization proposed in this paper
- KMMP: use W (bx)w\left(b_{\ boldSymbol {x}}\right) W (bx) to correct the data and use CTR estimation and bid optimization proposed in this paper
- FULL: Bid with high bids, so you can get labels for all bid request data, which is used for CTR estimation and bid optimization
The data distribution of Full and Win in the iPinYou dataset is as follows
Winning Probability Estimation
Before evaluating the actual CTR estimation and bidding optimization tasks, the performance of the comparison models in estimating the probability of winning bids was analyzed. The experimental results are shown in the figure below. The real curve is constructed from all the market price observations in the total trading volume forecast data and is regarded as the real value here
It can be concluded from the picture that there are
- When the bid is 0, the winning probability is 0, and as the bid increases, the winning probability increases, and the winning probability converges to 1
- The FULL curve is the closest to the Truth curve because FULL uses all click data and is unbiased
- UOMP overestimated the probability of winning, and KMMP was closer to Truth than UOMP
The experiment compared the KL divergence and Pearson correlation coefficient of UOMP, KMMP, FULL and Truth. The experimental results are shown in the following table. KMMP has better effect than UOMP and is close to FULL.
CTR Estimation Results
The AUC and Cross Entropy values estimated by CTR are compared in the experiment, and the experimental results are shown in the figure below
It can be concluded from the picture that there are
- UOMP and KMMP with data correction achieved better results than BIAS, indicating the effectiveness of the model in eliminating training data instance BIAS, which made the prediction model have better generalization on the predicted data
- KMMP has better correlation than UOMP, that is, the correction method proposed in this paper has better effect
This paper compares the stability of AUC and cross entropy during learning, as shown in the figure below. The conclusions are as follows: UMOP and KMMP are more stable than BIAS, and FULL is the best.
Bid Optimisation Results
Set the budget ratio to perform offline bid optimization, where the training/test budget is set to 1/64, 1/32, 1/16, 1/8, 1/4, and 1/2 of the total cost of the training/test data set of the budget. You can’t set the ratio to 1, because in that case, people might simply bid infinity to win all the shows and clicks in the data and just spend all the budget.
The following table shows the results at 1/64 and 1/4 bids on the iPinYou dataset; Click values and eCPC values on the TukMob dataset
Conclusion there are
- UMOP and KMMP have better effects than BIAS. The click value in Table 6 and 7 is higher, and the eCPC value in Table 7 is lower.
The figure below shows the percentage increase in clicks, exposure and the percentage decrease in eCPC compared to Bias. Relative Bias click through and exposure increases are both positive, and eCPC drops are both negative (except FULL is positive at 1/2).
Online A/B Testing
The experiment ran A/B tests on Yahoo, dividing traffic and budget by 50% and 50%. The control group was trained with BIAS strategy using gradient ascending tree, and the control group was trained with KMMP strategy using the same model. The experimental results are shown in the figure below (the two tables in the paper are omitted, the information in the figure is more and overlaps with that in the table).
It can be concluded from the picture that there are
- KMMP gets more clicks and fewer exposures, increasing CTR
- The boost comes from KMMP’s ability to reduce overbidding for low traffic
- CTR goes up, eCPC goes down, which means you can get more clicks and lower bid costs
discuss
The theoretical derivation is clear, the experiment is rich, very powerful.
Read the work related to this paper, is this paper bid model modeling paper, can refer to: Paper reading (011).
The resources
- Aside: DSP bidding algorithm
- Optimal real-time bidding for display advertising
- RTB advertising bid strategy