2016 Field-Aware Factorization Machines for CTR Prediction

background

This paper studies the CTR problem and further optimizes it on the basis of FM. In CTR problems, features of the same nature can be grouped into a field, as shown in the figure for CTR data

ESPN, Vogue, NBC to Publisherfield, Nike, Gucci, Adidas to Advertiser Field. FFM considers the interaction between fields.

model

The FM model is as follows (the constant term and the first term of the model are omitted)

$\phi_{\mathrm{FM}}(\boldsymbol{w}, \boldsymbol{x})=\sum_{j_{1}=1}^{n} \sum_{j_{2}=j_{1}+1}^{n}\left(\boldsymbol{w}_{j_{1}} \cdot \boldsymbol{w}_{j_{2}}\right) x_{j_{1}} x_{j_{2}}$

After the reduction

$\phi_{\mathrm{FM}}(\boldsymbol{w}, \boldsymbol{x})=\frac{1}{2} \sum_{j=1}^{n}\left(\boldsymbol{s}-\boldsymbol{w}_{j} x_{j}\right) \cdot \boldsymbol{w}_{j} x_{j}$

Among them

$s=\sum_{j^{\prime}=1}^{n} \boldsymbol{w}_{j^{\prime}} x_{j^{\prime}}$

For different fields in the figure above, the quadratic term of FM is expressed as

$\boldsymbol{w}_{\mathrm{ESPN}} \cdot \boldsymbol{w}_{\mathrm{Nike}}+\boldsymbol{w}_{\mathrm{ESPN}} \cdot \boldsymbol{w}_{\text {Male }}+\boldsymbol{w}_{\mathrm{Nike}} \cdot \boldsymbol{w}_{\text {Male }}$

In FFM, the quadratic term is expressed as

$\boldsymbol{w}_{\text {ESPN }, \mathrm{A}} \cdot \boldsymbol{w}_{\text {Nike }, \mathrm{P}}+\boldsymbol{w}_{\text {ESPN }, \mathrm{G}} \cdot \boldsymbol{w}_{\text {Male }, \mathrm{P}}+\boldsymbol{w}_{\text {Nike }, \mathrm{G}} \cdot \boldsymbol{w}_{\text {Male }, \mathrm{A}}$

For (ESPN, NIKE), wESPN,A\ boldSymbol {w}_{\text {ESPN}, \ Mathrm {A}} wESPN,A is used to represent the potential vector, NIKE belongs to the Advertiser Field. To sum up, the mathematical form of FFM is

$\phi_{\mathrm{FFM}}(\boldsymbol{w}, \boldsymbol{x})=\sum_{j_{1}=1}^{n} \sum_{j_{2}=j_{1}+1}^{n}\left(\boldsymbol{w}_{j_{1}, f_{2}} \cdot \boldsymbol{w}_{j_{2}, f_{1}}\right) x_{j_{1}} x_{j_{2}}$

Where j1j_1j1 represents the eigenvalue and f1F_1F1 represents the field value.

Compared with FM, LM, and Poly2(models are introduced but omitted here), time complexity and model complexity are summarized below

model	#variables	complexity
LM	n	$O(\bar{n})$
Poly2	B	$O(\bar{n}^2)$
Fm	n	$O(\bar{n}k)$
FFM	n	$O(\bar{n}^2k)$

Where N is the number of features, nˉ\bar{n}nˉ is the average non-zero number of features, K is the length of the hidden variable of the domain variable, and F is the number of fields. Since the hidden variable in FFM only needs to learn the relation between field and field, it has the following properties: KFFM exploring kFMk_ {\ mathrm {FFM}} \ ll k_ {\ mathrm FM} {} kFFM exploring kFM.

The data processing

In the FFM model, filed variable needs to be added to the data, from label feat1:val1 feat2:val2 · · to label field1: Feat1 :val1 field2: Feat2 :val2 · ·. For different types of characteristics, the paper introduces the increasing methods of different filed.

Categorical Features

The feature representation is shown in the figure below

The characteristics of

Yes P:ESPN A:Nike G:Male

Transform to

Yes P:P-ESPN:1 A:A-Nike:1 G:G-Male:1

Each type of feature can be considered as a field, and the feature can be encoded as 01. Only non-zero features need to be stored.

Numerical Features

The feature representation is shown in the figure below

One way is to code directly, as shown below, and is referred to in this article as dummy field. There may be a problem: virtual fields may not have field information, because they are just feature duplicates.

Yes AR: AR: 45.73 Hidx: Hidx: 2 Cite: Cite: 3

Another method is to have buckets of features, as shown below, called discretization in this article. The possible problems are: division will lose the information of features, there is quantization error; Need to adjust better division standards.

Yes AR:45:1 Hidx:2:1 Cite:3:1

Single-field Features

The feature representation is shown in the figure below

Filed itself is meaningless. If filed is directly used, FFM will be pushed into FM. If field is taken according to the word, FFF value will be very large. There is no solution at the end…

Optimization algorithm

In this paper, AdaGrad is used for optimization, and the optimization objectives are as follows: Min ⁡ w ∑ I = 1 llog ⁡ (1 + exp ⁡ {- yi ϕ (w, xi)}) + lambda ∥ w ∥ \ 2 min 2 _ {\ mathbf {w}} \ sum_ {I = 1} ^ {L} / log/left (1 + / exp \ left \ {- y_ {I} \phi\left(\mathbf{w}, _ \ mathbf {x} {I} \ \} \ \ right right right)) + \ frac {\ lambda} {2} \ | \ mathbf {w} \ | ^ {2} minw ∑ I = 1 llog (1 + exp {- yi ϕ (w, xi)}) lambda ∥ w ∥ 2 + 2.

The derivative of the variable is as follows

$\boldsymbol{g}_{j_{1}, f_{2}} \equiv \nabla_{\boldsymbol{w}_{j_{1}, f_{2}}} f(\boldsymbol{w})=\lambda \cdot \boldsymbol{w}_{j_{1}, f_{2}}+\kappa \cdot \boldsymbol{w}_{j_{2}, f_{1}} x_{j_{1}} x_{j_{2}}$

$\boldsymbol{g}_{j_{2}, f_{1}} \equiv \nabla_{\boldsymbol{w}_{j_{2}, f_{1}}} f(\boldsymbol{w})=\lambda \cdot \boldsymbol{w}_{j_{2}, f_{1}}+\kappa \cdot \boldsymbol{w}_{j_{1}, f_{2}} x_{j_{1}} x_{j_{2}}$

The kappa = predominate partial log ⁡ (1 + exp ⁡ (- y ϕ FFM (w, x))) partial ϕ FFM (w, x) = – y1 + exp ⁡ (y ϕ FFM (w, x)) \ kappa = \ frac {\ partial/log/left (1 + / exp \ left (- y \phi_{\mathrm{FFM}}(\boldsymbol{w}, \boldsymbol{x})\right)\right)}{\partial \phi_{\mathrm{FFM}}(\boldsymbol{w}, \boldsymbol{x})}=\frac{-y}{1+\exp \left(y \phi_{\mathrm{FFM}}(\boldsymbol{w}, \ \ boldsymbol {x})} kappa = predominate right) partial ϕ FFM (w, x) partial log (1 + exp (- y ϕ FFM (w, x))) = 1 + exp (y ϕ FFM (w, x)) – y

The gradient update criterion is as follows

$\left(G_{j_{1}, f_{2}}\right)_{d} \leftarrow\left(G_{j_{1}, f_{2}}\right)_{d}+\left(g_{j_{1}, f_{2}}\right)_{d}^{2}$

$\left(G_{j_{2}, f_{1}}\right)_{d} \leftarrow\left(G_{j_{2}, f_{1}}\right)_{d}+\left(g_{j_{2}, f_{1}}\right)_{d}^{2}$

$\left(w_{j_{1}, f_{2}}\right)_{d} \leftarrow\left(w_{j_{1}, f_{2}}\right)_{d}-\frac{\eta}{\sqrt{\left(G_{j_{1}, f_{2}}\right)_{d}}}\left(g_{j_{1}, f_{2}}\right)_{d}$

$\left(w_{j_{2}, f_{1}}\right)_{d} \leftarrow\left(w_{j_{2}, f_{1}}\right)_{d}-\frac{\eta}{\sqrt{\left(G_{j_{2}, f_{1}}\right)_{d}}}\left(g_{j_{2}, f_{1}}\right)_{d}$

The experiment

K influence

The early stop strategy was used during training, and FFM was compared without K. The experimental results are as follows: small K has little effect.

In this paper, the effects of λ\lambdaλ, η\etaη and different threads are compared, and the conclusions are quite consistent with engineering experience. The specific values can be referred to the literature.

On the two data sets, the different methods are compared as follows, and FFM is excellent.

The comparison results of different data are as follows, and more conclusions can be drawn (see the characteristics of these data sets in the paper, and the conclusions are directly attached here).

FFM has a better effect on categorical features, and the feature encoding mode is 01
If the features are not very sparse, the FFM effect is not very significant
FFM is more difficult to use in numerical feature, dummy Fields is better among the two coding methods of numerical feature.

conclusion

It is not difficult to understand FFM, and the full text is still full of engineering details that are worth reading carefully. Resources has a FFM blog post from the Meituan technical team, which covers more engineering implementation.

The resources

Deep into FFM principle and practice

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Classic CTR Prediction (FFM)

background

model

The data processing

Optimization algorithm

The experiment

conclusion

The resources

Classic CTR Prediction (FFM)

background

model

The data processing

Optimization algorithm

The experiment

conclusion

The resources

Related Posts

Sklearn: implementing LinearRegression

B station may welcome first profit year: why say B station is not video website?

How to turn programmers into technical managers, says Ele. me director