2016 Field-Aware Factorization Machines for CTR Prediction
background
This paper studies the CTR problem and further optimizes it on the basis of FM. In CTR problems, features of the same nature can be grouped into a field, as shown in the figure for CTR data
ESPN, Vogue, NBC to Publisherfield, Nike, Gucci, Adidas to Advertiser Field. FFM considers the interaction between fields.
model
The FM model is as follows (the constant term and the first term of the model are omitted)
After the reduction
Among them
For different fields in the figure above, the quadratic term of FM is expressed as
In FFM, the quadratic term is expressed as
For (ESPN, NIKE), wESPN,A\ boldSymbol {w}_{\text {ESPN}, \ Mathrm {A}} wESPN,A is used to represent the potential vector, NIKE belongs to the Advertiser Field. To sum up, the mathematical form of FFM is
Where j1j_1j1 represents the eigenvalue and f1F_1F1 represents the field value.
Compared with FM, LM, and Poly2(models are introduced but omitted here), time complexity and model complexity are summarized below
model | #variables | complexity |
---|---|---|
LM | n | |
Poly2 | B | |
Fm | n | |
FFM | n |
Where N is the number of features, nˉ\bar{n}nˉ is the average non-zero number of features, K is the length of the hidden variable of the domain variable, and F is the number of fields. Since the hidden variable in FFM only needs to learn the relation between field and field, it has the following properties: KFFM exploring kFMk_ {\ mathrm {FFM}} \ ll k_ {\ mathrm FM} {} kFFM exploring kFM.
The data processing
In the FFM model, filed variable needs to be added to the data, from label feat1:val1 feat2:val2 · · to label field1: Feat1 :val1 field2: Feat2 :val2 · ·. For different types of characteristics, the paper introduces the increasing methods of different filed.
Categorical Features
The feature representation is shown in the figure below
The characteristics of
Yes P:ESPN A:Nike G:Male
Transform to
Yes P:P-ESPN:1 A:A-Nike:1 G:G-Male:1
Each type of feature can be considered as a field, and the feature can be encoded as 01. Only non-zero features need to be stored.
Numerical Features
The feature representation is shown in the figure below
One way is to code directly, as shown below, and is referred to in this article as dummy field. There may be a problem: virtual fields may not have field information, because they are just feature duplicates.
Yes AR: AR: 45.73 Hidx: Hidx: 2 Cite: Cite: 3
Another method is to have buckets of features, as shown below, called discretization in this article. The possible problems are: division will lose the information of features, there is quantization error; Need to adjust better division standards.
Yes AR:45:1 Hidx:2:1 Cite:3:1
Single-field Features
The feature representation is shown in the figure below
Filed itself is meaningless. If filed is directly used, FFM will be pushed into FM. If field is taken according to the word, FFF value will be very large. There is no solution at the end…
Optimization algorithm
In this paper, AdaGrad is used for optimization, and the optimization objectives are as follows: Min w ∑ I = 1 llog (1 + exp {- yi ϕ (w, xi)}) + lambda ∥ w ∥ \ 2 min 2 _ {\ mathbf {w}} \ sum_ {I = 1} ^ {L} / log/left (1 + / exp \ left \ {- y_ {I} \phi\left(\mathbf{w}, _ \ mathbf {x} {I} \ \} \ \ right right right)) + \ frac {\ lambda} {2} \ | \ mathbf {w} \ | ^ {2} minw ∑ I = 1 llog (1 + exp {- yi ϕ (w, xi)}) lambda ∥ w ∥ 2 + 2.
The derivative of the variable is as follows
The kappa = predominate partial log (1 + exp (- y ϕ FFM (w, x))) partial ϕ FFM (w, x) = – y1 + exp (y ϕ FFM (w, x)) \ kappa = \ frac {\ partial/log/left (1 + / exp \ left (- y \phi_{\mathrm{FFM}}(\boldsymbol{w}, \boldsymbol{x})\right)\right)}{\partial \phi_{\mathrm{FFM}}(\boldsymbol{w}, \boldsymbol{x})}=\frac{-y}{1+\exp \left(y \phi_{\mathrm{FFM}}(\boldsymbol{w}, \ \ boldsymbol {x})} kappa = predominate right) partial ϕ FFM (w, x) partial log (1 + exp (- y ϕ FFM (w, x))) = 1 + exp (y ϕ FFM (w, x)) – y
The gradient update criterion is as follows
The experiment
K influence
The early stop strategy was used during training, and FFM was compared without K. The experimental results are as follows: small K has little effect.
In this paper, the effects of λ\lambdaλ, η\etaη and different threads are compared, and the conclusions are quite consistent with engineering experience. The specific values can be referred to the literature.
On the two data sets, the different methods are compared as follows, and FFM is excellent.
The comparison results of different data are as follows, and more conclusions can be drawn (see the characteristics of these data sets in the paper, and the conclusions are directly attached here).
- FFM has a better effect on categorical features, and the feature encoding mode is 01
- If the features are not very sparse, the FFM effect is not very significant
- FFM is more difficult to use in numerical feature, dummy Fields is better among the two coding methods of numerical feature.
conclusion
It is not difficult to understand FFM, and the full text is still full of engineering details that are worth reading carefully. Resources has a FFM blog post from the Meituan technical team, which covers more engineering implementation.
The resources
- Deep into FFM principle and practice