Reading thesis –xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems
background
Based on DeepFM, Wide & Deep, a Compressed Interaction Network (CIN) is proposed, and CIN is realized
- Vector-wise is used for feature interaction instead of bit-wise
- The interaction of higher-order features is displayed (it is possible to specify the maximum cross order of features)
- The complexity of a network does not grow exponentially with the level of interaction
Bit-wise corresponds to the concept of vector-wise. In the original FM model, the cross combination of features is represented by the dot product between feature hidden vectors. The smallest unit of feature crossover operation is vector, and elements in the same implicit vector will not have cross product, which is called vector-wise. In the subsequent FM derivative model, especially after the introduction of DNN module, the common practice is to splicing the feature vectors after embedding together and then feeding them into the subsequent DNN structure to simulate the process of feature crossing. The difference between this method and vector-wise lies in that each feature vector concat becomes a vector together and the concept of different feature vectors is erased. In the subsequent module calculation, elements in the same feature vector will be interactive computed. This method is called bit-wise.
The common bit-wise mode is changed to vector-wise, so that the model and FM ideas are more appropriate, which is also one of the Motivation of xDeepFM. It should be noted that vector-wise is used in PNN, NFM, AFM and DeepFM, but it is not explained separately.
In this paper, CIN+DNN+Linear is used to construct the eXtreme Deep Factorization Machine(xDeepFM) model, and the effect of the model is verified on the data set.
Preliminary knowledge
embedding layer
When 01 is used to represent features, features are high-order and sparse, as shown in the figure below
Local Layer is used to transform the original features into the representation of embedding as follows: E = [e1, e2,…, em] e = [e_1, e_2,…, e_m] e = [e1, e2,…, em], including ei ∈ RDe_i \ in R ^ collection ∈ RD, the length of embedding for m * D Dm \ times Dm. The representation of embedding Layer is shown in the following figure
Implicit higher order interaction
The feature computation in the neural network is shown as follows, wherein the embedding of the same field also influences each other. DNN cannot infer exactly how many order of cross features are learned, so it is called implicit feature interaction.
Show higher order interaction
In Cross Network(CrossNet), the display interaction is represented as: Xk = x0xk – 1 TWK + bk + xk – 1 \ mathbf {x} _ {k} = \ mathbf {x} _ {0} \ mathbf {x} _ {1} k ^ {T} \ mathbf {w} _ {k} + \ mathbf {b} _ {k} + \ mathbf {x} _ {1} k xk = x0xk – 1 TWK + bk + xk – 1, was derived in this paper are as follows
The alpha I + 1 = alpha I (x0Twi + 1 + 1) \ alpha ^ = {I + 1} \ alpha ^ {I} (x ^ T_0 w_ I + 1 + 1) attach alpha I + 1 = alpha I (x0Twi + 1 + 1), so you can deduce the order number of feature interaction, CrossNet said has the following disadvantages
- Cross-network output is limited in a special form, where each hidden layer is a multiple of the X0x_0x0 scalar
- Feature interaction is carried out in bitwise mode.
CIN model
The embedding is represented as a matrix, where the input is X0∈Rm×DX^0 \in R^{m\times D}X0∈Rm×D, The embedding at layer K in CIN is represented as Xk∈RHk×DX^k\in R^{H_k\times D}Xk∈RHk×D, where HkH_kHk represents the number of feature vectors, Xi,∗0X_{I,*}^0Xi, and ∗0 is the i-th feature of X0X^0X0.
In CIN, XkX^kXk is computed as follows
Xh, ∗ k = ∑ ∑ j = I = 1 hk – 1 1 mwijk, h (Xi, ∗ k – 1 ∘ Xj, ∗ 0) \ mathbf {X} _ {h, *} ^} {k = \ sum_ {I = 1} ^ {H_ – 1} {k} \ sum_ ^ {j = 1} {m} \ mathbf {W} _ {I} j ^ {k, H} \ left (\ mathbf {X} _ {I, *} ^ {1} k \ circ \ mathbf {X} _ {j *} ^ {0} \ right) Xh, ∗ k = I = 1 hk – 1 ∑ ∑ mwijk, j = 1 h (Xi, ∗ k – 1 ∘ Xj, ∗ 0).
Among them 1 h or less Hk or less, Wk, a h ∈ RHk – 1 x m1 \ leq h \ leq H_ {k}, \ mathbf {W} ^ {k, h} \ \ mathbb in ^ {R} {H_} {k – 1 \ times m} 1 h or less Hk or less, Wk, a h ∈ RHk – 1 x m, CVD \circ CVD is ∘ Hadamard product, ⟨ a1, a2, a3 ⟩ ∘ ⟨ b1, b2, b3 ⟩ = ⟨ a1b1, a2b2, a3b3 ⟩ ⟨ a1, a2, a3 ⟩ \ circ ⟨ b1, b2, b3 ⟩ = ⟨ a1b1, a2b2, A3b3 ⟩ ⟨ a1, a2, a3 ⟩ ∘ ⟨ b1, b2, b3 ⟩ = ⟨ a1b1, a2b2, a3b3 ⟩.
The calculation of XkX to the kXk can be graphically represented as two steps,
- The Hadamard product of XkX^kXk and X0X^0X0 forms Hk∗mH_{K}*mHk∗m D dimensions tensor, denoby Zk+1Z^{k+1}Zk+1, as shown in the left figure below
- The sum is then performed, as shown on the right below.
For Xk,k∈[1,T]X^k,k\in [1,T]Xk,k∈[1,T] Namely dxi piks = ∑ j = 1, jkp_ {I} ^} {k = \ sum_ ^ {j = 1} {D} \ mathbf {X} _ {I, j} ^ {k} piks dxi = ∑ j = 1, jk, After pooling XkX ^ kXk vector for pk = [p1k, p2k,…, pHkk] \ mathbf {p} ^} {k = \ left [p_ {1} ^ {k}, p_ {2} ^ {k}, \ ldots, P_ {H_ {k}} ^ {k} \ pk = right] [p1k, p2k,…, pHkk], will be 1 to T get pk \ mathbf {p} ^ {k} pk for Mosaic as follows: P + = [p1, p2,…, pT] ∈ R ∑ I = 1 thi \ mathbf {p} ^ = {+} \ left [\ mathbf {p} ^ {1}, \ mathbf {p} ^ {2}, \ ldots, ^ \ mathbf {p} {T} \ \ \ mathbb in right] {R} ^ {\ sum_ {I = 1} ^ {T} H_ {I}} p + = [p1, p2,…, pT] ∈ R ∑ thi I = 1, through the way of linear model to predict as follows: Y = 11 + exp (p + Two) y = \ frac {1} {1 + / exp/left (\ mathrm ^ {p} {+ T} \ mathbf {w} ^ {o} \ right)} y = 1 + exp (p + Two) 1.
CIN was analyzed from Space Complexity, Time Complexity and Polynomial Approximation.
Space Complexity
- CIN reference number is: the ∑ k = 1 THK (Hk – 1 x (1 + m) \ sum_ {k = 1} ^ {T} H_ {k} \ times \ left H_ (1 + 1} {k \ \ times m right) ∑ k = 1 THK x (Hk – 1 * 1 + m), parameters of the complexity of: O(mTH2)O(mTH^2)O(mTH2)
- DNN parameter number is: M * D * HT H1 + + Hk – THK ∑ k = 2 x 1 m \ \ times times D H_ H_ {1} + {T} + \ sum_ ^ {k = 2} {T} H_ {k} \ times H_} {k – 1 m * D * HT H1 + + Hk – THK ∑ k = 2 x 1, parameters of the complexity of: O(mDH+TH2)O(mDH+TH^2)O(mDH+TH2) is related to D
Time Complexity
- CIN time complexity: O(mH2DT)O(mH^2DT)O(mH2DT)
- O(mHD+H2T)O(mHD+H^2T)O(mHD+H2T)
Polynomial Approximation
The h th feature vector of the k layer can be expressed as follows, which is the k order representation of the feature in my personal understanding.
XDeepFM model
XDeepFM consists of CIN, DNN and Linear, which includes low-order and high-order feature interactions.
-
The network is expressed as: Y ^ = sigma (wlinear Ta + wdnnTxdnnk + wcinTp++ b) \ hat {} y = \ sigma \ left (\ mathbf {w} _ {\ text {linear}} ^ {T} \ mathbf {a} + \ mathbf {w} _ {d n ^ n} {T} \ mathbf {x} _ {n} d ^ + \ mathbf {k} {w} _ {c I n} ^ {T} \ mathbf {p} ^ {+} + b \ right) y ^ = sigma (wlinear Ta + wdnnTxdnnk + wcinTp++ b)
-
The loss function is Cross entropy loss function: L = N – 1 ∑ I = 1 nyilog y ^ I + 1 – (yi) log (1 – y ^ I) \ mathcal {L} = – \ frac {1} {N} \ sum_ {I = 1} ^ y_ {N} {I} \ log \ hat {y} _ {I} + \ left (1 – y_ {I} \ right)/log/left (1 – \ hat {y} _ {I} \ right) L = – N1 ∑ I = 1 nyilogy ^ I + 1 – (yi) log (1 – y ^ I), to join the regularization in the loss function, the final loss function is: J = L + lambda ∗ ∥ Θ ∥ \ mathcal {J} = \ mathcal {L} + \ lambda_ {*} \ | \ Theta \ | J = L + lambda ∗ ∥ Θ ∥.
The network structure is shown as follows
The experiment
The experiment explored three questions
- How does the CIN proposed perform in the interaction learning of higher-order features?
- Should the recommendation system combine implicit feature interaction with display feature interaction?
- How do network Settings affect xDeepFM performance?
The experimental data set is shown below and is evaluated using AUC(Area Under the ROC Curve) and Logloss(cross entropy). For specific parameters, refer to the paper.
Problem 1: CIN effect
The experimental results are as follows, where the number of layers is the optimal network layer (obtained through search), and the conclusions are as follows:
- High-order interactions on sparse features are necessary, which can be verified by the significantly superior performance of DNN, CrossNet and CIN over FM on the three datasets
- CIN is the best model, demonstrating the effectiveness of CIN in modeling explicit higher-order feature interactions. CIN at the K layer can simulate k-degree characteristic interactions. It is also interesting to note that CIN requires 5 layers to produce the best results on the Bing News dataset.
Problem 2: Implicit feature interaction + display feature interaction
The results are shown below, and the conclusions are
- LR is much worse than all other models, which indicates that factory-based models are essential for measuring sparse features
- Wide&Deep, DCN, DeepFM and xDeepFM are all significantly better than DNN, which directly reflects the importance of hybrid components in improving the accuracy of prediction systems, although simple
- The proposed xDeepFM achieves the best performance on all datasets, which shows that the combination of explicit and implicit higher-order is feasible
- Typical Settings for the depth hyperparameters are 2 and 3, and the optimal depth setting for xDeepFM is 3, indicating that we are learning interactions of order 4 at most
Problem 3: Parameter exploration
The following questions are studied and the conclusions are shown in the figure below
- Depth of Network
- Number of Neurons per Layer
- Activation Function
conclusion
On the basis of DeepFM, the feature representation is further advanced. Personally, I think the model is more complex, so I don’t know whether there is a large number of applications in the actual business and how the effect of the application is in the actual business.
The resources
- Recommendation system – Sorting – xDeepFM
- –xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems