0. Write first
Previously, we introduced three models of the FM series, including FM model, DeepFM and NFM. Both DeepFM and NFM models optimize FM to different degrees, focusing on the combination of neural networks to achieve high-order combination of features. Today we will introduce the AFM model (Attentional Factorization Machine) which applies attention mechanism to FM model.
Personal experience:
- The importance of feature combination will change with the change of prediction target
- The attention mechanism is used to learn different weights before features cross
Thesis Address:
www.ijcai.org/Proceedings…
1. The background
For the traditional FM model, DeepFM model and NFM model, the coefficient of the feature vector in the second-order cross term is 1. However, in some recommended scenarios, different second-order cross features have different weights to predict the current target. In order to enable the model to learn the weights of different second-order cross feature items, AFM introduced the Attention mechanism.
In fact, AFM is still divided into shallow-part and DNN-part, shallow-part is the same as other models, so it will not be repeated here. Next, we will introduce the DNN part of AFM model from the overall model architecture, pair-wise Interation Layer and Attention Layer.
2. Overall architecture
The AFM model architecture is shown in figure 1. The model consists of five parts: sparse feature input, embedding layer, pairwise interaction layer, attention-based pooling layer and predict layer. The input layer accepts sparse features. After embedding, the embedding feature vectors of each feature domain are obtained with the same vector dimensions.
The first two steps are routine operations of THE FM series neural network recommendation model without any special treatment. Each embedding feature vector will go through a pin-wise interaction layer and cross the feature vectors in pairs to obtain the second-order feature term. Meanwhile, each second-order feature vector is sent into the Attention layer to output the weight value of the feature vector, and a vector that integrates all feature information is obtained by sum-pooling weighting, and then the prediction result is output through the full connection layer.
3. pair-wise interaction layer
Let’s first look at the implementation of this layer, whose mathematical expression is shown in the following formula.
Where viv_{I}vi represents the corresponding embedding vector in the feature domain, xix_{I}xi represents the feature value, and ⊙\odot⊙ represents the element-wise Product operation. This is similar to BI-Interaction in NFM. The pin-wise interaction layer intersects the feature vector in pairs to obtain the second-order feature term. Meanwhile, the number of second-order feature items is m(m−1)/2m(m-1)/2m(m−1)/2, where MMM is the number of sample feature fields.
4. attention-based pooling layer
Since the designer called the model attention-FM, the introduction of attention mechanism must be the biggest highlight of the paper. Let’s take a look at the design and implementation of attention-based pooling layer in AFM.
In order to estimate the weight of the second-order vector, a direct method is to learn the value by minimizing the loss function. Although this seems feasible, this will run into the previous problem: when an interaction feature does not appear in the sample, it is impossible to calculate the attention score of an interaction feature. To solve this generalization problem, AFM deparameterizes the attention score using an MLP network called the Attention Layer. The mathematical expression of Attention Layer is as follows.
It can be seen that the input of attention layer is each second-order feature vector, and after a layer of MLP, it is sent to Softmax to normalize into the weight of each second-order feature vector. Among them, the activation function of the network uses the commonly used ReLU, and the number of hidden layer neurons is the hyperparameter of the attention layer, which needs to be obtained through network tuning. After obtaining the weight of each second-order feature vector, AFM conducts weighted sum-pooling of all second-order feature vectors to obtain the integrated vector of all feature information, and then sends the vector into the full connection layer, and softMax obtains the final prediction result.
5. To summarize
AFM introduces attention mechanism on the basis of FM model, which enables the model to effectively learn the weight of second-order feature items, which is not taken into account by DeepFM, NFM and other models.