As we all know, the method of adding the general single-valued category features into the CTR prediction model is one-hot for the single-valued category features, and then the multi-dimensional dense features are converted by embedding matrix multiplication, as shown in Fig.1 below:
In the previous article, the method of adding dense features into CTR prediction model was summarized. But in the real practical problems, often there will be more value category features, such as 2019 tencent advertising algorithm contest I come into contact with interest in the user’s behavior of features is more value category, also is the interests of a user can have multiple categories, such as playing basketball, table tennis and dance etc, and is not the same as the number of different user’s interest. In addition, the topic features of users interested in the 2019 Zhihu Mountain Cup Competition, that is, a user can be interested in multiple topics, and different users are interested in different topics, the form of these features are generally the following structure (take the topic features of users interested in) :
In the CTR prediction model, common processing methods for such multi-valued category features are summarized as follows:
▌ Unweighted method
The routine is the most simple is first of all ‘topic’ collection one hot coding, and then according to the figure 1 way for multivalued category characteristics for each item in the dense characteristics of translation, the dense characteristic vector of the converted, and then according to the average or maximum or minimum, etc., the whole process can be expressed as shown in figure 2:
It can be seen that after the multi-valued category features are processed in this way, each multi-valued category feature can be transformed into the same dimension space. In this way, the input to the neural network does not need to carry out padding in order to keep the input dimension consistent, so that the input becomes sparse, and it is convenient to make cross-features with other features.
▌ weighting method
On second thought, it seems unreasonable to directly calculate the mean value of multi-value type features. After all, users have different degrees of liking for each topic they are interested in, which leads to the introduction of weight instead of simply calculating the mean value. The specific method of introducing weight is shown in Figure 3:
Then how is the weight obtained, summarized as follows:
Data mining is used to obtain the weight of each value in multi-value features
For example, the weight of the multi-value type feature of the topic in which users are interested can be obtained as follows: The number of questions answered by users under related topics or the number of “likes” answered by users on related topics, that is, the more questions answered by users on related topics, the more interested they are in the topic and the greater the weight is. The more times you “like” a related topic, the more interested you are in the topic and the greater the weight.
The weight of each value in the multi-value features can be learned automatically through the neural network
1. The SE module is used in FiBiNET[1] to learn the idea of different embedding vector weights. The main process is shown in Figure 4:
This is called squeeze in the SE[2] module. Then we use two full connection layers for the full connection operation. This is called excitation abstraction in the SE module. The final output is the weight corresponding to each value in the learned multi-valued category feature. Since it is for the processing of multi-value category features, the following operations, such as one Hot encoding, need to be carried out after padding according to Max Length in the programming implementation.
2. Learning embedding Vector by referring to the thoughts of Transformer [4] in AutoInt[3]Weight information in the Value space. The specific operation is shown in FIG. 5, where M is the number of characteristic values of multi-valued categories:
So let’s do matrix multiplication and the linear transformation takes each of theseIt is projected into multiple subspaces, namely Query, Key and Value, and the calculation formula is as follows:
, ,
And then calculate the currentWith other embedding Vectors in the multi-valued class featureIs obtained by the vector inner product formula as follows:
Then, softmax is used to normalize the calculated similarity, and the formula is:
The normalized value is the learned eachThe weights in the Value space, so the weighted sum is not onInstead, the weighted sum of the features mapped to the Value space is expressed by the formula as follows:
In general, the weight obtained through neural network learning is more complex and requires a large amount of calculation than the weight obtained through data mining, so it needs to weigh in the selection.
In addition to multi-value category features, there are behavioral sequence features, their processing methods also have similarities, can learn from each other, later there is time to introduce some simple behavior sequence features processing methods, interested can pay attention to the public account, see next time ~
▌ References
[1][FiBiNET] Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction (RecSys 2019).
[2][SENet] Squeeze-and-Excitation Networks (CVPR 2018)
[3][AutoInt] AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks (2018 arxiv)
[4][Transformer] Attention is all you need