Abstract

Generally, there are two common recall algorithms in business recommendation systems, namely, similarity index paradigm (such as I2I) and EBR paradigm (such as DeepMatch). The disadvantage of I2I paradigm is that it is difficult to generalize pairs with few co-occurrence, and it is difficult to model U2I part, thus the model lacks accuracy and personalization. Although the EBR paradigm models the U2I part, it integrates the user’s interests into a vector. However, it fails to model the relationship between each user action and rated item (similar to Target Attention), thus the recall lacks diversity. In order to integrate the advantages of both and reduce the disadvantages of both as much as possible, we propose a new paradigm Path Based Deep Network (PDN). The PDN model uses TriggerNet to model parts of U2I, SimNet to model parts of I2I, and then to model U2I2I end-to-end. At present, PDN model has been fully used in the content information flow scene of Taobao home page, and has become the main source of online recall, bringing about 20% increase in the number of clicks, GMV and diversity. At the same time, PDN was also admitted by SIGIR2021.

background

Recommendation technology is very important and common in The application of Taobao, the purpose is to build a bridge so that users can direct to the products they are interested in to improve user experience and benefit conversion. The general recommendation system mainly includes recall, rough arrangement, fine arrangement and rearrangement. As the recall link is at the bottom of the whole recommendation link, which determines the bottleneck and upper limit of the recommendation effect, this work is mainly aimed at optimizing the link with good goods. The main task of recall link is to efficiently screen out a small part (generally thousands to hundreds of thousands) of commodities that users may be interested in from the whole commodity pool for other links to screen and sort. The recall link in industry generally includes two types of algorithms: correlation index recall paradigm and vector Based Retrieval (EBR) paradigm.

At present, relevant index recall is mainly based on Item2Item paradigm in the industry. The specific approach is as follows: Step1, in the offline stage, inversion index table is constructed based on some commodity similarity measurement indicators (such as Pearson correlation coefficient); Step2. In the service stage, the user’s historical behavior sequence is used for direct table lookup.

The advantages of the Item2Item paradigm are:

  1. Can ensure the relevance of user interest;
  2. Recall of users with rich behaviors is also diverse;
  3. Can capture users’ interests in real time.

But there are four problems:

  1. Often I2I index is based on a kind of co-occurrence statistics, may appear unpopular goods can not be ranked, new products can not be ranked;
  2. How to consider both the co-occurrence information of I2I and the Side Info at both ends of Item;
  3. How to relate the establishment of such indexes to the diverse business objectives;
  4. How to consider the joint probability of multiple triggers pointing to the same Item.

Vectorization recall model (EBR) can also use Side Info to model the joint probability of multiple user behaviors, so it has attracted more attention in recent years. In simple terms, the algorithm obtains the user representation and the product representation respectively, and realizes the recall by using the nearest neighbor search when serving. Of course, there are also shortcomings in this kind of algorithm. There are two main points. One is that this kind of algorithm only uses one or several vectors (similar to MIND) to represent users, and it cannot represent users’ multidimensional interests in a fine-grained manner like I2I. The other is that it is difficult to introduce the co-occurrence information of the target commodity and the interacted commodity because of the parallel architecture of commodity and client.

In general, constrained by the existing recall model framework, the twin towers model uses user information and product Profile information, but fails to make explicit use of product co-occurrence information. I2I index mainly uses commodity co-occurrence information, but ignores user and commodity Profile information, and cannot consider the comprehensive impact of behavior sequence on target commodity. Meanwhile, due to different methods of similarity calculation, there are often multiple I2I indexes working at the same time online. We hope to find a way to unify the I2I similarity and solve the four problems mentioned above as far as possible.

primers

As shown in Figure 1, the recommendation problem is summarized as the chain prediction problem based on the quadratic graph:


y ^ u i = f ( z u . x i . { x j } . { a u j } . { c j i } )   with    j N ( u ) \hat{y}_{ui} = f\big( \bm{z}_u, \bm{x}_i, \{\bm{x}_{j}\}, \{\bm{a}_{uj}\}, \{\bm{c}_{ji}\} \big) ~ \textit{with}~~ j\in N(u)

Where, N(u) represents the set of goods that the user has interacted with. Most of the existing work, including i2I’s collaborative filtering method and the two-tower model, can be regarded as special cases of the above formula. For example, the regression-based commodity collaborative filtering method can be defined as:


y ^ u i = f ( { a u j } . { c j i } ) = j N ( u ) f r ( a u j ) c j i \hat{y}_{ui} = f\big(\{\bm{a}_{uj}\}, \{\bm{c}_{ji}\} \big) = \sum_{j\in N(u)}f_r(\bm{a}_{uj})c_{ji}

Fr: Rm→R1f_r: R^m→R^1fr: Rm→R1 is defined as a function to predict the user’s interest degree in interactive goods, and CJI ∈R1c_{ji}\in \mathcal{R}^ 1CJI ∈R1 represents the correlation degree between interactive goods and the target goods. Thus, the method can be thought of as a sum over n two-hop paths, each of which has a weight fr(auj) cjIF_r (\ BM {a}_{uj})c_{ji} Fr (auj)cji. In addition, vector recall based methods, such as MF, can be defined as:


y ^ u i = f ( z u . x i . { x j } ) = q i ( p u + 1 N ( u ) j N ( u ) p j ) T \hat{y}_{ui} = f\big( \bm{z}_u, \bm{x}_i, \{\bm{x}_{j}\} \big)=\boldsymbol{q}_i\big(\boldsymbol{p}_u+\frac{1}{\sqrt{|N(u)|}}\sum_{j\in{N(u)}}\boldsymbol{p}_j\big)^T

Where, QI, PU,pjq_i, P_U, p_jQI, PU,pj represent the feature vectors of target goods, user information and interactive goods respectively. MF can be seen as the sum of n+1 paths of the quadratic graph, specifically, qipu\boldsymbol{q}_i\boldsymbol{p}_uqipu represents the weight of the direct path, 1 / ∣ N (u) ∣ ⋅ qipj1 / \ SQRT {(u) | | N} \ cdot \ boldsymbol _i \ boldsymbol {q} {p} _j1 / ∣ N (u) ∣ ⋅ qipj said two weights of jumping path. Similarly, YotubeDNN, an in-depth version of MF, can be defined as:


y ^ u i = f ( z u . x i . { x j } ) = q i ( MLP ( p u . 1 N ( u ) j N ( u ) p j ) ) T \hat{y}_{ui} = f\big( \bm{z}_u, \bm{x}_i, \{\bm{x}_{j}\})=\boldsymbol{q}_i\Big(\textit{MLP}\big(\boldsymbol{p}_u, \frac{1}{|N(u)|}\sum_{j\in N(u)} \boldsymbol{p}_j\big)\Big)^T

Existing recall methods are limited by recall efficiency and model structure, so it is difficult to use all the information in the figure. For example, I2I paradigm lacks user information and commodity information, and EBR paradigm does not explicitly model commodity co-occurrence information. Therefore, we propose a new framework, path-based Deep Network (PDN), to rationally use all information to realize personalized multi-peak interest recall with low delay.

Figure 1: The degree of users’ liking for the target product is decoupled into a two-degree graph

The first hop represents the user’s liking for the interactive product, and the second hop represents the similarity between the interactive product and the target product. Zu \bm{z}_uzu represents the user information (ID, gender, etc.) of user U, {XJK}k=1n\{bm{x}_{j_k} _{k=1}^n{XJK}k=1n represents the product information (ID, category, etc.) of n goods that user has interacted with. Xi \bm{x}_ixi represents the commodity information of the target commodity, {aujk}k=1n\{\bm{a}_{u j_k}\}_{k=1}^n{aujk}k=1n represents the user’s behavior information of the KTH interactive commodity (duration of stay, times of purchase, etc.), {cjki}k=1n\{\bm{c}_{j_k I}\}_{k=1}^n{cjki}k=1n represents the correlation information (co-occurrence times, etc.) between the KTH interactive commodity and the target commodity, and the thickness of the edge represents the weight of the edge.

methods

In order to ensure that personalized users’ fine-grained multi-peak interests can be combined with recall, we built a new generation recall framework based on the two-dimensional graph shown in Figure 1. This framework overcomes the disadvantage of the previous framework not being able to use all the information, and integrates the respective advantages of i2I and the twin tower model to achieve a unified optimization.

Figure 1 contains N (length of historical behavior sequence) two-hop paths and 1 direct path. 2 said the first jump jump path for users interested in interactive products, said the second jump for interactive similarity of commodities to the target, therefore, with the twin towers model, we fine-grained independent modeling the multimodal user interest (each interaction said interested to establish a path), solved the problem of the single vector to express multidimensional interest. The direct path represents the user’s intuitive liking for the target product. For example, girls may be more interested in clothing, while boys may be more interested in electronic products.

Specifically, for n two-hop paths, our framework

(1) Based on user information, behavior information and interactive product information, a TriggerNet is used to model the user’s liking degree for each interactive product, and a variable length user representation vector (dimension 1×n) is finally obtained, where the KTH dimension represents the user’s liking degree for the KTH interactive product;

(2) Similarity between interactive goods and target goods is modeled by Similarity Net based on the information and correlation between interactive goods and target goods, and a variable length representation vector of target goods is finally obtained, wherein the k-dimension represents the Similarity between the KTH interactive goods and the target goods. Finally, the weight of N +1 paths is integrated to predict the final liking degree of the target commodity.

FIG. 2 Overall framework of PDN

Overview of PDN

Figure 2 shows the proposed recall framework PDN, which mainly includes four modules: Embedding Layer, Trigger Net (TrigNet), Similarity Net (SimNet), Direct & Bias Net. The forward process of PDN can be summarized as follows:


y ^ u i = AGG ( f d ( z u . p o s ) . { PATH u j i } )   with    j N ( u ) \hat{y}_{ui} = \textit{AGG}\Big(f_{d}\big(\bm{z}_u, \bm{pos}\big), \big\{\textit{PATH}_{uji}\big\} \Big) ~ \textit{with}~~ j\in N(u)

P A T H u j i = MEG ( TrigNet ( z u . a u j . x j ) . SimNet ( x j . c j i . x i ) ) PATH_{uji} = \textit{MEG}\big(\textit{TrigNet}(\bm{z}_u, \bm{a}_{uj}, \bm{x}_j), \textit{SimNet}(\bm{x}_j, \bm{c}_{ji}, \bm{x}_i) \big)

Where, FDF_DFD represents the calculation function of direct path weight, PATHuijPATH_uijPATHuij represents the two-hop path weight based on interactive commodity J, and AGG represents the scoring function integrating N +1 path weight to predict the correlation between users and target commodities. MEG represents a function that fuses the weights of each two-hop graph.

In order to ensure that PDN meets the delay requirement of recall link, MEG is defined as the dot product or addition of two vectors, fdF_DFD is defined as the dot product. Therefore, PDN can be formally defined as:


y ^ u i = p b i a s + u b i a s + j N ( u ) MEG ( TrigNet ( z u . a u j . x j ) . SimNet ( x j . c j i . x i ) ) \hat{y}_{ui} = pbias + ubias + \sum_{j\in N(u)}\textit{MEG}\big(\textit{TrigNet}(\bm{z}_u, \bm{a}_{uj}, \bm{x}_j), \textit{SimNet}(\bm{x}_j, \bm{c}_{ji}, \bm{x}_i) \big)

Below, we’ll look at the individual modules in PDN in detail.

Embedding Layer

As shown in Figure 1, PDN mainly uses four types of features, including user information zu\bm{z}_uzu, commodity information x\bm{x}x, Behavior information {aujk} k = 1 n \ {\ bm {a} _ {u j_k} \} _ {k = 1} {aujk} ^ n n and k = 1 goods related information {cjki} k = 1 n \ {\ bm {c} _ {I} j_k \} _ {k = 1} ^ n {cjki} n k = 1.

PDN converts it into encoding vector by Embedding Layer:


E ( z u ) R 1 x d u . E ( x ) R 1 x d i . E ( a u j ) R 1 x d a . E ( c j i ) R 1 x d c . \bm{E}(\bm{z}_u) \in \mathcal{R}^{1 \times d_u}, \bm{E}(\bm{x}) \in \mathcal{R}^{1 \times d_i}, \bm{E}(\bm{a}_{uj}) \in \mathcal{R}^{1 \times d_a}, \bm{E}(\bm{c}_{ji}) \in \mathcal{R}^{1 \times d_c},

Du,di,da,dcd_u, d_i, d_A, d_cdu,di,da,dc represent dimensions of various features.

Trigger Net & Similarity Net

After passing through the coding layer, PDN calculates each two-hop path between the user and the target commodity. For the first hop, PDN uses TrigNet to calculate the user’s liking for each interactive item to complement the user’s multimodal interest. Specifically, given user U and his interactive product J, the calculation method is as follows:


t u j = TrigNet ( z u . a u j . x j ) = MLP ( C A T ( E ( z u ) . E ( a u j ) . E ( x j ) ) ) t_{uj} = \textit{TrigNet}(\bm{z}_u, \bm{a}_{uj}, \bm{x}_j)= \textit{MLP}\Big( CAT\big(\bm{E}(z_u), \bm{E}(\bm{a}_{uj}), \bm{E}(\bm{x}_{j})\big) \Big)

Which CAT (E (zu), E (auj), E (xj)) ∈ R1 * (da du + + di) CAT \ big (\ bm {E} (z_u), \ bm {E} (\ bm {a} _ {uj}). (\ \ bm {E} bm {x} _ {j}) \ \ \ mathcal in big) ^ {R} {1 \ times (d_u + + d_i entries)} CAT (E (zu), E (auj), E (xj)) ∈ R1 * (da du + + di) concatenation operation, Tujt_ {uj}tuj indicates how much U likes J. When users interact with n commodity, Tu = \ [tu1, tu2,…, top] bm {T} _u = [t_ {u1}, t_ {u2},…, t_ {UN}] Tu = [tu1, tu2,…, top] can be regarded as a variable-length user said. The twin-tower model often presents users with one or more vectors of constant length, which is considered to be the bottleneck of capturing multi-interest, because multi-interest information is mixed into several vectors without constraint, resulting in inaccurate recall. Compared with these methods, Tu\ BM {T}_uTu is more fine-grained and interpretable, because each dimension in the vector explicitly conveys the user’s degree of interest.

For the second hop, SimNet calculates the similarity between the interactive commodity and the target commodity based on commodity information and co-occurrence information:


s j i = SimNet ( x j . c j i . x i ) = MLP ( C A T ( E ( x j ) . E ( c j i ) . E ( x i ) ) ) s_{ji} = \textit{SimNet}(\bm{x}_j, \bm{c}_{ji}, \bm{x}_i) = \textit{MLP}\Big(CAT\big(\bm{E}(\bm{x}_j), \bm{E}(\bm{c}_{ji}), \bm{E}(\bm{x}_i)\big) \Big)

Where sjIS_ {ji}sji represents the similarity of commodity J and I, Si = \ [s1i, s2i,…, sni] bm _i = {S} [s_ {1} I, s_ {2} I,…, s_ {ni}] Si = [s1i, s2i,…, sni] can be regarded as the target variable length vector representation of goods. It is important to note that SimNet explicitly learns about similarity between goods, so it can be deployed independently online instead of using the i2I strategy. After obtaining tujt_{uj}tuj and SJIS_ {ji}sji, PDN calculated the correlation weight of each two-hop path:


PATH u j i = MEG ( t u j . s j i ) = l n ( 1 + e t u j e s j i ) = softplus ( t u j + s j i ) \textit{PATH}_{uji}=\textit{MEG}(t_{uj}, s_{ji})=ln(1+e^{t_{uj}}e^{s_{ji}}) = \textit{softplus}(t_{uj}+s_{ji})

Direct & Bias Net

Selection bias such as location bias has been proved to be an important factor in the recommendation system. For example, a user is more likely to click on an item near the top, even if it’s not the most relevant item. To eliminate such bias, we trained a shallow tower based on features (location information, etc.) that would lead to selectivity bias. As shown in Figure 2, the output of Bias Net, YBIAS, was added to the output of the master model during training. During service, Bias Net is removed to ensure unbiased scoring. Direct Net is similar, mainly modeling user bias. We separate these two parts to make TrigNet and SimNet learn things independent of user and position.

Loss function

Whether the user will click on the item can be considered a dichotomous task. Therefore, PDN integrates the weight of N +1 paths and deviation score to obtain the correlation score between users and commodities, and converts it into click probability:


y ^ u . i = j = 1 n l n ( 1 + e t u j e s j i ) + softplus ( y b i a s ) \hat{y}_{u,i} = \sum_{j=1}^nln(1+e^{t_{uj}}e^{s_{ji}}) + \textit{softplus}(y_{bias})

p u . i = 1 e x p ( y ^ u . i ) p_{u,i} = 1 – exp(-\hat{y}_{u,i})

Due to the introduction of softplus, y ^ u, I ∈ \ [0, + oo] hat {y} _ {u} I \ [0, + oo] y ^ in u, I ∈ [0, + oo], therefore, we use 1 – exp () to forecast the projection to between 0 and 1. We adopt the cross entropy loss training model:


l u . i = ( y u . i l o g ( p u . i ) + ( 1 y u . i ) l o g ( 1 p u . i ) ) l_{u,i} = – \big(y_{u,i} log(p_{u,i}) + (1-y_{u,i})log(1-p_{u,i})\big)

Yu,iy_{u, I}yu, I are sample labels.

Constraint learning

In order to ensure that the model converges to a better region, we carefully design the constraint form on the two-hop path. As mentioned above, in the last layer of TrigNet and SimNet, we use exp() instead of other activation functions to constrain the output to be positive, namely esjie^{s_{ji}}esji and etuje^{t_{uj}}etuj. If negative weighted outputs are allowed, leading PDN to search for local optimal values in a broader parameter space, this can easily lead to overfitting. In real life, SimNet is used to generate indexes and TrigNet is used for Trigger Selection. The result of such overfitting is not that it is less effective, but that the learned index may become unusable.

We illustrate the possible problems with unconstrained learning by allowing the correlation weights to be negative. In the first example, as shown on the left in Figure 3, a user has clicked Ipad and Huawei P40pro, and a negative sample Iphone appears. It is assumed that the similarity of goods is correct, but the part of Trigger is overfitted. The appearance of this negative sample may indicate that the user’s interest in this category has been consumed. We hope that the model can capture this information through this negative example, so we hope that the model can learn two smaller Trigger weights. However, the positive and negative situation in the figure is also a suboptimal solution. At this time, Loss is relatively small, and the optimizer may fall into this trap and cannot get out.

The second example is shown on the right in Figure 3. A user has clicked Nike suit and Telonsu, and a negative sample Iphone appears. Assume that the Trigger Weight is correct, but the similarity is overfitted. At this point, 0.8*-0.8+0.5*0.5=-0.11, resulting in very low Loss. But learned the similarities between the Tron and the Iphone. If the constraint is positive, the optimizer pushes the two similarities as close to zero as possible to avoid the overfitting of one positive and one negative.

FIG. 3 Bad case when the weight of two-hop path is negative

Online use

Path-based retrieval

In order to meet the delay requirements of recall, as shown in Figure 4, we constructed a new recall link based on greedy strategy: Path retrieval. Specifically, we decouple path retrieval into two parts:

(1) TrigNet is used to retrieve top-M interactive commodities that users are most interested in;

(2) The commodity similarity index constructed by SimNet is used to achieve I2I retrieval for each interactive commodity in Top-M respectively.

For TrigNet, we deployed it as a real-time online service for scoring each interactive item; For SimNet, we construct an inverted index table offline based on its calculated product similarity. The steps of online recall can be summarized as follows:

Index generation: Based on SimNet, we select k most relevant commodities for each commodity in the commodity pool to build an index and store the correlation score sjIS_ {ji}sji. See 4.2 for detailed generation method.

Interactive Goods Extraction: When the user enters the scene, we use TrigNet to score all the items the user has interacted with tujt_{uj}tuj and return top-M interactive items.

Top-k search: We built the index table based on top-M interactive goods query SimNet. M × K candidate products were scored based on the following formula and the final recall result was returned.


s ^ u . i = j = 1 m softolus ( t u j + s j i ) \hat{s}_{u,i} =\sum_{j=1}^m\textit{softolus}(t_{uj}+s_{ji})

Position Bias and User Bias are no longer required. The overall recall framework is shown in Figure 5.

Figure 4 Path Retrieval

The index generated

Due to the large commodity pool, we need to compress the similarity matrix RN×N→RN×k\mathcal{R}^{N \times N}\rightarrow\mathcal{R}^{N \times k}RN×N→RN×k to ensure offline computing efficiency and storage resources. There are three steps.

Step 1: Enumeration of candidate commodity pairs: We mainly generate candidate pair pairs based on two strategies, one is the commodities that have been present together in the same Session, the other is based on commodity information, such as commodities of the same brand/store.

Step 2: Candidate pair sorting: Use SimNet to score each pair.

Step 3: Index construction: For each commodity, the score based on simNet is sorted and truncated according to some rules, and an index table of N× K is constructed.

Since SimNet input not only co-occurrence information, but also side info at both ends, we can solve the problem of new product recall. The specific approach is to enumerate a number of pairs of goods that are very similar to new products in the perspective of commodity attributes in step one.

FIG. 5 Overall recall framework

The experiment

Offline validation

Table 1 shows the offline verification of the I2I recall method using the online exposure click log. The offline recall method is to use all behaviors of users within 3 days as Trigger, and use four different indexes, and find TopN(3/8) for each Trigger in the index.

This verification method is used for all the triggers, and the number of recall is different for different users. In consideration of the fact that there are some index of similarity inversion, which may not be 3 or 8 in some Trigger cases, the Precision index is specially added. Where RankI2I is GBDT model, Swing I2I is used as a feature, and the profile of Item is also used.

The characteristics of SimNet side of PDN model are basically the same as RankI2I, and the training objectives are also the same, which are CTR with good goods and one jump page. The difference between PDNV13 and PDNV43 is that V43 adopts constrained learning.

Table 1 Offline Hitrate comparison based on the exposure log of good goods

Swing I2I RankI2I PDNv13 PDNv43
TOP3 Hit-Rate 7.56% 14.55% 12.08% 22.99%
TOP3 Precision 0.08% 0.15% 0.12% 0.23%
TOP8 Hit-Rate 13.09% 20.18% 22.12% 34.68%
TOP8 Precision 0.06% 0.09% 0.09% 0.14%

Effect of online

Table 2 shows the effect of online AB experiment. The Baseline consists of multiple recall, including index recall and vector recall. SwingI2I, RankI2I, DeepI2I and so on are used in index recall part, and single tower/double tower (Deep Match)/Node2Vec are used in vector recall part. The online experiment is to replace all index recall with PDNv43 (the unified index, i.e. Trigger, remains unchanged and replaces all previous I2I indexes). Figure 6 shows the online ratio after PDNv43 goes live, which compresses the single/twin tower recall ratio to 6%. Eating up almost all of the recall portion of the algorithm.

Table 2 Online effect comparison of good goods

Base:RankI2I+Swing Clicks per capita pctr uctr dpv Per capita stop time GMV diversity
PDNv43 + 17.55% + 15.07% + 1.50% + 7.81% + 7.56% + 21.25% + 19.12%

Figure 6 Proportion of online recall of good goods

Analysis of the influence of user Trigger number

The recall of I2I paradigm is subject to the influence of Trigger number. We specifically compared several methods. According to the number of user triggers, users can be divided into four sections: those with less than or equal to 15, 15-30, 30-45, and those with more than 45. We use two methods for verification: One is the Hitrate of TopN, as shown in Table 3; One is the category coverage (Diversity) of TopN users’ interests, as shown in Figure 7.

The PDN is much better than Swing I2I and the twin towers in different sections. It is worth noting that when the number of triggers is low, the Hitrate of the twin towers is also low. It shows that when the user behavior is not rich, the twin towers recall is difficult to predict the user’s interest. Compared with SwingI2I, when the Trigger number is less than or equal to 45, the absolute increase of hitrate is stable at 20%. Relative to the twin towers, the absolute increase in hitrate was steady at around 15%. From the perspective of diversity, PDNv43 has maintained a 15-20% increase in absolute value compared with the twin towers. BST[3] in Figure 7 is also a two-tower vector recall, but Transformer is used to model the user sequence side.

Table 3 Comparison of offline Hitrate based on exposure logs of good goods (user segmentation according to Trigger)

Figure 7 User hierarchy and diversity according to Trigger number

Open data data set comparison

Experiments have also been done on open data sets. In EBR recall paradigm, we choose DSSM, Youtube DNN and BST models. The differences between the three models can be understood as follows: no user sequence, user sequence mean pooling and user sequence through Transformer. In the I2I recall paradigm, we choose the traditional Item-CF and SLIM. In order to highlight the sorting ability of PDN model, DIN model was selected as one of the comparisons. When calculating Hitrate in DIN model, all candidate sets are evaluated and TopN is calculated.

Table 4

Discussion and Outlook

The model structure can also be understood as the inner product of two sparse vectors whose length is Item Corp size (counted as N). Specifically, for a user with k behaviors, its user representation is a vector with dimension N, but only k dimension of this vector has values, and all other values are 0. For an item representation with m similar items, its item representation is also a vector with dimension N, in which only m dimension has value and the rest are 0. And since k<<N and m<<N, these two vectors are sparse. The first formula in Section 3.5 can be regarded as the inner product of two sparse vectors, except that the value is reduced by an ln function. The Capacity of the entire model is quite high compared to fixing the User&Item representation to 64 or 128 dimensions.

PDN can be used as a sort model directly. In this direction, we have tried in the sorting model of live broadcast information flow, and have obtained the effect of offline verification. Based on the coarse layout model +PDN, the AUC predicted by the coarse layout model increases by +0.6%, which is close to the AUC predicted by the fine layout model.

Training AUC Predict AUC
Target Attention + mean pooling 72.2 72.8
Rough layout model: User side Mean pooling + inner volume of two towers 71.1 72.0
Coarse row model based on +PDN 72.0 72.63

Thank you

Thanks to Deng Hongbo, Piaoxue two teachers for their guidance.

Thanks to Mr. Li Chenliang of Wuhan University for his help and guidance.

Thanks for the cooperation and support of @Yunzhi @Weiming.

Reference

[1] Learning Deep Structured Semantic Models for Web Search using Clickthrough Data.

[2] Deep Neural Networks for YouTube Recommendations.

[3] Behavior sequence transformer for e-commerce recommendation in Alibaba.

[4] Item-Based Collaborative Filtering Recommendation Algorithms.

[5] Slim: Sparse linear methods for top-n recommender systems.

[6] Deep Interest Network for Click-Through Rate Prediction.