One, foreword      

Industry recommendation systems typically have a two-stage architecture: recall + sort. In the recall stage, I was responsible for Match, and efficiently retrieved topK-related items from super-large candidate sets. In the sorting stage, I was responsible for Rank, and made personalized sorting of recalled items according to business needs, generally including rough sorting, fine sorting and fusion. In the recommendation of information flow, articles may be of tens of millions of magnitude. After Match recall, articles of thousands to tens of thousands of magnitude are screened out and sent to the back link for Rank, as shown in Figure 1-1.

The recall phase is usually multi-path recall, including traditional explicit recall based on tag and implicit recall based on Item embedding. The former connects the user and item through the label, strictly matching, with high precision and remarkable effect. Moreover, the zipper logic makes it have strong real-time performance, which can accelerate the cold start of new articles. The latter generally uses the interaction between features of deep network learning, especially the interaction of higher-order features, to obtain the dense vector of user and item through representation learning, and to find similar items by relying on vector similarity, which has good generalization ability.

This article shares some of the it team’s attempts at explicit recall, which we will cover in a future article.

Explicit recall includes both basic information -Based recall and content-based Recommendations (CB Recommendations). The former relies on geographical location, crowd attributes, model, APP list, etc., to associate users with articles. The latter is mainly based on the expression of interest in the user’s portrait to recommend the content collection related to these interest points for the user.

Second, label system and inverted index

At the point of information, we have established a set of complete tag system, for users and content fine marking. An inverted index system is constructed to quickly obtain the corresponding contents of labels.

2.1. Label system

Description, user profiling, or explicit recall will rely heavily on the tagging system. The definition of labels is a business issue that needs to be coordinated globally, and it needs to meet the requirements of accuracy, coverage, and differentiation, etc., which is crucial to fine operation. A complete tag system usually includes content tag system and user tag system: the former describes content comprehensively and discriminately from different levels, dimensions and styles, etc. Generally, the tags of articles include but are not limited to size categories, tags, keywords, entities, etc., as shown in Figure 2-1. The latter includes user base attributes, activity, and interest tags.

2.2 User interest modeling

The tag data of users comes from their click feedback behaviors, such as clicking, viewing, browsing, commenting, favorites and liking articles. Accumulate user behavior data on these article tags to build user interest portraits. User interest modeling can calculate user interest distribution based on user’s historical behavior to the article. According to the statistical time window length, including long-term interest, short-term interest and dynamic characteristics, its time window and interest stability gradually decreased. The general expression of user interest is :(tag, weight), where tag represents the point of interest and weight represents the user’s interest in tag. Figure 2-2 shows part of a user’s expression of interest.

2.3. Inverted index

How to quickly retrieve TopK articles related to tags in a database of tens of millions of magnitude? In engineering practice, the inverted index data structure is generally used to assist calculation. KV databases such as HBase and Redis are usually used to establish the inverted index of “tag -> content”. For new articles, they can be associated with labels in the construction of inverted index, and then exposed to users through explicit recall. Therefore, CB recall can promote the launch of new resources. This is also the reason for the strong real-time performance of explicit recall mentioned above. As shown in Figure 2-3, when an article is stored, the team will first classify and mark it according to the label system after understanding the content, and then associate it to the label in a certain order with the help of zipper sorting.

2.4. Label based recall

The index of the middle layer of the article is constructed to match the image information for recall, and the connection between the user and the content is established with the label as the intermediary. In other words, the user’s interest label and its intensity are obtained first, and then the relevant content is queried according to the interest label, as shown in Figure 2-4. In recommendation scenarios, there are generally two stages, that is, User to interest-tag (such as brand and label), and interest-tag to Item. For example, we optimize User to interest-tag to form various matching or sorting strategies, and then optimize interest-tag to Doc to form inverted indexes.

Three, interest selection

The main work of explicit recall based on content includes label construction, portrait accumulation, zipper sorting and content recall. Here is a bit of information about how to realize the selection of interest points in content recall, as shown in Figure 3-1. When users enter the information flow, which interest points should be recommended in real time based on the interest expression in the user’s portrait? For tags that are more interested in, users click more news and devote more attention and time to our APP. In business data, it is click rate, per capita click, stay time and other indicators. Based on this, we abstract user interest selection as a point of interest prediction problem, and improve the per capita click and other business indicators as the guidance.

3.1 Multi-objective modeling

The prediction of points of interest relies on rich user behavior log data and builds the association between user characteristics and points of interest with the help of deep network. Combined with Figure 3-1, briefly describe our technical solution:

  1. Goal: Increase click and stay time per person

  2. Model: multi-objective modeling, multi-task joint training.

    A single goal reflects whether the content meets a single level of user needs. Only optimize click through rate, may lead to clickbait articles too much exposure; Only optimize the duration of the stay, long videos or articles are more likely to be recalled. The research progress of industry recommendation system also indicates that CTR model, duration model and completion rate model are common practice of information flow products. At the same time, many tasks are correlated, such as CTR and CVR. When multiple tasks have a certain correlation, multi-task joint training is better than single Task independent training, because multi-task Learning (MTL) focuses on knowledge sharing and common Learning among multiple tasks. Based on this, we choose the MTL method for point of interest estimation. In terms of Model structure, the shared-bottom Model in MTL was adopted. The Bottom part of the Shared Model of different tasks was differentiated only in the Dense layer, and different tasks constituted different towers, which were then connected with the loss function respectively. The Tower of each Task adopts Wide&Deep model structure.

  3. Sample: Using the article as a bridge, get the relationship between the user and the tag, namely (user, doc) – article portrait – >(User, tag), where tag is the tag of doc. Click task is dichotomous problem: if the user clicks, it is positive sample; if the user does not click, it is negative sample. The residence duration task is a regression task, which truncates and transforms the real residence duration.

  4. Characteristics: Design characteristics to describe user, tag, and context properties, including:

    User: Group attributes, such as age, gender, etc. Long-term portrait: a long-term accumulation of relatively stable points of interest; Short-term portrait: short-term interest tag accumulated by the user’s behavior in the recent N days: Currently, tag ID feature is used. Match feature: Signal strength of the article tag in the user’s portrait, which indicates the user’s interest in the article tag. Context characteristics: Context characteristics, such as the batch number of users, article presentation location, etc.

    The Tower for each mission uses the Wide&Deep model structure. The Wide part is input to get match feature for user portrait and article label, which is responsible for memory. The input of Deep is vectorized user, tag, and context features. Through automatic combination of these features, latent patterns are mined to improve model generalization ability.

  5. Loss: Ctr Task is classified Task, and logarithmic loss is used. Dwell Task is regression Task, and is mean square error loss; The Loss of two tasks is combined to form a multi-task Loss.

  6. Serving: During online prediction, the tags are derived from user portraits, and TopN points of interest of model prediction score are selected, and N is adjusted according to the tag granularity. The score of the current point of interest is represented by the predicted probability of Ctr task * the predicted value of Dwell task. The combination of the predicted values of the two tasks can be adjusted according to business needs. The advantages of precise matching and direct efficiency of explicit recall can be demonstrated in online serving: in real-time estimation, the candidate set of points of interest comes from the user’s portrait, and the points of interest in the user’s portrait come from the user’s click behavior, which has a certain confidence level, and we directly seek the articles related to the points of interest. And an explicit recall can clearly explain why users are turning to the article.

All the six parts mentioned above can be adjusted and optimized according to business. Next, we will share our attempts in multi-task Loss balancing.

3.2 Balancing multi-objective loss

In the field of MTL, there are many excellent works on the iteration and innovation of network structure, such as ESMM[1], MMOE[2], DUPN[3], PLE[4], etc. At the same time, multi-task learning Loss optimization is also the direction we can explore and try. The simplest way is to add the Loss of different tasks directly. However, the Loss magnitude of different tasks may be different, leading to the whole training process being dominated by a task. Further, the Loss weights for different tasks are added, which also means that the weights remain constant throughout the training process. In fact, different tasks have different degrees of difficulty in learning and may be in different stages of learning.

Therefore, a better combination should be dynamic, adjusted according to the stage of learning different tasks, the difficulty of learning, and even the effect of learning.

In this respect, many researchers have also made attempts. Chen Z et al. [5] proposed the idea of Gradient Normalization to enable different tasks to learn at a similar speed. Specifically, in addition to the Loss of the task itself, Gradient Loss is also defined, which adjusts the Gradient update speed of the task according to the learning speed of each task, and the Gradient update speed is affected by the weight of Loss, so the weight of Loss of each task can be dynamically adjusted in the process of optimizing Gradient Loss. Liu S[6] hoped that different tasks could learn at a similar speed, and the training speed was represented by the change rate of Loss. Meanwhile, the weight of Loss of different tasks was defined as a monotone function of training speed. Guo M[7] hopes that the weight of Loss for tasks that are more difficult to learn should be higher, and the difficulty degree of the task should be defined according to the accuracy rate. Meanwhile, the weight of Loss is positively correlated with the training difficulty of the task. Sener O[8] treated the task of MTL as a multi-objective optimization problem to find the Pareto optimal solution. Kendall A and Gal Y[9] adjust the weighted parameters in the Loss function through uncertainty, so that the magnitude of Loss in each task is close. One of its authors, Professor Yarin Gal, is well versed in the field of Bayesian deep learning.

We refer to the multi-task Loss weighting method of task uncertainty, and add a dynamic weighting mechanism to the multi-objective model that has been online.

The uncertainty here refers to task-dependent uncertainty. For example, we have a data that includes students’ review time, mid-term score and final score. Predicting the final score by mid-term score and predicting the final score by review time are two tasks with different difficulty, that is, different uncertainties. So how do you quantify that uncertainty. Suppose the model of task T is f trained according to data D. When input is x, the corresponding output of the model is F (x). After considering the uncertainty of the task, the output of the model is {f(x),σ^2}, which represents the inherent variance of data D and is used to describe the uncertainty of the model in the data. The model will learn this variance in an unsupervised way.

After deduction [9], we can obtain the loss function:

As you can see, one noise parameter will be learned in the loss function of each task. This method can accept multi-task (regression or classification) and unify the scale of different losses. In this way, we can directly add them to get the total loss, and get:

The multi-task Loss weighted by the uncertainty of the task can eliminate the uncertainty caused by the task to some extent by dividing the original Loss by σ^2. Log (σ^2) acts as a regularizer, preventing the network from predicting σ to a large value in order to minimize Loss. With the addition of this similar regularization term, the value of σ increases only when Loss is large enough. In this way, the network is equipped with the ability to learn task uncertainty. Intuitively, the greater the σ, the greater the uncertainty of the task, and the smaller the weight of the task. In other words, noisy and difficult tasks will have a smaller weight.

The method only adds single-digit parameters to be trained and does not introduce extra weight overparameters. At the same time, it only needs to rewrite the Loss function in the model.

When estimating MTL at points of interest, we try to weight multi-task Loss with task uncertainty. Practice shows that it can help the model to optimize Ctr task and Dwell task better, and bring positive benefits in business indicators (per capita click, Dwell time).

Four, outlook

The above is part of the work we have done in the aspect of interest selection. Through multi-objective joint learning and multi-task loss balance, the per capita click and stay time have been significantly improved.

In order to better serve users, understand, respect and meet their needs, and bring “interesting and useful” reading experience to users, we will also make new attempts. For example, in algorithm, we can make new attempts in model structure, sample construction and multi-objective joint optimization. Business indicators, improve the depth of user consumption, surprise, satisfaction and so on.

References:

[1] Ma X , Zhao L , Huang G , et al. Entire Space Multi-Task Model: An Effective Approach for Estimating Post-Click Conversion Rate[C]// The 41st International ACM SIGIR Conference. ACM, 2018.

[2] Ma J ,  Zhe Z ,  Yi X , et al. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts. ACM, 2018.

[3] Ni Y , Ou D , Liu S , et al. Perceive Your Users in Depth: Learning Universal User Representations from Multiple E-commerce Tasks[J]. the 24th ACM SIGKDD International Conference, 2018.

[4] Tang H , J Liu, Zhao M , et al. Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations[C]// RecSys ’20: Fourteenth ACM Conference on Recommender Systems. ACM, 2020.

[5] Chen Z ,  Badrinarayanan V ,  Lee C Y , et al. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks[J].  2017.

[6] Liu S ,  Johns E ,  Davison A J . End-to-End Multi-Task Learning with Attention[J].  2018.

[7] Guo M ,  Haque A ,  Huang D A , et al. Dynamic Task Prioritization for Multitask Learning[M]. Springer, Cham, 2018.

[8] Sener O , Koltun V . Multi-Task Learning as Multi-Objective Optimization[J]. 2018.

[9] Kendall A , Gal Y , Cipolla R . Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics[J]. IEEE, 2018.

This article is from the little Information algorithms team