RL reinforcement learning application in YouTube recommendation system 2019 Google Paper interpretation

Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

Original link: www.cs.toronto.edu/~cebly/Pape…

Main contributions:

1. A q-learning algorithm named Slate-Q is proposed to decompose the recommendation sequence of a SLATE into multiple items, calculate the long-term Value of LTV (long-term revenue), and add the long-term revenue into the ranking multi-objective for modeling optimization. (SLATE, as I understand it, is a candidate column, the equivalent of a candidate list of items recommended to users.)

Several hypotheses are introduced: User choice behavior and system dynamics.

After SLATE decomposition, the complexity of TD model (temporal difference), such as Q-learning, Generalization and exploration of SARSA algorithm, is reduced. For some specific choice models, optimal LP-based and heuristic algorithms, greedy and top-K, can be used to solve the optimization problem of SLATE. The results show that SLATEQ’s method has good robustness and can be applied to large-scale commercial systems such as YouTube.

2. Use the existing short-term return model, obtain the specific implementation method of long-term return RL model, and conduct experiments on YouTube.

1, the Introduction

Problems of the current mainstream personalized recommendation technology:

The goal of optimization is short-term return, such as click rate and viewing duration, and it is difficult to model long-term revenue LTV(Long term value).
The most important thing is to predict users’ interests, but the models are all trained based on logged feedback. Samples and features are extremely sparse, and a large number of materials have not been fully displayed. Meanwhile, a large number of new materials and new users still flood in, resulting in a large amount of bias. In addition, users’ interests change dramatically, their behaviors are diverse and there is a lot of Noise.
790-hole: In the short term, it’s easy to repeatedly recommend existing preferences to users. On the other hand, when new users or non-behavior users come, they are more inclined to use the popular money to undertake.

Reinforcement learning (RL) applied to recommendation systems

As we all know, RL is a typical scenario that requires massive data. For example, the famous AlphaGo adopted the way of left and right interaction to make up for the lack of training data. However, in the recommendation scenario, the interaction between the user and the system is dynamic, that is, cannot be simulated. For example, you don’t know what the user’s feedback will be if you give them a product A that hasn’t been recommended.

Huge Action Space: many Millions of items to Recommend. There are thousands of options recommended to the user, and what is recommended to the user is a series of items, which is a combination problem that makes action Space more complicated.
Since it is a dynamic environment, it is impossible to confirm what the user’s feedback will be when we show the user an item we have not seen before, so it cannot be effectively simulated and training becomes more difficult.
User Latent State is essentially estimated, but with the large number of unobersever samples and noise, it is difficult to estimate, and this problem coexists in RL and other scenarios.
It is difficult to model long term reward and long/short term reward. Tradeoff due to user state estimate error.

2, the Related work

Recommender Systems

Collaborative filtering based recommendation system is the most common. Here is more introduction.

Sequence model and RL recommendation system

User Choice Behavior Models the User Choice Behavior

Multinomial Logit (MNL) is the main model used to optimize recommended sequences depending on user selection behavior

And its extension conditional Logit Model, etc.

RL with Combinatorial Action Spaces

Sequential DQN

Slate MDPs

3. MDP model markov decision process

The biggest characteristic of RL is its interaction with the environment, which is a trial-error process. Usually, MDP is used to describe the whole process. In combination with recommended scenarios, the mathematical definition of quad is as follows:

S: state: indicates the user status
- Static (declared user interests, properties) and dynamic characteristics (user context)
- User feedback, accepted suggestions, user responses (accepted or rejected recommendations), user engagement
A: Action, item recommended to users, discrete space
P (s’ | s, A) : s * A * s, R, and state transition probability, take only under A state s, transferred to the state ‘s probability. It reflects the uncertainty of user response and environmental factors.
R: S × A → R, reward function, instant reward, the article uses user participation to model. (Consider the uncertainty of user response)

Function definition:

The value the function:

Train the optimal value function:

Q – function:

Train the optimal q-function:

SARSA’s method updates the q-value (on-policy) :

: iteration tThe predicted values

Vector:

Q-learning is off-policy:

4. Slate-q: Decomposition of candidate values

The problem of the above formula: the complexity of the user’s action space is too large, the data needs to be compressed, and the optimal solution of a combined optimization problem needs to be found.

This paper proposes a method to calculate q-value: Slate-q.

Efficient TD-learning is supported by splitting the LTV of a SLATE into item-level LTVs, with emphasis on modeling the User Choice Modle.

This Itemwise is different from the personalized technology of pointwise in our general understanding, because wise is the expression of reward and the user choice model is extended. The pointwise method only considers the probability of a single item. Although Itemwise also believes that the final reward is only related to each selected item and items do not directly affect each other, it makes assumptions about user choice. The optimization of the objective function is turned into a polynomial solvable problem.

SC (Single choice) : a user selects only one item at a time from the candidate column (possibly null ⊥).
- R(S, A, I) is used to indicate that the user selects the reward of the ith item by taking candidate sequence A under state S.
  
  P (s’ | s, A, I) by the same token, the reward was then able to use the accumulation and formulas.

Which act (Reward/transition dependence on selection) : get returns or the state transition R (s, A) and P (s’ | s, A) is only related to the currently selected item.

We can assume that the user’s behavior is not affected by an unselected item in SLATE A, and we get the following simplified formula.

Use an item-level helper function, represents the LTV of user consumption of an item, which is unrelated to SLATE A but only related to A single item.

Therefore, in combination with SC’s hypothesis, the conclusion is drawn as follows:

Update formula:

Eventually it converges to the optimal

5. Optimize Slate with q-Values

In order to make q-learning feasible, it is necessary to solve a Combinatorial SLATE optimization problem — that is, given a q-value of each item, find a SLATE with maximum LTV.

A fractional mixed-integer programming formula is used to optimize SLATE, and then linear programming can be solved by formulation and relaxation.

Two heuristic algorithms, top-K and greedy, are also proposed. Although there is no theoretical basis, they run well in the experiment.

1. The Exact Optimization algorithm

The following values need to be maximized:

Conclusion: The results can be solved in polynomial time under sc hypothesis.

The calculation process of specific optimization depends on the specific form of choice model, and conditional Logit Model (CLM) is used to calculate.

: The probability that user I selects item j

Item selection is based on its LTV or Q-value, not its immediate appeal to users.

Assuming that the user chooses item I with probability V (s, I) in the state of S,

The optimization formula (putPlug in) :

The above-mentioned problem is a variation of the classic product-line, and can be translated into a linear programming problem using the method of yunfu optimization or product line design.

Chen and Hausman [2000] discussed the understanding method.

2.Top-k and Greedy algorithms

Avoid LP(linear programming) problems being computed at serving stage.

A heuristic algorithm is to get the highest score for directly adding k items.

Top-k optimization ：

Will the item toThe descending order of the values of. Time complexity is determined bybecome 。

Greedy optimization:

Update the score of the item based on the current SLATE. Such as a given The L itemitem added is the maximum value of the marginal contribution

6. Simulate the user environment

A recommender system simulation environment, RecSim, is proposed, which supports direct item configuration, user state model and user selection model.

1. Document and Topic models

: A combination of documents, which means something you can recommend.

.Represents the degree of correlation between D and topic J.

: the length of D. In the experiment, it was fixed length L

: The quality measure of the article follows normal distribution
The collection of T: Topics captures the basic characteristics of user interests

2. User Interest and Satisfaction model

The user: represents the vector of user’s interest in topic.

: The user’s interest in Document d is dotted.

: User’s satisfaction with Document D. The function used in this article isOr you can use more complex functions.

3. User Choice model

Given a Document for SLATE, the model influences which document the user chooses. It is assumed that the user can observe the document’s topic before selecting it, but cannot know its quality, and can know the quality of the article after selecting it.

: defines the user’s interest in document.

The model used in the simulation experiment is general Conditional Choice Model

In this paper, another choice model — Exponential Cascade Model [Joachims, 2002] is used. Here is more introduction.

Dynamic user model

1. Remaining time model: Assume that each user U has an initial time budgetAnd the recommendation system cannot sense this budget. Each document D will reduce the time budget, which is calculated from the length L of the document. However, higher quality articles will consume a slower budget, which will increase the reward B when the user consumes L, calculated by S(u,d) satisfaction degree.

You take the initial time budget in the experimentUnit; Consumption per documentUnit; If there is no selected document in a SLATE, it takes 0.5 units of time. reward

2. Users’ interests change with the articles they read.

: represents document D topic

: indicates user U’s interest in Topic T

To update the

5. Recommend dynamic model of the system

In each round of interaction with users, fromTake m candidate documents, and a SLATE contains k items that must be selected into the recommendation. In large-scale commercial applications, it is necessary to select a small set of candidate sequences from a large corpus and then score them in a more precise (and computationally expensive) way.

In the simulation experiment, m = 10, k = 3.

7. Empirical estimation

A standard two-tower architecture uses stacked, fully connected layers to represent user status and documents. Updates to q-Models are done online by simulated users.

8. Implementation methods

Many item-level recommendation systems consist of the following components:

Logging: Log of user feedback and impression.
Training: Regression model (such as DNN) is trained to predict user response, which is aggregated by some scoring functions.
Serving: Recommend and sort, select top K items.

1. Establishment of state space

Features extracted from the user’s history.

2. Generlization of users

Features models users to generlize users, and current recommendation systems can do this.

3. Modeling of user response

Use Slate-q from Sections 4 and 5 for modeling. Current recommendation systems predict immediate returns, such as the pCTR model.

4. Basis of Logging, Training, Serving

LTV Qπ (S, A) training based on user data log, and real-time recommendation service serving based on LTV score.

In the later experiments, we used SARSA algorithm. In our short-term recommender system, regressive models predicted users’ immediate response, while in our long-term recommender system, Label Generation provided LTV labels for Regressor to model

The model is periodically trained and pushed to the server. Ranker uses the latest model to recommend items and records user’s feedback, which is used to train the new model. Training and pushing with the iterative model of LTV Labels can be regarded as a kind of generalized policy iteration. Each trained DNN represents the value(policy evaluation) of the previous batch. Ranker made policy improvements based on the value function.

LTV label Generation is similar to DQN training [Mnih et al., 2015]. LTV labels are trained on separate networks, periodically copying parameters from the main network to the Label network. Update method of label network:

9. Training process

In this paper, sarsa-TS algorithm, especially SLATEQ Decomposition algorithm, is used to implement large-scale video recommendation on YouTube. The system consists of two parts, the candidate Generator, which retrieves the items (several hundred) that best match the user’s content. Ranker uses DNN to score and rank selected items.

The non-short-term recommendation system in this paper uses users’ click track in N days to maximize cumulative engagement, and we use time-based discount to deal with credit assignments with long time spans. If theandIf there is continuous access between, then discount ofThe value isC is a parameter that controls the time scale.

Specifically, we use a multi-task feedforward deep network to learn:, the user’s long-term pCTR probability under state S, short-term return V (s, I), etc. The network has four hidden layers with sizes of 2048, 1024, and 512,256, each with ReLU activation function.