Tianchi – last year aetna cup cross-border electricity business intelligent algorithm competition was the first time I contact recommend related game, through the game makes me for a relatively shallow understanding of recommendation system, after the game is going to study this aspect of the content of system, then I will also recommend system 】 【 updated as a series of plates, the main principle of the classic recommendation algorithm, I believe that each article is worthy of repeated study. \

Antai cup champion Shared) : zhuanlan.zhihu.com/p/100827940

I. Background introduction

       

As the first lecture of recommendation System series, we will introduce the Deep Neural network recommendation model of YouTube based on the paper Deep Neural Networks for YouTube Recommendations published by YouTube in 2016. Before this, YouTube has three papers introducing YouTube video recommendation. If these four papers are strung together, they constitute an epitome of the continuous development and improvement of YouTube recommendation system.

In 2008: The paper “Video Suggestion and Discovery for YouTube” was written by Shumeet Baluja, Shumeet Baluja et al published on the International World Wide Web Conference Committee (IW3C2) in 2008. The paper has spent a lot of time on the ADSORPTION algorithm, which aims to solve the diffusion of video tags. Therefore, it is a bold conjecture that the YouTube recommendation system at that time should recommend similar videos to users according to the videos they have watched, that is, videos with the same label.

2010: The YouTube Video Recommendation System was published by Davidson J, Liebald B, Liu J et al on The fourth ACM RecSys in 2010. At that time, the core algorithm of YouTube recommendation system was collaborative filtering algorithm based on Item-CF. In other words, for a user’s favorite videos in the current scene and historical interest, find out the relevant videos, filter out the videos that the user has watched, and the remaining videos are the ones that the user is likely to watch. Video correlation here is described in terms of the number of common hits.

2013: Paper “Label Partitioning For Sublinear Ranking” was published by Jason Weston, Jason Weston et al at the 30th International Conference on Machine Learning in 2013. This paper transforms the recommendation problem into a multi-classification problem and solves how to find the highest probability output node from the last output layer in neural network.

In 2016, YouTube began to move towards the stage of recommendation system dominated by deep learning algorithm. Although it seems to be far away from 2020, this paper should be regarded as a must-read paper for the introduction of recommendation system. The recommendation system architecture design in this paper is also very classic, which is the foundation of many recommendation system architectures.

Three difficulties of YouTube recommendation system

  • Data scale: YouTube has billions of users and videos, requiring distributed learning algorithms and efficient deployment.
  • Novelty: Recommendation systems need to be responsive to new video uploads and users’ new behaviors.
  • Data noise: Historical user behavior is difficult to predict due to sparse and external factors. Most YouTube videos have only implicit feedback (how users view the video) and lack explicit feedback (how users rate the video). In addition, the meta information of the video is not structured enough. Our algorithm needs to be robust to these factors of the training data.

Second, system overview

       

The overall architecture of Youtube recommendation system

The first part recalls the network: the main purpose of this stage is to retrieve a small part of videos from millions of videos for the subsequent sorting operation. This part needs to process a very large amount of data and requires fast speed. All models and features used should not be too complicated. Recall network will conduct recall according to the user’s historical information (such as the user’s historical viewing and searching, etc.). The recalled videos at this stage meet the user’s generalized interest, and the similarity between users is represented by rough features, such as the USER’s VIDEO watching ID, search query and user portrait.

The second part is sorting network: in this stage, richer and more detailed features of users and videos will be used to predict users’ scores on recalled videos, and then sorted according to the scores and presented to users successively. This part is mainly able to accurately push videos to users, so more complex models and features are needed to improve the recommendation effect.

The third part is offline evaluation: evaluation indexes include precision, recall, ranking Loss, etc. The final effect still needs to be tested online. The indicators concerned include click rate, viewing time, etc. It should be noted that online and offline results are sometimes inconsistent.

Next, take a look at the specific design ideas of Matching and Ranking. At the same time, specific implementation codes will be given to help deepen the understanding of algorithm principles. Before introducing the recall model, let’s take a look at the traditional recall approach.

Third, Matching

3.1 Traditional recall thinking

Firstly, the Embedding of goods and users is calculated offline. When Embedding is recalled online, TopN can be found according to the Embedding of users and the Embedding inner product of all goods. This traditional way needs to solve the following problems:

  • How to ensure that the Embedding space of goods and user are in the same space?
  • Need to calculate the inner product with all goods, there are performance problems.
  • Methods such as singular value decomposition and input coordination matrix have relatively single features.

3.2 Problem Modeling

During the recall phase, Youtube treated the recommendation issue as a super-large multi-category issue, known as the Softmax category. Defined based on a specific userAnd its contextIn theMoment will video libraryThe specified video inDivided into the firstStudent: Probability of class. The formula is as follows:



Among themHigh dimensional embedding that represents (user, context)Represents the embedding vector of each candidate video.According to the firstThe embedding vector of each video, where each video is represented by an embeeding vector.

So far, the goal of the DNN is to learn the User embedding vector given the user’s historical behavior and context, as input to Softmax Classifier, to generate a preliminary candidate set as the video recall result.

3.3 Negative sampling (Importance sampling by Softmax)

In each calculation, the denominator of SoftMax needs to traverse the video libraryAnd the dot product between the user context vector and the video vector, exp and other operations cause too much computation, so how to efficiently train becomes a problem. The solution is to use sample negative classes to improve the training speed, and use importance weighting to correct the sample. For each sample, the cross entropy loss function is minimized for the real label and the sampled negative class. This is hundreds of times faster than the classic Softmax.

3.4 Network structure of recall model

When doing NLP tasks, the first step is to express text, or words in text, into a structured form that the computer can understand. The common embedding method is Word2vec, which represents all Word embedding vectors as low-dimensional and dense embedding vectors, and finally feeds the embedding vectors to neural networks for learning.

Network structure of recall model

Youtube’s recall model is also inspired by this, and uses the embedding skill of Word embedding to calculate the embedding of each video. Then the embedding of video and that of video searched by users are calculated as average respectively, and user attributes, video quality and other features are added. Two fully connected ReLU layers and Softmax functions are used to predict what video the user will watch next.

One of the reasons for using DNN is that continuity variables and category variables are easily input into the model in DNN, including some Demographic features, which play a very important role in the final effect. The user’s region and device can be input to DNN as embedding vector. Simple binaryized features (such as gender) and numerical features (such as age) can be directly input into DNN, and numerical features need to be normalized to [0,1] and then input into the model.

(1) Input layer

  1. User viewing video sequence ID: Mean pooling is performed on the embedding vector of the video ID to obtain the watch vector.
  2. When users search for video sequence ID, mean pooling is performed on the embedding vector of video ID to obtain the search vector.
  3. User’s geographical features and user’s device features: both are discrete features. The user’s geographical vector and device vector can be obtained by embedding method or one-hot method (when the discrete dimension is small)
  4. Example Age: In the recommendation system is very important that video of novelty, users tend to watch the new video, but the machine learning model is based on historical watch video record, so to some extent, the model and the business is inconsistent, so this paper builds a characteristic example age, simple can be understood as the age of the video, the initial value is set to zero, Over time, record the age of the video. After addition, the effect is very obvious, as shown in the picture

5. Demographic attribute features: Used to provide priori, enabling it to make reasonable recommendations to new users. Specifically, the user’s geographical region and the device used are added and the features are concatenated.

(2) MLP layer

Using the usual tower design, the bottom layer is the broadest and the number of neurons in each layer above is cut in half until the Input layer of Softmax is 256 dimensions (1024ReLU->512ReLU->256ReLU).

(3) Softmax layer

Deep neural network video embedding vectorAnd the user’s embedding vectorRecall task forecast users 在 Videos to watch at the moment:

(4) Output layer

The dimensions of the output layer are the same as the embeeding vector dimension of the video ID, resulting in the user vector 。

Finally, the embedding vector of all videos can be obtained by learning the network structure, whose dimension is pool_sizeWhere pool_size is the size of the video resource of the training set.Is the dimension of embedding. We can also get the output vector for all the users, where the dimension of each user vector isDimension, which is consistent with the embedding vector dimension of video.

3.5 Recall Optimization

Online service stage, via video vectorAnd the user vectorFor similarity calculation, in order to meet the delay requirement, the nearest neighbor query is used in the actual recall calculation. For each user vector, for all videos in the video library according to vectorDo the nearest neighbor algorithm, get top-N video as the recall result.

3.6 Sample selection and context selection

In supervised learning problems, the most important choice is label, because label determines what you do and determines your upper limit, while feature and model are approaching label. Several of our designs are as follows:

  • Use a wider range of data sources: Train with not only the recommended scenario data, but also other scenarios such as search, providing some explore for the recommended scenario.
  • Generating a fixed number of training samples for each user: a practical lesson we found in practice is that if the upper limit of the number of samples is fixed for each user, each user is treated equally and loss is avoided from being domanate by a few active users, online effects can be significantly improved.
  • Discard the sequence information: What we try to implement is to remove the sequence information and weighted average the embedding vector of the past viewing video/historical search query. This is actually counterintuitive, probably because the model doesn’t model negative feedback very well.
  • The problem with asymmetric co-watch: The so-called asymmetric co-watch is that users tend to view videos in a sequence, starting to watch some popular videos and gradually finding segmented ones. Figure (a) is a HLED-out method, which uses context information to estimate a video in the middle. Figure (b) shows the mode of Predicting Next Watch, which uses the above information to predict the next video viewing. We found that the way in figure (b) performed better in online A/B tests. As a matter of fact, traditional collaborative filtering algorithms are all helt-out of hidden sampling graph (a), ignoring the asymmetric browsing mode.

Four, Ranking

       

In the recall stage, the candidate set has been given, and in the sorting stage, the purpose is to carry out a fine sorting of the given small-scale candidate set. Often, the sorting phase also involves effective integration of videos from multiple different recall sources, which makes the sorting phase more difficult.

The network structure of the sorting model

4.1 Training Stage

The base model selected was DNN+weighted Logistic regression. The negative sample is the video with exposure and no click, and the positive sample is the video with exposure and click, which predicts the viewing duration of the user. The positive sample recorded the video viewing time of users. In order to predict the viewing time, a weighted logistic regression model was specially designed. (The measurement standard of viewing duration is added to solve noise, because most of the time when users click to watch a video, it does not mean that users really like and are satisfied with the content)

This model is trained based on logistic regression model of cross entropy loss function. But the positive samples were weighted by viewing time and the negative samples were unweighted (i.e., unweighted). So, the advantage that we learned from logistic regression is, includingIs the number of samples,Is the number of positive samples,It’s the viewing time. Assuming the positive sample set is small, then the advantages we learned are similar , Is the probability of clicks,It’s the expected value of the viewing time. becauseSmall, so this product is approximately equal to. Let’s use an exponential functionAs the final activation function to generate an estimate of the approximate viewing duration.

4.2 Feature Representation

Our features are different from the traditional classification methods of classification and continuous/serial features, and features are analyzed from two dimensions:

  • Number of values: divided into single-value features, such as the ID of the video to be displayed and scored; And multi-valued features, such as N video ids that the user has watched in the past;
  • Feature description content: We also classify features based on the attributes they describe the project (” impressions “) or the attributes of the user/context (” queries “). Query characteristics are calculated once per request and presentation characteristics are calculated for each rated item.

(1) Feature engineering

Although DNN implies feature engineering, the author still does a lot of feature engineering (personally, this indicates that the network model is not very suitable for this data, or the network structure is not as efficient as CNN). The three most important features are as follows:

  • User history behavior: users have previously interacted with items or similar items;
  • Last time to watch: the time since last time to watch the same channel video, the principle is similar to “attention mechanism;
  • The number of times a video has been exposed to the user before: If a video has been shown before and the user has not watched it, the user is more likely not to watch it the next time it is shown. This principle is similar to “exploration”.

(2) Discrete feature Embedding

As with the generation of candidate set, we use embedding to map sparse discrete features to a dense matrix form suitable for neural networks. Each unique video ID corresponds to a separately learned embedding, its dimension size and the entire video setThe logarithm of the number of ids of, namely, log(unique(values)). The videos are created by scanning the data once before training. The video set was arranged in reverse order according to the frequency of ID in click display. Only the topN ids with the highest frequency were retained, and other ids were removed from the video set. Values that are not in the video set are simply added as 0 vector. As in the candidate set generation phase, after multi-valued discrete features are mapped as embedding, they need to be added and averaged before being input to the network.

Importantly, when the discrete features correspond to the same ID, their underlying embedding is also shared. For example, a global embedding corresponds to many different features (the ID of the exposed video, the ID of the last browsing video, the ID of the seed video of the recommendation system, etc.). Although embedding is shared, each feature is individually input to the network during training so that the upper layer network can learn the special representation of each feature. Embedding sharing is very important for enhancing generalization ability, accelerating training and reducing memory footprint. Most parameter models are in this high dimensional embedding, for example, a million ids are mapped into a 32 dimensional space with more than 7 times the parameters of a fully connected 2048 cell wide network.

(3) Normalization of continuous features

Neural networks are very sensitive to the scaling and distribution of their inputs, while models such as those incorporating decision trees are not affected by the scaling of individual features. We find that proper normalization of continuous features is essential to convergence. A meetCharacteristic of distributionIs equivalent toBefore the training, scan the data once and approximate the integral based on the quantile of the eigenvalue by linear interpolation method.

In addition to input normalized features, we also input normalized featuresThe square,The square root of, the superlinear and sublinear functions of the features make the network more expressive. Input the power of the continuous feature is proved to improve the off-line accuracy. Such seemingly illogical features are actually useful, maybe it really enriches the expression of features, can only be understood in this way.

4.3 Hidden layer experiment

The table shows the results of different hidden layer configurations on the reserved dataset. The values for each configuration are derived from positive and negative samples shown to a user on the same page. We first used our model to score two samples. If the negative sample scores higher than the positive sample, the positive sample’s viewing time is considered to be a false prediction. The loss per user is the total number of wrong predictions.

The width and depth of the hidden layer in the network structure are tested. It can be seen from the following figure that increasing the width and depth of the hidden layer can improve the effect of the model. However, for the network 1024->512->256, loss increased by 0.2% for the version tested that did not contain the normalized root sum. If weighted LR is replaced with LR, the effect drops by as much as 4.1%.

reference

       

  • [Covington et al., 2016] Paul Covington, Jay Adams, Emre Sargin. Deep Neural Networks for YouTube Recommendations. RecSys: 191-198, 2016.
  • zhuanlan.zhihu.com/p/52
  • zhuanlan.zhihu.com/p/61

Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply to "add group" to get a discount station knowledge planet coupon, please reply to "knowledge planet" like the article, click on itCopy the code