Twenty articles have been written in this series, but there is still much to explore and learn about recommendation systems. But before we do that, let’s take a moment and review what we’ve learned!

Since it is a summary article, many details will not be involved too much, interested students can click on the links given in the article to learn.

Most of the algorithms involved in this paper are models used to calculate click-through rate estimation in advertisements. Of course, pin-WISE models such as Bayesian personality ranking and List-WISE models such as REINFORCEMENT learning recommendation model of JINGdong will also be involved.

All right, without further ado, let’s get started. Take a look at the table of contents:

1. Recommend commonly used evaluation indexes in the system

1.1 Accuracy, recall rate, F1 value
1.2 AUC
1.3 Hit thewire (HR)
1.4 Mean Average Precision (MAP)
A Normalized trial of Cummulative Gain(NDCG)
1.6 Mean Frame Rank (MRR)
1.7 skyriver

2. Data in the click rate estimation problem

3. Traditional methods

3.1 Linear model
3.2 FM model
3.3 FFM model
3.4 GBDT + LR model

4. Deep learning methods

4.1 Parallel Structure
4.1.1 Wide & Deep model
4.1.2 DeepFM model
4.1.3 Deep Cross Network
4.2 Serial Structure
2 the Product – -based Neural Network
4.2.2 Neural factorization those
Holdings Attention – -based factorization those

5. Reinforcement learning method

6. EE problems of the recommendation system

6.1 Bandit algorithm
6.2 LinUCB algorithm

7. Actual practice of recommendation system in the company

7.1 Ali MLR algorithm
7.2 Ali Deep Interest Network
7.3 Ali ESSM Model
7.4 JD reinforcement learning recommendation model

1. Recommend commonly used evaluation indexes in the system

Evaluation index is not our loss function. For CTR prediction problem, we can choose square loss function as regression problem, or logarithmic loss function as classification problem. However, the evaluation index is used to evaluate the effect of our recommendation, not to guide the training of our model. Therefore, in general models based on deep learning, the problem of inconsistent indicators in model training and evaluation is often faced. Okay, let’s start with a review of common metrics. Some of these indicators are applicable to dichotomous problems, and some are applicable to the evaluation of topK.

1.1 Accuracy, recall rate, F1 value

Let’s first take a look at the confusion matrix. For the dichotomous problem, there are two types of real sample labels, and there are two types of categories predicted by our learner, so the category combination of the two can be divided into four groups, as shown in the following table:


Based on the confusion matrix, we can obtain the following evaluation indicators:

Accuracy/recall rate

Accuracy rate represents the probability that positive samples will be correctly predicted among the positive samples predicted. Recall rate refers to the probability of being correctly predicted as a positive sample in the positive sample of the original sample.

The two are calculated by confusion matrix as follows:


F1 value

In order to compromise the results of accuracy rate and recall rate, F-1 Score is also introduced, and the calculation formula is as follows:


1.2 AUC

AUC is defined as the area under the ROC curve:

The horizontal axis of ROC curve is “True Positive Rate” (TPR), also known as “false Positive Rate”. The vertical axis is the False Positive Rate (FPR), also known as the true Positive Rate,

The following figure is a ROC curve drawn by us, and the area below the curve is the value of AUC:


Another interpretation of AUC is to test the probability that score of a positive sample is greater than score of a negative sample given randomly to a positive sample and a negative sample.

1.3 Hit thewire (HR)

In top-K recommendation, HR is a commonly used indicator to measure recall rate, and its calculation formula is as follows:


e

The denominator is the sum of all test sets and the number of tests belonging to each user’s top-K recommendation list in the formula. For a simple example, the number of goods of three users in the test set is 10, 12 and 8 respectively. In the top-10 recommendation list obtained by the model, there are 6, 5 and 4 items respectively in the test set. Then, the value of HR is (6+5+4)/(10+12+8) = 0.5.

1.4 Mean Average Precision (MAP)

Before understanding MAP(Mean Average Precision), take a look at AP(Average Precision), which is the Average accuracy. For example, for user U, we recommend some items to him, then the average accuracy of U is defined as:


To illustrate the AP calculation process, use an example:


Therefore, the AP of the user is (1 + 0.66 + 0.5) / 3 = 0.72

Then for MAP(Mean Average Precision), it is easy to know that the AP of all users u is averaged (Mean). Then the calculation formula is as follows:


A Normalized trial of Cummulative Gain(NDCG)

For NDCG, we need to uncover its mystery step by step, starting with CG: CG we start with CG(Cummulative Gain), which is called “cumulative Gain” in direct translation. In recommendation systems, CG aggregates the score of each recommendation’s relevance as the score of the whole list of recommendations. namely


Here, rel-i represents the correlation of the recommendation results at position I, and K represents the size of the recommendation list to be examined.

One disadvantage of DCG CG is that it does not consider the influence of different positions of each recommendation result on the overall recommendation effect. For example, we always hope that the results with high correlation should be ranked first. Obviously, Beijing Discounted Cummulative Gain (DCG) will seriously affect user experience if the results with low correlation rank at the top. This refers to the “discount treatment” for the recommendation effect of the lower ranking recommendation results:


image

Two conclusions can be drawn from the above formula: 1) The greater the correlation of recommendation results, the greater the DCG. 2) The better the recommendation effect is, the bigger the DCG is.

NDCG DCG still has its limitations, that is, it is difficult to evaluate horizontally between different recommendation lists. When we evaluate a recommendation system, it is not possible to use only one user’s recommendation list and corresponding results to evaluate, but to evaluate the whole test set of users and their recommendation list results. Normalized quotidian Cummulative Gain (NDCG) is required for the evaluation scores of different users’ recommendation lists.

Before introducing NDCG, one more concept needs to be understood: IDCG. IDCG, Ideal DCG, refers to the list of best recommendations returned by the recommendatory system for a user, assuming that the returned results are sorted by relevance, with the most relevant results at the top, and the DCG of this sequence is IDCG. Therefore, the value of DCG is between (0,IDCG], and the value of NDCG is between (0,1]. Then NDCG@K of user U is defined as:


Therefore, the average NDCG can be calculated as:


1.6 Mean Frame Rank (MRR)

MRR calculation formula is as follows:


The | | Q is the number of users, ranki is for the case of a user, I recommend the list the first item in the ground – way result in the arrangement of the position.

For example, if there are three users in the recommended list whose minimum rank is 3, 2, 1, then MRR=(1 + 0.5 + 0.33) / 3 = 0.61

1.7 skyriver

ILS is an indicator to measure the diversity of recommended lists. The calculation formula is as follows:


If S(bi, BJ) calculates the similarity of the two items I and J, and if the items in the recommendation list are less similar, the ILS is smaller, then the diversity of the recommendation results is better.

More knowledge on recommendation system evaluation index, we can see before summary of two articles: recommendation system meets the deep learning evaluation index AUC (9) – theory and practice: www.jianshu.com/p/4dde15a56… Recommendation system meets the deep learning (16), explanation of the recommendation system commonly used evaluation indexes: www.jianshu.com/p/665f9f168…

The relevant code implementation is here: github.com/princewen/t…

2. Data in the click rate estimation problem

The data in the CTR estimation problem are mainly divided into discrete variables and continuous variables. For continuous variables, you can directly insert them into the calculation. For discrete (category) variables, we usually adopt the one-hot form, such as the following data:


After one-hot encoding the above data, it looks like this:


After one-hot encoding, the data of the sample inevitably becomes sparse. For a very simple example, assume that the item on Taobao or JINGdong is 1 million. If one-hot coding is carried out on the dimension of item, the sparsity of data in the dimension of light is one part in a million. It can be seen that data sparsity is a very common challenge and problem we face in practical application scenarios.

3. Traditional methods

3.1 Linear model

The general linear model is:


It is easy to see from the above formula that the general linear model does not take into account correlations between features. To express the correlation between features, a polynomial model is used. In a polynomial model, the combination of features XI and xj is represented by xixj. For simplicity, we discuss second-order polynomial models. The specific model expression is as follows:


In the above formula, n represents the number of features of the sample, and xi represents the ith feature.

However, for linear models, generalization ability is weak, especially for one-hot features of the same discrete feature expansion, the product between two is always 0.

3.2 FM model

FM introduces a hidden variable for each feature, and uses the product of hidden variables as the weight of feature crossing:




The feature crossing part of FM can be simplified to simplify the calculation, and the process is as follows:


More details about the FM, refer to the article: recommendation system meets the deep learning (a) – FM model theory and practice: www.jianshu.com/p/152ae633f… Code address: github.com/princewen/t…

3.3 FFM model

On the basis of FM, FFM model introduces the concept of category, that is, field. Taking the data from last lecture as an example, let’s look at the following figure:


In the case of AD click above, “Day=26/11/15”, “Day=1/7/14” and “Day=19/2/15” all represent dates and can be placed in the same field. Similarly, Country can be placed in a field. To put it simply, the same categorical feature generated by one-hot encoding can be placed in the same field, including user nationality, advertising type, date and so on.

In FFM, every one-dimensional feature xi, for every kind of field fj of other features, learns a hidden vector v_i,fj. Therefore, hidden vectors are not only related to features, but also to fields. In other words, different implicit vectors are used when the feature “Day=26/11/15” is associated with the feature “Country” and the feature “Ad_type”, which is consistent with the internal difference between” Country “and” Ad_type”, and is also the origin of “field-aware” in FFM.

Assuming that n features of the sample belong to F fields, then the quadratic term of FFM has NF hidden vectors. In FM model, there is only one hidden vector for each one-dimensional feature. FM can be regarded as a special case of FFM, which is the FFM model when all features are attributed to a field. According to the field sensitive property of FFM, its model equation can be derived.


It can be seen that if the length of the hidden vector is K, there are NFK secondary parameters of FFM, which are far more than NK of FM model. In addition, because the hidden vector is related to field, the quadratic term of FFM cannot be simplified, and its predictive complexity is O(kN ^2).

For more details on FFM, refer to the article: recommendation system meets the deep learning (2) – FFM model theory and practice: www.jianshu.com/p/781cde3d5… Code address: github.com/princewen/t…

3.4 GBDT + LR model

In Facebook’s 2014 article, GBDT was introduced to solve LR’s feature combination problem. Later, The Kaggle competition also put this idea into practice, and the integration of GBDT and LR began to attract attention in the industry.

An example of the fusion solution of GBDT and LR is provided in FaceBook’s paper:


There are two trees in the figure, and X is an input sample. After traversing the two trees, X samples fall on leaf nodes of the two trees respectively, and each leaf node corresponds to LR one-dimensional features. Then, all LR features corresponding to this sample can be obtained by traversing the tree. The new eigenvector is constructed with a value of 0/1. For example, in the figure above, there are two trees. The left tree has three leaf nodes and the right tree has two leaf nodes. The final feature is a five-dimensional vector. For input x, it is assumed that it falls on the first node of the left tree and codes [1,0,0], while falls on the second node of the right tree and codes [0,1], so the overall code is [1,0,0,0,1]. Such codes are input into LR as features for classification.

More details about GBDT + LR, refer to the article: recommendation system meets the deep learning (10) — GBDT + LR fusion scheme of actual combat: www.jianshu.com/p/96173f2c2… Code address: github.com/princewen/t…

4. Deep learning methods

In CTR prediction, in order to solve the problem of sparse features, scholars proposed FM model to model the interaction between features. However, FM model can only express the relationship between two pairs of features, but cannot model the Deep relationship between two features or the interaction relationship between multiple features. Therefore, scholars use Deep Network to model the relationship between higher-order features.

Therefore, the combination of FM and deep network DNN has become the mainstream method for CTR prediction. There are two main methods for the combination of FM and DNN, parallel structure and serial structure. The understanding and realization of the two structures are shown in the following table:

structure describe Common model
Parallel structure FM part and DNN part are calculated separately, and only one fusion is performed in the output layer to obtain the results DeepFM, DCN, Wide&Deep
Serial structure The results of the first and second terms of FM (or either of them) are taken as the input of the DNN part, and the final results are obtained through DNN PNN,NFM,AFM

Typical network models of the two types of structures are shown as follows:




Let’s review typical models of these two types of structures.

4.1 Parallel Structure

4.1.1 Wide & Deep model

Wide & Deep models this series has not been cleaned up yet, but can be briefly introduced. The structure of Wide & Deep model is as follows:


The Wide part is a generalized linear model. The input is mainly composed of two parts, one is the original feature and the other is the interactive feature. We can construct the K-group interactive feature in the form of cross-product transformation:


The Deep part is a DNN model, with each layer calculated as follows:


Joint training

The Wide & Deep model takes the form of joint training rather than integration. The difference between the two is that the joint training uses a common loss function and then updates the parameters of each part at the same time, while the integration method trains N models independently and then integrates them. Therefore, the output of the model is:


For more details on the Wide&Deep model, you can read the original paper or follow subsequent articles in this series.

4.1.2 DeepFM model

Let’s take a look at the DeepFM model structure:


DeepFM consists of two parts: neural network part and factor factorizer part, which are responsible for low-order feature extraction and high-order feature extraction respectively. The two parts share the same input. The prediction results of DeepFM can be written as:


FM part

The detailed structure of FM is as follows:


The FM part is a factorizer. We won’t talk too much about it here. The output formula of FM is:


The depth of the part


The depth part is a feedforward neural network. Unlike input such as image or speech, which is generally continuous and dense, input for CTR is usually extremely sparse. Therefore, the network structure needs to be redesigned. In the concrete implementation, before the first hidden layer, an embedding layer is introduced to complete the compression of the input vector to the low-dimensional dense vector.


The structure of the embedding layer is shown in the figure above. The current network structure has two interesting features. 1) Although the input lengths of different fields are different, the length of vector after embedding is K. 2) For the same feature, the implicit variable of FM and vector after Embedding are the same, and the two parts share the same input.

About DeepFM more details, refer to the article: recommendation system meets the deep learning (3) – DeepFM model theory and practice: www.jianshu.com/p/6f1c2643d… Code address: github.com/princewen/t…

4.1.3 Deep Cross Network

A DCN model starts with an embedding and accumulation layer, followed by a crossover network and a parallel deep network, followed by a final composite layer that combines the outputs of the two networks. The complete network model is shown as follows:


image

Embedding and stacking layers We consider input data with discrete and continuous characteristics. In a network size recommendation system, such as CTR prediction, the input is mainly classified features, such as “country= USA”. These features are usually encoded as unique heat vectors such as “[0,1,0]”; However, this tends to lead to excessive high-dimensional feature space for large vocabularies.

To reduce the dimension, we use an embedding process to transform these discrete features into dense vectors with real values (usually called embedding vectors) :


Then, we superimpose the embedding vector and the continuous feature vector to form a vector:


The spliced vector X0 will serve as the input of our Cross Network and Deep Network

Cross Network

The core idea of crossover network is to apply explicit feature crossover in an effective way. The cross network consists of cross layers, and each layer has the following formula:


A visualization of a cross layer is shown below:


It can be seen that the special structure of crossover network makes the degree of crossover feature increase with the increase of layer depth. The highest degree of the polynomial (in terms of input X0) is L + 1 for the l-layer crossover network. If Lc is used to represent the number of cross layers, D represents the input dimension. Then, the number of parameters involved across the network parameters are: d * Lc * 2 (w and B)

The few parameters of the crossover network limit the model capacity. To capture highly nonlinear interactions, the model introduces a deep network in parallel.

Deep Network

The deep network is a fully connected feedforward neural network, and each deep layer has the following formula:


Combination Layer

The link layer connects the outputs of two parallel networks and obtains the outputs through a full link layer:


If the logarithmic loss function is adopted, the form of the loss function is as follows:


More details on the DCN, refer to the article: recommendation system meets the deep learning (5) – Deep&Cross Network model theory and practice: www.jianshu.com/p/77719fc25… Code address: github.com/princewen/t…

4.2 Serial Structure

2 the Product – -based Neural Network

PNN, also known as product-based Neural Network, believes that the expression of cross features learned by embedding into MLP is not sufficient. A Product Layer idea is proposed, which reflects the DNN Network structure of cross signs based on multiplication operation, as shown in the following figure:


We mainly focus on the product-layer here. The Product Layer can be divided into two parts, one is the linear part LZ and the other is the nonlinear part LP.


We need to know z and P which are obtained by embedding, where Z is a linear signal vector, so we can use embedding directly to get:


The equal sign plus a triangle used in this paper actually means equal. You can think that Z is the copy of embedding.

For p, we need a formula to map:




The choice of different g makes us have two calculation methods of PNN, one is called Inner PNN, or IPNN for short, and the other is called Outer PNN, or OPNN for short.

IPNN

The schematic diagram of IPNN is as follows:


The calculation of P in IPNN is as follows, that is, the inner product is used to represent PIJ:


OPNN

The schematic diagram of OPNN is as follows:


The calculation of P in OPNN is as follows:


More details on the PNN, refer to the article: recommendation system meets the deep learning (6) – PNN model theory and practice: www.jianshu.com/p/be784ab4a… Code address: github.com/princewen/t…

4.2.2 Neural factorization those

For NFM model, the prediction formula of target value becomes:


Where, F (x) is a multi-layer feedforward neural network module used to model the interaction between features, and the architecture diagram is shown as follows:


Embedding Layer is the same as several networks among us, and the vector that Embedding gets is actually the implicit variable V that we need to learn in FM.

The bi-interaction Layer has a nice name. In fact, it is the process of computing the quadratic terms in FM, so the vector dimensions obtained are the dimensions of our Embedding. The end result:


Hidden Layers is our DNN part, and the results obtained by bi-interaction Layer are connected to multi-layer neural network for training, so as to capture complex nonlinear relations between features.

After multi-layer training, the sum of the output of the last layer is added to both the first term and the bias term to obtain our predicted output:


(NFM) for more details, refer to the article: recommendation system in depth study (7) – (NFM) model theory and practice: www.jianshu.com/p/4e65723ee… Code address: github.com/princewen/t…

Holdings Attention – -based factorization those

When making prediction, FM will fix a feature with a specific vector. When this feature intersects with other features, the same vector is used for calculation. This is very unreasonable, because the intersection between different features, the degree of importance is not the same. How to reflect this degree of importance, the FFM model introduced earlier is a solution. In addition, AFM model combined with attention mechanism is also a solution.

What is attention model? Instead of going into details in this paper, what we need to know here is that the attention mechanism is equivalent to a weighted average in which the value of attention is the weight to judge the importance of interactions between different features.

As mentioned earlier, attention is equal to the weighting process, so our prediction formula becomes:


The symbol with a dot in the circle stands for element-wise product.


So what we get after summation is a K dimensional vector, and we have to multiply by some vector p to get a specific number.

As can be seen, the first two parts of AFM are the same as FM, and the latter one is obtained through the following network:


The first three parts in the figure: Sparse Iput, Embedding Layer, and pairwise Interaction Layer are the same as FM. The last two parts are the innovation of AFM, namely our Attention Net. The math behind Attention is as follows:


To sum up, it is not difficult to see that AFM only adds attention mechanism on the basis of FM. But in fact, due to the final weighted accumulation, the quadratic term does not carry out a deeper network to learn nonlinear cross features, so AFM does not give full play to the advantages of DNN. Maybe DNN can be combined to achieve better results.

More details on the AFM, refer to the article: recommendation system meets the deep learning (8) — AFM model theory and practice: www.jianshu.com/p/83d3b2a1e… Code address: github.com/princewen/t…

5. Reinforcement learning method

In DRN:A Deep Reinforcement Learning Framework for News Recommendation, A News Recommendation model based on Reinforcement Learning is proposed. Let’s review:

Problems and Solutions The method proposed in this paper mainly aims at three problems: 1. DQN is used to model the dynamic variation of user interest; 2. Recommendation algorithms usually only consider user click/non-click or user score as feedback, and user activity is taken as feedback information in this paper. 3. Current recommendation systems tend to recommend things that users repeat or have similar contents. This paper uses the Dueling Bandit Gradient Descent method to conduct effective exploration.

Therefore, the framework of this paper is as follows:


The overall framework of the model is shown in the figure below:


There are several key links: PUSH: At each moment, when users send requests, the agent will generate k stories based on the current state and recommend them to users. This recommendation result is a combination of exploitation and exploration

FEEDBACK: FEEDBACK results can be obtained through users’ click behaviors on recommended news.

MINOR UPDATE: According to the users’ information (state), recommended news (action) and rewards, the agents will evaluate exploitation network Q and exploration network Q ̃ are all working in South Africa. Exploitation network Q works better, the model will stay cremated, and if the Exploration network Q ̃ is working better, Exploitation Network Q will also usher in the municipality of Exploration Network Q.

MAJOR UPDATE: After a period of time, the parameters of the EXPLOITATION Network Q model will be updated based on the historical experience stored in the experience pool of the DQN.

The double-Dueling structure is used in the exploration model of this paper. States are composed of user features and context features, and actions are composed of news features and user-news interaction features:


To explore the model

The exploration in this paper adopts Dueling Bandit Gradient Descent algorithm, and the algorithm structure is as follows:


According to the working conditions of DQN, an exploration Network Q ̃ o is created according to the parameters of the existing Q network. When a user request comes in, the top-K news list is generated by both networks, and then the news generated by both networks is mixed to a certain extent, and then the user’s feedback is obtained. According to the municipality of Exploration Network Q, the parameters of the Q network are updated to the direction of Exploration Network Q ̃ I.

For more details about this paper, see DRN:A Deep Reinforcement Learning Framework for News Recommendation. www.jianshu.com/p/c0384b213…

6. EE problems of the recommendation system

Exploration and Exploitation(EE) is a common problem in computational advertising and recommendation systems. To put it simply, to balance the accuracy and diversity of recommendation systems.

Exploitation in the EE problem is that users’ interests are certain, and of course they should exploit them. For example, they should spend the money they have earned. Exploration means that users will soon tire of using their interests, so they must constantly explore new interests. This is like having a little money to spend, but you must continue to make money, or you will have to spend it.

6.1 Bandit algorithm

Bandit algorithm is an effective algorithm to solve EE problems. Bandit algorithm is derived from the long history of gambling science. The problem it solves is as follows: A gambler wants to shake a slot machine. When he walks into the casino, he sees a row of slot machines, all of which look exactly the same, but each machine has a different probability of spitting out money. He does not know what the probability distribution of spitting out money from each slot machine is. This is the multi-armed bandit problem (K-Armed Bandit Problem, MAB).

How does Bandit relate to EE issues in recommendation systems? Suppose that we have made some experiments to get the current payout probabilities for each slot machine. If we want to get the most bang for our money, we will keep shaking the slot machine with the highest payout probability. This is called Exploitation. However, the current information is not the true probability of slot machine payouts. There may be better slot machines with higher probability of payouts, so further Exploration is needed.

Here are some classic Bandit algorithms:

Naive Bandit algorithm: Firstly, random attempts are made several times to calculate the average return of each arm, and the arm with the largest mean is always selected.

Epsilon-greedy algorithm: select a smaller number Epsilon between (0,1) and randomly select one of all arms with the probability of Epsilon each time. With a probability of 1-Epsilon, pick the arm with the highest average return so far. The return expectation is updated according to the return value of the selection arm.

Thompson Sampling algorithm: Thompson Sampling algorithm uses Beta distribution. This method assumes that each slot machine has a probability P of payout, and the probability distribution of this probability P conforms to Beta (WINS, lose) distribution. Each arm maintains a parameter of Beta distribution, i.e. Wins, lose. After each test, select an arm and shake it. If there are gains, wins of the arm will increase by 1; otherwise, lose of the arm will increase by 1. Each arm selection method is: use the existing beta distribution of each arm to generate a random number B, and select the arm with the largest random number generated by all the arms to swing.

UCB algorithm: This algorithm always optimistically thinks that each slot machine can get a return of P ‘+ ∆ at each recommendation. The calculation formula of P ‘+ ∆ is as follows:


Where the plus sign is preceded by the average return of the JTH slot machine so far, and the following is called Bonus, which is essentially the standard deviation of the mean, where T is the current number of trials, and n is the number of trials of the slot machine.

More details on the problem of EE, refer to the article: recommendation system meets the deep learning (12) – EE problems in recommender systems and basic algorithm of Bandit: www.jianshu.com/p/95b2de50c…

6.2 LinUCB algorithm

The MAB mentioned above is context-free, that is, it does not take into account the user’s personalized problem, so it is rarely applied in practice. We mostly use Contextual Bandit algorithms in the real world. LinUCB is one of them.

Since it is an extension of the UCB algorithm, we still choose the appropriate slot machine according to P ‘+ ∆. The calculation of p’ is based on supervised learning. We maintain a feature vector D for each slot machine, while we write θ for the context feature, and then conduct supervised learning through the collected feedback:


The confidence upper bound is calculated based on the following formula:


Therefore, the flow of LinUCB algorithm is as follows:


About LinUCB more details, refer to the article: recommendation system meets the deep learning (13) — LinUCB method is analysed and implemented: www.jianshu.com/p/e0e843d78… Code address: github.com/princewen/t…

7. Actual practice of recommendation system in the company

7.1 Ali MLR algorithm

MLR can be regarded as a natural extension of LR. It adopts the idea of divide and conquer and uses the fragment linear model to fit the nonlinear classification plane of high-dimensional space. Its formal expression is as follows:


Where u is the clustering parameter, which determines the division of the space, w is the classification parameter, which determines the prediction within the space. In this case, the number of hyperparametric fragments m can balance the fitting and generalization ability of the model. When m=1, MLR degenerates into ordinary LR. The larger M is, the stronger the fitting ability of the model is. However, the model parameter scale increases linearly with M, and the corresponding training samples required also increase. Therefore, in practical application, M needs to be selected according to the actual situation. For example, in Ali’s scenario, m is generally chosen to be 12. In the following figure, the MLR model can perfectly fit the rhomboid classification surface in the data with 4 segments.


In practice, the common form of MLR algorithm is as follows, using Softmax as the sharding function:


In this case, the MLR model can be viewed as a FOE model:


As for the design of loss function, Ali adopts NEg-Likelihood loss function and L1 and L2 regularization, in the form as follows:


Due to the addition of the regular term, MLR algorithm is no longer smooth convex function, and the gradient descent method is no longer applicable. Therefore, the update of model parameters uses the combination of LBFGS and OWLQN. For specific optimization details, you can refer to the paper arxiv.org/pdf/1704.05… .

More details about the MLR, refer to the article: recommendation system meets the deep learning (17) – explore shallow of MLR algorithm and realization of ali: www.jianshu.com/p/627fc0d75… Code address: github.com/princewen/t…

7.2 Ali Deep Interest Network

By observing the online data collected, ali researchers found that there were two very important features in user behavior data: Diversity: users showed very diverse interests when browsing e-commerce websites. Local activation: Due to the diversity of user interests, only some historical data will affect whether or not a recommended item is clicked, not all historical records.

For the above two features, Ali added a Attention mechanism in the recommendation network to weight users’ historical behaviors:


Two other details are worth noting:

Evaluation index GAUC

The evaluation index used by the model is GAUC. Let’s first look at the calculation formula of GAUC:


GAUC calculation not only calculates the AUC of each user separately, but also weights the AUC of each user based on the number of user impressions or clicks. The effect of user bias on the model is further eliminated. The experiment proves that GAUC is indeed a more reasonable evaluation index.

Dice Activation function

When PRelu is used as the activation function, there is a problem, that is, we think that the segmentation point is 0, but in fact, the segmentation point should be determined by the data, so the Dice activation function is proposed in this paper.

The Dice Activation Function stands for Data Dependent Activation Function and has the following form:


Where, expectation and variance are calculated as follows:


But you can see that every yi corresponds to a probability PI. PI calculation is mainly divided into two steps: yi standardization and Sigmoid transformation.

Adaptive Regularization

For the long tail of user data, Ali proposed the method of adaptive regularization, namely: 1. According to the frequency of feature ID, the intensity of regularization should be adjusted accordingly. 2. Smaller regularization intensity is given to those with high frequency; 3. For those with low frequency, greater regularization intensity is given.

The calculation formula is as follows:


More details about the DIN, refer to the article: recommendation system meets the deep learning road ali (18) – the depth of interest network (DIN) is analysed and the implementation: www.jianshu.com/p/73b6f5d00…

7.3 Ali ESSM Model

This model mainly solves two major problems in CVR estimation: sample selection bias and sparse data. Sample selection bias: Most CVR estimation problems are trained on the sample space that the user has clicked on, while predictions are made for the entire sample space. This kind of training sample is extracted from a small set of the whole sample space, but the trained model needs to infer and predict the samples in the whole sample space, which is called sample selection bias.

Sparse data: items clicked by users only account for a very small part of the entire sample space, making model training very difficult.

The ESMM model proposed by Ali Mom’s algorithm borrowed the idea of multi-task learning and introduced two auxiliary learning tasks to fit pCTR and pCTCVR respectively, thus eliminating the two challenges mentioned above. The ESMM model can make full use of the sequential pattern of user behavior, and its model architecture is shown below:


It can be seen that the ESSM model is composed of two sub-networks, the sub-network on the left is used to fit pCVR, and the sub-network on the right is used to fit pCTR. Meanwhile, pCTCVR can be obtained after the output of the two sub-networks is multiplied. Therefore, the network structure has three sub-tasks, which are used to output pCTR, pCVR and pCTCVR respectively.

Assuming that x represents feature(impression), Y represents click, and Z represents transformation, pCTCVR = pCTR * pCVR can be obtained as follows:


By converting multiplication into division, we can obtain the calculation of pCVR:


We took the exposure events with clicking behaviors as positive samples and the exposure events without clicking behaviors as negative samples to perform CTR prediction tasks. Exposure events with both click behavior and purchase behavior were used as positive samples, while others were used as negative samples to train the estimation part of CTCVR.

So what does the model do? As you can see, the input x used to train the two tasks is actually the same, but the label is different. The CTR task predicts clicking Y, and the CTCVR predicts converting Z. Therefore, we input (x,y) into the CTR task to obtain the estimated value of CTR, input (x,z) into the CVR task to obtain the estimated value of CVR, and then multiply the estimated value of CTR and CVR to obtain the estimated value of CTCVR. Therefore, the loss function of the model can be defined as:


Where θ CTR and θ CVR are parameters of CTR network and CVR network respectively, L (⋅) is cross entropy loss function.

It is also worth noting that the embedding layers of two sub-networks share the embedding vector.

More details about the ESSM model, refer to the article: recommendation system meets the deep learning road ali (19) – the full space of multitasking model ESSM:www.jianshu.com/p/35f00299c…

7.4 JD reinforcement learning recommendation model

Jd makes list-wise recommendations through reinforcement learning.

Before the online recommendation system is built, offline training and evaluation are required, which are mainly based on historical behavior data of users. However, we only have ground-truth data and corresponding feedback. Therefore, for the entire action space (that is, all possible combinations of items), it is very sparse. This will cause two problems. First, only part of state-action pairs can be obtained for training, and all cases cannot be modeled (which may lead to over-fitting). Second, the inconsistency of online and offline environments will be caused. Therefore, an emulator is needed to simulate the reward value of the state-action that does not appear, for training and evaluating the offline model.

The construction of the emulator is mainly based on the historical data of users. The basic idea is that given a similar state and action, different users will also make similar feedback. To model the reward value for the state-action.

The model is trained by actor-Critic method and the simulator just established:


Details of the method, the reference articles: recommendation system meets the deep learning (15) — of intensive study in jingdong to explore of the recommendation: www.jianshu.com/p/b9113332e…

9. Sorting out papers

1, GBDT+LR:quinonero.net/Publication… 2. DeepFM: arxiv.org/abs/1703.04… 3, Wide&Deep: dl.acm.org/citation.cf… 4, DCN: arxiv.org/pdf/1708.05… 5, PNN: arxiv.org/pdf/1611.00… 6. AFM: www.comp.nus.edu.sg/~xiangnan/p… 7, NFM: arxiv.org/abs/1708.05… 8, LinUCB: arxiv.org/pdf/1003.01… 9. MLR: arxiv.org/pdf/1704.05… 10, DIN: arxiv.org/abs/1706.06… 11, ESSM: arxiv.org/abs/1804.07… 12, JINGdong: arxiv.org/abs/1801.00…

Welcome to personal public account: Little excavator


Add wechat SXW2251, can pull you into the small excavator technology exchange group yo!