1. What is a recommendation system

The recommendation system uses e-commerce websites to provide customers with commodity information and suggestions, help users decide what products to buy, and simulate sales personnel to help customers complete the purchase process. Personalized recommendation is to recommend interested information and commodities to users according to their interest characteristics and purchasing behaviors.

With the continuous expansion of e-commerce scale, the number and types of goods grow rapidly, and customers need to spend a lot of time to find the goods they want to buy. This kind of browsing through a lot of irrelevant information and product processes is a sure way to drain consumers who are drowning in information overload.

In order to solve these problems, personalized recommendation system came into being. Personalized recommendation system is an advanced business intelligence platform based on massive data mining to help e-commerce websites provide completely personalized decision support and information services for their customers shopping.

Common recommendation fields such as: Taobao guess you like, see again and again, recommend products, Meituan home page recommendation, nearby recommendation, etc.

Recommendation system is more inclined to engineering system. To be more accurate, it needs not only recommendation algorithm, but also user intention recognition, text analysis, behavior analysis, etc. It is a highly comprehensive system.

2. Overall architecture

The recommendation system architectures described in this section are not independent of each other. Actual recommendation systems may use one or more of these architectures. In the actual design process, readers can take the architecture introduced in this paper as a starting point for design, and more independent thinking combined with their own business characteristics, so as to design a system suitable for their own business.

According to the speed of response to user behavior, recommendation system can be roughly divided into offline training and online training recommendation system.

2.1 Offline Recommendation

The recommendation system architecture for off-line training is the most common one. In this case, “offline” training refers to training with data over a period of time (such as weeks or weeks), with long model iterations (typically in hours). The model fits the medium – and long-term interests of users.

As shown in the figure below, A typical recommendation system architecture based on offline training consists of data reporting, offline training, online storage, real-time computing and A/B testing. Among them, data reporting and off-line training constitute the learning system in supervised learning, while real-time computing and A/B testing constitute the prediction system. In addition, in addition to the model, there is an online storage module for storing the model and the characteristic information required by the model for the real-time computing module to call. Each module in the figure consists of training and prediction data flow. The data flow of training collects the business data and generates the model and stores it in the online storage module. The predicted data flow receives the prediction request of the service and accesses the real-time computing module through A/B test module to obtain the prediction result.

  1. Data reporting: The function of the data reporting module is to collect business data to form training samples. Generally divided into collection, validation, cleaning and conversion steps. The collected data is converted into the sample format required for training and stored in the offline storage module.

  2. Offline training: The line training module is subdivided into offline storage and offline computing. The recommendation system used in actual services generally needs to process massive user behavior data, so the offline storage module needs a distributed file system or storage platform to store the data. The common operations of off-line computing include sample sampling, feature engineering, model training, similarity calculation, etc.

  3. Online storage: Because online services have strict requirements for latency. For example, if a user opens a mobile APP, he must expect the APP to respond quickly. If it takes too long, the user experience will be affected. Generally speaking, this requires the recommender system to process user requests and return recommender results within tens of milliseconds. Therefore, for online services, a special online storage module is required to store online model and feature data.

  4. Real-time recommendation: The function of the real-time recommendation module is to predict new requests from the business. 1. Obtain user characteristics; 2. Call the recommendation model; 3. Result ordering.

    In practical applications, because the list of business items is too large, if real-time calculation is used to score each item using a complex model, it may take too long and affect user satisfaction. So, a common practice is to split recommendation list generation into recall and sort steps. The effect of a recall is to sift through a large number of candidates (say millions) to find a set of candidates (typically hundreds) that a user is more likely to like. The sorting function is to score the relatively small set of candidates obtained by recall using the sorting model. Furthermore, after sorting and obtaining the recommendation list, for the sake of diversity and operation, the third step rearrangement filter will be added to process the refined recommendation list.

  5. A/B testing: A/B testing is basically A necessary module for Internet products, and recommendation systems are no exception. It can help developers evaluate the impact of new algorithms on customer behavior. In addition to the offline indicators, A new recommendation algorithm will generally go through A/B test to test the effectiveness of the new algorithm before it goes online.

The following figure shows the flow of each component in the corresponding actual system. It should be noted that recall and sorting operations have been completed when the recommendation list is generated, and the business layer can directly call the API to get the recommendation list.

2.2 Online Training

For the business, we want feedback from users on the last AD (liked or didn’t like, clicked or not clicked) to be quickly used in recommendations for the next AD. This requires us to solve this problem in another way, and this method is online training.

The architecture of the recommendation system based on online training is suitable for the scenes with high dimension and large data volume such as advertising and e-commerce, which require high real-time performance. Compared with the recommendation system based on offline training, the recommendation system based on online training does not distinguish between training and testing stages, and is learning in each round, adjusting strategies through real-time feedback. Online training requires real-time processing of samples, features and models, so that the recommended content can reflect users’ real-time preferences more quickly. On the other hand, because online training Wells do not need to store all the training data, there is no need for huge offline storage overhead, which makes the system has good scalability and can support large data volume and model.

  1. Sample processing: Compared with the recommendation system based on offline training, the main difference of online training in data reporting stage is reflected in sample processing. For off-line training, the reported data is first stored in a distributed file system and then waits for off-line computing tasks to process the samples. For online training, the calculation of sample removal, filtering and sampling should be carried out in real time.
  2. Real-time features: The real-time feature module constructs training samples by processing sample data in real time and splicing the required features. The input-stream training module is used to update the model. The main functions of this module are feature splicing and feature engineering.
  3. Streaming training: The main function of streaming training module is to update the model with real-time training samples. The incremental update part of the recommendation algorithm is updated by streaming calculation. One of the advantages of online training is that sparse storage of models can be supported. In terms of training, online models are not always trained from scratch, but can conduct incremental training on the basis of model parameters obtained from offline training.
  4. Model storage and loading: Models are typically stored in parameter servers. After model update, model file is pushed to online storage and dynamically loaded by online service module.

3. Characteristic data

To train the recommendation model, it is necessary to collect user behavior data to generate feature vectors before training. A feature vector is composed of features and the weight of features. The following factors should be considered when calculating feature vectors using user behavior.

  1. Types of user behavior: Within a site, there are many different types of user behavior toward items. Users can browse items, click links to items, bookmark items, rate items, buy items, comment on items, tag items differently, share items with friends, search for different keywords, and more. All of these behaviors have an impact on the weight of item features, but different behaviors have different influences. In most cases, it is difficult to determine which behavior is more important. The general standard is that the behavior with higher cost paid by users has higher weight.
  2. Timing of user behavior: Generally speaking, the recent behavior of the user is important, while the behavior of the user long ago is relatively minor. Therefore, if the user has recently purchased an item, the characteristics corresponding to that item will have a higher weight.
  3. Number of user actions: Sometimes a user can act multiple times on an item. For example, users will listen to a song many times or watch many episodes of a TV series. Therefore, the frequency of the same behavior for the same item also reflects the user’s interest in the item, and the item with more behavior times has higher feature weight.
  4. Item popularity: If the user has a behavior on a very popular item, often cannot represent the user’s personality, because users are likely to follow suit, may be interested in the item is not much, especially in the user had occasionally a few times on a hot item not important behavior (such as browsing behavior), is that more users for this item may not have what interest, Maybe it’s just that links to the item are everywhere and it’s easy to click on. On the other hand, if a user acts on an unpopular item, it speaks to the user’s personality needs. Therefore, the recommendation engine will increase the weight of features corresponding to less popular items when generating user features.
  5. Data dryness: denoising the sample. The data of cheating behaviors such as brushing should be excluded from the training data, otherwise it will directly affect the effect of the model. Missing values in the sample are also dealt with.
  6. Balance of positive and negative samples: generally, the behavioral data collected by users are positive samples, resulting in serious imbalance. So for a user, sample some items from the items he has not acted on as a negative sample, but make sure that the number of positive and negative samples for each user is equal.
  7. Feature combination: We need to consider the relationship between features. For example, in the search ranking of Meituan hotel, the sales volume, price and consumption level of the hotel are strongly correlated factors, the age and location of the user may be weakly correlated factors, and the user ID is completely irrelevant factors. After determining which factors may be relevant to the prediction target, we need to express this information as a numerical type, which is the process of feature extraction. In addition, users’ browsing, transaction and other behavior records on App contain a large amount of information. Feature extraction mainly extracts relevant factors from these information and presents them with numerical variables. Common statistical features include counting features, such as browsing times, order times, etc. Ratio characteristics, such as CTR, conversion rate, etc. Statistical characteristics, such as price mean, standard deviation, quantile, skewness, kurtosis, etc.

4. Collaborative filtering algorithm

Collaborative filtering algorithm originated in 1992 and was used by Xerox to customize mail system. Xerox’s users are asked to select between three and five topics from dozens of topics, and the collaborative filtering algorithm filters messages based on each topic to personalize them.

Collaborative filtering algorithm is divided into item-based collaborative filtering and user-based collaborative filtering. The output result is TOPn recommendation list.

4.1 Item-based Collaborative Filtering (ItemCF)

The core idea of item-based collaborative filtering algorithms is to suggest items that are similar to the items they previously liked.

Item-based collaborative filtering algorithm firstly calculates the similarity between items, which can be calculated by the following methods:

  1. Based on a list of common favorites


    In this case, N(I) in the denominator is the number of users who bought item I, N(j) is the number of users who bought item J, and the numeratorIs the number of users who buy both item I and item J. It can be seen that the core of the above formula is to calculate the proportion of people who buy the product at the same time. The more people who bought both items, the more similar they were. It is also worth noting that in the denominator we use the total number of purchasers of an item as a penalty, which means that an item may be so popular that it is often bought with other items, so divide by its total number of purchasers to reduce its similarity score with other items.

  2. Similarity calculation based on cosine

    The above method calculates item similarity by directly counting the number of people who buy both items at the same time. But there may be cases where users buy and don’t like it so if the data set contains specific ratings we can further introduce user ratings into the similarity calculation.


    Among themIs user K’s rating of item I, or 0 if there is no rating.

  3. Punishment for hot items

    The problem of hot items can be solved by using the following formula:


    when, the smaller N(I) is, the more severe the punishment will be, which will decrease the correlation score of hot items.

4.2 User-based Collaborative Filtering (UserCF)

The principle of user-based collaborative filtering (User CF) is similar to that of item-based collaborative filtering. The difference is that the principle of item-based collaborative filtering is that user U buys item A and recommends items B, C and D similar to user U and A to user U. In user-based collaborative filtering, the similarity between user U and other users is calculated first, and then several users who are most similar to U are selected and their purchased items are recommended to user U.

In order to calculate user similarity, we first convert the index data of items purchased by users into the index data of items purchased by users, namely, the inverted index of items:

After the inverted index of items is established, the similarity between users can be calculated according to the similarity formula:


Where N(a) represents the number of items purchased by user A, N(b) represents the number of items purchased by user B, and N(a)∩N(b) represents the number of the same items purchased by user A and user B. Given the similar data of users, select K users who are most similar to user U, and recommend the items U has not purchased to user U among the items they have purchased.

4.3 Matrix Decomposition

The above calculation will result in a similarity matrix, and the size and latitude of this matrix are very large, which requires dimension reduction. The dimension reduction method of SVD is used. For details, please refer to the dimension reduction method I wrote before: 2.5 Dimension reduction Method

Matrix factorization based on sparse autocoding

The application of matrix decomposition technology in the field of recommendation is relatively mature, but through the introduction in the previous section, it is not difficult to find that matrix decomposition in essence only approximates the original matrix through a decomposition, and the level of feature mining is not deep enough. In addition, matrix decomposition does not apply to the content characteristics of the object itself, such as the classification of books and music genres. With the rise of neural network technology, the author found that more in-depth feature representation can be obtained through multi-layer perceptron, and content classification features can be applied. Firstly, we introduce the design idea of sparse self-coding neural network.

  1. The underlying self-coding structure

    The simplest self-coding structure is shown in the figure below. A three-layer neural network is constructed. We make the output layer equal to the input layer, and the dimension of the middle layer is much lower than that of the input layer and output layer, so the feature compression of the first layer is obtained.

    In short, self-coding neural networks attempt to learn the functions of the middle layer approximately equal to the input layer. In other words, it’s trying to approximate an identity function. This compressed representation can be very difficult to learn if the input data of the network is completely random, such as each input is an independent homodistributed Gaussian random variation completely independent of other features. But if there are certain structures implicit in the input data, such as some input features that are related to each other, then the algorithm can find those correlations in the input data.

  2. Multilayer structure

    Based on the above single-layer hidden layer network structure, we can extend to the multi-layer network structure and learn the higher-level abstract features.

5. Cryptic meaning model

5.1 Basic Ideas

Cryptic modeling is an important branch of recommendation system. LFM: Latent Factor Model, the core idea of which is to connect users’ interests and items through hidden features.

The process is divided into three parts, mapping items to implicit categories, determining users’ interest in implicit categories, and then selecting items in the categories that users are interested in and recommending them to users. It is automatic clustering based on user behavior statistics.

Cryptic model is widely used in top-N recommendation. Commonly used cryptic Model, LSA(Latent Semantic Analysis), LDA(Latent Dirichlet Allocation), Topic Model (Topic Model), Matrix Factorization and so on.

First of all, let’s take an example to understand this model. For example, there are two users A and B. Currently, there is A reading list for user A. So how to recommend books to A and B?

For UserCF, you first need to find other users who have read the same books as them (users with similar interests), and then recommend other books that those users like. For ItemCF, it is necessary to recommend books similar to those they have already read. For example, user B has read a lot of books on data mining, so he can recommend books on machine learning or pattern recognition.

Another approach is to use cryptic models, which categorize interest in books and objects. For a user, first get a category of his interests, and then pick items from that category that he might like.

5.2 Model Understanding

  1. How do you classify things?
  2. How do you determine what types of items users are interested in, and to what extent?
  3. For a given class, which items belong to this class should be selected and recommended to the user, and how to determine the weight of these items in a class?

In order to solve the above problems, researchers put forward: why don’t we start from the data, automatically find those classes, and then personalized recommendation, cryptic analysis technology because of the adoption of automatic clustering based on user behavior statistics, better solve the above problems. From the birth of cryptic analysis technology to today, many famous models and methods have been produced, among which pLSA, LDA, Latent Class model, latent topic model, matrix factorization.

LFM calculates user U’s interest in item I through the following formula:


In this formulaIs the parameters of the model, whereMeasures the relationship between user u’s interest and the KTH implicit class, andMeasures the relationship between the KTH hidden class and item I. So, the next question is how to calculate these two parameters.

Readers familiar with optimization theory or machine learning will probably have a good idea of how to calculate both parameters. These two parameters are calculated from the dataset. To calculate these two parameters, a training set is needed. For each user U, the training set contains items that user U likes and items that user U is not interested in. By learning this data set, the above model parameters can be obtained.

6. Sorting algorithms

In industrial applications, recommendation systems are usually divided into two parts, recall and sorting. Collaborative filtering is a recall algorithm. A relatively small recommendation list is obtained from the recall, and then it is output to the final recommendation list after sorting, which is an ordered recommendation list.

This process filters hundreds or thousands of candidate sets from tens of millions of items, and then selects 30 items for each user in the sorting phase. This sort can be thought of as a function, F(user, item, context), which takes the user, the item, the environment, and outputs a score between 0 and 1, taking the highest score. This process is often called CTR estimation. Then the common operating form of F function is:

  1. Logistic Regression

    The simplest is Logistic Regression, a generalized linear model. Take the user image of a user (a vector) such as [3, 1], splicing the item image of a user such as [4, 0], and add the vector [0, 1, 1] representing the context to get [3, 1, 4, 0, 0, 1, 1], If the user has ever been in contact with the item, the label is 1, which adds up to a positive sample. At the same time, the item “skipped” by the user or the popular item that has never been in contact with the user can be taken as a negative sample, and the label is 0. You can train a sorting algorithm from these inputs and outputs. Detailed models are as follows: 2. Logistic regression

  2. GBDT

    The scheme using gradient ascending decision tree (GBDT) can also solve this sorting problem, but the model is different from LR. As an integration model, GBDT will use multiple decision trees, and each tree will fit the residual of the previous tree to achieve good fitting effect. When a sample is input into a tree, it will go down to a leaf node according to the conditions of each node, and set the value of this node to 1 and the rest to 0. See 3.2 GBDT for detailed model algorithm

  3. GBDT+LR

    GBDT and LR stacking models offer a slight improvement over GBDT alone, with the greater benefit of preventing GBDT overfitting. The upgrade to GBDT+LR improved online performance by about 5%, and iterative testing to add features was made easier by eliminating the need to manually convert new features.

  4. GBDT+FM

    GBDT does not support high-dimensional sparse features. If high-dimensional features are added to LR, on the one hand, high-dimensional features need to be combined manually; on the other hand, model dimension and computational complexity will increase at O (N^2) level. Therefore, GBDT+FM model is designed as shown in the figure, and LR is replaced by Factorization Machines model.

    Factorization Machines (FM) model is shown below:


    It has the following advantages: (1) the first two terms are a linear model, which is equivalent to the function of LR model; (2) the third term is a quadratic cross term, which can automatically cross combine features; (3) the computational complexity of model training and prediction is reduced to O(N) by adding hidden vectors; (4) Sparse features are supported.

    Several advantages enable GBDT+FM to have good support for sparse features. FM uses GBDT leaf nodes and sparse features (content features) as input. The schematic diagram of the model structure is as follows.

  5. DNN+GBDT+FM

    GBDT+FM model makes insufficient use of depth features with structural information such as embedding, while Deep Neural Network can learn embedding features and common dense features to extract Deep information and improve the accuracy of the model. It has been successfully applied to many machine learning fields. Therefore, DNN is introduced into the sorting model to improve the overall quality of sorting.

    The ensemble model architecture of DNN+GBDT+FM is shown in the figure. FM layer is the last layer of the model, namely the fusion layer, and its input is composed of three parts: the last hidden layer of DNN, the output leaf node of GBDT, and the high-dimensional sparse features. The ensemble model architecture of DNN+GBDT+FM is introduced as follows, and the effect of this model is improved by 4% compared with THAT of GBDT+FM.

    Distributed TensorFlow was used for training, online predictions were made using TensorFlow Serving microservices, and Adam optimizer was used for DNN+GBDT+FM ensemble model. Adam combined The Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp) Algorithm. With better convergence rate, each variable has its own descent step, and the overall descent step can be adjusted according to the current gradient, which can adapt to the data with noise. Experiments have tested a variety of optimizers, and Adam’s effect is the best.

Industry DNN ranking status

  1. Youtube introduced the DNN sorting algorithm in 2016.
  2. In 2016, Shanghai Jiao Tong University and UCL launched product-based Neural Network (PNN) Network for user click prediction. PNN is equivalent to feature crossing at DNN layer. What we do is to leave feature crossing to FM, while DNN focuses on extraction of deep information.
  3. Google launched Wide And Deep Model in 2016, which is also the basis of our current Model. On this basis, FM was used to replace Cross Feature LR, simplifying computational complexity And improving Cross generalization ability.
  4. Ali introduced Deep Interest Network (DIN) to predict click rate of goods by using the attention mechanism this year to optimize the accuracy of embedding vector, which is worth learning from.

7. Evaluate tests

7.1 A/B testing

Once the new recommendation model is online, A/B testing is performed to compare it with the old algorithm.

AB test is a common experimental method of online evaluation algorithm. It randomly divides users into several groups according to certain rules, and adopts different algorithms for different groups of users. Then, it compares different algorithms through various evaluation indicators of different groups of users, such as the click rate of different groups of users, and compares the performance of different algorithms through the click rate. Readers who are interested in AB testing should check out http://www.abtests.com/, which provides many examples of how to improve user satisfaction with actual AB testing and how to conduct proper AB testing.

Sharding flow is key in AB testing. The different layers and the teams that control these layers need to get their AB test flow from a unified place, and the flow between the different layers should be orthogonal.

Orthogonality is a term borrowed from geometry. If two lines meet at right angles, they are orthogonal. In vector terms, these two lines don’t depend on each other.

Below is a simple AB test system. Once a user enters the site, the traffic distribution system decides if the user needs to be tested AB, and if so, the traffic distribution system labels the user to which group they belong in the test. The user then browses the web page, and the behavior of the user when browsing the web page is sent back to the background log database through the logging system. At this point, if the user has a label for the test group, that label is also sent back to the background database. In the background, the experimenter’s job is first to configure the traffic distribution system and decide what tests to take for what users. Secondly, the experimenter needs to count the data in the log database, generate the experimental reports of different groups of users through the evaluation system, and compare and evaluate the experimental results.

When the AB test is completed, the old algorithm can be replaced with a new one if it is better than the previous one according to the index results.

7.2 Other evaluation methods

Once the model is ready, it is usually evaluated by offline metrics before deciding whether to test it online. Common indicators of off-line algorithm evaluation include accuracy, coverage, diversity, novelty and UC, etc. Online testing is generally carried out through A/B testing, and common indicators include click rate, user stay time, advertising revenue, etc., which requires attention to analyze statistical significance. At the same time, it is necessary to pay attention to the combination of short-term indicators and long-term indicators. The improvement of some short-term indicators will sometimes lead to the decline of long-term indicators, for example, the frequent recommendation of beautiful women or funny content will bring short-term click rate increase, but may cause a long-term decline in user stickiness. Designers need to start from their own product perspective, according to the needs of the product to develop evaluation indicators, so as to better guide the optimization direction of the recommendation system. Common evaluation indicators are as follows:

8. Cold startup is recommended

In the recommendation system, cold start indicates that the accumulated data of the system is too small to make personalized recommendation for new users, which is a big problem for product recommendation. Every recommended product has a cold start problem. On the one hand, when a new product is put on the shelf, it will encounter the problem of cold start. There is no collection of any user’s browsing, clicking or purchasing behavior, and it is impossible to determine how to recommend the product. On the other hand, when a new user arrives, it is impossible to predict his interest without his behavioral data in the application. If the recommendations given to users are monotonous and without highlights, users will lose interest in the product at the beginning and thus give up using it. Therefore, the cold start of users and items should be considered at the same time when cold start.

Basically, cold priming can be divided into the following three categories.

8.1 Cold Startup

User cold start mainly solves the problem of how to make personalized recommendation for new users. When a new user arrives, I don’t have his behavior data, so I can’t predict his interest based on his historical behavior, so I can’t make personalized recommendation for him. The solutions are as follows:

  1. Use the user’s account information.
  2. Cold boot is performed using the IMEI number of the user’s mobile phone.
  3. Create job selection pages that allow users to select points of interest and generate coarse-grained recommendations in real time.

8.2 Cold start of items

Item cold start addresses the problem of how to recommend new items to users who might be interested in them. The solutions are as follows:

  1. Use the content and classification information of articles.
  2. Use data annotated by experts.

8.3 Cold Startup of the System

System cold start mainly solves the problem of how to design personalized recommendation system on a newly developed website (there is no user, no user behavior, only some item information), so that users can experience personalized recommendation service when the product is just launched.

9. References

  1. Recommend system practice – item bright
  2. Recommendation systems and deep learning
  3. Meituan machine learning practice

Machine Learning


Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]