“This is the first day of my participation in the Gwen Challenge in November. See details of the event: The last Gwen Challenge in 2021”.
I. Introduction to recommendation system
1.1 Concepts and Background
-
What is a recommendation system
Users without clear requirements visit our service, and the items in the service constitute information overload to users. The system sorts the items according to certain rules and shows the items in front to users. Such system is the recommendation system
-
Information overload & unclear user needs
- Category: covers a small number of popular sites. Typical application: Hao123 Yahoo
- Search engines: Identify requirements through search terms. Typical application: Google Baidu
- Recommendation system: it does not need users to provide clear needs, and models users’ interests by analyzing users’ historical behaviors, so as to actively recommend information that can meet their interests and needs
-
Recommendation systems & search engines
search recommended behavior Take the initiative to passive intentions clear The fuzzy personalized weak strong The flow distribution Matthew effect The long tail The target Quick to meet Continuous service Evaluation indicators concise complex
1.2 The working principle and function of the recommendation system
-
How the recommender system works
- Social recommendation
- Content-based recommendations
- Recommendations based on popularity
- Collaborative filtering based recommendations: Find users with similar historical interests
-
The role of the recommendation system
- Connect users and objects efficiently
- Increase user stay time and user activity
- Effectively help products to achieve their commercial value
1.3 Differences between recommendation systems and Web projects
-
Achieving goals through information filtering improves V.S. stable information flow systems
- Web projects: Handle complex business logic, handle high concurrency, and build a stable information flow service for users
- Referral system: Growth metrics, retention/reading time/Gross Merchandise Volume (GMV)/ Video View
-
Definite V.S. uncertain thinking
- Web projects: Have firm expectations of results
- Recommendation systems: The result is a matter of probability
Ii. Recommendation system design
2.1 Recommendation system elements
- The UI and UE
- Data (Lambda architecture)
- Business knowledge
- algorithm
2.2 Recommended System architecture
-
Lambda architecture for Big data
-
Lambda architecture is a real-time big data processing framework proposed by Nathan Marz, author of The real-time big data processing framework Storm.
-
Lambda architecture integrates offline computing and real-time computing, and designs an architecture that can meet the key characteristics of real-time big data systems, including high fault tolerance, low latency and scalability.
-
Layered architecture
- Batch layer
- Data is immutable, computable in any way, and horizontally scalable
- High latency of several minutes to several hours (calculation and data volume vary)
- Log collection: Flume
- Distributed storage: Hadoop
- Distributed computing: Hadoop and Spark
- View storage database
- osql(HBase/Cassandra)
- Redis/memcache
- MySQL
- Real time processing layer
- Stream processing, continuous calculation
- Store and analyze data within a window period (top sales over time, real-time hot searches, etc.)
- Real-time data collection flume & Kafka
- Real-time data analysis Spark Streaming/Storm/Flink
- The service layer
- Random read support
- Results need to be returned in a very short time
- Read and merge batch and real-time layer results
- Batch layer
-
-
Recommendation algorithm Architecture
-
Recall Stage (Audition)
- Recall determines the ceiling of the final recommendation
- Commonly used algorithm
- Collaborative filtering
- Based on the content
-
Sorting stage (Select)
- Recall determines the ceiling of the final recommendation results, while sorting approaches this limit and determines the final recommendation effect
- CTR estimation (CTR estimation using LR algorithm) estimates whether a user will click on an item and requires user click data
-
Recommendation algorithm
3.1 Recommendation model construction process
Data ->Features ->ML Algorithm ->Prediction Output
-
Data cleaning/data processing
- The data source
- Explicit data
- Rating scale
- A: What are your Comments
- Contact data
- Order history
- Add a shopping cart
- Page views
- Click on the
- Search records
- Explicit data
- Data quantity/Whether the data meets the requirements
- The data source
-
Characteristics of the engineering
-
Filter characteristics from the data
-
A given item may be purchased by users with similar tastes or needs
-
Use user behavior data to describe goods
-
-
Represent features with data
- Combine all user actions together to form a user-item matrix
-
-
Choose the right algorithm
- Collaborative filtering
- Based on the content
-
Generate recommended results
- Evaluate the recommendation results, and go online after passing the evaluation
3.2 The most classical recommendation algorithm: collaborative filtering recommendation algorithm
Collaborative Filtering
Algorithm thought: birds of a feather flock together
The basic collaborative filtering recommendation algorithm is based on the following assumptions:
- “You are likely to like what others like like you like” : User-based Collaborative Filtering Recommendation (USER-based CF)
- “You’re likely to like something similar to what you like” : Item-based Collaborative Filtering Recommendation
There are several steps to implement collaborative filtering recommendations:
-
Find the most similar person or thing: top-n Similar person or thing
By calculating the similarity of two pairs to sort, you can find top-N similar people or items
-
Generate recommendations based on similar people or items
Use top-n results to generate initial recommendation results, and then filter out items that the user already has a record of or explicitly expresses no interest in
As a simple example, the data set is equivalent to a user’s purchase record of an item: a tick indicates that the user has a purchase record of the item
- On similarity calculation here with a simple idea: if you have two classmates X and Y, X classmates hobbies/soccer, basketball, table tennis, Y classmates hobbies/tennis, football, basketball, badminton, is their common hobby has 2, then can use their similarity: two-thirds of * 2/4 = 0.33 to represent a third material.
3.3 Similarity calculation
-
The calculation method of similarity
- Euclidean distance is a method of measuring distance in Euclidean space. Two objects, both represented as two points in the same space, if called P and q, are n coordinates, then the Euclidean distance measures the distance between these two points. Euclidean distances do not apply between Boolean vectors
The value of Euclidean distance is a non-negative number, and the maximum value is infinity. Usually, the result of similarity calculation is expected to be between [-1,1] or [0,1], which can be used generally
The transformation formula is as follows:
- Cosine similarity
- It measures the Angle between two vectors, and uses the cosine of the Angle to measure similar cases
- If the Angle between the two vectors is 0, the cosine is 1, if the Angle is 90 degrees, the cosine is 0, and if the Angle is 180 degrees, the cosine is -1
- Cosine similarity is more commonly used to measure text similarity, user similarity and object similarity
- The characteristics of cosine similarity have nothing to do with vector length. The calculation of cosine similarity should be normalized to vector length. As long as two vectors have the same direction, no matter how strong or weak, they can be regarded as’ similar ‘.
-
Pearson correlation coefficient
- It’s actually cosine similarity, but you’ve just centralized the vectors, subtracted the mean of each of the vectors a and B, and then you calculate cosine similarity
- Pearson similarity calculation results are between -1, 1, -1 means negative correlation, 1 means positive correlation
- Measures whether two variables increase and decrease in the same way
- Pearson correlation coefficient measures whether the change trend of two variables is consistent, which is not suitable for calculating the correlation between Boolean value vectors
-
Jaccard similarity
-
The proportion of the number of elements in the intersection of two sets in the union is very suitable for Boolean vector representation
-
The numerator is the dot product of two Boolean vectors, and you get the number of elements that intersect
-
The denominator is two Boolean vectors and you do the or and you sum the elements
-
-
How do I choose cosine similarity
- Cosine similarity/Pearson correlation coefficient fits user rating data (real values),
- Jekard similarity applies to implicit feedback data (0,1 Boolean bookmark, click, add cart)
3.4 Code implementation of collaborative filtering recommendation algorithm
-
Importing tool Packages
import pandas as pd import numpy as np Copy the code
-
Building a data set
users = ["User1"."User2"."User3"."User4"."User5"] items = ["Item A"."Item B"."Item C"."Item D"."Item E"] # Build the dataset datasets = [ ["buy".None."buy"."buy".None], ["buy".None.None."buy"."buy"], ["buy".None."buy".None.None], [None."buy".None."buy"."buy"], ["buy"."buy"."buy".None."buy"]]Copy the code
-
In calculation, our data usually need to be processed or encoded, so as to facilitate us to process the data. For example, here is a relatively simple case, we use 1 and 0 respectively to indicate whether the user has bought the product, so our data set should actually be like this:
users = ["User1"."User2"."User3"."User4"."User5"] items = ["Item A"."Item B"."Item C"."Item D"."Item E"] # User purchase record data set datasets = [ [1.0.1.1.0], [1.0.0.1.1], [1.0.1.0.0], [0.1.0.1.1], [1.1.1.0.1]]import pandas as pd df = pd.DataFrame(datasets, columns=items, index=users) print(df) Copy the code
-
With the data set, we can then calculate the similarity, but there are many special similarity calculation methods for similarity calculation, such as cosine similarity, Pearson correlation coefficient, Jacquard similarity and so on. Here we choose to use the jeckard similarity coefficient [0,1]
from sklearn.metrics import jaccard_similarity_score # Directly calculate the Jacquard similarity coefficient of some two terms # Calculate the similarity between Item A and Item B print(jaccard_similarity_score(df["Item A"], df["Item B"])) Calculate the Jacquard similarity coefficient for all data pairs from sklearn.metrics.pairwise import pairwise_distances # Calculate similarity between users user_similar = 1 - pairwise_distances(df, metric="jaccard") user_similar = pd.DataFrame(user_similar, columns=users, index=users) print("Pairwise similarity between users:") print(user_similar) # Calculate the similarity between items item_similar = 1 - pairwise_distances(df.T, metric="jaccard") item_similar = pd.DataFrame(item_similar, columns=items, index=items) print("The similarity between two objects:") print(item_similar) Copy the code
With pairwise similarity, you can then filter top-N similarity results and make recommendations
-
User-Based CF
import pandas as pd import numpy as np from pprint import pprint users = ["User1"."User2"."User3"."User4"."User5"] items = ["Item A"."Item B"."Item C"."Item D"."Item E"] # User purchase record data set datasets = [ [1.0.1.1.0], [1.0.0.1.1], [1.0.1.0.0], [0.1.0.1.1], [1.1.1.0.1], ] df = pd.DataFrame(datasets, columns=items, index=users) Calculate the Jacquard similarity coefficient for all data pairs from sklearn.metrics.pairwise import pairwise_distances Calculate the similarity between users 1- Jackard distance = Jackard similarity user_similar = 1 - pairwise_distances(df, metric="jaccard") user_similar = pd.DataFrame(user_similar, columns=users, index=users) print("Pairwise similarity between users:") print(user_similar) topN_users = {} Iterate over each row of data for i in user_similar.index: Fetch each column and delete itself, then sort the data _df = user_similar.loc[i].drop([i]) #sort_values Sort by descending similarity _df_sorted = _df.sort_values(ascending=False) # Slice the first two (the two most similar) from the sorted results top2 = list(_df_sorted.index[:2]) topN_users[i] = top2 print("Top2 similar users:") pprint(topN_users) Prepare a blank dict to store recommendations rs_results = {} # iterate over all the most similar users for user, sim_users in topN_users.items(): rs_result = set(a)# Store recommendation results for sim_user in sim_users: # Build initial recommendation results rs_result = rs_result.union(set(df.ix[sim_user].replace(0,np.nan).dropna().index)) # Filter out items that have already been purchased rs_result -= set(df.ix[user].replace(0,np.nan).dropna().index) rs_results[user] = rs_result print("Final recommendation:") pprint(rs_results) Copy the code
-
Item-Based CF
import pandas as pd import numpy as np from pprint import pprint users = ["User1"."User2"."User3"."User4"."User5"] items = ["Item A"."Item B"."Item C"."Item D"."Item E"] # User purchase record data set datasets = [ [1.0.1.1.0], [1.0.0.1.1], [1.0.1.0.0], [0.1.0.1.1], [1.1.1.0.1], ] df = pd.DataFrame(datasets, columns=items, index=users) Calculate the Jacquard similarity coefficient for all data pairs from sklearn.metrics.pairwise import pairwise_distances # Calculate the similarity between items item_similar = 1 - pairwise_distances(df.T, metric="jaccard") item_similar = pd.DataFrame(item_similar, columns=items, index=items) print("The similarity between two objects:") print(item_similar) topN_items = {} Iterate over each row of data for i in item_similar.index: Fetch each column and delete itself, then sort the data _df = item_similar.loc[i].drop([i]) _df_sorted = _df.sort_values(ascending=False) top2 = list(_df_sorted.index[:2]) topN_items[i] = top2 print("Top2 similar items:") pprint(topN_items) rs_results = {} # Build recommendation results for user in df.index: Pass through all users rs_result = set(a)for item in df.ix[user].replace(0,np.nan).dropna().index: Fetch a list of items that each user has currently purchased # Build the initial recommendation by finding the most similar top-N item for each item rs_result = rs_result.union(topN_items[item]) Filter out items that users have already purchased rs_result -= set(df.ix[user].replace(0,np.nan).dropna().index) Add to the result rs_results[user] = rs_result print("Final recommendation:") pprint(rs_results) Copy the code
3.5 Data set used by collaborative filtering algorithm
In the previous demo, we only used a purchase record of an item, which could be a browsing record, a listening record, etc. In this way, the result of data prediction is actually equivalent to predicting whether users are interested in a certain item, and the degree of preference cannot be well predicted.
Therefore, the collaborative filtering recommendation algorithm actually makes more use of the “rating” data of users on items for prediction. Through the rating data set, we can predict the rating of users on items that they have not rated before. The principle and idea is the same, but the data set is user-item rating data.
About the user-item rating matrix
The user-item scoring matrix will have different solutions according to the sparsity of the scoring matrix
-
Dense scoring matrix
-
Sparse scoring matrix
The processing of dense scoring matrix is introduced here, while the processing of sparse matrix is relatively complicated.
Collaborative filtering algorithm is used to predict user scores
-
The data set
Objective: To predict user 1’s rating of item E
-
Build the data set: Note that when building the score data here, we need to leave the missing part as None, and if set to 0 it will be treated as a score value of 0
users = ["User1"."User2"."User3"."User4"."User5"] items = ["Item A"."Item B"."Item C"."Item D"."Item E"] # User purchase record data set datasets = [ [5.3.4.4.None], [3.1.2.3.3], [4.3.4.3.5], [3.3.1.5.4], [1.5.5.2.1]]Copy the code
-
Calculation of similarity: Pearson correlation coefficient [-1,1] is used to calculate the score data, -1 represents strong negative correlation, +1 represents strong positive correlation
The CORR method in Pandas can be directly used to calculate Pearson correlation coefficients
df = pd.DataFrame(datasets, columns=items, index=users) print("Pairwise similarity between users:") Calculate Pearson correlation coefficient directly The default is to calculate by column, so if the similarity between users is calculated, it is currently required to transpose user_similar = df.T.corr() print(user_similar.round(4)) print("The similarity between two objects:") item_similar = df.corr() print(item_similar.round(4)) Copy the code
Running results:
# run result:Similarity between users: User1 User2 User3 User4 User5 User11.0000 0.8528 0.7071 0.0000 -0.7921 User2 0.8528 1.0000 0.4677 0.4900 -0.9001 User3 0.7071 0.4677 1.0000 -0.1612 -0.4666 User4 0.0000 0.4900 -0.1612 1.0000 -0.6415 User5 -0.7921 -0.9001 -0.4666 -0.6415 1.0000Item A Item B Item C Item D Item E Item A1.0000 -0.4767 -0.1231 0.5322 0.9695 Item B -0.4767 1.0000 0.6455 -0.3101 -0.4781 Item C -0.1231 0.6455 1.0000 -0.7206 -0.4276 Item D 0.5322 -0.3101 -0.7206 1.0000 0.5817 Item E 0.9695 -0.4781 -0.4276 0.5817 1.0000 Copy the code
You can see that users 2 and 3 are most similar to user 1; The items most similar to item A are item E and item D respectively.
Note: We tend to predict ratings based on users or items with which we have a positive correlation. If there is no positive correlation, we cannot predict ratings. This is especially true in sparse scoring matrices, where positive correlation coefficients are difficult to derive.
-
Score predicts
User-based CF score prediction: predicts Based on the similarity between users
There are also many schemes for scoring prediction. The following is a scheme with good effect, which takes into account the scoring of users themselves and the weighted average similarity score of neighboring users for prediction:
We want to predict user 1’s score on item E, so we can make prediction based on user 2 and user 3 closest to user 1, and calculate as follows:
The final prediction is that the score of user 1 on item 5 is 3.91
Item-based CF score prediction: the similarity between items is used for prediction
Here, the calculation of similarity prediction of items is the same as above, and the average scoring factor of users is also taken into account, and the prediction is made by combining the weighted average similarity scoring of predicted items with similar items:
We want to predict user 1’s score on item E, so we can make prediction based on item A and item D closest to item E, and calculate as follows:
As can be seen from the comparison, the scoring results of user-based CF prediction score and item-based CF are also different, because they actually belong to two different recommendation algorithms in a strict sense, and both of them have better effects than the other one in different fields and scenarios. However, which one is better? Therefore, in the implementation of the recommendation system, these two algorithms are often needed to be implemented, and then the recommendation effect is evaluated and analyzed to select a better scheme.
Case study – Film recommendation based on collaborative filtering
4.1 User-based CF predicts movie ratings
-
Data set download
-
Download address
-
Load ratings.csv, convert it into a user-movie score matrix and calculate the similarity between users
import os import pandas as pd import numpy as np DATA_PATH = "./datasets/ml-latest-small/ratings.csv" dtype = {"userId": np.int32, "movieId": np.int32, "rating": np.float32} # Loading data, we only use the first three columns of data, which are the user ID, the movie ID, and the corresponding rating of the movie by the user ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3)) PivotTable, which converts the Movie ID to the column name, into a user-movie score matrix ratings_matrix = ratings.pivot_table(index=["userId"], columns=["movieId"],values="rating") # Calculate the similarity between users user_similar = ratings_matrix.T.corr() Copy the code
-
Predict user’s rating of items (take user 1’s rating of movie 1 as an example)
# 1. Find similar users for the UID user similar_users = user_similar[1].drop([1]).dropna() # Similar users filter rule: positive related users similar_users = similar_users.where(similar_users>0).dropna() # 2. Screen out the nearest neighbor users who have scored item 1 from the nearest neighbor similar users of user 1 ids = set(ratings_matrix[1].dropna().index)&set(similar_users.index) finally_similar_users = similar_users.ix[list(1)] # 3. Predict uid users' ratings of IID items based on their similarity to their nearest neighbors numerator = 0 # Score predicts the value of the numerator part of the formula denominator = 0 # The value of the denominator of the scoring prediction formula for sim_uid, similarity in finally_similar_users.iteritems(): # Nearest neighbor user rating data sim_user_rated_movies = ratings_matrix.ix[sim_uid].dropna() # IID item rating by nearest neighbor users sim_user_rating_for_item = sim_user_rated_movies[1] # Compute the numerator numerator += similarity * sim_user_rating_for_item # Evaluate the denominator denominator += similarity # 4 Calculate the predicted score value predict_rating = numerator/denominator print("Predicted user <%d> rating of movie <%d> : %0.2f" % (1.1, predict_rating)) Copy the code
-
Encapsulated into a method to predict the rating of any user on any movie
def predict(uid, iid, ratings_matrix, user_similar) : Uid: user ID: Param iID: Item ID: Param ratings_matrix: user-item rating matrix: param user_similar: P2-user similarity matrix :return: predicted score value print("Start predicting user <%d> ratings for movie <%d>..."%(uid, iid)) # 1. Find similar users for the UID user similar_users = user_similar[uid].drop([uid]).dropna() # Similar users filter rule: positive related users similar_users = similar_users.where(similar_users>0).dropna() if similar_users.empty is True: raise Exception("User <%d> no similar user" % uid) # 2. Select the nearest neighbor users with scores for iID items from the uid user's nearest neighbor similar users ids = set(ratings_matrix[iid].dropna().index)&set(similar_users.index) finally_similar_users = similar_users.ix[list(ids)] # 3. Predict uid users' ratings of IID items based on their similarity to their nearest neighbors numerator = 0 # Score predicts the value of the numerator part of the formula denominator = 0 # The value of the denominator of the scoring prediction formula for sim_uid, similarity in finally_similar_users.iteritems(): # Nearest neighbor user rating data sim_user_rated_movies = ratings_matrix.ix[sim_uid].dropna() # IID item rating by nearest neighbor users sim_user_rating_for_item = sim_user_rated_movies[iid] # Compute the numerator numerator += similarity * sim_user_rating_for_item # Evaluate the denominator denominator += similarity # Calculate the predicted score value and return it predict_rating = numerator/denominator print("Predicted user <%d> rating of movie <%d> : %0.2f" % (uid, iid, predict_rating)) return round(predict_rating, 2) Copy the code
-
Predict all movie ratings for a user
def predict_all(uid, ratings_matrix, user_similar) : Uid: user ID :param ratings_matrix: user-item scoring matrix: param user_similar: similarity between two users :return: generator, return predicted score "" Prepare a list of ids for items to predict item_ids = ratings_matrix.columns # One by one prediction for iid in item_ids: try: rating = predict(uid, iid, ratings_matrix, user_similar) except Exception as e: print(e) else: yield uid, iid, rating if __name__ == '__main__': for i in predict_all(1, ratings_matrix, user_similar): pass Copy the code
-
Recommend topN movies to specified users according to their ratings
def top_k_rs_result(k) : results = predict_all(1, ratings_matrix, user_similar) return sorted(results, key=lambda x: x[2], reverse=True)[:k] if __name__ == '__main__': from pprint import pprint result = top_k_rs_result(20) pprint(result) Copy the code
4.2 Item-based CF predicts movie ratings
-
Load ratings.csv, convert it into a user-movie score matrix and calculate the similarity between users
import os import pandas as pd import numpy as np DATA_PATH = "./datasets/ml-latest-small/ratings.csv" dtype = {"userId": np.int32, "movieId": np.int32, "rating": np.float32} # Loading data, we only use the first three columns of data, which are the user ID, the movie ID, and the corresponding rating of the movie by the user ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3)) PivotTable, which converts the Movie ID to the column name, into a user-movie score matrix ratings_matrix = ratings.pivot_table(index=["userId"], columns=["movieId"],values="rating") # Calculate the similarity between users item_similar = ratings_matrix.corr() Copy the code
-
Predict user’s rating of items (take user 1’s rating of movie 1 as an example)
# 1. Find similar items for iID items similar_items = item_similar[1].drop([1]).dropna() # Similar items filter rule: positive related items similar_items = similar_items.where(similar_items>0).dropna() # 2. Select items rated by the UID user from the iID item's nearest neighbors ids = set(ratings_matrix.ix[1].dropna().index)&set(similar_items.index) finally_similar_items = similar_items.ix[list(ids)] # 3. Predict the rating of UID on IID by combining the similarity of IID items and similar items and the rating of UID users on similar items numerator = 0 # Score predicts the value of the numerator part of the formula denominator = 0 # The value of the denominator of the scoring prediction formula for sim_iid, similarity in finally_similar_items.iteritems(): # Nearest neighbor item rating data sim_item_rated_movies = ratings_matrix[sim_iid].dropna() # 1 User's rating of similar items sim_item_rating_from_user = sim_item_rated_movies[1] # Compute the numerator numerator += similarity * sim_item_rating_from_user # Evaluate the denominator denominator += similarity # Calculate the predicted score value and return it predict_rating = sum_up/sum_down print("Predicted user <%d> rating of movie <%d> : %0.2f" % (uid, iid, predict_rating)) Copy the code
-
Encapsulated into a method to predict the rating of any user on any movie
def predict(uid, iid, ratings_matrix, user_similar) : Uid: user ID: Param iID: Item ID: Param ratings_matrix: user-item rating matrix: param user_similar: P2-user similarity matrix :return: predicted score value print("Start predicting user <%d> ratings for movie <%d>..."%(uid, iid)) # 1. Find similar users for the UID user similar_users = user_similar[uid].drop([uid]).dropna() # Similar users filter rule: positive related users similar_users = similar_users.where(similar_users>0).dropna() if similar_users.empty is True: raise Exception("User <%d> no similar user" % uid) # 2. Select the nearest neighbor users with scores for iID items from the uid user's nearest neighbor similar users ids = set(ratings_matrix[iid].dropna().index)&set(similar_users.index) finally_similar_users = similar_users.ix[list(ids)] # 3. Predict uid users' ratings of IID items based on their similarity to their nearest neighbors numerator = 0 # Score predicts the value of the numerator part of the formula denominator = 0 # The value of the denominator of the scoring prediction formula for sim_uid, similarity in finally_similar_users.iteritems(): # Nearest neighbor user rating data sim_user_rated_movies = ratings_matrix.ix[sim_uid].dropna() # IID item rating by nearest neighbor users sim_user_rating_for_item = sim_user_rated_movies[iid] # Compute the numerator numerator += similarity * sim_user_rating_for_item # Evaluate the denominator denominator += similarity # Calculate the predicted score value and return it predict_rating = numerator/denominator print("Predicted user <%d> rating of movie <%d> : %0.2f" % (uid, iid, predict_rating)) return round(predict_rating, 2) Copy the code
-
Predict all movie ratings for a user
def predict_all(uid, ratings_matrix, item_similar) : Uid: user ID :param ratings_matrix: user-item rating matrix: Param item_similar: Item similarity :return: generator, return "" Prepare a list of ids for items to predict item_ids = ratings_matrix.columns # One by one prediction for iid in item_ids: try: rating = predict(uid, iid, ratings_matrix, item_similar) except Exception as e: print(e) else: yield uid, iid, rating if __name__ == '__main__': for i in predict_all(1, ratings_matrix, item_similar): pass Copy the code
-
Recommend topN movies to specified users according to their ratings
-def top_k_rs_result(k) : results = predict_all(1, ratings_matrix, item_similar) return sorted(results, key=lambda x: x[2], reverse=True)[:k] if __name__ == '__main__': from pprint import pprint result = top_k_rs_result(20) print(result) Copy the code
V. Recommendation system evaluation
5.1 Recommended evaluation indicators of the system
-
Evaluate data sources for explicit and implicit feedback
Explicit feedback Implicit feedback example Movie/book rating/If you like this recommendation Play/click/comment/download/buy accuracy high low The number of less more Acquisition costs high low -
Common evaluation indicators
• Accuracy • trust • satisfaction • real-time • coverage • robustness • diversity • scalability • novelty • business goals • Surprise • retention
-
Accuracy (theoretical perspective)
- Score predicts
- RMSE MAE
- TopN recommended
- Recall rate accuracy rate
- Score predicts
-
Accuracy (business perspective)
-
coverage
- The greater the entropy of information for recommendation, the better
- coverage
-
Diversity & Novelty & surprise
- Diversity: Dissimilarity between two items on a recommended list. (How is similarity measured?
- Novelty: a category or author not previously considered; The average popularity of recommended results
- Surprise: Historical dissimilarity (surprise) but satisfaction (joy)
- Accuracy is often sacrificed
- Use historical behavior to predict how much users will like an item
- The system overemphasizes real time
-
Exploration &Exploration issues of exploration and exploitation
- Exploitation: Choose the best possible solution
- Exploration: Select options that are uncertain now, but may yield high returns in the future
- In the process of making the two kinds of decisions, the cognition of the uncertainty of all decisions should be constantly updated to optimize the long-term goals
-
EE problem practice
- Interest expansion: Similar topics, collocation recommendation
- Crowd algorithm: userCF user clustering
- Balance personalized recommendations and popular recommendations
- Randomly discards user behavior history
- Random disturbance model parameters
-
Possible problems with EE
- Exploration hurts the user experience and can lead to user churn
- Exploration brings long term revenue (retention) evaluation cycle and KPI pressure
- How to balance real-time and long-term interests
- How to balance short-term product experience with long-term ecosystem
- How to balance popular tastes and niche needs
-
5.2 Recommended system evaluation methods
- Evaluation methods
- Questionnaire survey: high cost
- Offline evaluation:
- It can only be evaluated on the candidate set that users have seen, and it is not consistent with the online reality
- Only a few indicators can be assessed
- Fast speed, no damage to user experience
- Online evaluation: Grayscale release & A/B test 50% full online
- Practice: combine offline evaluation with online evaluation and make questionnaire survey regularly
Six. Recommended system cold start problem
6.1 The concept of cold startup is recommended
- User cold start: How to make personalized recommendations for users
- Item Cold start: How to recommend new items to users (collaborative filtering)
- System cold start: User cold start + item cold start
- The essence is that the recommendation system relies on historical data, without which it cannot predict user preferences
6.2 Common methods for troubleshooting the recommended cold startup problem
- User cold start
- Collecting User Characteristics
- User registration information: gender, age, region
- Device information: location, phone model, app list
- Social information, promotional material, installation sources
- Guide users to fill in interests
- Use behavioral data from other sites
- Differences between the recommended policies of old and new users
- New users are more likely to be attracted to hot leaderboards during the cold start, and existing users are more likely to need long tail recommendations
- Efforts to Explore exploits
- Use individual features and model projections
- Collecting User Characteristics
- Item cold start
- Label items
- Use the content information of the item to first drop the new item to users who have liked other items with similar content.
- System cold start
- Early days of content-based recommendation systems
- Content-based recommendations are gradually transitioning to collaborative filtering
- The results of content-based recommendation and collaborative filtering are calculated by weighted sum to obtain the final recommendation result