This is the 17th day of my participation in the August Challenge

In order to share this article, I did a lot of work, so without my consent, please do not reprint, thank you here

Collaborative filtering is the most commonly used technique in building intelligent recommendation systems. As more user information is collected, a good model can be learned from these data for recommendation.

Recommendation systems are everywhere

Recently, there are more and more consumer festivals. Thanks to the development of the Internet of Things, online shopping has become the mainstream, changing our traditional shopping behavior. When it comes to e-commerce, recommendation system is indispensable. When we enter the shopping website, we are surrounded by so many products, so many products in front of us give us a new name for the difficulty of choosing. Experiments have shown that more variety is less common than less variety to motivate us to buy.

Today we are all surrounded by so much information that access to information is no longer a problem. In a day or two, modern people receive about as much information as medieval people did in a lifetime. The problem is how to find the information we want in the mass of information. So today recommendation system is no longer icing on the cake, but essential. So what is a recommendation system, and what does it do? In fact, the recommendation system is based on the user’s historical information and behavior, to recommend the user the content or goods he is interested in.

Most websites, such as Amazon, YouTube and Netflix, use collaborative filtering as part of their sophisticated recommendation systems. You can use this technique to build recommendation engines that recommend products to similar users based on their common interests.

What is collaborative filtering

Collaborative filtering is a technique that filters out items that users are likely to like based on similar user behavior. Collaborative filtering analyzes user interest, finds similar users (interest) of the specified user in the user group, synthesizes the evaluation of these similar users to a certain information, and forms the system to predict the degree of preference of the specified user to this information.

The data set

To explore this recommendation algorithm, prepare a small example. The data is a group of projects and a group of users, and the users have comments on the project as the data to be analyzed.

A user’s evaluation (feeling) of an item can be either explicit (e.g., rating on a scale of 1 to 5, like or dislike) or implicit (viewing an item, adding it to a shopping cart, willingness to spend time on content, etc.).

When processing such data, the matrix takes the form of a user’s assessment of the project. The row represents the user and the column represents the item. The cross position of the user and item represents the user’s rating of the item.

User/itme Item1 Item2 Item3 Item4 Item5
User1 5 4 1
User2 3 3
User3 2 4 4 1
User4 4 4 5
User5 2 4 5 2

Above is the rating matrix for users and items, for example, User1 gives Item3 a rating of 4.

In most cases, the position in the matrix is empty because the user has rated only a few items. It is obviously unrealistic for each user to rate all projects. A matrix in which most of the positions are thus empty is usually called a sparse matrix, while the opposite (a matrix in which most positions have values) is called a dense matrix.

Many data sets are collected and made available to the public for research and benchmarking. Here is a list of high-quality data sources from which you can choose.

MovieLens data set collected by GroupLens Research will be used in the next analysis. MovieLens contains a 100K data set that is a stable benchmark data set with 100,000 ratings for 1,682 movies from 943 users, each with at least 20 movie ratings.

This data set consists of multiple files that contain information about movies, users, and how users rate the movies they watch. Among them, the following are noteworthy.

General steps for collaborative filtering

How to build a recommendation system based on user behavior can be roughly divided into two steps: the first step is to find similar users or items. The second step is to figure out what items have not yet been evaluated by the user, and how the user might rate the item. To do this, you need to answer some questions

  • How do you measure similarities between users or projects
  • We have identified users who are similar to users. How can we predict the rating of a project from the ratings of similar users
  • How do you measure how accurately the model predicts how users rate an item

The answers to the first two questions are not unique. Collaborative filtering is a family of algorithms that can be used to calculate similarities between users or goods, as well as how similar users rate projects based on their ratings. Depending on your choice, you end up with a method of collaborative filtering.

Regarding collaborative filtering, it is important to note that in a purely collaborative filtering based algorithm, user similarity is calculated without using data such as the user’s age, the type of movie, or any other characteristic data about the user or item. The similarity between users or items is calculated only based on the user’s rating of an item (explicit or implicit). For example, if two users give the same rating to ten movies, it can be considered similar despite their vastly different ages.

Pearson correlation coefficient

s i m ( a . b ) = i I ( r a . i r a ) ( r b . i r b ) i I ( r a . i r a ) 2 i I ( r b . i r b ) 2 sim(a,b) = \frac{\sum_{i \in I}(r_{a,i} – \overline{r}_a)(r_{b,i} – \overline{r}_b)}{ \sqrt{\sum_{i \in I}(r_{a,i}- \overline{r}_a)^2}\sqrt{\sum_{i \in I}(r_{b,i}- \overline{r}_b)^2}}
  • III indicates the project of common interest to user A and user B
  • Ra,ir_{a, I}ra, I indicates the score of user A on item I. RA,ir_{a, I}ra, I indicates the score of user A on item I
  • R ‾a\overline{r}_ara means average score for an item
Cosine similarity

s i m ( a . b ) = r a . r b r a r b sim(a,b) = \frac{r_a,r_b}{||r_a|||r_b||}

The third question is how to measure the accuracy of model (algorithm) predictions. There are also multiple answers, including error calculation techniques that can be used in many ways, not just for collaborative filtering recommendations.

One way to measure the accuracy of your results is the root mean Square Error (RMSE), in which scores are predicted for user-project pairs of test data sets whose score values are already known. The difference between the known value and the predicted value will be error. Square all the error values of the test set, find the average (or average), and then take the square root of that average to get RMSE.

Another measure of accuracy is the mean Absolute error (MAE), which is calculated by finding the absolute value of the error and taking the average of all the error values.

At this point, you don’t need to worry about the details of RMSE or MAE and how to implement them, because most Python packages provide ready-made methods that you can call.

Now let’s look at the different types of algorithms in the collaborative filtering family.

Collaborative filtering based on memory

Memory-based algorithms use statistical techniques to calculate an entire data set to predict how users will rate the item.

To find out that a user U would give an item I a rating R, the method includes.

  • Find users similar to U who have rated item I
  • Score R is calculated based on the ratings of similar users found

You’ll see every detail in the following chapters.

How to find similar users based on the evaluation matrix

To understand the concept of similarity, let’s first create a simple dataset to illustrate similarity. The data includes four users A, B, C and D, who rate the two movies. The ratings are stored in lists, each containing two numbers representing the rating of each film.

The grade of A is [1.0, 2.0]. B is rated [2.0, 4.0]. C is rated [2.5, 4.0]. D is rated [4.5, 5.0].

Let’s plot these points on a graph for easy observation

In the image above, each dot represents a user and is compared with their ratings for the two movies.

Looking at the distance between these points seems like a good way to estimate similarity, right? You can calculate the distance using the Euclidean distance formula between two points. You can use the functions in SCIpy, as shown in the following program.

As shown above, can use scipy. Spatial. Short. Euclidean to calculate the distance between two points. The distance is used to indicate the distance between the grade of A, B and D and the grade of C. In terms of distance, the grade of C is closest to the grade of B.

If you look at the picture, you can see that user B is closest to user C. If you narrow it down to A and D, who is C closer to?

If we ignore the actual situation and only look at the distance, D is closer to C than A, that is, D is closer to C. If we look at the scores of A and C for the two movies, it is not difficult to find that they are proportional. Both of them give the scores of the second movie twice as much as the first movie, indicating that they both like the second movie. From A practical point of view, A and C should be more similar.

As explained above, Euclidean distance cannot correctly reflect the similarity between users by scoring, so let’s look at the angles between some lines (vectors) connecting each point to the origin to measure the similarity between them.

Now we represent the four users as vectors, where the lines A and B overlap, so the Angle between them is 0. In other words, the similarity of two vectors is inversely proportional to the Angle between them.

In order to calculate the similarity by Angle, we need such a function that the smaller the Angle between two vectors, the higher the similarity will be returned. It can be said that the smaller the distance between them, the larger the Angle between two vectors, the smaller the similarity will be returned. The cosine of an Angle is a function, and as the Angle goes from 0 to 180, the cosine goes from 1 to minus 1.

You can use the cosine of the Angle to find the similarity between two users. The larger the Angle, the smaller the cosine and the lower the user’s similarity. Or you can reverse the cosine of the Angle, which is 1 minus the cosine distance between the users.

Scipy has a function that calculates the cosine distance of a vector. The higher the Angle, the higher the value returned by this function.

The smaller the Angle between C and A, the smaller the cosine distance. To rank users’ similarity in this way, use cosine distance.

Note: In the above example, only two movies were considered, which makes it easier to visualize the rating vector in two dimensions. It’s a little bit more intuitive to explain. For multiple projects, there are more dimensions to the rating vector. You might also want to explore the mathematics of cosine similarity.

Here, although users A and B have different ratings for the two movies, their cosine similarity is very small. This is why it is not difficult to find that they score in A certain proportion, and why there is such an effect. Here, user A may be harsh and difficult to give high marks, while user B is A good person and usually gives high marks. In other words, there is a starting value between them. To eliminate their different starting values, we put them at the same level or in the same way to measure the two users’ preference for the project. That’s why we use it to measure user similarity.

For user A, the average score vector [1, 2] is 1.5. Subtract 1.5 from each score to get the vectors [-0.5, 0.5]. For user B, the average of the rating vector [2, 4] is 3. Subtract 3 from each score and you get a vector [-1, 1].

Try doing the same for users C and D, and you’ll see that all users are now rated an average of 0, which puts them all on the same level and eliminates their bias.

The cosine of the angle between the adjusted vectors is called centered cosine. This approach is normally used when there are a lot of missing values in the vectors, and you need to place a common value to fill up the missing values.

The cosine of the Angle between the adjusted vectors is called the central cosine. This method is usually used when there are a large number of missing values in a vector and a common value needs to be placed to fill in the missing values.

Filling in the missing values in the rating matrix with a random value can lead to inaccuracies. A good choice to fill in the missing values might be the average score per user, but the original average for users A and B was 1.5 and 3, respectively, and filling in all the empty values for A with 1.5 and filling in all the empty values for B with 3 could result in them no longer being similar users.

But after adjusting the values, the central average for both users is 0, which more accurately captures the idea that both users’ projects are above or below the average, and all missing values in both user vectors are 0.

Euclidean distance and cosine similarity are some of the methods that can be used to measure the similarity between users, even items that are similar to each other. (The function used above computes cosine distance. To calculate cosine similarity, subtract distance from 1.

Note: The formula for the center cosine is the same as that for Pearson’s correlation coefficient. You will find that many resources on referees and libraries refer to the implementation of the center cosine as Pearson related.

How to calculate the score

After you have identified a list of users similar to user U, you need to calculate U’s rating R for an item I. Again, just like similarity, this can be done in a variety of ways.

You can predict that a user’s rating R for an item I will be close to the average of the top 5 or top 10 users’ ratings of I that are most similar to U.


R U = u = 1 n R u n R_U = \frac{\sum_{u=1}^n R_u}{n}

This formula shows that the average score of n similar users on project I is calculated as the score of users on project I. We can sort similar users and select the mean value of the first N users’ scores on project I as user U’s score on project I. Of course, the most similar users’ scores on project I should be considered to contribute more to the predicted score of target users.

In the weighted average method, you multiply each score by a similarity coefficient. Weighting is added to the score by multiplying it by the similarity coefficient. The greater the weight of similar users (similarity), the more important the rating.

As the similarity coefficient of the weight, it should be the reciprocal of the distance discussed above, because a smaller distance means a higher similarity. For example, cosine similarity can be obtained by subtracting cosine distance from 1. Given the similarity coefficient S for each user similar to the target user U, this formula can be used to calculate the weighted average.


R u = u = 1 n R u S u u = 1 n S u R_u = \frac{\sum_{u=1}^n R_u S_u}{\sum_{u=1}^n S_u}

In the above formula, the score given by each similar user is multiplied by the similarity coefficient of the user who gives the score. The final predicted score for user U is going to be equal to the sum of the weighted scores divided by the sum of the weights, which is regularized.

Note: If you want to know why the weighted score sum is divided by the sum of the weights and not by n, consider that in the previous average formula, divided by n, because that is every similar user weight is the value of the weight is 1.

With a weighted average, you take more account of similar users’ ratings, ranking them by similarity.

Now you know how to find similar users and how to calculate the target users’ ratings of the project based on their ratings. There is also a variant of collaborative filtering, which predicts and calculates ratings by looking for items that are similar to each other rather than users. You’ll learn about this variant in the next section.

User-based collaborative filtering and project-based collaborative filtering

The technique in the example explained above, which uses a rating matrix to search for similar users based on their ratings, is called ** Users base or user-user collaborative filtering. If a rating matrix is used to find similar items based on the ratings given by the user, this method is called item-bases ** or item-item collaborative filtering.

The two approaches are similar in mathematics and flow, but there are conceptual differences. Here’s a comparison of the two approaches.

User-based

For a user U, a list of users similar to user U is calculated according to the user’s rating of the item, and similar users are sorted according to similarity. Estimate the rating that user U is likely to give to the item by rating that similar users to user U have not seen

Item-based

For a project I, select a list of similar projects of the project I according to the user’s score on the project, select N projects that have been rated by user U from the similar list, and calculate user U’s possible score on the project he has never seen according to the N scores.

Project-based collaborative filtering was developed by Amazon. In a system with many more users than projects, project-based filtering is faster and more stable than user-based filtering. It works on a project basis because, typically, the average rating received by a project does not change as quickly as the average rating given by users to different projects. It is well known that when the scoring matrix is sparse, it also performs better than the user-based approach.

But project-based approaches don’t perform well on data sets with browsing or entertainment-related items, such as MovieLens, where recommendations seem obvious to the target audience. Such data sets would be better served using matrix decomposition techniques,

Recommend variety and precision
  • Item-based recommendations – A limited list of recommendations may contain a number of long-tail items that are not popular
  • User-based recommendations – diversity
  • Project-based collaborative filtering does not take into account differences between users, so it is less accurate, but it does not require historical data of users or user identification. For projects, the similarity between them is much more stable, so the similarity calculation step with the largest workload can be completed offline, thus reducing the online calculation amount and improving the recommendation efficiency, especially when there are more users than projects.

Collaborative Filtering Based on Model

The model-based approach involves a reduction or compression of a large and sparse matrix of user items. To understand this step, a basic understanding of dimension reduction will be helpful.

Dimension reduction algorithm

In the user-project matrix, there are two dimensions.

  • Number of users
  • Number of projects

If the matrix is mostly empty, reducing the dimensions can improve the performance of the algorithm in space and time. This can be done using various methods such as matrix factorization or autoencoders.

Matrix factorization can be thought of as factoring a large matrix into the product of two small matrices. This is similar to the factorization of integers. 12 can be written as 6×2 or 4×3. In the case of matrices, A matrix A with dimension m by n can be reduced to the product of X and Y of two matrices with dimensions m by p and P by n respectively.

It is worth noting that in matrix multiplication, the matrix X can be multiplied by Y only if the number of columns of X is equal to the number of rows of Y, so the two reduced matrices have a common dimension P. And usually the p dimension is much less than m or n

After decomposition, the user matrix and project matrix are obtained. The M line in the user matrix represents the MTH user, and P represents the characteristics of the user. The same thing with the item matrix, there are n items and each item has P features. Here is an example of matrix factorization.

In the picture above, the matrix is decomposed into two matrices. The one on the left is the user matrix with M users, and the one on the top is the item matrix with N items. Score 4 by the result of the dot product of two matrices m vector and n item eigenvectors

  • A user vector (2, -1)

  • An item vector (2.5, 1)

The way to think about it is that we find p-eigenvectors by, users and items by matrix factorization and they can be represented by p-dimensional eigenvectors, which is that we bring different things into a space through their underlying relationships. Factor matrices can provide this insight into users and projects. In this case, you have two underlying factors about the movie genre, but in the real world, these underlying factors are not realized and are difficult to explain, which is a drawback of matrix decomposition. The number of latent factors will influence the recommendation, and the more factors, the more personalized the recommendation. However, too many factors will lead to over-fitting of the model.

Matrix Factorization

One of the popular algorithms for matrix factorization is singular value decomposition (SVD). SVD came to the fore when matrix factorization performed well in the Netflix Prize competition. Other algorithms include PCA and its variation, NMF, etc. If you want to use neural networks, autoencoders can also be used for dimensionality reduction.

Build the recommendation engine in Python

In Python, there are a number of libraries and toolkits that provide implementations of various algorithms that make it easy to build a recommendation engine in a packaged way. Try Surprise. Surprise is a Python SciKit that provides a variety of recommendation algorithms and similarity metrics that can be used to implement recommendation systems.

Here is how to install the required dependency packages using PIP.

pip install numpy
pip install scikit-surprise
Copy the code

To use Surprise, you need to understand that Surprise provides some basic modules and classes.

The dataset module is used for learning from files, Pandas Dataframes, and also provides built-in datasets for testing and learning. (MovieLens 100k is a built-in dataset in Surprise.) To load a dataset, some of the methods available are.

Dataset.load_builtin()
Dataset.load_from_file()
Dataset.load_from_df()
Copy the code

Reander’s class is used to parse files that contain ratings. How to configure reader to read data sets

  • line_formatIs a string that stores the order of data. Field names are separated by Spaces, for example, “item user rating”.
  • sepUsed to specify delimiters between strings, such as ‘,’
  • rating_scaleUsed to specify a rating level. The default is (1,5)
  • skip_linesUsed to indicate the number of lines to skip at the beginning of the file. The default value is 0.

Create a load_data.py file that can be used to load data from Pandas DataFrame or from the built-in MovieLens 100K dataset.

# load_data.py

import pandas as pd
from surprise import Dataset
from surprise import Reader

# This is the same data that was plotted for similarity earlier
# with one new user "E" who has rated only movie 1
ratings_dict = {
    "item": [1.2.1.2.1.2.1.2.1]."user": ['A'.'A'.'B'.'B'.'C'.'C'.'D'.'D'.'E']."rating": [1.2.2.4.2.5.4.4.5.5.3],
}

df = pd.DataFrame(ratings_dict)
reader = Reader(rating_scale=(1.5))


# loading Pandas dataframe
data = Dataset.load_from_df(df[["user"."item"."rating"]], reader)
# Load the built-in Movielens-100K dataset
movielens = Dataset.load_builtin('ml-100k')
Copy the code

Algorithm based on K -NN

The algorithm you choose for recommendations depends on the technology you’re using. For the memory-based method discussed above, the proper algorithm is the Centered K-NN, because the algorithm is very close to the Centered cosine similarity formula explained above. An implementation of KNNWithMeans is provided in Surprise. To call this method KNNWithMeans we need to pass in a configuration item, and the configuration type defines a KNNWithMeans in dictionary form. The meanings of related parameters are explained below.

  • Name contains the similarity measure to be used. Cosine, MSD, Pearson, pearson_baseline The default is MSD
  • User_based is a Boolean value that tells people whether the method is user-based or project-based. The default value is True, which means that a user-based approach will be used
  • Min_support is the minimum number of common items required to consider similarity between users. For the project-based approach, this corresponds to the minimum number of common users for both projects
# recommender.py

from surprise import KNNWithMeans

# To use item-based cosine similarity
sim_options = {
    "name": "cosine"."user_based": False.# Compute similarities between items
}
algo = KNNWithMeans(sim_options=sim_options)
Copy the code

The recommendation function in the above program is configured to use cosine similarity and an item-based approach to find similar items.

To test the recommendation engine, you need to get the Trainset from the data set. You can use the entire data set as a training data set, or you can create a training data set with part of the data. Usually we split the data set into training data set and test data set.

from load_data import data
from recommender import algo

trainingSet = data.build_full_trainset()

algo.fit(trainingSet)
Computing the cosine similarity matrix...
Done computing similarity matrix.
<surprise.prediction_algorithms.knns.KNNWithMeans object at 0x7f04fec56898>

prediction = algo.predict('E'.2)
prediction.est
4.15
Copy the code

This algorithm predicts that user E will give 4.15 points to film 2, which indicates that it has met the basic requirements of an operating system engine. Although it is simple, it is often more effective for actual projects, and it is easy to manage, control and solve problems.

You should try different k-NN based algorithms as well as different similarity options and matrix factorization algorithms in the Surprise library. You can try out the algorithm on the MovieLens dataset and compare the results to which algorithm is more efficient with which set of parameters.

Adjust algorithm parameters

Surprise provides a GridSearchCV class, similar to sciKit-Learn’s GridSearchCV class, that is, the optimal search parameters are handed over to the framework for automatic completion. Tell it to use a parameter to measure how good the model is.

from surprise import KNNWithMeans
from surprise import Dataset
from surprise.model_selection import GridSearchCV

data = Dataset.load_builtin("ml-100k")
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])
Copy the code
Copy the code
0.9725497699737625 {'sim_options': {'name': 'MSD ', 'min_support': 3, 'user_based': True}}Copy the code