Recommendation Engine Based on Collaborative Filtering (Theoretical Part)

I remember I used to guess with my friends how tO realize the recommendation of netease Cloud. I guess there are two kinds of guesses: one is to look at the music you have listened to and collected, and then look at the music that people who have listened to these music like you, and recommend the music they like to you. The other is to look at what kind of music he has listened to or has in his collection, and recommend that kind of music to him. Of course, these are just wild guesses. However, a problem can be found. The second idea is very dependent on the attributes of the recommended thing, for example, how many types of labels a music should have, and the granularity of attributes will have a great impact on the accuracy of the recommendation. After looking at collaborative filtering today, I found that the whole algorithm is basically similar to the idea of the first one. Its biggest feature is that it ignores the attributes of the recommended thing and makes recommendations according to other users’ preference for it.

I found a random data set — Movielens movie rating data — and selected ML-latest-small.zip from recommended for Education and Development to test it and see how it works. Writing in one article is too long, the next one is actual combat. Other datasets can also be used, and this blog has several recommendation engines to test data aggregation.

The whole algorithm Learning is through the Turing programming series of Machine Learning in Action Learning.

What is collaborative filtering

Collaborative filtering is an algorithm that makes recommendations by comparing users’ data with that of other users.

Collaborative filtering flow chart

Collaborative filtering flowchart. JPG

Grafio 3 :-):-):-) :-), but the resolution seems to be low.

similarity

Similarity calculation is to look at how similar two items (or users) are. For movies, they might compare types, directors, regions, etc., but in collaborative filtering, they don’t care about these attributes and calculate similarity strictly according to the viewpoints of many users. For example, the following movie and user rating matrix:

Movie_user matrix. PNG

Similarity calculation

Euclidean metric

Euclidean distance refers to the true distance between two points in m-dimensional space, or the natural length of a vector (that is, the distance between the point and the origin). The Euclidean distance in two and three dimensions is the actual distance between two points, which is the square root of the difference in the x-coordinate plus the difference in the y-coordinate squared. If we look at the similarity between moana and a dog’s mission, moana’s column vectors are (4, 3,1), and a dog’s mission’s column vectors are (4, 3, 2), square the difference and take the square root.

Euclidean distance between a dog and Moana. PNG

Python implementation code:

Import numpy as np def eulid_sim(colA, colB): return 1.0/(1.0 + np.linalg.norm(cola-colb))Copy the code

Code parsing: If the order of the norm is not specified, it will default to the 2-norm, which is the square root of the sum of the squares of each element of the vector. For example, if A=(4,2,2), Vector of A 2 norm | | A | | = under the square root of (2 ^ 4 ^ 2 + 2 + 2 ^ 2) (Jane does not support the LaTeX is really not easy…) , so np.linalg.norm(cola-colb)) is the Euclidean distance between vectors A and B. The function of 1.0/(1.0 + Euclidean distance) is to change the value of similarity between 0 and 1. The more similar the value of similarity is, the greater the value of similarity is. When the distance is 0, the similarity is 1.

Pearson correlation coefficient

Pearson correlation coefficient. JPG

Reference here

Pearson correlation coefficient of Pearson correlation coefficient can be used to measure similarity between two vectors, a little better than Euclidean distance is that it is not sensitive to user ratings, such as a mania for all film score is 5, a depression for all film score is 1, Pearson correlation coefficient will think these two vectors are equal. Let’s look at the last formula, the cosine formula for two vectors, which looks very similar, and they say that Pearson’s coefficient is the cosine of two vectors.
Z-score, also known as standard score, is a process of dividing the difference between a number and the mean by the standard deviation. The Z-score answers the question: How many standard deviations is a given score from the mean? A score above the average will receive a positive standard score, and a score below the average will receive a negative standard score. A Z-score is a way of telling you where a score is relative to a distribution. The Z-score can truly reflect the relative standard distance between a score and the mean. If we convert each score to a Z-score, each z-score will indicate the distance or deviation from the mean of a specific score in standard deviations.

Python code implementation

Def pearson_sim(colA, colB): if len(colA) < 3: return 1.0 return 0.5 + 0.5 *. Corrcoef (colA, colB, rowvar=0)[0][1]Copy the code

Code parsing

len(colA) < 3Is to check if there are 3 or more points, if not, return 1, the two vectors are perfectly related.

corrcoef(colA, colB, rowvar=0)Return the correlation coefficient matrix of variables, the [0][1] element is the correlation coefficient, rowvar=0 represents the column is variables.API here.

0.5 + 0.5 * Pearson correlation coefficientThe purpose is also to normalize the value range to 0 ~ 1. The value range of Pearson correlation coefficient is -1 ~ 1, so it is normalized by 0.5+0.5* coefficient.

Cosine similarity

Cosine similarity is to calculate the cosine of the Angle between two vectors. If the Angle is 90 degrees, the similarity is 0. If the direction is the same, the similarity is 1. Since cosine is also in the range of -1 to 1, we need to normalize in the same way. Said earlier, | | A | | represents the vector norm. Python code

def cos_sim(colA, colB): Num = float(cola.t * colB) denom = np.linalg.norm(colA) * Np.linalg.norm (colB) return 0.5 + 0.5 * (num/denom)Copy the code

Similarity selection

Calculation of distance between two movies is based on item-based similarity, while calculation of distance between users is based on user-based similarity. Which degree of similarity to use depends on the user and the number of items. Item-based similarity increases with more items, and user-based similarity increases with more users. If there are many users, item similarity calculation method is preferred. For most recommendation engines, the number of users is often greater than the number of items, so item similarity is used.

Build a rating estimation function (that is, predict how many points users will give to the movie to be recommended)

For movies the user has never seen, we predict how many ratings he would give them, and then recommend the top N movies to him.

General scoring estimation

Flow chart:

General scoring estimation flowchart. PNG

# Count how similar an item is to all other items, add it up, and score it, At last, using the accumulated total score/total similarity, the user's rating on the new item is predicted # data_mat: item-user matrix # user: user number # item: item number to be predicted # sim_meas: similarity calculation method def stand_est(data_mat,  user, item, sim_meas): Sim_total = 0.0rat_sim_total = 0.0 # For j in range(n): User_rating == 0 user_rating == 0 # If the user does not score, Skip this item continue # to find the item column to predict the score and the current item j column of the score is not 0 (that is, all the users who have rated these two items on these two items overlap =) np.nonzero(np.logical_and(data_mat[:, item].A > 0, data_mat[:, [0] # If the item column of the predicted score (assuming column vector A) and the item j column of the current selection (assuming column vector B) have no non-zero items, then the similarity is considered to be 0, otherwise, Calculation of similarity if len(overlap) == 0: similarity = 0 else: Similarity = sim_meas(data_mat[overlap, item], data_mat[overlap, j]) # Note that overlap is an array, so overlap is still two column vectors Print ('the %d and %d similarity is %f' % (item, j, Similarity rat_sim_total += similarity # user_rating rat_sim_total += similarity # rating, if sim_total == 0: Return 0 else: return rat_sim_total/sim_total # Rat_sim_total/sim_total # Rat_sim_total/sim_total # Rat_sim_totalCopy the code

Code parsing: It is difficult to understand overlap, data_ma[:,item] stands for the column numbered as item in the matrix,.A operation is to change the return value into NDARray, data_ma[:,item].A>0 will produce A Boolean matrix with the same shape. The logical_and method evaluates the logical sum of two Boolean matrices, depending on whether the value is greater than zero and is set to True or False. The nonzero method finds the subscripts of the logical and post-nonzero values. The effect of the whole process is to expose the subscript of the row from which both items are rated for similarity calculation. In order to figure out the function of overlap sentence, I do the following tests:

data_mat = np.mat([[1, 2, 0, 0, 0],[6, 7, 8, 1, 10]]) a = data_mat [0] :, b = data_mat [:, 3) print (a) # [[1] [6]] print (a.A) # [[1] [6]] a.A, like a long, what's the difference? Look at the print print types (type (a)) # < class 'numpy. Matrixlib. Defmatrix. Matrix' > print (type (a.A) # < type 'numpy. Ndarray' > view the API, Return 'self' as an'ndarray' object print(type(a.A1)) # <type 'numpy. Ndarray '> return 'self' as a flattened 'ndarray' print(a.A > 0) # [[ True][ True]] print(b.A > 0) # [[ False][ True]] Print (np.logical_and(a.A > 0, b.A > 0)) # [[False][True] print(np.logical_and(a.A > 0, b.A > 0)) B.A > 0)) # (array([1]), array([0])) np. Nonzero, return the indices of the elements that are non-zero. Print (np.nonzero(np.logical_and(a.A > 0, b.A > 0))[0]) # [1] print(np.nonzero(np.logical_and(a.A > 0, b.A > 0))[0]) # [1] print(np.nonzero(np.logical_and(a.A > 0, b.A > 0))[0]) # [1] print(np.nonzero(np.logical_and(a.A > 0, b.A > 0))[0]) # [1Copy the code

SVD score estimation

Flow chart:

SVD scoring prediction flowchart. PNG

# Transform the matrix into a lower dimensional space using SVD, Def svd_est(data_mat, user, item, sim_meas): def svd_est(data_mat, user, item, sim_meas) N = Np.shape (data_mat)[1] sim_total = 0.0rat_sim_total = 0.0u, sigma, Vt = np.linalg. SVD (data_mat) # SVD = np.mat(np.eye(4) * sigma[:4]) # sigma Sigma matrices are stored as row vectors, So change it back to diagonal matrix x_formed_items = data_mat.T * u[:,:4] * sig4.I # Transform it to lower dimensional space using U matrix,I operation is the inverse of matrix for j in range(n): user_rating = data_mat[user, j] if user_rating == 0 or j == item: continue similarity = sim_meas(x_formed_items[item, :].T, X_formed_items [j,:].t) print('the %d and %d similarity is %f' % (item, j, similarity)) sim_total += similarity rat_sim_total += similarity * user_rating if sim_total == 0: return 0 else: return rat_sim_total / sim_totalCopy the code

Sig4 takes 4 singular values is not necessary, according to the original data matrix, to find the number of singular values containing 90% of the energy of the original matrix. X_formed_items: Suppose the shape of the original data matrix is (m,n), then the shape of U [:,4] is (m, 4), the shape of SIG4 is (4,4), and the shape of the inverse is (4,4). The original n becomes a row, and we’re passing each column vector, so transpose.

# data_mat: data matrix # user: user number # N: top N items to be returned Def limit (data_mat, user, N, sim_meas, est_method): Nonzero (data_mat[user, :].A == 0)[1] # Un_rated_items = np.nonzero(data_mat[user, :]. Return 'you rated everything' item_scores = [] for item in UN_rated_items: len(un_rated_items) == 0: return 'you rated everything' item_scores = [] for item in un_rated_items: estimate_score = est_method(data_mat, user, item, sim_meas) item_scores.append((item, Estimate_score)) return sorted(item_scores, key=lambda jj:jj[1], reverse=True) Key =lambda jj:jj[1] indicates that each element is sorted from largest to smallest by the parameters with subscript 1, taking the first NCopy the code

evaluation

Since the recommendation engine is built with neither a predicted target value nor users to survey their satisfaction with the recommendation, it is common to remove certain known score values and then predict them to calculate the difference between the predicted value and the actual value. The index usually used for recommendation engine evaluation is Root Mean Squared Error (RMSE). First calculate the Mean square Error and then take the square Root. If the rating is between 1 and 5 stars, we get an RMSE of 1, which is one star off the real rating.