Recommendation system has a wide range of applications, movie recommendation, commodity recommendation and so on are used in the recommendation system. This paper introduces the basic principle of collaborative filtering algorithm, and then understands the implementation principle of recommendation system.

Description of the recommendation system

Let’s take the movie recommendation system to see how to describe the recommendation system from the perspective of machine learning. $y^{(I,j)}$${y^{(I,j)}$$${y^{(I,j)}$$ $r(I,j) = 0$indicates that the user has not watched or rated the movie. We assume that after the user sees the movie, they will give it a rating, and if they don’t, the default rating is zero. Thus, the task of our movie recommendation system is to predict the ratings of movies that users have not yet seen, based on the ratings of users, and to recommend those movies that users are likely to give high ratings to users.

Content-based recommendation algorithm

Again using the movie recommendation system as an example, we assume that $\theta^{(j)}$represents the parameter of user J and $x^{(I)}$is the eigenvector of movie I (such as romantic movies, action movies, and, as you wish, maybe romantic action movies). Thus, user J’s predicted score for movie I is $(\theta{(j)})T (x^{(I)})$.

The next goal is to get the argument $\theta^{(j)}$for user j. This is actually a linear regression problem, where we learn the parameter $\theta{(j)}$from all the ratings the user has for the existing movie. According to the cost formula of the linear regression algorithm, our goal is to find the value of $\theta{(j)}$when solving the minimum of the cost function of the linear regression algorithm. Assume that $m^{(j)}$is the total number of movies evaluated by user j. N is the feature number of the movie.


Cost function for user J

$\theta^{(j)} \in R^{n+1}$ The accumulator part refers to the accumulation of all the movies rated by the user.

The process of solving $\theta^{(j)}$is the process of minimizing the cost function. Mathematically, we can tweak it a bit to get the objective function of content-based recommendation algorithms

Evaluate user j’s argument $\theta^{(j)}$

Minimizing the following cost function yields user j’s argument $\theta^{(j)}$


Cost function for user J

$\theta^{(1)}, \theta^{(2)},… , \theta^{(n_u)}$

Superimpose all users, find the minimum value of the following cost function, and get the parameters of all users.


Overlay cost functions for all users

According to the linear regression algorithm, we do not take the regular term for $k=0$. Then the parameter iteration formula is:


Parameter iteration formula 1

According to the linear regression algorithm, the iterative formula for $k \ne 0$contains the regular term. The parameter iteration formula is:


Parameter iteration formula 2

Where $\alpha$is the learning rate.

Collaborative Filtering

The content-based recommendation algorithm needs to extract features from the recommendation object and construct feature vectors. Let’s take the movie recommendation system as an example. Feature extraction is needed for movies, such as romantic movies and action movies, and then feature collection is conducted for all movies, that is, the score of love component and action component is written for each movie. It’s a lot of engineering work.

Alternatively, we can ask users to tell us their preferences when they register, such as what kinds of movies they like. $\theta^{(j)}$is known to user j in advance through the questionnaire. $x^{(I)}$is the feature vector of the movie. For the movie I that user J has not seen, the user’s possible rating can be predicted according to $(\theta{(j)})T (x^{(I)})$, and the user can decide whether to recommend the movie to the user according to the predicted rating.

How do you describe this mathematically?

Compute the eigenvector $x^{(I)}$of movie I

Select the appropriate $x^{(I)}$to minimize the following formula. This is the process of solving the eigenvector $x^{(I)}$of movie I.


Cost function for solving the feature of movie I

$x^{(1)}, x^{(2)},… , x^{(n_m)}$

$x^{(1)}, x^{(2)},… , x^{(n_m)}$to minimize the following formula.


Solve for the cost function of all movie features

Collaborative filtering algorithm

In practical engineering applications, it is also difficult to know the parameter $\theta^{(j)}$for user j in advance. Given all the user parameters $\theta^{(j)}$, we can estimate the movie eigenvector $x^{(I)}$. Or if we have the movie feature vector $x^{(I)}$, we can calculate the user preference parameter $\theta^{(j)}$. It seems like a chicken-and-egg problem.

In fact, collaborative filtering algorithm is designed to solve this problem.

  1. The user parameter $\theta^{(j)}$is randomly estimated
  2. The eigenvector $x^{(I)}$is estimated by using the estimated $\theta^{(j)}$and the rating data of the movies seen by users
  3. Using the estimated eigenvector $x^{(I)}$, the user preference parameter $\theta^{(j)}$is inversely estimated
  4. Repeat step 2 until the user parameter $\theta^{(j)}$and the eigenvector $x^{(I)}$converge to a suitable value

This is the core step of the collaborative filtering algorithm. An important effect of collaborative filtering is that when users score a certain movie, it will help the algorithm to learn the characteristics of the movie, which is conducive to the system to recommend appropriate movies to all users, and also enables the algorithm to better learn the preferences of users. In this way, in the process of continuous use of the system, all users can virtually cooperate in filtering, helping the system learn better parameters, so as to more accurately recommend users’ favorite movies.

Implementation of collaborative filtering algorithm

The collaborative filtering algorithm described in the previous section requires that $\theta^{(j)}$and $x^{(I)}$be continuously evaluated, so the efficiency of the algorithm is actually low. $\theta^{(j)}$= $x^{(I)}$= $\theta^{(j)}$= $x^{(I)}$= $\theta^{(j)}$ $\theta^{(j)}$= x^{(I)}$= x^{(I)}$

The total cost function is:


Total cost function

$\sum_{(I,j):r(I,j)=1}$represents the sum of all movies evaluated by each user. $\theta^{(j)}$and $x^{(I)}$. $x^{(I)}$= x^{(I)}$= x^{(I)}$

In this way, we can update the implementation steps of the collaborative filtering algorithm:

  • Initialize $x^{(1)}, x^{(2)}m,… ,x^{(n_m)}, \theta^{(1)}, \theta^{(2)}, … , \ theta ^ ${(n_u)}. Why initialize with smaller random numbers instead of all zeros? This is because we need to have different initial values for these variables so that the two variables do not become the same feature.
  • Minimize the cost function $J(x^{(1)}, x^{(2)}m,… ,x^{(n_m)}, \theta^{(1)}, \theta^{(2)}, … , \theta^{(n_u)})$, can use gradient descent or other optimized advanced algorithms. The parameter iteration formula is


Parametric iteration formula

  • $(\theta{(j)})T x^{(I)}$= (\theta{(j)})T x^{(I)}$

Note in particular that we do not bias $x_0$and there is no $\theta_0$.

Vectorization implementation of collaborative filtering algorithm

Low Rank Matrix Factorization

$(\theta{(j)})T x^{(I)}$ If we wanted to calculate the ratings of all users on all movies at once, we could write the following matrix:


A matrix that predicts the ratings of all users

This is a $n_m \times n_u$matrix, where $n_m$is the number of movies and $n_u$is the number of users. We write all movie features $x^{(I)}$as a matrix $n_m \times n$, where n is the number of movie features.


Film feature matrix

$n_u \times n$= n_u \times n$


User parameter matrix

Using the rule of matrix multiplication, the formula $Y = X \Theta^T$can be used to calculate the ratings of all users on all movies at once. $X \in R^{n_m \times n}, \Theta^T \in R^{n \times n_u}$, \ Y \in R^{n_m \times n_u}$

Computational problem

For large movie websites, the number of movies is very large, and the number of users may be even larger. The computation of such a large matrix would be prodigious. From the core principle of the collaborative filtering algorithm, the system needs to evaluate this matrix frequently, at worst, every time a user rates a new movie. It is beyond the scope of this article to discuss this issue in detail, but a few strategies can be discussed briefly:

  • We don’t necessarily calculate it every time a user submits a new movie rating, but rather every day or even several days. As a result, as long as the read-write separation of the database is well done, we can achieve the desired computing goals even with limited hardware resources. Because our calculation is not time-sensitive, our recommendation system becomes more accurate after updating parameters after each calculation.
  • Using distributed computing
  • Using quantum computers quantum computers are N orders of magnitude faster than current computers.

Recommend similar movies

Let’s take the movie as an example and assume that we have learned all the movie features $x^{(I)}$through the collaborative filtering algorithm. Let’s say the user is browsing movie I, and we want to recommend five similar movies to the user. How do you find five similar movies?

We can iterate through all the film, through formula $| x ^ ^ {(I)} – x | ${(k)} to find and are viewing the movie “the smallest distance”, namely the highest similarity of five films.

Normalization means Normalization

So let’s say we have a new user j who hasn’t rated any movies. Then according to the cost function of the collaborative filtering algorithm:


The cost function

The first part is going to be 0, because there is no item where $r(I,j)=1$for user j. $x_k^{(I)}$will be constant because we have learned the characteristics of the movie, So j problem for new users will be reduced to minimize $\ frac {\ lambda} {2} \ sum_ ^ {j = 1} {n_u} \ sum_ {k = 1} ^ n (\ theta_k {} (j)) 2 $, the calculation results will make $\ theta ^ {(j)} $0 for the whole.

In this case, the user’s ratings for all movies would be predicted to be zero, and we would have nothing to recommend to him.

How to solve this problem?

Simply put, if a new user has not rated any movies, we can predict that the user’s rating of the new movie will be the average of the movie’s ratings. To describe it in mathematical terms, we need to first normalize the movie rating data.

Suppose we have all the movie rating matrix $Y \in R^{n_m \times n_u}$, and average it in rows to get $\mu \in R^{n_m}$. Then we calculate $Y – \mu$as the training data set for our collaborative filtering algorithm. So trained $\ theta ^ ^ {(j)} $and $x {(I)} $, j for new users for the movie I predict score formula will be $(\ theta {} (j)) ^ T x + \ mu_i ${(I)}.

The other thing is if there’s a new movie, and no one has rated it, then the movie’s eigenvalues are all zero. We can also use mean normalization to deal with this, but whether this is reasonable needs to be understood from the business level. If a new movie has not been seen or reviewed by anyone, we should not recommend it to anyone. On a business level, a new movie might have a dedicated display area, such as “New Movie Express.”

conclusion

Like magic, the collaborative recommendation algorithm solves the chicken-and-egg problem, building user preferences and movie characteristics from scratch. Of course, there is no magic here, and the key input to the collaborative filtering algorithm is the rating data that users consistently submit. Users continue to use the system and improve the recommendation system in the process of use.