Original link:tecdat.cn/?p=10911
Original source:Tuo End number according to the tribe public number
Potential characteristics of users and products writing recommendation systems matrix decomposition works by using potential representations to find similar products.
1. Potential characteristics of users and products
We can estimate how much a user likes a movie by assigning attributes to each user and each movie, then multiplying them and combining the results.
The same calculation can be expressed as a matrix multiplication problem. First, we put the user attributes in a matrix called U, in this case 5, -2, 1, -5, and 5. Then, we put the movie properties into a matrix called M, and we use matrix multiplication to find out the user’s ratings.
But to do that, we must already know the user properties and movie properties. Providing attribute ratings for each user and each movie is not easy. We need to find a way to automate it. Let’s look at the movie rating matrix,
It shows how all the users in our data set rated the movie. This matrix is very sparse, but it gives us a lot of information. For example, we know that user ID2 gives movie 1 five stars. So, based on this, we can guess that the attributes of this user might be similar to the attributes of the movie, because they match well. In other words, we have some clues to work with.
Let’s see how we can use these clues to understand each movie and each user. In the equation we just saw, U times M is equal to the movie rating, and we already know the actual movie rating of some users. The movie score matrix that we already have is the solution to our equation. While it was part of the solution, the array still had a lot of bugs, but it was enough for us.
In fact, we can use the movie ratings we know so far, and then work backwards to find the U and M matrices that satisfy this equation. Of course, that’s the cool part. When we multiply U and M, they actually give us a complete matrix, and we can use that completed matrix to recommend movies. Let’s review how we will build this recommendation system.
First, we created a matrix of all the user reviews we had in the dataset. Next, we decompose a U matrix and an M matrix from the known comments. Finally, we’ll multiply the U and M matrices we found to get the ratings per user and per movie. But there’s another problem. Previously, when we manually created attributes for each user and each movie, we knew what each attribute meant. We know that the first attribute represents the action, the second represents the story, and so on. But when we use matrix factorization to factor out U and M, we don’t know what each value means. What we do know is that each value represents some characteristic that makes users feel attracted to certain movies. We don’t know how to describe these features in words. Therefore, U and M are called potential vectors. The word latent means hidden. In other words, these vectors are the hidden information we derive by looking at the comment data and working backwards.
2. Write recommendation system
Let’s write the main code for the recommendation system. Chapter 5/factor_review_matrix.py First, I’ll use the PANDAS read_CSV function to load the check dataset into a dataset named RAW_DATASet_df.
We then use the PANDAS PivotTable function to build the comment matrix. At this point, ratings_df contains a sparse comment array.
Next, we want to decompose the array to find the user attribute matrix and the movie attribute matrix that we can multiply back to recreate the ratings data. To do this, we will use a low rank matrix factorization algorithm. I’ve included this implementation in matrix_Factorization_utilities.py. We’ll talk more about how it works in the next video, but let’s keep using it. First, we pass the scoring data, but we’ll call pandas’ AS_matrix () function to make sure we pass it in as a NUMPY matrix data type.
Next, the method takes a parameter named num_features. Num_features controls how many potential features are generated per user and per movie. We’re going to start with 15. This function also takes a parameter, regularization_amount. Now let’s pass in 0.1. We will discuss how to adjust this parameter in a later article.
The result of the function is a U matrix and an M matrix with 15 attributes for each user and each movie. Now, we can get the rating of each movie by multiplying U and M. But instead of using the regular multiplication operator, it uses Numpy’s Matmul function, so it knows we’re going to do matrix multiplication.
The results are stored in an array called predicted_ratings. Finally, we save the predict_ratings to a CSV file.
First, we’ll create a new PANDAS data box to hold the data. For this data box, we tell PANDAS to use the same row and column names as in the ratingS_df data box. We will then use the PANDAS CSV function to save the data to a file. When you run this program, you can see that it creates a new file called predicted_ratings.csv. We can open this file using any spreadsheet application.
This data looks just like our original comment data, and now every cell is filled. Now let’s evaluate how each individual user will rate each individual movie. For example, we can see that user 3 rates movie 4 and they will give it a four-star rating. Now that we know all these ratings, we can recommend movies to users in order of rating. Let’s take a look at user 1 and see the movie we recommend to them. Of all these movies, if we exclude the movies the user has reviewed before, the movie no. 34 on the right is the movie with the highest score, so this is the first movie we should recommend to this user. When users watch the movie, we ask them to rate it. If their ratings are inconsistent with our predictions, we add a new rating and recalculate the matrix. This will help us improve our overall score. The more scores we get from this, the fewer holes will appear in our score array, and the better chance we have of providing accurate values for the U and M matrices.
3. Working principle of matrix decomposition
Since the rating matrix is the result of multiplying the user attribute matrix by the movie attribute matrix, we can use matrix factorization to work backwards to find the values of U and M. In the code, we do this using an algorithm called low rank matrix factorization. Let’s see how this algorithm works. Matrix decomposition is the idea that a large matrix can be decomposed into smaller matrices. So, let’s say we have a big matrix of numbers, and let’s say we want to find two smaller matrices multiplied together to produce that big matrix, our goal is to find two smaller matrices that satisfy this requirement. If you happen to be an expert in linear algebra, you probably know that there are standard ways to factor matrices, such as using a process called singular value decomposition. However, there is such a special case that will not work properly. The problem is we only know some of the values in the big matrix. Many of the entries in the large matrix are blank, or the user has not yet checked for a particular movie. So, instead of splitting the rating array directly into two smaller matrices, we use an iterative algorithm to estimate the value of the smaller matrix. We guess and check until we get close to the right answer. Whoa, whoa, whoa, whoa. What happened? First, we will create U and M matrices, but set all values to random numbers. Because U and M are both random numbers, so if we multiply U and M now, it’s going to be random. The next step is to check how different our calculated and real rating matrices are from the current values of U and M. But we’re going to ignore all the points in the rating matrix where we don’t have data, and just look where we have actual user reviews. We call this difference cost. Cost is error rate. Next, we will use a numerical optimization algorithm to search for the minimum cost. The numerical optimization algorithm will adjust the numbers in U and M at a time. The goal is to get the cost function of each step closer to zero. The function we will use is called fmin_cg. It searches for the input that makes the function return the smallest possible output. It is provided by the SciPy library. Finally, the fMIN_cg function will loop hundreds of times until we get the smallest possible cost. When the value of the cost function is as low as we can get, then the final values of U and M are what we will use. But because they are only approximations, they won’t be perfect. When we multiply these U and M matrices to calculate the movie rating, and compare it to the original movie rating, we see that there are still some differences. But as long as we’re close, a small amount of difference doesn’t matter.
4. Use potential features to find similar products
Search engines are a common way for users to discover new websites. When a user first visits your site from a search engine, you are not able to provide personalized recommendations to the user, and our recommendation system cannot recommend them until the user enters some product reviews. In this case, we can show users products that are similar to what they are already looking at. The goal is to get them on the site and get them to look at more products. You may have seen this feature on online shopping sites, and if you like this product, you may also like these other products. By using matrix factorization to calculate product attributes, we can calculate product similarity. Let’s look at find_similar_products.py. First, we’ll load the movie rating data set using pandas’ reading CSV functionality.
We’ll also use read_csv to load movies_csv into a data box called movies_df.
We will then use Pandas’ Pivot_table function to create the scoring matrix, and we will use matrix factorization to compute the U and M matrices. Now, each movie is represented by a column in the matrix. First, we use Numpy’s transpose function to trigger the matrix, making each column a row.
It just makes the data easier to work with, it doesn’t change the data itself. In the matrix, each movie has 15 unique values that represent the characteristics of that movie. This means that other nearly identical movies should be very similar. To find other movies like this one, we just need to find other movies whose numbers are closest to this one. It’s just a subtraction problem. Let’s select the main movie the user is watching, and let’s select movie ID5.
If you like, you can choose other movies. Now, let’s look at the titles and genres of movie ID5. We can do this by looking at the Movies_df data box and using pandas’ LOc function to find rows through its index. Let’s print out the title and genre of the movie.
Next, let’s get the movie property with movie ID 5 from the matrix. We have to subtract one here, because M is the zero index, but the movie ID starts at one. Now, let’s print out these movie properties so that we can see them, and these properties are ready to find similar movies.
The first step is to subtract the properties of this movie from the other movies. This line of code subtracts the current movie characteristics from each row of the matrix. This gives us the score difference between the current movie and the other movies in the database. You could also use four loops to subtract one movie at a time, but with Numpy we can do this in one line of code. The second step is to take the absolute value of the difference we calculated in the first step. Numpy’s ABS function gives us the absolute value. This just ensures that any negative numbers come out positive. Next, we combined the 15 individual attribute differences for each movie into a movie’s total difference score. Numpy’s summation capabilities will do just that. We also pass access equal to one to tell Numpy to summarize all the numbers in each row and produce a separate sum for each row. At this point, we’re done. We simply save the calculated score back into the list of movies so that we can print the name of each movie. In step 5, we sort the list of movies by the difference score we calculated so that the fewest different movies are shown in the list first. Here pandas provides a handy sorting value function. Finally, in step 6, we print the first five movies in the sorted list. These are the movies that most closely resemble current movies.
Okay, so let’s run this program. We can see the 15 attributes we calculated for the movie. These are the five most similar movies we found. The first movie is the one the user has already seen. The next four movies are similar projects that we show users. Depending on their titles, these movies might look very similar. They all seem to be about crime and investigation. Sequels, Big City Judge THREE, are all on the list. This is a movie that users might also be interested in. You can change the movie ID and run the program again to see similar content to other movies.
Most welcome insight
1. Why do employees dimission from decision tree model
2. Tree-based methods of R language: decision tree, random forest
3. Use scikit-learn and PANDAS in Python
4. Machine learning: Running random forest data analysis reports in SAS
5.R language improves airline customer satisfaction with random forest and text mining
6. Machine learning boosts fast fashion precise sales time series
7. Identifying changing Stock Market Conditions with Machine learning: Application of Hidden Markov Models
8. Python Machine learning: Recommendation System Implementation (Matrix factorization for collaborative filtering)
9. Python uses PyTorch machine learning classification to predict bank customer churn