Building Spark based Recommendation Engine (Python)

The idea behind recommendation engines is to predict what people are likely to like and to aid in this process by finding connections between items

When I learned Spark machine learning, the book was completed in Scala, but I was not familiar with it, so I used Pyshark to complete it, and I had a deeper understanding of Spark’s implementation of collaborative filtering

Here, we choose the type of collaborative filtering as our recommendation model, and use the implementation based on matrix factorization in Spark’s MLlib recommendation model library.

Collaborative Filtering

Collaborative filtering simply is to use a certain interests, share a common experience of the group’s preferences to recommend user information of interest, personal information through cooperation mechanism to give a fair degree of response (e.g., grade) and recorded in order to achieve the purpose of filtering and screening information to help others, have different response is limited to particularly interested in, Keeping track of particularly uninteresting information is also important.

A simple example is that we often seek movies in our daily life by asking friends with similar tastes to recommend them. This is the idea of collaborative filtering

The basic principle of user-based collaborative filtering recommendation mechanism

User – or item-based approaches score based on the set of users or items formed by similarity (i.e. neighbors), so they are often referred to as nearest neighbor models.

Matrix decomposition

The data we need to deal with here is the user’s own preference data, that is, the user’s rating data on the item.

These data can be converted into A two-dimensional matrix with users as rows and items as columns, that is, the scoring matrix A (M *n) represents the scores of M users on N items

UI i1 i2 i3
u1 3.0 3.0 ?
u2 ? 2.0 4.0
u3 ? 5.0 ?

Many elements of matrix A are empty, which is called missing value.

Collaborative filtering proposed a matrix decomposition method to support incomplete scoring matrix without estimating and filling the scoring matrix.

In the recommendation system, we want to get how the user rates all the items, and if the user does not rate an item, we need to predict whether and how much the user will rate the item. This is called matrix completion.

The way to factor this matrix is to find two lower-dimensional matrices, such that their product is the original matrix.

So this is also a dimension reduction technique. To find a k-dimensional matrix that approximates the ‘user-item’ matrix, the final requirement is an M x K matrix representing the user and a K x N matrix representing the item. These two matrices are also called factor matrices.

The understanding of K is that for each product, the movie can be evaluated from k angles, namely k characteristic values

Due to the direct modeling of ‘user-item’ matrix, prediction using these models is also relatively straightforward. To calculate the expected rating of a given user for an item, the corresponding rows and columns are selected from the user factor matrix and the item factor matrix respectively, and then the dot product of the two is calculated.

Assume that for user A, there is A certain linear relationship between the comprehensive score of the user for A movie and the characteristic value of the movie.

=(a1D1 + a2D2 +a3d3+ A4d4)

Where A1-4 is the eigenvalue of user A, and D1-4 is the eigenvalue of the movie mentioned above

The least square method achieves synergy

The Least Squares (ALS) method is an optimal method for solving matrix decomposition problems. It is powerful, works well, and has proved relatively easy to parallelize. This makes it a good fit for platforms like Spark.

$U(m*k) $; $u_i $;

$a_{ij} $, $a_{ij} $, $a_{ij} $, $a_{ij} $, $a_{ij} $

The loss function of the matrix decomposition model is as follows

? \large C = \sum\limits_{(i,j)\in R}[(a_{ij} – u_iv_j^T)^2+\lambda(u_i^2+v_j^2)] ?


The usual optimization methods are divided into two types: alternative least squares and stochastic gradient descent. Spark uses the cross least square method (ALS) to optimize the loss function.


The idea is that we generate it randomly, fix it, fix it again, and so on until we get the optimal solution $min(C) $

Use PySpark

Our data set here is the Movielens 100K data set, which contains 100,000 ratings of multiple movies by multiple users

Download address

Reads the rating data set, which includes fields such as user ID, movie ID, star, and timestamp, separated by /t

The data read from sc.textFile() is RDD, and the first three properties are user ID, movie ID, and star respectively

rawData = sc.textFile('/home/null/hadoop/data/ml-100k/u.data')
rawData.first()
type(rawData)Copy the code
pyspark.rdd.RDD



Copy the code
rawRatings = rawData.map(lambda line: line.split("\t")[:3])
rawRatings.first()Copy the code
[' 196 ', '242', '3']Copy the code
# import spark mllib recommended in the library import pyspark. Mllib. Recommendation as rdCopy the code

Generate RDD data of the Rating class

Rd. Rating = rawRatings. Map (lambda line: rd.Rating(int(line[0]), int(line[1]), float(line[2]))) ratings.first()Copy the code
Rating (user = 196, the product = 242, Rating = 3.0)Copy the code

Training ALS model

  • Rank: the number of factors in ALS model, that is, the number of new rows/columns of the two matrices decomposed by the matrix, that is, $A \approx UV^T, k << m,n $m, k in n
  • Iterations: specifies the maximum number of iterations corresponding to the runtime
  • Lambda: Controls the regularization process of the model, thereby controlling the over-fitting situation of the model.
Rd.als. train(ratings, 50, 10, 0.01) modelCopy the code
<pyspark.mllib.recommendation.MatrixFactorizationModel at 0x7f53cc29c710>



Copy the code
PredictedRating = model. Predict (789,123) predictedRating = modelCopy the code
3.1740832151065774



Copy the code
# obtain top 10 recommendations for user 789 topKRecs = model.shameshameproducts (789,10) topKRecsCopy the code
[Rating(user=78, product= 78, Rating =78), Rating(user=78, product= 78, Rating =78), Rating(user=78, product= 78, Rating =78), Rating(user=78, product= 78, Rating =78), Rating(user=78, product= 78, Rating =78), Rating(user=78, product= 78, Rating =78), Rating(user= 100, product= 100, Rating = 80), Rating(user= 100, product= 100, Rating = 80), Rating(user=78, product= 78, Rating =78), Rating(user=78, product= 78, Rating =78)]Copy the code

Check recommended content

Here, the movie data is first read in, and the data is processed as a mapping from the movie ID to the title

The top 10 movies rated by a particular user are then compared to the top 10 movies recommended by that user

# check recommendations movies = sc. TextFile ('/home/null/hadoop/data/ml - 100 k/u.i tem ') movies. The first ()Copy the code
'1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0'



Copy the code
titles_data= movies.map(lambda line: line.split("|")[:2]).collect()
titles = dict(titles_data)
titlesCopy the code
moviesForUser = ratings.keyBy(lambda rating: rating.user).lookup(789)
type(moviesForUser)Copy the code
list



Copy the code
moviesForUser = sorted(moviesForUser,key=lambda r: r.rating, reverse=True)[0:10]
moviesForUserCopy the code
[Rating(user=78, product= 78, Rating =78), Rating(user=78, product= 78, Rating = 79), Rating =78, product= 78, rating (product= 78, rating= 78), rating (product= 78, rating= 78), Product =78, rating= 78, product= 78, rating (product= 78, product= 78, rating= 78), Rating(user=78, product= 78, Rating =78), Rating(user=78, product= 78, Rating =78)]Copy the code
[(titles[str(r.product)], r.rating) for r in moviesForUser]Copy the code
'('Godfather, The (1972)', 'Trainspotting (1996)', 'Dead Man Walking (1995)', 'Star Wars (1977)', 5.0), (' Swingers (1996) ', 5.0), (' brigade Las Vegas (1995) ', 5.0), (' Bound (1996) ', 5.0), (' Fargo (1996) ', 5.0). ('Last Supper, The (1995)', 5.0), ('Private Parts (1997)', 4.0)]Copy the code
[(titles[str(r.product)], r.rating) for r in topKRecs]Copy the code
[('Day the Earth Stood Still, the (1951)', ("It's a Wonderful Life (1946)", 5.782039583864358), ('Glory (1989)', 5.665266358968961), ('Fifth Element, The (1997)', 5.551256887914674), ('Shawshank Redemption, The (1994)', ('Rear Window (1954)', ('In The Name of The Father (1993)', 5.442052952711695), ('North by Northwest (1959)', ('Apocalypse Now (1979)', 5.413309515550101), '(' Birds, The (1963), 5.400024900653429)]Copy the code

Evaluation of the effectiveness of the recommendation model

Mean Squared Error (MSE)

It is defined as the sum of the squared errors and the total number, where the squared error is the square of the difference between the predicted rating and the actual rating

Directly measure the quality of the predictive target variable of the model

Root Mean Squared Error (RMSE)

Take the square root of the MSE, which is the standard deviation of the difference between the expected and actual ratings

# evaluation metric
usersProducts = ratings.map(lambda r:(r.user, r.product))
predictions = model.predictAll(usersProducts).map(lambda r: ((r.user, r.product),r.rating))
predictions.first()Copy the code
(4.006135662882842) (316, 1084),Copy the code
ratingsAndPredictions = ratings.map(lambda r: ((r.user,r.product), r.rating)).join(predictions)
ratingsAndPredictions.first()Copy the code
((186, 302), (3.0, 2.7544572973050236))



Copy the code
Evaluation import RegressionMetrics predictionsAndTrue = from Pysparg.mllib.evaluation ratingsAndPredictions.map(lambda line: (line[1][0],line[1][3])) predictionsAndTrue.first()Copy the code
(3.0, 2.7544572973050236)



Copy the code
# MSE
regressionMetrics = RegressionMetrics(predictionsAndTrue)
regressionMetrics.meanSquaredErrorCopy the code
0.08509832708963357



Copy the code
# RMSE
regressionMetrics.rootMeanSquaredErrorCopy the code
0.2917161755707653


Copy the code

Reference:

  • In-depth understanding of Spark ML: Collaborative filtering algorithm and source code analysis based on ALS matrix decomposition
  • How to explain the principles of the ALS algorithm in Spark MLlib? – zhihu
  • Maching Learning With Spark — Posts and Telecommunications Press