0 Related to the source code
The above knowledge will be combined with the comprehensive actual combat, in order to achieve the learning is used. In the recommendation system project, the basic principle of the recommendation system and the framework of the implementation of the recommendation system are explained. Students with other related research and development experience can combine their previous experience to achieve their own recommendation system.
1 Introduction to the recommendation system
1.1 What is a recommendation System
1.2 Functions of the recommendation system
1.2.1 Help customers quickly locate their needs and save time
1.2.2 Significantly increase sales volume
1.3 Technical ideas of recommendation system
1.3.1 Recommendation system is an engineering application of machine learning
1.3.2 Recommendation system is based on knowledge discovery principle
1.4 Industrial implementation of recommendation system
-
Apache Spark
-
Apache Mahout
-
SVDFeature(C++)
-
LibMF(C+ +,Lin Chih-Jen)
2 Principles of the recommendation system
Probably the most detailed and simple introduction to recommended systems
Official Documentation Guide
Collaborative filtering
Collaborative filtering is commonly used in recommendation systems. These techniques are designed to fill in the missing entries in the user item association matrix. Spark.ml currently supports model-based collaborative filtering, where users and products are described by a small group of potential factors that can be used to predict missing items. Spark.ml uses the alternating least squares (ALS) algorithm to learn about these potential factors. The implementation in spark.ml has the following parameters:
-
NumBlocks users and items will be partitioned into a number of blocks for parallelization (default: 10).
-
The number of potential factors in the Rank model (default is 10).
-
The maximum number of iterations maxIter will run (default: 10).
-
RegParam specifies the regularization parameter in ALS (default: 1.0).
-
ImplicitPrefs specifies whether to use the explicit feedback ALS variant or the variable that applies to implicit feedback data (the default is false, meaning explicit feedback is used).
-
Alpha applies to the parameter of the implicit feedback variable of ALS, which controls the baseline confidence in the preference observation (default: 1.0). Nonnegative Specifies whether to use a nonnegative constraint on least squares (default: false).
Note: The Dataframe-based ALS API currently only supports integer user and item ids. The User and item ID columns support other numeric types, but the ID must be in the integer value range.
Explicit and implicit feedback
The standard approach to collaborative filtering based on matrix decomposition treats the entries in the user item matrix as explicit preferences given by the user to the item, for example, a user who rates a movie.
In many real-world use cases, implicit feedback is usually only accessible (for example, watch, click, buy, like, share, etc.). The method used to process such data in Spark.ml is taken from Collaborative Filtering for Implicit Feedback Datasets. In essence, this approach does not attempt to model a rating matrix directly, but rather looks at the data as numbers that represent the intensity of user action observations (such as clicks or the cumulative duration someone spends watching a movie). These numbers are then related to the observed confidence level of user preference, rather than the explicit rating of the item. The model then tries to find the underlying factors that can be used to predict a user’s expected preference for an item.
Scaling regularization parameters
We solve each least-squares problem by shrinking the regularization parameter regParam by the number of ratings the user generates when updating the user factor or the number of product ratings the user receives when updating the product factor. This approach is named “ALS-WR” and discussed in the article “Massively Parallel Collaborative Filtering for the Netflix Prize”. It makes regParam less dependent on the size of the dataset, so we can apply the best parameters learned from the dataset to the full dataset and expect similar performance.
Cold start strategy
When using the ALS model for prediction, it is common to encounter users and/or items in the test data set that did not exist during the training model. This usually happens in two ways:
- In production, for new users or projects with no rating history and no training in the model (this is the “cold start problem”).
- During cross-validation, data is split between training and evaluation sets. When using Spark’s CrossValidator or a simple random split in TrainValidationSplit, it is actually quite common to encounter users and/or items in an evaluation set that are not in the training set. By default, when there are no user and/or item factors in the model, Spark allocates NaN predictions during alsModel.transform. This is useful in a production system because it represents a new user or project, so the system can decide to use some backup as a prediction.
However, this is undesirable during cross-validation, because any NaN predicted values will result in NaN results for the evaluation metrics (for example, when using RegressionEvaluator). This makes model selection impossible.
Spark allows the user to set the coldStartStrategy parameter to “drop” in order to drop any rows in the predicted DataFrame that contains a NaN value. The evaluation metric will then be calculated based on non-nan data and will be valid. The following example illustrates the use of this parameter.
Note: The currently supported cold start policies are “nan” (the default behavior mentioned above) and “drop”. Further strategies may be supported in the future.
In the following example, we load rating data from the MovieLens dataset, each row containing the user, movie, rating, and timestamp. We then train an ALS model that, by default, assumes that ratings are explicit (implicitPrefs is false). We evaluate the recommendation model by measuring the root mean square error of the rating prediction.
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.recommendation.ALS
case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
def parseRating(str: String): Rating = {
val fields = str.split("... "")
assert(fields.size == 4)
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
}
val ratings = spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt")
.map(parseRating)
.toDF()
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
// Build the recommendation model using ALS on the training data
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training)
// Evaluate the model by computing the RMSE on the test data
// Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics model.setColdStartStrategy("drop") val predictions = model.transform(test) val evaluator = new RegressionEvaluator() .setMetricName("rmse") .setLabelCol("rating") .setPredictionCol("prediction") val rmse = evaluator.evaluate(predictions) println(s"Root-mean-square error = $rmse") // Generate top 10 movie recommendations for each user val userRecs = model.recommendForAllUsers(10) // Generate top 10 user recommendations for each movie val movieRecs = model.recommendForAllItems(10) // Generate top 10 movie recommendations for a specified set of users val users = ratings.select(als.getUserCol).distinct().limit(3) val userSubsetRecs = model.recommendForUserSubset(users, 10) // Generate top 10 user recommendations for a specified set of movies val movies = ratings.select(als.getItemCol).distinct().limit(3) val movieSubSetRecs = model.recommendForItemSubset(movies, 10)Copy the code
If the rating matrix is derived from another information source (that is, inferred from other signals), you can set implicitPrefs to true for better results:
Val als = new als ().setMaxiter (5).setregParam (0.01).setimplicitPrefs (true)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
Copy the code
3 recommended system combat coding
3.1 Split the data set
-
Data set TAB split
-
The code splits the data set
-
Segmentation results
3.2 Prediction Score
- Prediction code
- Predicted results
3.3 MovieLens data set recommendation
- Data set recommendation code
The MovieLens dataset was organized by the GroupLens research group at the University of Minnesota-University of Minnesota (independent of our use of the dataset). MovieLens is a collection of movie scores that come in all sizes. The datasets are named 1M, 10M, and 20M because they contain 1,10, and 200,000 ratings. The largest dataset uses data from about 140,000 users and covers 27,000 movies. In addition to ratings, MovieLens data also contains genre information like “Western” and user app labels like “Over the top” and “Arnold Schwarzenegger.” These genre tags and tags are useful in building content vectors. The content vector encodes information about the item, such as color, shape, genre or really any other attribute – it can be any form used for content-based recommendation algorithms.
MovieLens data has been collected over the past 20 years by college students and people on the Internet. MovieLens has a website where you can sign up, contribute your ratings, and receive recommendations from one of several recommender algorithms implemented by the GroupLens group here.
-
The user ID
-
Pushed by the movie
Spark Machine learning Practice series
- Spark based Machine learning Practices (PART 1) – Introduction to machine learning
- Spark based Machine learning practices (II) – Introduction to MLlib
- Spark based machine learning practice (III) – Actual environment construction
- Spark based Machine learning practice (IV) – Data visualization
- Spark based Machine learning practice (vi) – Basic statistics module
- Spark based machine learning practice (vii) – regression algorithm
- Spark based machine learning practice (viii) – Classification algorithm
- Spark based machine learning practice (IX) – clustering algorithm
- Spark based machine learning practice (10) – Dimensionality reduction algorithm
- Spark based machine learning practice (xi) – Text emotion classification project actual combat
- Spark based Machine learning Practice (xii) – Recommended system practice