“This is the second day of my participation in the First Challenge 2022, for more details: First Challenge 2022”.

What is a recommendation system

At the late stage of every system development, there is such a requirement, how to recommend the most interesting content to the user. In order to solve this problem and put forward the “recommendation system” came into being. Collaborative filtering is based on the idea of “collective intelligence”, which holds that people gravitate towards certain parts of a group that have in common. For example, if you want to watch a movie but don’t know which one to watch, most people will ask around and tend to choose recommendations from people who have similar tastes to their own.

This paper will start with how to model the recommendation system, and mainly introduce how to use GridSearch to model, input data and algorithm, and perform algorithm and evaluation process.

GridSearch GridSearch

Modeling of recommendation systems

Input data set, test set data, through the user’s rating data on the item,

  • User CF: Obtain the similarity between users and users, and recommend users items that are well received by other users.
  • Item CF: obtain the similarity between items and recommend items to other users who have good comments on neighboring items.

Then take out the test suite, see if the predicted data is consistent with the actual data, and evaluate the algorithm and model.

Grid search architecture

As can be seen from the recommendation system modeling, the basic flow of GridSearch is as follows:

But GridSearch evaluates a range of algorithms and hyperparameters to select the ones that work best for the actual situation.

Collaborative recommendation algorithm

A lot of work has been done in the direction of collaborative recommendation algorithm. It’s basically divided into three directions

  • K nearest neighbor algorithm KNN
    • Jaccard algorithm
    • PIP algorithm
    • Cosine algorithm
  • Algorithm based on matrix decomposition
    • PMF probability matrix decomposition
  • Deep learning – Recommendations based on neural network algorithms

How does the recommendation system work

demo:UserKNNGridSearch

This is a GridSearch based on the UserKNN cooperative algorithm

public class UserKNNGridSearch {
    public static void main(String[] args) throws IOException {
        DataModel datamodel = BenchmarkDataModels.MovieLens100K();
        ParamsGrid grid = new ParamsGrid();
        grid.addParam("numberOfNeighbors".new int[] {25.50.75.100.200.300});
        grid.addParam(
                "metric".new UserSimilarityMetric[] {
                        new Correlation(),
                        new Cosine(),
                        new Jaccard()
                });
        grid.addFixedParam("aggregationApproach", UserKNN.AggregationApproach.DEVIATION_FROM_MEAN);
        Map<String, Object> precisionParams = new HashMap<>();// Precision parameters
        precisionParams.put("numberOfRecommendations".3);
        precisionParams.put("relevantThreshold".4.0);
        GridSearch gs =
                new GridSearch(datamodel, grid, UserKNN.class, Precision.class, precisionParams);
        gs.fit();
        gs.printResults(5.false); }}Copy the code
  • BenchmarkDataModels. MovieLens100K () is the user and the film Netflix rating score (about 100 k) data set
  • Input three usersimilaritymetrics to GridSearch: Correlation, Cosine, and Jaccard
  • NumberOfNeighbors {25, 50, 75, 100,200,300} is set as a hyperparameter for prediction
  • DEVIATION_FROM_MEAN, also known as mean deviation method, was selected as the prediction method, which will be introduced in detail below
  • Adapt Precision as an evaluation algorithm
  • Finally, three best algorithm + parameter combinations are output

The effect

0.805357 the for {metric = Jaccard, numberOfNeighbors = 200, AggregationApproach =DEVIATION_FROM_MEAN} 0.804167 for {metric=Jaccard, numberOfNeighbors=300, AggregationApproach =DEVIATION_FROM_MEAN} 0.786310 for {metric=Jaccard, numberOfNeighbors=75, aggregationApproach=DEVIATION_FROM_MEAN}Copy the code

How GridSearch works

As you can see, the entry point to GridSearch’s work is the FIT function. As follows:

public void fit(a) {

  Iterator<Map<String, Object>> iter = grid.getDevelopmentSetIterator(true, seed);

  int i = 0;
  while (i < this.numIters && iter.hasNext()) {
    Map<String, Object> params = iter.next();
    i++;

    Recommender recommender = null;

    try {
      recommender =
          this.recommenderClass
              .getConstructor(DataModel.class, Map.class)
              .newInstance(this.datamodel, params);
    } 

    if(recommender ! =null) {
      recommender.fit();
    }

    QualityMeasure qm = null;

    try {
      if (this.qualityMeasureParams == null || this.qualityMeasureParams.isEmpty()) {
        qm = this.qualityMeasureClass.getConstructor(Recommender.class).newInstance(recommender);
      } else {
        qm =
            this.qualityMeasureClass
                .getConstructor(Recommender.class, Map.class)
                .newInstance(recommender, this.qualityMeasureParams); }}if(qm ! =null) {
      double score = qm.getScore();
      results.add(newPair<>(params.toString(), score)); }}}Copy the code

It’s basically a while loop that does something like this:

  • Select a hyperparameter and algorithm as recommender
  • Recommender calculates the correlation matrix and neighbor matrix
  • The constructor member is the QualityMeasure object of recommender
  • QualityMeasure objects will forecast the test set internally, compare with the actual data, evaluate and calculate the score

How to calculate

Recommender and QualityMeasure object require a lot of calculation and prediction, and generally adopt parallel calculation to make full use of the number of system core.

The process for UserKNN is as follows:

@Override
public void fit(a) {
  System.out.println("\nFitting " + this.toString());
  Parallelizer.exec(this.datamodel.getUsers(), this.metric);
  Parallelizer.exec(this.datamodel.getUsers(), new UserNeighbors());
}
Copy the code
  • Firstly, the similarity matrix of users is calculated in parallel.
  • The neighbors matrix is then computed in parallel.

Calculate the user’s similarity matrix

UserSimilarityMetric calculates the user’s similarity matrix as follows:

@Override
public void run(User user) {
  int userIndex = user.getUserIndex();

  for (int u = 0; u < datamodel.getNumberOfUsers(); u++) {
    User otherUser = datamodel.getUser(u);
    if (userIndex == otherUser.getUserIndex()) {
      similarities[userIndex][u] = Double.NEGATIVE_INFINITY;
    } else {
      similarities[userIndex][u] = this.similarity(user, otherUser); }}}Copy the code

Run calculates the similarity of all users. As mentioned above, Jaccard algorithm is one of the algorithms to calculate the similarity, as follows:


public class Jaccard extends UserSimilarityMetric {

  @Override
  public double similarity(User user, User otherUser) {

    int i = 0, j = 0, common = 0;
    while (i < user.getNumberOfRatings() && j < otherUser.getNumberOfRatings()) {
      if (user.getItemAt(i) < otherUser.getItemAt(j)) {
        i++;
      } else if (user.getItemAt(i) > otherUser.getItemAt(j)) {
        j++;
      } else{ common++; i++; j++; }}// If there is not items in common, similarity does not exists
    if (common == 0) return Double.NEGATIVE_INFINITY;

    // Return similarity
    return (double) common
        / (double) (user.getNumberOfRatings() + otherUser.getNumberOfRatings() - common); }}Copy the code

Jaccard algorithm formula is:


a a b b a a b b \frac{a \bigcap\limits_a^b b} { a \bigcup\limits_a^b b}

Jaccard compared the similarity between two users by whether two users rated the same item. The more two users rated the same item, the more similar they thought the two users were.

Calculates the neighbors matrix for the user

The task is mainly to run out of the neighborhood for real prediction. The following

private class UserNeighbors implements Partible<User> {

  @Override
  public void beforeRun(a) {}

  @Override
  public void run(User user) {
    int userIndex = user.getUserIndex();
    double[] similarities = metric.getSimilarities(userIndex);
    neighbors[userIndex] = Search.findTopN(similarities, numberOfNeighbors);
  }

  @Override
  public void afterRun(a) {}}Copy the code

This process is simple: find the neighbors with the highest user similarity and place them in the Neighbors matrix. If numberOfNeighbors is a hyperparameter, it is configured in Demo.

Prediction method

UserKNN. AggregationApproach offers a variety of prediction method to predict a user rating on items:

  • MEAN MEAN: Average the ratings of the user’s neighbors on the item

  • WEIGHTED_MEAN, which uses a weighted mean to average the item’s ratings from users’ neighbours by similarity

  • Mean deviation: DEVIATION_FROM_MEAN

private double predictDeviationFromMean(int userIndex, int itemIndex) {
  User user = this.datamodel.getUser(userIndex);
  double[] similarities = metric.getSimilarities(userIndex);

  double num = 0;
  double den = 0;

  for (int neighborIndex : this.neighbors[userIndex]) {
    if (neighborIndex == -1)
      break; // Neighbors array are filled with -1 when no more neighbors exists

    User neighbor = this.datamodel.getUser(neighborIndex);

    int pos = neighbor.findItem(itemIndex);
    if(pos ! = -1) {
      double similarity = similarities[neighborIndex];
      double rating = neighbor.getRatingAt(pos);
      doubleavg = neighbor.getRatingAverage(); num += similarity * (rating - avg); den += similarity; }}return (den == 0)? Double.NaN : user.getRatingAverage() + num / den; }Copy the code

The process above:

  • The similarity vector of the user is first obtained from the user
  • Then, get some information from the user’s neighbors
    • If the neighbor has a rating for the item (to obtain the rating rating and avG of the average user), first find the similarity between the user and the neighbor
      • Num += (rating− AVg) ∗similaritynum+= (rating- AVg) *similaritynum+= (rating− AVg) ∗similarity


      • d e n + = s i m i l a r i t y den += similarity

  • According to the preceding rules, all neighbors of the user are traversed
    • There are n neighbors in total, and if any neighbor has ever overreacted to such an item, the predicted value is:

    u s e r . g e t R a t i n g A v e r a g e ( ) + 1 n ( r a t i n g a v g ) x s i m i l a r i t y 1 n s i m i l a r i t y user.getRatingAverage() + \frac{\sum\limits_1^n (rating-avg) \times similarity} {\sum\limits_1^n similarity}
    • Otherwise, the user cannot predict it

Such an algorithm is called DEVIATION_FROM_MEAN clustering and is the prediction algorithm used in the demo above.

Evaluation methods

The entry point to the evaluation method is the QualityMeasure class’s getScore method, which executes the following run method in parallel:

public void run(TestUser testUser) {
  int testUserIndex = testUser.getTestUserIndex();
  double[] predictions = recommender.predict(testUser);
  usersScores[testUserIndex] = getScore(testUser, predictions);
}

@Override
public void afterRun(a) {
  double sum = 0;
  int count = 0;
  for (double us : usersScores) {
    if(! Double.isNaN(us)) { sum += us; count++; } } score = sum / count; }Copy the code

Run each user’s score in parallel, and then calculate the average, which is the score of this algorithm.

So how is the score calculated for each user? Is the getScore method in the Precision class, as follows:

protected double getScore(TestUser testUser, double[] predictions) {

  // Items that has been recommended and was relevant to the active user

  int[] recommendations = Search.findTopN(predictions, this.numberOfRecommendations);

  int recommendedAndRelevant = 0, recommended = 0;

  for (int pos : recommendations) {
    if (pos == -1) break;

    double rating = testUser.getTestRatingAt(pos);
    if (rating >= this.relevantThreshold) {
      recommendedAndRelevant++;
    }

    recommended++;
  }

  return (double) recommendedAndRelevant / (double) recommended;
}
Copy the code
  • Top N: enclosing numberOfRecommendations is input parameters
  • Select the Top N predictions, and consider them valid recommendations if the relevantThreshold4 score is greater than the relevantThreshold4 score
  • The test user score is:
    r e c o m m e n d e d A n d R e l e v a n t / r e c o m m e n d e d recommendedAndRelevant / recommended
    . Among them:

    • RecommendedAndRelevant is the number of scores greater than relevantThreshold of the hyperparameter
    • Recommended is a recommended number of effective testable assessments

reference

This article refers to the implementation of CF4J: github.com/ferortega/c…