Recommended introduction to systematic study notes

“This is the first day of my participation in the Gwen Challenge in November. See details of the event: The last Gwen Challenge in 2021”.

I. Introduction to recommendation system

1.1 Concepts and Background

What is a recommendation system

Users without clear requirements visit our service, and the items in the service constitute information overload to users. The system sorts the items according to certain rules and shows the items in front to users. Such system is the recommendation system
Information overload & unclear user needs
- Category: covers a small number of popular sites. Typical application: Hao123 Yahoo
- Search engines: Identify requirements through search terms. Typical application: Google Baidu
- Recommendation system: it does not need users to provide clear needs, and models users’ interests by analyzing users’ historical behaviors, so as to actively recommend information that can meet their interests and needs

Recommendation systems & search engines

	search	recommended
behavior	Take the initiative to	passive
intentions	clear	The fuzzy
personalized	weak	strong
The flow distribution	Matthew effect	The long tail
The target	Quick to meet	Continuous service
Evaluation indicators	concise	complex

1.2 The working principle and function of the recommendation system

How the recommender system works
- Social recommendation
- Content-based recommendations
- Recommendations based on popularity
- Collaborative filtering based recommendations: Find users with similar historical interests
The role of the recommendation system
- Connect users and objects efficiently
- Increase user stay time and user activity
- Effectively help products to achieve their commercial value

1.3 Differences between recommendation systems and Web projects

Achieving goals through information filtering improves V.S. stable information flow systems
- Web projects: Handle complex business logic, handle high concurrency, and build a stable information flow service for users
- Referral system: Growth metrics, retention/reading time/Gross Merchandise Volume (GMV)/ Video View
Definite V.S. uncertain thinking
- Web projects: Have firm expectations of results
- Recommendation systems: The result is a matter of probability

Ii. Recommendation system design

2.1 Recommendation system elements

The UI and UE
Data (Lambda architecture)
Business knowledge
algorithm

2.2 Recommended System architecture

Lambda architecture for Big data
- Lambda architecture is a real-time big data processing framework proposed by Nathan Marz, author of The real-time big data processing framework Storm.
- Lambda architecture integrates offline computing and real-time computing, and designs an architecture that can meet the key characteristics of real-time big data systems, including high fault tolerance, low latency and scalability.
- Layered architecture
  - Batch layer
    - Data is immutable, computable in any way, and horizontally scalable
    - High latency of several minutes to several hours (calculation and data volume vary)
    - Log collection: Flume
    - Distributed storage: Hadoop
    - Distributed computing: Hadoop and Spark
    - View storage database
      - osql(HBase/Cassandra)
      - Redis/memcache
      - MySQL
  - Real time processing layer
    - Stream processing, continuous calculation
    - Store and analyze data within a window period (top sales over time, real-time hot searches, etc.)
    - Real-time data collection flume & Kafka
    - Real-time data analysis Spark Streaming/Storm/Flink
  - The service layer
    - Random read support
    - Results need to be returned in a very short time
    - Read and merge batch and real-time layer results
Recommendation algorithm Architecture
- Recall Stage (Audition)
  - Recall determines the ceiling of the final recommendation
  - Commonly used algorithm
    - Collaborative filtering
    - Based on the content
- Sorting stage (Select)
  - Recall determines the ceiling of the final recommendation results, while sorting approaches this limit and determines the final recommendation effect
  - CTR estimation (CTR estimation using LR algorithm) estimates whether a user will click on an item and requires user click data

Recommendation algorithm

3.1 Recommendation model construction process

Data ->Features ->ML Algorithm ->Prediction Output

Data cleaning/data processing
- The data source
  - Explicit data
    - Rating scale
    - A: What are your Comments
  - Contact data
    - Order history
    - Add a shopping cart
    - Page views
    - Click on the
    - Search records
- Data quantity/Whether the data meets the requirements
Characteristics of the engineering
- Filter characteristics from the data
  - A given item may be purchased by users with similar tastes or needs
  - Use user behavior data to describe goods
- Represent features with data
  - Combine all user actions together to form a user-item matrix
Choose the right algorithm
- Collaborative filtering
- Based on the content
Generate recommended results
- Evaluate the recommendation results, and go online after passing the evaluation

3.2 The most classical recommendation algorithm: collaborative filtering recommendation algorithm

Collaborative Filtering

Algorithm thought: birds of a feather flock together

The basic collaborative filtering recommendation algorithm is based on the following assumptions:

“You are likely to like what others like like you like” : User-based Collaborative Filtering Recommendation (USER-based CF)
“You’re likely to like something similar to what you like” : Item-based Collaborative Filtering Recommendation

There are several steps to implement collaborative filtering recommendations:

Find the most similar person or thing: top-n Similar person or thing

By calculating the similarity of two pairs to sort, you can find top-N similar people or items
Generate recommendations based on similar people or items

Use top-n results to generate initial recommendation results, and then filter out items that the user already has a record of or explicitly expresses no interest in

As a simple example, the data set is equivalent to a user’s purchase record of an item: a tick indicates that the user has a purchase record of the item

On similarity calculation here with a simple idea: if you have two classmates X and Y, X classmates hobbies/soccer, basketball, table tennis, Y classmates hobbies/tennis, football, basketball, badminton, is their common hobby has 2, then can use their similarity: two-thirds of * 2/4 = 0.33 to represent a third material.

3.3 Similarity calculation

The calculation method of similarity
- Euclidean distance is a method of measuring distance in Euclidean space. Two objects, both represented as two points in the same space, if called P and q, are n coordinates, then the Euclidean distance measures the distance between these two points. Euclidean distances do not apply between Boolean vectors
The value of Euclidean distance is a non-negative number, and the maximum value is infinity. Usually, the result of similarity calculation is expected to be between [-1,1] or [0,1], which can be used generally

The transformation formula is as follows:
- Cosine similarity
- It measures the Angle between two vectors, and uses the cosine of the Angle to measure similar cases
  - If the Angle between the two vectors is 0, the cosine is 1, if the Angle is 90 degrees, the cosine is 0, and if the Angle is 180 degrees, the cosine is -1
  - Cosine similarity is more commonly used to measure text similarity, user similarity and object similarity
  - The characteristics of cosine similarity have nothing to do with vector length. The calculation of cosine similarity should be normalized to vector length. As long as two vectors have the same direction, no matter how strong or weak, they can be regarded as’ similar ‘.
Pearson correlation coefficient
- It’s actually cosine similarity, but you’ve just centralized the vectors, subtracted the mean of each of the vectors a and B, and then you calculate cosine similarity
- Pearson similarity calculation results are between -1, 1, -1 means negative correlation, 1 means positive correlation
- Measures whether two variables increase and decrease in the same way
- Pearson correlation coefficient measures whether the change trend of two variables is consistent, which is not suitable for calculating the correlation between Boolean value vectors
Jaccard similarity
- The proportion of the number of elements in the intersection of two sets in the union is very suitable for Boolean vector representation
- The numerator is the dot product of two Boolean vectors, and you get the number of elements that intersect
- The denominator is two Boolean vectors and you do the or and you sum the elements
How do I choose cosine similarity
- Cosine similarity/Pearson correlation coefficient fits user rating data (real values),
- Jekard similarity applies to implicit feedback data (0,1 Boolean bookmark, click, add cart)

3.4 Code implementation of collaborative filtering recommendation algorithm

Importing tool Packages

import pandas as pd
import numpy as np
Copy the code

Building a data set

users = ["User1"."User2"."User3"."User4"."User5"]
items = ["Item A"."Item B"."Item C"."Item D"."Item E"]
# Build the dataset
datasets = [
    ["buy".None."buy"."buy".None],
    ["buy".None.None."buy"."buy"],
    ["buy".None."buy".None.None],
    [None."buy".None."buy"."buy"],
    ["buy"."buy"."buy".None."buy"]]Copy the code

In calculation, our data usually need to be processed or encoded, so as to facilitate us to process the data. For example, here is a relatively simple case, we use 1 and 0 respectively to indicate whether the user has bought the product, so our data set should actually be like this:

users = ["User1"."User2"."User3"."User4"."User5"]
items = ["Item A"."Item B"."Item C"."Item D"."Item E"]
# User purchase record data set
datasets = [
    [1.0.1.1.0],
    [1.0.0.1.1],
    [1.0.1.0.0],
    [0.1.0.1.1],
    [1.1.1.0.1]]import pandas as pd

df = pd.DataFrame(datasets,
                  columns=items,
                  index=users)
print(df)
Copy the code

With the data set, we can then calculate the similarity, but there are many special similarity calculation methods for similarity calculation, such as cosine similarity, Pearson correlation coefficient, Jacquard similarity and so on. Here we choose to use the jeckard similarity coefficient [0,1]

from sklearn.metrics import jaccard_similarity_score
# Directly calculate the Jacquard similarity coefficient of some two terms
# Calculate the similarity between Item A and Item B
print(jaccard_similarity_score(df["Item A"], df["Item B"]))

Calculate the Jacquard similarity coefficient for all data pairs
from sklearn.metrics.pairwise import pairwise_distances
# Calculate similarity between users
user_similar = 1 - pairwise_distances(df, metric="jaccard")
user_similar = pd.DataFrame(user_similar, columns=users, index=users)
print("Pairwise similarity between users:")
print(user_similar)

# Calculate the similarity between items
item_similar = 1 - pairwise_distances(df.T, metric="jaccard")
item_similar = pd.DataFrame(item_similar, columns=items, index=items)
print("The similarity between two objects:")
print(item_similar)
Copy the code

With pairwise similarity, you can then filter top-N similarity results and make recommendations

User-Based CF

import pandas as pd
import numpy as np
from pprint import pprint

users = ["User1"."User2"."User3"."User4"."User5"]
items = ["Item A"."Item B"."Item C"."Item D"."Item E"]
# User purchase record data set
datasets = [
    [1.0.1.1.0],
    [1.0.0.1.1],
    [1.0.1.0.0],
    [0.1.0.1.1],
    [1.1.1.0.1],
]

df = pd.DataFrame(datasets,
                  columns=items,
                  index=users)

Calculate the Jacquard similarity coefficient for all data pairs
from sklearn.metrics.pairwise import pairwise_distances
Calculate the similarity between users 1- Jackard distance = Jackard similarity
user_similar = 1 - pairwise_distances(df, metric="jaccard")
user_similar = pd.DataFrame(user_similar, columns=users, index=users)
print("Pairwise similarity between users:")
print(user_similar)

topN_users = {}
Iterate over each row of data
for i in user_similar.index:
    Fetch each column and delete itself, then sort the data
    _df = user_similar.loc[i].drop([i])
    #sort_values Sort by descending similarity
    _df_sorted = _df.sort_values(ascending=False)
    # Slice the first two (the two most similar) from the sorted results
    top2 = list(_df_sorted.index[:2])
    topN_users[i] = top2

print("Top2 similar users:")
pprint(topN_users)

Prepare a blank dict to store recommendations
rs_results = {}
# iterate over all the most similar users
for user, sim_users in topN_users.items():
    rs_result = set(a)# Store recommendation results
    for sim_user in sim_users:
        # Build initial recommendation results
        rs_result = rs_result.union(set(df.ix[sim_user].replace(0,np.nan).dropna().index))
    # Filter out items that have already been purchased
    rs_result -= set(df.ix[user].replace(0,np.nan).dropna().index)
    rs_results[user] = rs_result
print("Final recommendation:")
pprint(rs_results)
Copy the code

Item-Based CF

import pandas as pd
import numpy as np
from pprint import pprint

users = ["User1"."User2"."User3"."User4"."User5"]
items = ["Item A"."Item B"."Item C"."Item D"."Item E"]
# User purchase record data set
datasets = [
    [1.0.1.1.0],
    [1.0.0.1.1],
    [1.0.1.0.0],
    [0.1.0.1.1],
    [1.1.1.0.1],
]

df = pd.DataFrame(datasets,
                  columns=items,
                  index=users)

Calculate the Jacquard similarity coefficient for all data pairs
from sklearn.metrics.pairwise import pairwise_distances
# Calculate the similarity between items
item_similar = 1 - pairwise_distances(df.T, metric="jaccard")
item_similar = pd.DataFrame(item_similar, columns=items, index=items)
print("The similarity between two objects:")
print(item_similar)

topN_items = {}
Iterate over each row of data
for i in item_similar.index:
    Fetch each column and delete itself, then sort the data
    _df = item_similar.loc[i].drop([i])
    _df_sorted = _df.sort_values(ascending=False)

    top2 = list(_df_sorted.index[:2])
    topN_items[i] = top2

print("Top2 similar items:")
pprint(topN_items)

rs_results = {}
# Build recommendation results
for user in df.index:    Pass through all users
    rs_result = set(a)for item in df.ix[user].replace(0,np.nan).dropna().index:   Fetch a list of items that each user has currently purchased
        # Build the initial recommendation by finding the most similar top-N item for each item
        rs_result = rs_result.union(topN_items[item])
    Filter out items that users have already purchased
    rs_result -= set(df.ix[user].replace(0,np.nan).dropna().index)
    Add to the result
    rs_results[user] = rs_result

print("Final recommendation:")
pprint(rs_results)
Copy the code

3.5 Data set used by collaborative filtering algorithm

In the previous demo, we only used a purchase record of an item, which could be a browsing record, a listening record, etc. In this way, the result of data prediction is actually equivalent to predicting whether users are interested in a certain item, and the degree of preference cannot be well predicted.

Therefore, the collaborative filtering recommendation algorithm actually makes more use of the “rating” data of users on items for prediction. Through the rating data set, we can predict the rating of users on items that they have not rated before. The principle and idea is the same, but the data set is user-item rating data.

About the user-item rating matrix

The user-item scoring matrix will have different solutions according to the sparsity of the scoring matrix

Dense scoring matrix
Sparse scoring matrix

The processing of dense scoring matrix is introduced here, while the processing of sparse matrix is relatively complicated.

Collaborative filtering algorithm is used to predict user scores

The data set

Objective: To predict user 1’s rating of item E

Build the data set: Note that when building the score data here, we need to leave the missing part as None, and if set to 0 it will be treated as a score value of 0

users = ["User1"."User2"."User3"."User4"."User5"]
items = ["Item A"."Item B"."Item C"."Item D"."Item E"]
# User purchase record data set
datasets = [
    [5.3.4.4.None],
    [3.1.2.3.3],
    [4.3.4.3.5],
    [3.3.1.5.4],
    [1.5.5.2.1]]Copy the code

Calculation of similarity: Pearson correlation coefficient [-1,1] is used to calculate the score data, -1 represents strong negative correlation, +1 represents strong positive correlation

The CORR method in Pandas can be directly used to calculate Pearson correlation coefficients

df = pd.DataFrame(datasets,
                  columns=items,
                  index=users)

print("Pairwise similarity between users:")
Calculate Pearson correlation coefficient directly
The default is to calculate by column, so if the similarity between users is calculated, it is currently required to transpose
user_similar = df.T.corr()
print(user_similar.round(4))

print("The similarity between two objects:")
item_similar = df.corr()
print(item_similar.round(4))
Copy the code

Running results:

# run result:Similarity between users: User1 User2 User3 User4 User5 User11.0000  0.8528  0.7071  0.0000 -0.7921
User2  0.8528  1.0000  0.4677  0.4900 -0.9001
User3  0.7071  0.4677  1.0000 -0.1612 -0.4666
User4  0.0000  0.4900 -0.1612  1.0000 -0.6415
User5 -0.7921 -0.9001 -0.4666 -0.6415  1.0000Item A Item B Item C Item D Item E Item A1.0000 -0.4767 -0.1231  0.5322  0.9695
Item B -0.4767  1.0000  0.6455 -0.3101 -0.4781
Item C -0.1231  0.6455  1.0000 -0.7206 -0.4276
Item D  0.5322 -0.3101 -0.7206  1.0000  0.5817
Item E  0.9695 -0.4781 -0.4276  0.5817  1.0000
Copy the code

You can see that users 2 and 3 are most similar to user 1; The items most similar to item A are item E and item D respectively.

Note: We tend to predict ratings based on users or items with which we have a positive correlation. If there is no positive correlation, we cannot predict ratings. This is especially true in sparse scoring matrices, where positive correlation coefficients are difficult to derive.

Score predicts

User-based CF score prediction: predicts Based on the similarity between users

There are also many schemes for scoring prediction. The following is a scheme with good effect, which takes into account the scoring of users themselves and the weighted average similarity score of neighboring users for prediction:
$pred(u,i)=\hat{r_{ui}}=\frac{\sum_{v\in U}sum(u,v)*r_{vi}}{\sum_{v\in U}|sim(u,v|}$
We want to predict user 1’s score on item E, so we can make prediction based on user 2 and user 3 closest to user 1, and calculate as follows:
$Mr Pred (u_1 i_5) = \ frac {3 + 0.71 * 0.85 * 5} {0.85 + 0.71} = 3.91$
The final prediction is that the score of user 1 on item 5 is 3.91

Item-based CF score prediction: the similarity between items is used for prediction

Here, the calculation of similarity prediction of items is the same as above, and the average scoring factor of users is also taken into account, and the prediction is made by combining the weighted average similarity scoring of predicted items with similar items:
$pred(u,i)=\hat{r_{ui}}=\frac{\sum_{j\in I_{rated}}sim(i,j)*r_{uj}}{\sum_{j\in I_{rated}}sim(i,j)}$
We want to predict user 1’s score on item E, so we can make prediction based on item A and item D closest to item E, and calculate as follows:
$Mr Pred (u_1 i_5) = \ frac {5 + 0.58 * 0.97 * 4} {0.97 + 0.58} = 4.63$
As can be seen from the comparison, the scoring results of user-based CF prediction score and item-based CF are also different, because they actually belong to two different recommendation algorithms in a strict sense, and both of them have better effects than the other one in different fields and scenarios. However, which one is better? Therefore, in the implementation of the recommendation system, these two algorithms are often needed to be implemented, and then the recommendation effect is evaluated and analyzed to select a better scheme.

Case study – Film recommendation based on collaborative filtering

4.1 User-based CF predicts movie ratings

Data set download
Download address

Load ratings.csv, convert it into a user-movie score matrix and calculate the similarity between users

import os

import pandas as pd
import numpy as np

DATA_PATH = "./datasets/ml-latest-small/ratings.csv"

dtype = {"userId": np.int32, "movieId": np.int32, "rating": np.float32}
# Loading data, we only use the first three columns of data, which are the user ID, the movie ID, and the corresponding rating of the movie by the user
ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3))
PivotTable, which converts the Movie ID to the column name, into a user-movie score matrix
ratings_matrix = ratings.pivot_table(index=["userId"], columns=["movieId"],values="rating")
# Calculate the similarity between users
user_similar = ratings_matrix.T.corr()
Copy the code

Predict user’s rating of items (take user 1’s rating of movie 1 as an example)

Score formula: Mr Pred (u, I) = \ hat {r_ {UI}} = \ frac {\ sum_ v \ {u} in sim r_ (u, v) * {n}} {\ sum_ v \ {u} in | | sim (u, v}

# 1. Find similar users for the UID user
similar_users = user_similar[1].drop([1]).dropna()
# Similar users filter rule: positive related users
similar_users = similar_users.where(similar_users>0).dropna()
# 2. Screen out the nearest neighbor users who have scored item 1 from the nearest neighbor similar users of user 1
ids = set(ratings_matrix[1].dropna().index)&set(similar_users.index)
finally_similar_users = similar_users.ix[list(1)]
# 3. Predict uid users' ratings of IID items based on their similarity to their nearest neighbors
numerator = 0    # Score predicts the value of the numerator part of the formula
denominator = 0    # The value of the denominator of the scoring prediction formula
for sim_uid, similarity in finally_similar_users.iteritems():
    # Nearest neighbor user rating data
    sim_user_rated_movies = ratings_matrix.ix[sim_uid].dropna()
    # IID item rating by nearest neighbor users
    sim_user_rating_for_item = sim_user_rated_movies[1]
    # Compute the numerator
    numerator += similarity * sim_user_rating_for_item
    # Evaluate the denominator
    denominator += similarity
# 4 Calculate the predicted score value
predict_rating = numerator/denominator
print("Predicted user <%d> rating of movie <%d> : %0.2f" % (1.1, predict_rating))
Copy the code

Encapsulated into a method to predict the rating of any user on any movie

def predict(uid, iid, ratings_matrix, user_similar) :
    Uid: user ID: Param iID: Item ID: Param ratings_matrix: user-item rating matrix: param user_similar: P2-user similarity matrix :return: predicted score value
    print("Start predicting user <%d> ratings for movie <%d>..."%(uid, iid))
    # 1. Find similar users for the UID user
    similar_users = user_similar[uid].drop([uid]).dropna()
    # Similar users filter rule: positive related users
    similar_users = similar_users.where(similar_users>0).dropna()
    if similar_users.empty is True:
        raise Exception("User <%d> no similar user" % uid)

    # 2. Select the nearest neighbor users with scores for iID items from the uid user's nearest neighbor similar users
    ids = set(ratings_matrix[iid].dropna().index)&set(similar_users.index)
    finally_similar_users = similar_users.ix[list(ids)]

    # 3. Predict uid users' ratings of IID items based on their similarity to their nearest neighbors
    numerator = 0    # Score predicts the value of the numerator part of the formula
    denominator = 0    # The value of the denominator of the scoring prediction formula
    for sim_uid, similarity in finally_similar_users.iteritems():
        # Nearest neighbor user rating data
        sim_user_rated_movies = ratings_matrix.ix[sim_uid].dropna()
        # IID item rating by nearest neighbor users
        sim_user_rating_for_item = sim_user_rated_movies[iid]
        # Compute the numerator
        numerator += similarity * sim_user_rating_for_item
        # Evaluate the denominator
        denominator += similarity

    # Calculate the predicted score value and return it
    predict_rating = numerator/denominator
    print("Predicted user <%d> rating of movie <%d> : %0.2f" % (uid, iid, predict_rating))
    return round(predict_rating, 2)
Copy the code

Predict all movie ratings for a user

def predict_all(uid, ratings_matrix, user_similar) :
    Uid: user ID :param ratings_matrix: user-item scoring matrix: param user_similar: similarity between two users :return: generator, return predicted score ""
    Prepare a list of ids for items to predict
    item_ids = ratings_matrix.columns
    # One by one prediction
    for iid in item_ids:
        try:
            rating = predict(uid, iid, ratings_matrix, user_similar)
        except Exception as e:
            print(e)
        else:
            yield uid, iid, rating
if __name__ == '__main__':
    for i in predict_all(1, ratings_matrix, user_similar):
        pass
Copy the code

Recommend topN movies to specified users according to their ratings

def top_k_rs_result(k) :
    results = predict_all(1, ratings_matrix, user_similar)
    return sorted(results, key=lambda x: x[2], reverse=True)[:k]
if __name__ == '__main__':
    from pprint import pprint
    result = top_k_rs_result(20)
    pprint(result)
Copy the code

4.2 Item-based CF predicts movie ratings