“This is the second day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021”

I. Model-based collaborative filtering

Model-based collaborative filtering algorithm

The following is a general classification of model-based CF algorithm:

  • Based on classification algorithm, regression algorithm, clustering algorithm
  • Recommendation based on matrix decomposition
  • Based on neural network algorithm
  • Based on graph model algorithm

2. Collaborative filtering based on regression model

If we think of ratings as continuous values rather than discrete values, then linear regression can be used to predict how a target user will rate an item. One such implementation strategy is called Baseline.

2.1 Baseline: Baseline forecast

2.1.1Baseline Design Idea

  • Some users generally score higher than others, and some users generally score lower than others. For example, some users are naturally willing to praise others and are soft and easy-going, while others are more demanding and always score no more than 3 out of 5.
  • Some items are generally rated higher than others, and some items are generally rated lower than others. For example, the status of some objects is determined by their production. Some are more popular than others.

The difference between a user or an item that is generally above or below the average is called bias.

Baseline target:

  • Find the offset value bub_ubu that is generally higher or lower for each user than for others
  • Find the bias value BIB_ibi for each item that is generally higher or lower than other items
  • Our goal was to find the best bub_ubu and BIB_ibi

2.1.2 Prediction step of Baseline algorithm

Take movie ratings:

  • Calculate the average score of all movies μ\muμ (i.e. global average score)
  • Calculate the offset of each user’s score to the average score μ\muμ bub_ubu
  • Calculate bib_ibi, the offset between the rating received for each movie and the average rating μ\muμ
  • Predict user ratings: by film rui ^ = bui = u + bu + bi \ hat {r_ {UI}} = b_} {UI = \ mu + b_u + b_irui ^ = bui = u + bu + bi
  • For example, user A’s rating of the movie “Forrest Gump” is predicted by Baseline
    • First, the average score μ of the whole score data set was calculated to be 3.5
    • User A is harsh, generally 0.5 points lower than the average score, that is, user A’s bias value BI is -0.5;
    • “Forrest Gump” is popular and well-received, with ratings generally 1.2 points above average, with a +1.2 bias
    • Therefore, it can be predicted that user A’s rating of the movie “Forrest Gump” is 3.5+(−0.5)+1.2, or 4.2 points.

The average rating for all movies can be calculated directly, so the problem is to measure the rating bias per user and the score bias per movie. For the linear regression problem, we can use the square variance to construct the loss function as follows:


C o s t = u . i R ( r u i r u i ^ ) 2 = u . i R ( r u i mu b u b i ) 2 Cost=\sum_{u,i\in R}(r_{ui}-\hat{r_{ui}})^2=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2

Add L2 regularization:


C o s t = u . i R ( r u i mu b u b i ) 2 + Lambda. ( u b u 2 + i b i 2 ) Cost=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2+\lambda*(\sum_ub_u^2+\sum_ib_i^2)

Formula analysis:

  • Formula for the first part of ∑ u, I ∈ R (rui – mu – bu – bi) 2 \ sum_} {u, I \ in R (r_ (UI} \ mu – b_u – b_i) ^ ∑ 2 u, I ∈ R (rui – mu – bu – bi) 2 is used to search for and the known score data fitting the best bub_ubu and bib_ibi

  • Formula of the second part of lambda ∗ (∑ ∑ ubu2 + ibi2) \ lambda * (\ sum_ub_u ^ 2 + \ sum_ib_i ^ 2) lambda ∗ (∑ ∑ ubu2 + ibi2) is the regularization item, to avoid over fitting phenomenon

For the solution of the minimum process, we generally use stochastic gradient descent method or alternating least square method to optimize the realization.

2.2 Stochastic gradient descent optimization

Baseline bias was predicted using stochastic gradient descent optimization algorithm

Step1: Derivation of gradient descent method


J ( Theta. ) = C o s t = f ( b u . b i ) J(\theta)=Cost=f(b_u,b_i)

Loss function :(λ\lambdaλ is regularization coefficient)


J ( Theta. ) = u . i R ( r u i mu b u b i ) 2 + Lambda. ( u b u 2 + i b i 2 ) J(\theta)=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2+\lambda*(\sum_ub_u^2+\sum_ib_i^2)

Update the original formula for gradient descent parameters :(in the formula α\alphaα is the learning rate)


Theta. j : = Theta. j Alpha. partial partial Theta. j J ( Theta. ) : \ theta_j: = \ theta_j – \ alpha \ frac {\ partial} {\ partial \ theta_j} J (\ theta) :

Partial derivatives of bub_ubu and BIB_ibi, respectively:


partial partial b u J ( Theta. ) = partial partial b u f ( b u . b i ) = 2 u . i R ( r u i mu b u b i ) ( 1 ) + 2 Lambda. b u = 2 u . i R ( r u i mu b u b i ) + 2 Lambda. b u \frac{\partial}{\partial b_u} J(\theta)=\frac{\partial}{\partial b_u} f(b_u, b_i) =2\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)(-1) + 2\lambda{b_u} =-2\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) + 2\lambda*b_u

Substitute the gradient descent formula:


b u : = b u Alpha. ( u . i R ( r u i mu b u b i ) + Lambda. b u ) = b u + Alpha. ( u . i R ( r u i mu b u b i ) Lambda. b u ) b_u:=b_u – \alpha*(-\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) + \lambda * b_u)=b_u + \alpha*(\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i) – \lambda* b_u)

Similarly, gradient descent updates bib_ibi:


b i : = b i + Alpha. ( u . i R ( r u i mu b u b i ) Lambda. b i ) b_i:=b_i+\alpha*(\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)-\lambda*b_i)

Step2: Random gradient descent

Since the STOCHASTIC gradient descent method essentially uses the loss of each sample to update parameters, instead of calculating the total loss sum each time, when SGD is used:


e r r o r = r u i r u i ^ = r u i ( mu + b u + b i ) = r u i mu b u b i error=r_{ui}-\hat{r_{ui}}=r_{ui-(\mu+b_u+b_i)}=r_{ui}-\mu-b_u-b_i

Single sample loss value parameter update:


b u : = b u + Alpha. ( ( r u i mu b u b i ) Lambda. b u ) = b u + Alpha. ( e r r o r Lambda. b u ) b i : = b i + Alpha. ( ( r u i mu b u b i ) Lambda. b i ) : = b i + Alpha. ( e r r o r Lambda. b i ) b_u:=b_u + \alpha*((r_{ui}-\mu-b_u-b_i) -\lambda*b_u) =b_u + \alpha*(error – \lambda*b_u)b_i:=b_i + \alpha*((r_{ui}-\mu-b_u-b_i) -\lambda*b_i):=b_i + \alpha*(error -\lambda*b_i)

Step3: Algorithm implementation

  • Pandas does not have a lower version: 0.24.2

  • The data load

    import pandas as pd
    import numpy as np
    dtype = [("userId", np.int32), ("movieId", np.int32), ("rating", np.float32)]
    dataset = pd.read_csv("ml-latest-small/ratings.csv", usecols=range(3), dtype=dict(dtype))
    Copy the code
  • Data initialization

    Groupby groupby('userId') agg by userId
    users_ratings = dataset.groupby('userId').agg([list])
    # Item rating data
    items_ratings = dataset.groupby('movieId').agg([list])
    # calculate the global average
    global_mean = dataset['rating'].mean()
    # initialization
    bu = dict(zip(users_ratings.index, np.zeros(len(users_ratings))))
    bi = dict(zip(items_ratings.index, np.zeros(len(items_ratings))))
    Copy the code
  • About the zip

    • The zip() function takes an iterable object as an argument, packs its elements into tuples, and returns an object made up of those tuples. This saves a lot of memory. We can use the list() transform to output the list. If the number of elements in each iterator is inconsistent, the length of the list returned is the same as that of the shortest object. Using the * operator, the tuple can be unpacked into a list

    • Syntax: zip([iterable,… )

  • Update bub_ubu and BIB_ibi

    Number of iterations alpha learning rate reg regularization coefficient
    for i in range(number_epochs):
        print("iter%d" % i)
        for uid, iid, real_rating in dataset.itertuples(index=False) : error = real_rating - (global_mean + bu[uid] + bi[iid]) bu[uid] += alpha * (error - reg * bu[uid]) bi[iid] += alpha * (error - reg * bi[iid])Copy the code
  • Prediction score

    def predict(uid, iid) :
        predict_rating = global_mean + bu[uid] + bi[iid]
        return predict_rating
    Copy the code
  • The overall package

    import pandas as pd
    import numpy as np
    
    
    class BaselineCFBySGD(object) :
    
        def __init__(self, number_epochs, alpha, reg, columns=["uid"."iid"."rating"]) :
            # Maximum number of iterations of gradient descent
            self.number_epochs = number_epochs
            Vector #
            self.alpha = alpha
            # regular parameter
            self.reg = reg
            [user-item-rating] [user-item-rating
            self.columns = columns
    
        def fit(self, dataset) :
            ''' :param dataset: uid, iid, rating :return: '''
            self.dataset = dataset
            # User rating data
            self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
            # Item rating data
            self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
            # calculate the global average
            self.global_mean = self.dataset[self.columns[2]].mean()
            Call SGD to train model parameters
            self.bu, self.bi = self.sgd()
    
        def sgd(self) :
            Optimize bu, bi by random gradient descent :return: BU, bi"
            Set bu and bi to 0
            bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings))))
            bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings))))
    
            for i in range(self.number_epochs):
                print("iter%d" % i)
                for uid, iid, real_rating in self.dataset.itertuples(index=False) : error = real_rating - (self.global_mean + bu[uid] + bi[iid]) bu[uid] += self.alpha * (error - self.reg * bu[uid]) bi[iid] += self.alpha * (error - self.reg * bi[iid])return bu, bi
    
        def predict(self, uid, iid) :
            predict_rating = self.global_mean + self.bu[uid] + self.bi[iid]
            return predict_rating
    
    
    if __name__ == '__main__':
        dtype = [("userId", np.int32), ("movieId", np.int32), ("rating", np.float32)]
        dataset = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(3), dtype=dict(dtype))
    
        bcf = BaselineCFBySGD(20.0.1.0.1["userId"."movieId"."rating"])
        bcf.fit(dataset)
    
        while True:
            uid = int(input("uid: "))
            iid = int(input("iid: "))
            print(bcf.predict(uid, iid))
    Copy the code

Step4: Accuracy index evaluation

  • Add the test method and calculate the accuracy metric using the previously implemented accuracy method

    import pandas as pd
    import numpy as np
    
    def data_split(data_path, x=0.8, random=False) :
        Param data_path: data set path: param x: proportion of training set, if x=0.8, then 0.2 is test set :param random: False :return: user - item score matrix ""
        print("Start shredding the data set...")
        Set the type of data field to load
        dtype = {"userId": np.int32, "movieId": np.int32, "rating": np.float32}
        # Loading data, we only use the first three columns of data, which are the user ID, the movie ID, and the corresponding rating of the movie by the user
        ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3))
    
        testset_index = []
        To ensure that each user has data in the test set and training set, aggregate by userId
        for uid in ratings.groupby("userId").any().index:
            user_rating_data = ratings.where(ratings["userId"]==uid).dropna()
            if random:
                Because immutable types cannot be shuffled, they need to be forcibly converted to lists
                index = list(user_rating_data.index)
                np.random.shuffle(index)    # hash list
                _index = round(len(user_rating_data) * x)
                testset_index += list(index[_index:])
            else:
                # Use x proportion of each user's data as a training set and the rest as a test set
                index = round(len(user_rating_data) * x)
                testset_index += list(user_rating_data.index.values[index:])
    
        testset = ratings.loc[testset_index]
        trainset = ratings.drop(testset_index)
        print("Complete data set segmentation...")
        return trainset, testset
    
    def accuray(predict_results, method="all") :
        Predict_results: predict_results are predict_results. Each element is a sequence containing uid, iID,real_rating, and pred_rating. Index method, type string, RMse or MAE, otherwise return both rmse and MAE :return: ""
    
        def rmse(predict_results) :
            Predict_results: :return: RMSE"
            length = 0
            _rmse_sum = 0
            for uid, iid, real_rating, pred_rating in predict_results:
                length += 1
                _rmse_sum += (pred_rating - real_rating) ** 2
            return round(np.sqrt(_rmse_sum / length), 4)
    
        def mae(predict_results) :
            Predict_results: :return: MAE"
            length = 0
            _mae_sum = 0
            for uid, iid, real_rating, pred_rating in predict_results:
                length += 1
                _mae_sum += abs(pred_rating - real_rating)
            return round(_mae_sum / length, 4)
    
        def rmse_mae(predict_results) :
            Predict_results: :return: RMSE, MAE"
            length = 0
            _rmse_sum = 0
            _mae_sum = 0
            for uid, iid, real_rating, pred_rating in predict_results:
                length += 1
                _rmse_sum += (pred_rating - real_rating) ** 2
                _mae_sum += abs(pred_rating - real_rating)
            return round(np.sqrt(_rmse_sum / length), 4), round(_mae_sum / length, 4)
    
        if method.lower() == "rmse":
            rmse(predict_results)
        elif method.lower() == "mae":
            mae(predict_results)
        else:
            return rmse_mae(predict_results)
    
    class BaselineCFBySGD(object) :
    
        def __init__(self, number_epochs, alpha, reg, columns=["uid"."iid"."rating"]) :
            # Maximum number of iterations of gradient descent
            self.number_epochs = number_epochs
            Vector #
            self.alpha = alpha
            # regular parameter
            self.reg = reg
            [user-item-rating] [user-item-rating
            self.columns = columns
    
        def fit(self, dataset) :
            ''' :param dataset: uid, iid, rating :return: '''
            self.dataset = dataset
            # User rating data
            self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
            # Item rating data
            self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
            # calculate the global average
            self.global_mean = self.dataset[self.columns[2]].mean()
            Call SGD to train model parameters
            self.bu, self.bi = self.sgd()
    
        def sgd(self) :
            Optimize bu, bi by random gradient descent :return: BU, bi"
            Set bu and bi to 0
            bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings))))
            bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings))))
    
            for i in range(self.number_epochs):
                print("iter%d" % i)
                for uid, iid, real_rating in self.dataset.itertuples(index=False) : error = real_rating - (self.global_mean + bu[uid] + bi[iid]) bu[uid] += self.alpha * (error - self.reg * bu[uid]) bi[iid] += self.alpha * (error - self.reg * bi[iid])return bu, bi
    
        def predict(self, uid, iid) :
            "Score Prediction"
            if iid not in self.items_ratings.index:
                raise Exception("Cannot predict user <{uid}> rating of movie <{iid}> because the <{iid}> data is missing from the training set.".format(uid=uid, iid=iid))
    
            predict_rating = self.global_mean + self.bu[uid] + self.bi[iid]
            return predict_rating
    
        def test(self,testset) :
            Predict test set data
            for uid, iid, real_rating in testset.itertuples(index=False) :try:
                    pred_rating = self.predict(uid, iid)
                except Exception as e:
                    print(e)
                else:
                    yield uid, iid, real_rating, pred_rating
    
    if __name__ == '__main__':
    
        trainset, testset = data_split("datasets/ml-latest-small/ratings.csv", random=True)
    
        bcf = BaselineCFBySGD(20.0.1.0.1["userId"."movieId"."rating"])
        bcf.fit(trainset)
    
        pred_results = bcf.test(testset)
    
        rmse, mae = accuray(pred_results)
    
        print("rmse: ", rmse, "mae: ", mae)
    Copy the code

2.3 Optimization by alternating least square method

Baseline bias was predicted using alternating least square optimization algorithm

Step1: Derivation by alternate square method

The least square method, like the gradient descent method, can be used to find extreme values

The least square method idea: take the partial derivative of the loss function, and then make the partial derivative 0

Similarly, the loss function:


J ( Theta. ) = u . i R ( r u i mu b u b i ) 2 + Lambda. ( u b u 2 + i b i 2 ) J(\theta)=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i)^2+\lambda*(\sum_ub_u^2+\sum_ib_i^2)

By alternating least squares:


b u : = u . i R ( r u i mu b i ) Lambda. 1 + R ( u ) bu:=\frac{\sum_{u,i\in R}(r_{ui}-\mu-b_i)}{\lambda_1+|R(u)|}

∣ R (u) ∣ | R (u) | ∣ R (u) ∣ said user uuu have grading number

In the same way:


b i : = u . i R ( r u i mu b i ) Lambda. 2 + R ( i ) b_i:=\frac{\sum_{u,i\in R}(r_{ui}-\mu-b_i)}{\lambda_2+|R(i)|}

Bub_ubu and BIB_ibi are user and item offsets respectively, so their re parameters can be set to two separate parameters.

Step2: Application of alternating least square method

By least squares we end up with bub_ubu and BIB_ibi, respectively, but their expressions contain each other, so here we’ll use a method called alternate least squares to calculate their values:

  • To calculate one of the terms, the other unknown parameters are fixed first, that is, the other unknown parameters are regarded as known
  • For bub_ubu, consider bib_ibi known; To evaluate bib_ibi, consider bub_ubu as known; Do this over and over again, constantly updating the values of the two, to get the final result. This is the alternating least square method.

Step3: Algorithm implementation

  • Update BU BI iteratively

    for i in range(number_epochs):
        print("iter%d" % i)
        for iid, uids, ratings in items_ratings.itertuples(index=True) : _sum = 0
            for uid, rating in zip(uids, ratings):
                _sum += rating - global_mean - bu[uid]
            bi[iid] = _sum / (reg_bi + len(uids))
    
        for uid, iids, ratings in users_ratings.itertuples(index=True) : _sum = 0
            for iid, rating in zip(iids, ratings):
                _sum += rating - global_mean - bi[iid]
            bu[uid] = _sum / (reg_bu + len(iids))
    Copy the code
  • encapsulation

    import pandas as pd
    import numpy as np
    
    
    class BaselineCFByALS(object) :
    
        def __init__(self, number_epochs, reg_bu, reg_bi, columns=["uid"."iid"."rating"]) :
            # Maximum number of iterations of gradient descent
            self.number_epochs = number_epochs
            The regular argument for # bu
            self.reg_bu = reg_bu
            The regular parameter of # bi
            self.reg_bi = reg_bi
            [user-item-rating] [user-item-rating
            self.columns = columns
    
        def fit(self, dataset) :
            ''' :param dataset: uid, iid, rating :return: '''
            self.dataset = dataset
            # User rating data
            self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
            # Item rating data
            self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
            # calculate the global average
            self.global_mean = self.dataset[self.columns[2]].mean()
            Call SGD to train model parameters
            self.bu, self.bi = self.als()
    
        def als(self) :
            Optimize bu, bi by random gradient descent :return: BU, bi"
            Set bu and bi to 0
            bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings))))
            bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings))))
    
            for i in range(self.number_epochs):
                print("iter%d" % i)
                for iid, uids, ratings in self.items_ratings.itertuples(index=True) : _sum = 0
                    for uid, rating in zip(uids, ratings):
                        _sum += rating - self.global_mean - bu[uid]
                    bi[iid] = _sum / (self.reg_bi + len(uids))
    
                for uid, iids, ratings in self.users_ratings.itertuples(index=True) : _sum = 0
                    for iid, rating in zip(iids, ratings):
                        _sum += rating - self.global_mean - bi[iid]
                    bu[uid] = _sum / (self.reg_bu + len(iids))
            return bu, bi
    
        def predict(self, uid, iid) :
            predict_rating = self.global_mean + self.bu[uid] + self.bi[iid]
            return predict_rating
    
    
    if __name__ == '__main__':
        dtype = [("userId", np.int32), ("movieId", np.int32), ("rating", np.float32)]
        dataset = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(3), dtype=dict(dtype))
    
        bcf = BaselineCFByALS(20.25.15["userId"."movieId"."rating"])
        bcf.fit(dataset)
    
        while True:
            uid = int(input("uid: "))
            iid = int(input("iid: "))
            print(bcf.predict(uid, iid))
    Copy the code

Step4: Accuracy index evaluation

import pandas as pd
import numpy as np

def data_split(data_path, x=0.8, random=False) :
    Param data_path: data set path: param x: proportion of training set, if x=0.8, then 0.2 is test set :param random: False :return: user - item score matrix ""
    print("Start shredding the data set...")
    Set the type of data field to load
    dtype = {"userId": np.int32, "movieId": np.int32, "rating": np.float32}
    # Loading data, we only use the first three columns of data, which are the user ID, the movie ID, and the corresponding rating of the movie by the user
    ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3))

    testset_index = []
    To ensure that each user has data in the test set and training set, aggregate by userId
    for uid in ratings.groupby("userId").any().index:
        user_rating_data = ratings.where(ratings["userId"]==uid).dropna()
        if random:
            Because immutable types cannot be shuffled, they need to be forcibly converted to lists
            index = list(user_rating_data.index)
            np.random.shuffle(index)    # hash list
            _index = round(len(user_rating_data) * x)
            testset_index += list(index[_index:])
        else:
            # Use x proportion of each user's data as a training set and the rest as a test set
            index = round(len(user_rating_data) * x)
            testset_index += list(user_rating_data.index.values[index:])

    testset = ratings.loc[testset_index]
    trainset = ratings.drop(testset_index)
    print("Complete data set segmentation...")
    return trainset, testset

def accuray(predict_results, method="all") :
    Predict_results: predict_results are predict_results. Each element is a sequence containing uid, iID,real_rating, and pred_rating. Index method, type string, RMse or MAE, otherwise return both rmse and MAE :return: ""

    def rmse(predict_results) :
        Predict_results: :return: RMSE"
        length = 0
        _rmse_sum = 0
        for uid, iid, real_rating, pred_rating in predict_results:
            length += 1
            _rmse_sum += (pred_rating - real_rating) ** 2
        return round(np.sqrt(_rmse_sum / length), 4)

    def mae(predict_results) :
        Predict_results: :return: MAE"
        length = 0
        _mae_sum = 0
        for uid, iid, real_rating, pred_rating in predict_results:
            length += 1
            _mae_sum += abs(pred_rating - real_rating)
        return round(_mae_sum / length, 4)

    def rmse_mae(predict_results) :
        Predict_results: :return: RMSE, MAE"
        length = 0
        _rmse_sum = 0
        _mae_sum = 0
        for uid, iid, real_rating, pred_rating in predict_results:
            length += 1
            _rmse_sum += (pred_rating - real_rating) ** 2
            _mae_sum += abs(pred_rating - real_rating)
        return round(np.sqrt(_rmse_sum / length), 4), round(_mae_sum / length, 4)

    if method.lower() == "rmse":
        rmse(predict_results)
    elif method.lower() == "mae":
        mae(predict_results)
    else:
        return rmse_mae(predict_results)

class BaselineCFByALS(object) :

    def __init__(self, number_epochs, reg_bu, reg_bi, columns=["uid"."iid"."rating"]) :
        # Maximum number of iterations of gradient descent
        self.number_epochs = number_epochs
        The regular argument for # bu
        self.reg_bu = reg_bu
        The regular parameter of # bi
        self.reg_bi = reg_bi
        [user-item-rating] [user-item-rating
        self.columns = columns

    def fit(self, dataset) :
        ''' :param dataset: uid, iid, rating :return: '''
        self.dataset = dataset
        # User rating data
        self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
        # Item rating data
        self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
        # calculate the global average
        self.global_mean = self.dataset[self.columns[2]].mean()
        Call SGD to train model parameters
        self.bu, self.bi = self.als()

    def als(self) :
        Optimize bu, bi by random gradient descent :return: BU, bi"
        Set bu and bi to 0
        bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings))))
        bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings))))

        for i in range(self.number_epochs):
            print("iter%d" % i)
            for iid, uids, ratings in self.items_ratings.itertuples(index=True) : _sum = 0
                for uid, rating in zip(uids, ratings):
                    _sum += rating - self.global_mean - bu[uid]
                bi[iid] = _sum / (self.reg_bi + len(uids))

            for uid, iids, ratings in self.users_ratings.itertuples(index=True) : _sum = 0
                for iid, rating in zip(iids, ratings):
                    _sum += rating - self.global_mean - bi[iid]
                bu[uid] = _sum / (self.reg_bu + len(iids))
        return bu, bi

    def predict(self, uid, iid) :
        "Score Prediction"
        if iid not in self.items_ratings.index:
            raise Exception("Cannot predict user <{uid}> rating of movie <{iid}> because the <{iid}> data is missing from the training set.".format(uid=uid, iid=iid))

        predict_rating = self.global_mean + self.bu[uid] + self.bi[iid]
        return predict_rating

    def test(self,testset) :
        Predict test set data
        for uid, iid, real_rating in testset.itertuples(index=False) :try:
                pred_rating = self.predict(uid, iid)
            except Exception as e:
                print(e)
            else:
                yield uid, iid, real_rating, pred_rating


if __name__ == '__main__':
    trainset, testset = data_split("datasets/ml-latest-small/ratings.csv", random=True)

    bcf = BaselineCFByALS(20.25.15["userId"."movieId"."rating"])
    bcf.fit(trainset)

    pred_results = bcf.test(testset)

    rmse, mae = accuray(pred_results)

    print("rmse: ", rmse, "mae: ", mae)
Copy the code

CF algorithm based on matrix decomposition

History of matrix decomposition

3.1 Traditional SVD

SVD matrix decomposition usually refers to SVD (singular value) decomposition technique, here we will name it as Traditional SVD (Traditional and classic) and its formula is as follows:


M m x n = U m x k k x k V k x n T M_{m\times n}=U_{m\times k}\sum_{k\times k}V_{k\times n}^T

Traditional SVD decomposition form is multiplication of 3 matrices, the middle matrix is singular value matrix. If you want to use SVD factorization, one of the prerequisites is that the matrix must be dense, that is, the elements in the matrix must be non-empty, otherwise SVD factorization cannot be used.

It is obvious that our data are sparse in most cases. Therefore, if Traditional SVD is used, it is generally used to fill the matrix with mean value or other statistical methods, and then use Traditional SVD to decompose and reduce dimension. However, this obviously affects the primacy of data.

3.2 FunkSVD (LFM)

The Traditional SVD mentioned just now needs to fill the matrix first, and then decompose and reduce the dimension. Meanwhile, it has the problem of high computational complexity, because it needs to decompose into 3 matrices, so Funk SVD method is put forward later, which decompose the matrix into 2 user-implied features instead of 3 matrices. Item – Matrix of implied features, Funk SVD is also known as the original LFM model:


i . j ( m i j q j T p i ) 2 \sum_{i,j}(m_{ij}-q_j^Tp_i)^2

Using the idea of linear regression, the optimal implicit vector representation of users and items is sought by minimizing the square of observed data. At the same time, in order to avoid Overfitting observation data, FunkSVD with L2 regular term was also proposed. The formula above is:


m i n q . p ( i . i ) K ( r u i q i T p u ) 2 + Lambda. ( q i 2 + p u 2 ) min_{q*,p*}\sum_{(i,i)\in K}(r_{ui}-q_i^Tp_u)^2+\lambda(||q_i||^2+||p_u||^2)

The above two optimization functions can be used to find the optimal solution by gradient descent or stochastic gradient descent.

3.3 BiasSVD

Since FunkSVD was proposed, there have been many variations. One relatively successful method is BiasSVD, which, as the name implies, is SVD decomposition with bias:


a r g   m i n p i . q j i . j ( m i j mu b i b j q j T p i ) 2 + Lambda. ( p i 2 2 + q j 2 2 + b i 2 2 + b j 2 2 ) \underbrace{arg\space min}_{p_i,q_j}\sum_{i,j}(m_{ij}-\mu-b_i-b_j-q_j^Tp_i)^2+\lambda(||p_i||_2^2+||q_j||_2^2+||b_i||_2^2+||b_j||_2^2)

It is based on the same assumptions as the Baseline prediction, but here the Baseline bias is introduced into the matrix factorization.

3.4 SVD++

Later, people proposed an improved BiasSVD, called SVD++, which added implicit feedback from users on the basis of BiasSVD:


a r g   m i n p i . q j i . j ( m i j mu b i b j q j T p i q j T N ( i ) 1 2 s N ( i ) y s ) 2 + Lambda. ( p i 2 2 + q j 2 2 + b i 2 2 + b j 2 2 + s N ( i ) y s 2 2 ) \underbrace{arg\space min}_{p_i,q_j}\sum_{i,j}(m_{ij}-\mu-b_i-b_j-q_j^Tp_i-q_j^T|N(i)|^{-\frac12}\sum_{s\in N(i)}y_s)^2+\lambda(||p_i||_2^2+||q_j||_2^2+||b_i||_2^2+||b_j||_2^2+\sum_{s\in N(i)}||y_s||_2^2)

Explicit feedback refers to behaviors such as user ratings, while implicit feedback refers to user browsing history, purchase history, listening history, etc.

SVD++ is based on the assumption that, on the basis of BiasSVD, users’ historical browsing records, purchase records and listening records for projects can reflect users’ preferences from the side.

Realization of CF algorithm based on matrix decomposition: LFM

LFM is also referred to as Funk SVD matrix factorization

4.1 LFM Principle Analysis

The core idea of LFM(Latent Factor Model) is to connect users and items through hidden features, as shown in the figure below:

  • P matrix is user-LF matrix, that is, User and implicit eigenmatrix. LF has three implied features.
  • Q matrix is LF-item matrix, that is, the matrix of implied features and items
  • The R matrix is the user-item matrix, given by P times Q
  • Can handle sparse scoring matrix

Using matrix decomposition technology, the original User-item scoring matrix (dense/sparse) is decomposed into P and Q matrices, and then P∗Q is used to restore the User-item scoring matrix R. The whole process is equivalent to dimensionality reduction, where:

  • The matrix value P11 represents the weight value of user 1 to implicit feature 1
  • The matrix value Q11 represents the weight value of implied feature 1 on item 1
  • Matrix R11 value means to predict user score 1 to 1, and R11 = P1, k ⃗ ⋅ Qk, 1 ⃗ R_ {11} \ vec = {P_ {1, k}} \ cdot \ vec {Q_ {k, 1}} R11 = P1, k ⋅ Qk, 1

LFM is used to predict users’ ratings of items, and KKK represents the number of implied features:


r u i ^ = p u k q i k = k = 1 k p u k q i k \hat{r_{ui}}=\vec{p_{uk}}\cdot\vec{q_{ik}}=\sum_{k=1}^kp_{uk}q_{ik}

So ultimately, our goal is to figure out the P and Q matrices and each of them, and then predict the user-item rating.

4.2 Loss function

Similarly, for score prediction, we use the square deviation to construct the loss function:


C o s t = i . i R ( r u i r u i ^ ) 2 = u . i R ( r u i k = 1 k p u k q i k ) 2 Cost=\sum_{i,i\in R}(r_{ui}-\hat{r_{ui}})^2=\sum_{u,i\in R}(r_{ui}-\sum_{k=1}^kp_{uk}q_{ik})^2

Add L2 regularization:


C o s t = u . i R ( r u i k = 1 k p u k q i k ) 2 + Lambda. ( U p u k 2 + I q i k 2 ) Cost=\sum_{u,i\in R}(r_{ui}-\sum_{k=1}^kp_{uk}q_{ik})^2+\lambda(\sum_Up_{uk}^2+\sum_Iq_{ik}^2)

4.3 Stochastic gradient descent optimization

Gradient descent update parameters pukp_{UK}puk and qikq_{ik}qik :(α\alphaα learning rate, λ\lambdaλ regularization coefficient)


p u k : = p u k + Alpha. [ u . i R ( r u i k = 1 k p u k q i k ) q i k Lambda. p u k ] q i k : = q i k + Alpha. [ u . i R ( r u i k = 1 k p u k q i k ) p u k Lambda. q i k ] p_{uk}:=p_{uk}+\alpha[\sum_{u,i\in R}(r_{ui}-\sum_{k=1}^kp_{uk}q_{ik})q_{ik}-\lambda p_{uk}] q_{ik}:=q_{ik}+\alpha[\sum_{u,i\in R}(r_{ui}-\sum_{k=1}^kp_{uk}q_{ik})p_{uk}-\lambda q_{ik}]

Gradient descent: Vector multiplication, multiplication of each component, sum:


p u k : = p u k + Alpha. [ ( r u i k = 1 k p u k q i k ) q i k Lambda. 1 p u k ] q i k : = q i k + Alpha. [ ( r u i k = 1 k p u k q i k ) p u k Lambda. 2 q i k ] p_{uk}:=p_{uk}+\alpha[(r_{ui}-\sum_{k=1}^kp_{uk}q_{ik})q_{ik}-\lambda_1p_{uk}]q_{ik}:=q_{ik}+\alpha[(r_{ui}-\sum_{k=1}^k p_{uk}q_{ik})p_{uk}-\lambda_2q_{ik}]

4.4 Algorithm Implementation

  • The data load

    import pandas as pd
    import numpy as np
    dtype = [("userId", np.int32), ("movieId", np.int32), ("rating", np.float32)]
    dataset = pd.read_csv("ml-latest-small/ratings.csv", usecols=range(3), dtype=dict(dtype))
    Copy the code
  • Data initialization

    Groupby groupby('userId') agg by userId
    users_ratings = dataset.groupby('userId').agg([list])
    # Item rating data
    items_ratings = dataset.groupby('movieId').agg([list])
    # calculate the global average
    global_mean = dataset['rating'].mean()
    Q 610 9700 K value 610*K 9700*K
    # user-lf 10 indicates that the number of implicit factors is 10
    P = dict(zip(users_ratings.index,np.random.rand(len(users_ratings),10).astype(np.float32)))
    # Item-LF
    Q = dict(zip(items_ratings.index,np.random.rand(len(items_ratings),10).astype(np.float32)
    ))
    Copy the code
  • Gradient descent optimizes loss function

    # Gradient descent optimization loss function
    for i in range(15) :print(The '*'*10,i)
        for uid,iid,real_rating in dataset.itertuples(index = False) :Get the user vector from the user id into the user matrix
            v_puk = P[uid]
            Get the item vector from the item's UID into the item matrix
            v_qik = Q[iid]
            # Calculate the loss
            error = real_rating-np.dot(v_puk,v_qik)
            # 0.02 learning rate 0.01 regularization coefficient
            v_puk += 0.02*(error*v_qik-0.01*v_puk)
            v_qik += 0.02*(error*v_puk-0.01*v_qik)
    
            P[uid] = v_puk
            Q[iid] = v_qik
    Copy the code
  • Score predicts

    def predict(self, uid, iid) :
        # If uid or IID is not available, we return the play average as a prediction
        if uid not in self.users_ratings.index or iid not in self.items_ratings.index:
            return self.globalMean
        p_u = self.P[uid]
        q_i = self.Q[iid]
    
        return np.dot(p_u, q_i)
    Copy the code
  • encapsulation

    # LFM Model
    
    import pandas as pd
    import numpy
    
    # Scale prediction 1-5
    class LFM(object) :
    
        def __init__(self, alpha, reg_p, reg_q, number_LatentFactors=10, number_epochs=10, columns=["uid"."iid"."rating"]) :
            self.alpha = alpha Vector #
            self.reg_p = reg_p    # P matrix regular
            self.reg_q = reg_q    # Q matrix is regular
            self.number_LatentFactors = number_LatentFactors  # Number of implicit categories
            self.number_epochs = number_epochs    # Maximum number of iterations
            self.columns = columns
    
        def fit(self, dataset) :
            ''' fit dataset :param dataset: uid, iid, rating :return: '''
    
            self.dataset = pd.DataFrame(dataset)
    
            self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
            self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
    
            self.globalMean = self.dataset[self.columns[2]].mean()
    
            self.P, self.Q = self.sgd()
    
        def _init_matrix(self) :
            Initializes the P and Q matrices while setting random values between 0 and 1 as initial values :return: ""
            # User-LF
            P = dict(zip(
                self.users_ratings.index,
                np.random.rand(len(self.users_ratings), self.number_LatentFactors).astype(np.float32)
            ))
            # Item-LF
            Q = dict(zip(
                self.items_ratings.index,
                np.random.rand(len(self.items_ratings), self.number_LatentFactors).astype(np.float32)
            ))
            return P, Q
    
        def sgd(self) :
            Using stochastic gradient descent, optimization result :return:"
            P, Q = self._init_matrix()
    
            for i in range(self.number_epochs):
                print("iter%d"%i)
                error_list = []
                for uid, iid, r_ui in self.dataset.itertuples(index=False) :# User-LF P
                    ## Item-LF Q
                    v_pu = P[uid] # user vector
                    v_qi = Q[iid] # Item vector
                    err = np.float32(r_ui - np.dot(v_pu, v_qi))
    
                    v_pu += self.alpha * (err * v_qi - self.reg_p * v_pu)
                    v_qi += self.alpha * (err * v_pu - self.reg_q * v_qi)
    
                    P[uid] = v_pu 
                    Q[iid] = v_qi
    
                    # for k in range(self.number_of_LatentFactors):
                    # v_pu[k] += self.alpha*(err*v_qi[k] - self.reg_p*v_pu[k])
                    # v_qi[k] += self.alpha*(err*v_pu[k] - self.reg_q*v_qi[k])
    
                    error_list.append(err ** 2)
                print(np.sqrt(np.mean(error_list)))
            return P, Q
    
        def predict(self, uid, iid) :
            # If uid or IID is not available, we return the play average as a prediction
            if uid not in self.users_ratings.index or iid not in self.items_ratings.index:
                return self.globalMean
    
            p_u = self.P[uid]
            q_i = self.Q[iid]
    
            return np.dot(p_u, q_i)
    
        def test(self,testset) :
            Predict test set data
            for uid, iid, real_rating in testset.itertuples(index=False) :try:
                    pred_rating = self.predict(uid, iid)
                except Exception as e:
                    print(e)
                else:
                    yield uid, iid, real_rating, pred_rating
    
    if __name__ == '__main__':
        dtype = [("userId", np.int32), ("movieId", np.int32), ("rating", np.float32)]
        dataset = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(3), dtype=dict(dtype))
    
        lfm = LFM(0.02.0.01.0.01.10.100["userId"."movieId"."rating"])
        lfm.fit(dataset)
    
        while True:
            uid = input("uid: ")
            iid = input("iid: ")
            print(lfm.predict(int(uid), int(iid)))
    Copy the code

Realization of CF algorithm based on matrix decomposition: BiasSvd

BiasSvd is basically the Funk SVD matrix factorization with a bias term

5.1 BiasSvd

BiasSvd is used to predict users’ ratings of items, and KKK represents the number of implied features:


r u i ^ = mu + b u + b i + p u k q k i = mu + b u + b i + k = 1 k p u k q i k \hat{r_{ui}}=\mu+b_u+b_i+\vec{p_{uk}}\cdot\vec{q_{ki}}=\mu+b_u+b_i+\sum_{k=1}^kp_{uk}q_{ik}

5.2 Loss function

Also for score prediction we use the square variance to construct the loss function


C o s t = u . i R ( r u i r u i ^ ) 2 = u . i R ( r u i mu b u b i k = 1 k p u k q i k ) 2 Cost=\sum_{u,i\in R}(r_{ui}-\hat{r_{ui}})^2=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i-\sum_{k=1}^kp_{uk}q_{ik})^2

Add L2 regularization:


C o s t = u . i R ( r u i mu b u b i k = 1 k p u k q i k ) 2 + Lambda. ( U b u 2 + I b i 2 + U p u k 2 + I q i k 2 ) Cost=\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i-\sum_{k=1}^kp_{uk}q_{ik})^2+\lambda(\sum_Ub_u^2+\sum_Ib_i^2+\sum_Up_{uk}^2+\sum_Iq_{ik}^2)

5.3 Stochastic gradient descent optimization

Gradient descent update parameter PUK: p_{UK} : puK:


p u k : = p u k + Alpha. [ u . i R ( r u i mu b u b i k = 1 k p u k q i k ) q i k Lambda. p u k ] p_{uk}:=p_{uk}+\alpha[\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i-\sum_{k=1}^kp_{uk}q_{ik})q_{ik}-\lambda p_{uk}]

In the same way:


q i k : = q i k + Alpha. [ u . i R ( r u i mu b u b i k = 1 k p u k q i k ) p u k Lambda. q i k ] b u : = b u + Alpha. [ u . i R ( r u i mu b u b i k = 1 k p u k q i k ) Lambda. b u ] b i : = b i + Alpha. [ u . i R ( r u i mu b u b i k = 1 k p u k q i k ) Lambda. b i ] q_{ik}:=q_{ik}+\alpha[\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i-\sum_{k=1}^kp_{uk}q_{ik})p_{uk}-\lambda q_{ik}] b_u:=b_u+\alpha[\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i-\sum_{k=1}^kp_{uk}q_{ik})-\lambda b_u]\\ b_i:=b_i+\alpha[\sum_{u,i\in R}(r_{ui}-\mu-b_u-b_i-\sum_{k=1}^kp_{uk}q_{ik})-\lambda b_i]

Random gradient descent:


p u k : = p u k + Alpha. [ ( r u i mu b u b i k = 1 k p u k q i k ) q i k Lambda. 1 p u k ] q i k : = q i k + Alpha. [ ( r u i mu b u b i k = 1 k p u k q i k ) p u k Lambda. 2 q i k ] b u : = b u + Alpha. [ ( r u i mu b u b i k = 1 k p u k q i k ) Lambda. 3 b u ] b i : = b i + Alpha. [ ( r u i mu b u b i k = 1 k p u k q i k ) Lambda. 4 b i ] p_{uk}:=p_{uk}+\alpha[(r_{ui}-\mu-b_u-b_i-\sum_{k=1}^kp_{uk}q_{ik})q_{ik}-\lambda_1p_{uk}] q_{ik}:=q_{ik}+\alpha[(r_{ui}-\mu-b_u-b_i-\sum_{k=1}^kp_{uk}q_{ik})p_{uk}-\lambda_2q_{ik}] b_u:=b_u+\alpha[(r_{ui}-\mu-b_u-b_i-\sum_{k=1}^kp_{uk}q_{ik})-\lambda_3b_u] b_i:=b_i+\alpha[(r_{ui}-\mu-b_u-b_i-\sum_{k=1}^kp_{uk}q_{ik})-\lambda_4b_i]

5.4 Algorithm Implementation

'''
BiasSvd Model
'''
import math
import random
import pandas as pd
import numpy as np

class BiasSvd(object) :

    def __init__(self, alpha, reg_p, reg_q, reg_bu, reg_bi, number_LatentFactors=10, number_epochs=10, columns=["uid"."iid"."rating"]) :
        self.alpha = alpha Vector #
        self.reg_p = reg_p
        self.reg_q = reg_q
        self.reg_bu = reg_bu
        self.reg_bi = reg_bi
        self.number_LatentFactors = number_LatentFactors  # Number of implicit categories
        self.number_epochs = number_epochs
        self.columns = columns

    def fit(self, dataset) :
        ''' fit dataset :param dataset: uid, iid, rating :return: '''

        self.dataset = pd.DataFrame(dataset)

        self.users_ratings = dataset.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
        self.items_ratings = dataset.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
        self.globalMean = self.dataset[self.columns[2]].mean()

        self.P, self.Q, self.bu, self.bi = self.sgd()

    def _init_matrix(self) :
        Initializes the P and Q matrices while setting random values between 0 and 1 as initial values :return: ""
        # User-LF
        P = dict(zip(
            self.users_ratings.index,
            np.random.rand(len(self.users_ratings), self.number_LatentFactors).astype(np.float32)
        ))
        # Item-LF
        Q = dict(zip(
            self.items_ratings.index,
            np.random.rand(len(self.items_ratings), self.number_LatentFactors).astype(np.float32)
        ))
        return P, Q

    def sgd(self) :
        Using stochastic gradient descent, optimization result :return:"
        P, Q = self._init_matrix()

        Set bu and bi to 0
        bu = dict(zip(self.users_ratings.index, np.zeros(len(self.users_ratings))))
        bi = dict(zip(self.items_ratings.index, np.zeros(len(self.items_ratings))))

        for i in range(self.number_epochs):
            print("iter%d"%i)
            error_list = []
            for uid, iid, r_ui in self.dataset.itertuples(index=False) : v_pu = P[uid] v_qi = Q[iid] err = np.float32(r_ui - self.globalMean - bu[uid] - bi[iid] - np.dot(v_pu, v_qi)) v_pu += self.alpha * (err * v_qi - self.reg_p * v_pu) v_qi += self.alpha * (err * v_pu - self.reg_q * v_qi) P[uid] = v_pu Q[iid] = v_qi bu[uid] += self.alpha * (err - self.reg_bu * bu[uid]) bi[iid] += self.alpha * (err - self.reg_bi * bi[iid]) error_list.append(err **2)
            print(np.sqrt(np.mean(error_list)))

        return P, Q, bu, bi

    def predict(self, uid, iid) :

        if uid not in self.users_ratings.index or iid not in self.items_ratings.index:
            return self.globalMean

        p_u = self.P[uid]
        q_i = self.Q[iid]

        return self.globalMean + self.bu[uid] + self.bi[iid] + np.dot(p_u, q_i)


if __name__ == '__main__':
    dtype = [("userId", np.int32), ("movieId", np.int32), ("rating", np.float32)]
    dataset = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(3), dtype=dict(dtype))

    bsvd = BiasSvd(0.02.0.01.0.01.0.01.0.01.10.20)
    bsvd.fit(dataset)

    while True:
        uid = input("uid: ")
        iid = input("iid: ")
        print(bsvd.predict(int(uid), int(iid)))
Copy the code

Vi. Content-based recommendation algorithm

6.1 introduction

Content-based recommendation method is very direct. It makes recommendations based on the content description information of articles, which is essentially based on the direct analysis and calculation of the characteristics or attributes of articles and users themselves.

For example, if we know that movie A is A comedy, and we happen to know that A user likes to watch comedy movies, we can recommend movie A to the user based on such known information.

6.2 Content-based recommended implementation steps

  • Portrait construction: As the name implies, a portrait depicts the features of an object or user. It’s essentially labeling a user or an item

    • Object portraits: for example, what can be used to tag the movie Wolf Warrior 2?

    “Action”, “wu”, “wu gang”, “Hans zhang”, “Chinese film”, “Chinese”, “patriotic”, “military” and so on a series of tags can be labeled

    • User portrait: for example the user movie history is known: “” 1″ the war Wolf “, “” 2″ the war Wolf “, “” the founding of” “, “the great cause for army”, “the founding of a republic”, “the red sea action”, “1-8” fast and furious “, etc., if we can analyze the user’s interest characteristics such as: “Patriotic”, “war”, “racing”, “action”, “military”, “Wu Jing”, “Han Sanping” and other labels.
  • Where do the labels come from

    • PGC item portrait – Cold start
      • Attributes inherent to the item (as soon as the item is created) : movie title, director, actor, genre, etc
      • Attributes set by the service provider (attributes attached to the goods by the service provider) : such as short video topic, weibo topic (platform draft)
      • Other channels: such as crawlers
    • UGC cold start problem
      • Attributes of items provided by users in the process of enjoying the service, such as user comments, weibo topics (prepared by users)

The object portrait constructed according to the PGC content can solve the cold start problem of the object

6.3 Algorithm flow based on content recommendation

  • Construct the object portrait based on the PGC/UGC content
  • Generate user portraits based on user behavior records
  • Find the most matching top-N items from the items according to the user’s portrait and make recommendations

6.4 Cold-start treatment of articles

  • Build an object portrait based on the PGC content
  • Calculate the similarity between two objects by drawing them
  • Generate top-N most similar items for each item to make relevant recommendations: for example, which items are similar to this item? What articles are similar to this article?

7. Content-based film recommendations: Object portraits

7.1 Feature extraction technology based on TF-IDF

As mentioned above, feature labels of object portraits mainly refer to structural data such as film directors, actors, book authors and publishing houses, that is, their feature extraction, especially the calculation of sign vector is relatively simple, such as defining the state of 0 or 1 directly for the classification of works.

Profile but also some characteristics, such as the content of the movie, movie reviews, the data such as text books, these are known as unstructured data, first of all, they should have a feature also belongs to the item’s label, but such, to quantify the characteristics of the tag is when calculating the characteristic vector of it would be hard to define.

Therefore, at this time, it is necessary to use some natural language processing, information retrieval and other technologies, such as the user’s text comments or other text content information of the unstructured data for quantitative processing, so as to achieve a more complete article portrait/user portrait.

Tf-idf algorithm is one of the algorithms widely used in the field of natural language processing. It can be used to extract the target document and obtain the keywords to calculate the weight of the target document, and combine these weights together to obtain the feature vector.

7.2 Algorithm Principle

In the field of TF-IDF natural language processing, the method of calculating the weight of words or phrases in documents is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF refers to the number of times a given word appears in the document. This number is usually normalized to prevent it from favoring long files (the same word may have a higher frequency in long files than in short files, regardless of the importance of the word). IDF is a measure of the general importance of a word. The IDF of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word and taking the logarithm of the resulting quotient.

The TF-IDF algorithm is based on the assumption that if a term appears frequently in the target document and infrequently in other documents, it can be used to distinguish the target document. There are two things to know about this assumption:

  • High frequency of occurrence in this document;
  • Appears infrequently in other documents.

Therefore, tF-IDF algorithm can be divided into Term Frequency (TF) and Inverse Document Frequency (IDF), and the weight of Document terms is set by the product of TF and IDF.

TF refers to how often a word appears in a document. Assuming that the number of documents in the document set is NNN and the number of documents in the document set containing the keyword kik_iki is nin_ini, fiJF_ {ij}fij represents the number of occurrences of the keyword kik_iki in the document djd_jdj, Fdjf_ {DJ} FDJ represents the total number of words appearing in djd_jdj, and the word frequency TFijTF_{ij}TFij of kik_iki in djd_jdj is defined as: TFij = fijfdjTF_ = {ij} \ frac {f_ {ij}} {f_ {DJ}} TFij = fdjfij. Also note that this number is usually normalized to prevent it from favoring long files (meaning that the same word in a long file may have a higher word frequency than in a short file, regardless of the importance of the word).

IDF is a measure of the universal importance of a word. Refers to the frequency of occurrence of a word in the whole document set, and the logarithm of the result is taken to obtain the inverse document frequency of the keyword kik_iki IDFi:IDFi=logNniIDF_i:IDF_i=log\frac{N}{n_I}IDFi:IDFi=logniN

The weight of words calculated by TF and IDF is: Wij = TFij ⋅ IDFi = fijfdj ⋅ logNniw_ = TF_ {ij} {the ij} \ cdot IDF_i = \ frac {f_ {ij}} {f_ {DJ}} \ cdot The log \ frac {N} {n_i} wij = TFij ⋅ IDFi = fdjfij ⋅ logniN

Conclusion: TF-IDF is proportional to the number of occurrences of the word in the document, and inversely proportional to the number of occurrences of the word in the whole document set.

Purpose: In the target document, the method of extracting keywords (feature tags) is to calculate and compare the TF-IDF of all words in the document, and take the k number with the largest TF-IDF value to form the feature vector of the target document to represent the document.

Note: Stop Words in the document, such as “yes” and “of”, are meaningless Words expressed in the central idea of the document, which need to be filtered out before calculating the TF-IDF value of other Words during word segmentation.

7.3 Algorithm Examples

For calculating the tF-IDF reviews, the film “Pirates of the Caribbean: Take the curse of the Black Pearl as an example. Assume that there are 1000 film reviews in total, and the total number of words in one of the reviews is 200. The words that appear most frequently are “pirate”, “captain” and “freedom”, which are 20, 15 and 10 times respectively, and are mentioned 1000, 500 and 100 times respectively in all the reviews. The order of these three words as keywords is calculated as follows.

  1. Filter out stop words in movie reviews and calculate the frequency of other words. Taking the three words that appear most frequently as examples, the calculation is as follows:
    • The word “pirate” appears at 20/200 = 0.1
    • The frequency of the word “captain” is 15/200=0.075
    • The frequency of the word “free” was 10/200=0.05;
  2. Calculate the inverse document frequency of words as follows:
    • The PIRATE IDF is log(1000/1000)=0
    • The IDF of “Captain” is: log(1000/500)=0.3. The IDF of “free” is: log(1000/100)=1
  3. The tF-IDF results of the words are obtained from the calculation results of 1 and 2. “Pirate” is 0, “captain” is 0.0225, and “freedom” is 0.05.

By comparison, the keywords of this review should be sorted as follows: “Freedom”, “Captain” and “pirate”. Tf-idf values of these words are arranged as their weights in the corresponding order to obtain the feature vector of this film review. We use this vector to represent this film review, and the magnitude of each dimension in the vector corresponds to the importance of this attribute.

Will generally reviews focus on all the reviews vector with certain coefficient multiplication summation, get comprehensive reviews of the movie vector, combined with the basic attributes of film build video items pictures, also to build user portraits, portraits, and the user USES the many kinds of method to calculate item picture, the similarity between the make recommendations for the user.

7.4 Loading a Data Set

import pandas as pd
import numpy as np
"- Use tags of each movie in distags. CSV as candidate keywords of movies - use TF·IDF to calculate tFIDF value of tags of each movie, select top-N keywords as movie portrait tags - and use the classification words of movies directly as portrait tags of each movie.

def get_movie_dataset() :
    # Load tabs based on all movies
    CSV from the ML-latest dataset
    # ML-latest -small has too much tag data
    _tags = pd.read_csv("datasets/ml-latest-small/all-tags.csv", usecols=range(1.3)).dropna()
    tags = _tags.groupby("movieId").agg(list)

    Load the movie list dataset
    movies = pd.read_csv("datasets/ml-latest-small/movies.csv", index_col="movieId")
    # Separate category words
    movies["genres"] = movies["genres"].apply(lambda x: x.split("|"))
    # Match tag data for each movie, if not NAN
    movies_index = set(movies.index) & set(tags.index)
    new_tags = tags.loc[list(movies_index)]
    ret = movies.join(new_tags)

    # create movie data set, including movie Id, movie name, category, tag four fields
    # If the movie has no tag data, replace it with an empty list
    # map(fun, iterable)
    movie_dataset = pd.DataFrame(
        map(
            lambda x: (x[0], x[1], x[2], x[2]+x[3]) if x[3] is not np.nan else (x[0], x[1], x[2], []), ret.itertuples())
        , columns=["movieId"."title"."genres"."tags"]
    )

    movie_dataset.set_index("movieId", inplace=True)
    return movie_dataset

movie_dataset = get_movie_dataset()
print(movie_dataset)
Copy the code
  • The map function

    • describe

      Map () maps the specified sequence based on the provided function.

      The first argument function calls the function function with each element in the argument sequence, returning a new list of values returned by each function function

    • grammar

      Map () function syntax:

      map(function, iterable,…)

    • parameter

      • Unction, function
      • Iterable – One or more sequences
    • The return value

      Python2. x returns the list

      Python3. x returns an iterator

    • The sample

      >>>def square(x) :            # Compute the square
      .    return x ** 2
      .
      >>> map(square, [1.2.3.4.5])   # Compute the square of each element of the list
      [1.4.9.16.25]
      >>> map(lambda x: x ** 2[1.2.3.4.5])  # Use lambda anonymous functions
      [1.4.9.16.25]
      
      Two lists are provided to add the list data in the same position
      >>> map(lambda x, y: x + y, [1.3.5.7.9], [2.4.6.8.10[])3.7.11.15.19]
      Copy the code

7.5 Based on TF.IDF, top-N keywords are extracted to construct movie portraits

  • Gensim introduction

    • Python tripartite library for natural language processing
    • Support a variety of topic model algorithms including TF-IDF, WORD2VEC
    • Install PIP install gensim
  • Basic concepts of Gensim

    • Corpus: A collection of raw text. In Gensim, the Corpus is usually an iterable object (such as a list). Each iteration returns a (sparse) vector that can be used to represent a text object.
    • Vector: A list of text features. Is the internal representation of a text in Gensim.
    • Model
  • BOW bag of Words

    There are two very important models for text feature extraction:

    • Word set model: a set of words. Naturally, there is only one element in the set, i.e. there is only one word in the word set.
    • Word bag model: If a word appears more than once in a document based on the word set, count its occurrence (frequency).

    The essential difference between the two is that the word bag adds the frequency dimension on the basis of the word set. The word set only focuses on the “yes” and “no”, while the word bag also focuses on the “several”.

from gensim.models import TfidfModel

import pandas as pd
import numpy as np

from pprint import pprint

#...

def create_movie_profile(movie_dataset) :
    Param movie_dataset: :return: ""
    dataset = movie_dataset["tags"].values

    from gensim.corpora import Dictionary
    # Build word bags according to the data set, count the word frequency, put all words into a dictionary, and use the index to obtain
    dct = Dictionary(dataset)
    # Return the corresponding word index and word frequency according to each piece of data
    corpus = [dct.doc2bow(line) for line in dataset]
    # Train tF-IDF model, that is, calculate TF-IDF value
    model = TfidfModel(corpus)

    movie_profile = {}
    for i, mid in enumerate(movie_dataset.index):
        # Return vector based on each piece of data
        vector = model[corpus[i]]
        # according to TF-IDF worthy to top- N keywords
        movie_tags = sorted(vector, key=lambda x: x[1], reverse=True)[:30]
        Extract the corresponding name based on the keyword
        movie_profile[mid] = dict(map(lambda x:(dct[x[0]], x[1]), movie_tags))

    return movie_profile

movie_dataset = get_movie_dataset()
pprint(create_movie_profile(movie_dataset))
Copy the code

7.6 Improve the keywords of the portrait

from gensim.models import TfidfModel

import pandas as pd
import numpy as np

from pprint import pprint

#...

def create_movie_profile(movie_dataset) :
    Param movie_dataset: :return: ""
    dataset = movie_dataset["tags"].values

    from gensim.corpora import Dictionary
    # Build word bags according to the data set, count the word frequency, put all words into a dictionary, and use the index to obtain
    dct = Dictionary(dataset)
    # Return the corresponding word index and word frequency according to each piece of data
    corpus = [dct.doc2bow(line) for line in dataset]
    # Train tF-IDF model, that is, calculate TF-IDF value
    model = TfidfModel(corpus)

    _movie_profile = []
    for i, data in enumerate(movie_dataset.itertuples()):
        mid = data[0]
        title = data[1]
        genres = data[2]
        vector = model[corpus[i]]
        movie_tags = sorted(vector, key=lambda x: x[1], reverse=True)[:30]
        topN_tags_weights = dict(map(lambda x: (dct[x[0]], x[1]), movie_tags))
        # add the category word and set the weight value to 1.0
        for g in genres:
            topN_tags_weights[g] = 1.0
        topN_tags = [i[0] for i in topN_tags_weights.items()]
        _movie_profile.append((mid, title, topN_tags, topN_tags_weights))

    movie_profile = pd.DataFrame(_movie_profile, columns=["movieId"."title"."profile"."weights"])
    movie_profile.set_index("movieId", inplace=True)
    return movie_profile

movie_dataset = get_movie_dataset()
pprint(create_movie_profile(movie_dataset))
Copy the code

In order to quickly match the corresponding movie according to the specified keywords, it is necessary to establish the inverted index of the label words of the object portrait

Introduction to inverted index

Generally, the data storage data is indexed by the ID of the item to extract other information data of the item

And the inverted index is to use the other data of the item as the index to extract the ID list of the corresponding item

#...

Create an inverted index of tag-items

def create_inverted_table(movie_profile) :
    inverted_table = {}
    for mid, weights in movie_profile["weights"].iteritems():
        for tag, weight in weights.items():
            Inverted_table dict uses a tag as a Key. []
            _ = inverted_table.get(tag, [])
            Add the movie id and weight to the list in a tuple
            _.append((mid, weight))
            # Set the modified value back
            inverted_table.setdefault(tag, _)
    return inverted_table

inverted_table = create_inverted_table(movie_profile)
pprint(inverted_table)
Copy the code

Content based movie recommendations: User portraits

User portrait construction steps:

  • According to the user’s rating history, combined with the image of the item, the image label of the movie with movie viewing record was used as the initial label to hit the user
  • The weight value of each initial label of the user is calculated by counting the number of times the user watches the film label. After sorting, top-N is selected as the final portrait label of the user

8.1 User Portrait Establishment

import pandas as pd
import numpy as np
from gensim.models import TfidfModel

from functools import reduce
import collections

from pprint import pprint

#...

User profile creation: 1. Extract user watch list 2. 3. Sort by word frequency, reserve a maximum of top-K words, where K is set to 100, as the user's tag ""

def create_user_profile() :
    watch_record = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(2), dtype={"userId":np.int32, "movieId": np.int32})

    watch_record = watch_record.groupby("userId").agg(list)
    # print(watch_record)

    movie_dataset = get_movie_dataset()
    movie_profile = create_movie_profile(movie_dataset)

    user_profile = {}
    for uid, mids in watch_record.itertuples():
        record_movie_prifole = movie_profile.loc[list(mids)]
        counter = collections.Counter(reduce(lambda x, y: list(x)+list(y), record_movie_prifole["profile"].values))
        # Take out the top 50 words
        interest_words = counter.most_common(50)
        # retrieve the number of occurrences of the most frequently used word
        maxcount = interest_words[0] [1]
        The word with the most occurrences has a weight of 1
        interest_words = [(w,round(c/maxcount, 4)) for w,c in interest_words]
        user_profile[uid] = interest_words

    return user_profile

user_profile = create_user_profile()
pprint(user_profile)
Copy the code

8.2 the reduce function

  • describe

    The reduce() function accumulates the elements in the argument sequence.

    Function performs the following operations on all the data in a data set (linked list, tuple, etc.) : operates on the first and second elements of the set with function (which has two parameters) passed to Reduce, and then computes the result with the third data using function function, finally obtaining a result.

  • grammar

    Reduce () function syntax:

    reduce(function, iterable[, initializer])
    Copy the code
  • parameter

    • Function — A function that takes two arguments
    • Iterable– Iterable
    • Initializer– Optional, initial parameter
  • The return value

    Returns the result of the function evaluation

  • The sample

    >>>def add(x, y) :            # Add two numbers
    .    return x + y
    .
    >>> reduce(add, [1.2.3.4.5])   # calculate the list sum: 1+2+3+4+5
    15
    >>> reduce(lambda x, y: x+y, [1.2.3.4.5])  # Use lambda anonymous functions
    15
    Copy the code

8.3 Use the collections.Counter class to count the occurrences of list elements

from collections import Counter
names = ["Stanley"."Lily"."Bob"."Well"."Peter"."Bob"."Well"."Peter"."Well"."Peter"."Bob"."Stanley"."Lily"."Bob"."Well"."Peter"."Bob"."Bob"."Well"."Peter"."Bob"."Well"]
names_counts = Counter(names)
Copy the code

Content based movie recommendation: Generate top-N recommendation results for users

#...

user_profile = create_user_profile()

watch_record = pd.read_csv("datasets/ml-latest-small/ratings.csv", usecols=range(2),dtype={"userId": np.int32, "movieId": np.int32})

watch_record = watch_record.groupby("userId").agg(list)

for uid, interest_words in user_profile.items():
    result_table = {} Id # movie: [0.2, 0.5, 0.7]
    for interest_word, interest_weight in interest_words:
        related_movies = inverted_table[interest_word]
        for mid, related_weight in related_movies:
            _ = result_table.get(mid, [])
            _.append(interest_weight)    # Only consider the user's interest level
            # _.append(related_weight) # Consider only the degree of association between the words of interest and the movie
            # _.append(interest_weight*related_weight) # Consider both
            result_table.setdefault(mid, _)

    rs_result = map(lambda x: (x[0].sum(x[1])), result_table.items())
    rs_result = sorted(rs_result, key=lambda x:x[1], reverse=True)[:100]
    print(uid)
    pprint(rs_result)
    break

    # Historical data ==> Historical interest ==> Historical recommendation result Offline Recommendation offline calculation
    # Online recommendation ===> Entertainment (Wang Sicong) ===> I ==> Wang Sicong 100%
    # Near line: real-time calculation of the latest 1, 3 and 7 days
Copy the code