This paper is part of notes and code repetition (clustering, dimensionality reduction) of Ng’s machine learning course [1].
Author: Huang Haiguang [2]
Note: Notes and assignments (including data and original assignment files) and videos can be downloaded on Github [3].
I will publish the course notes and course codes on the official account “Machine Learning Beginners”, please pay attention. This is part 5: Exception Detection and Recommendation Systems, week 9 of the original tutorial, which includes notes and work code (the original work was from OCTAVE, here is the Python code reworked).
Completed Parts:
Part ONE: Regression
Part two: Logistic regression
Part three: Support vector machines
Part FOUR: Unsupervised learning
The job code [4] for this article can be downloaded in its entirety
Markdown file for notes [5]
PDF file of notes [6]
Notes section – Table of contents
Xv. Anomaly Detection
15.1 Motivation for the question
15-1 – Problem Motivation (8 min).mkv
In the next series of videos, I’ll cover the issue of Anomaly detection. This is a common application of machine learning algorithms. One of the interesting aspects of this algorithm is that although it is mainly used for unsupervised learning problems, it is similar to some supervised learning problems in some respects.
What is exception detection? To illustrate this concept, let me give you an example:
Imagine you are an aircraft engine manufacturer, and as part of your QA(quality control test) as your aircraft engine rolls off the production line, you measure some characteristic variables of the aircraft engine, such as the heat generated when the engine is running, or the vibration of the engine, etc.
So, you have a data set, from, if you build an engine, to, and you graph that data, and it looks something like this:
Every dot here, every cross here, is your unlabeled data. In this way, the anomaly detection problem can be defined as follows: Let’s assume that one day later, you have a new aircraft engine coming off the production line, and your new aircraft engine has characteristic variables. The anomaly detection problem is: we want to know if there is something abnormal about this new airplane engine, or we want to determine if this engine needs further testing. Because, if it looks like a normal engine, then we can ship it straight to the customer without further testing.
Given a data set, we assume that the data set is normal, and we want to know whether the new data is abnormal, i.e. what is the probability that the test data does not belong to the data set. The model we build should tell us the likelihood that the test data belongs to a set of data based on its location.
In the figure above, the data within the blue circle is more likely to belong to that group, while the more remote the data is, the less likely it is to belong to that group.
This method is called density estimation and is expressed as follows:
Fraud detection:
The model gives us the possibility of belonging to a set of data by detecting abnormal users.
Anomaly detection is mainly used to identify spoofing. For example, when collecting data about users online, a feature vector might include such things as how often users log in, the pages they visit, the number of posts on forums, and even typing speed. Try to build a model based on these characteristics that can be used to identify users who don’t fit the pattern.
Another example is to detect a data center. The characteristics might include: memory usage, number of disks accessed, CPU load, network traffic, etc. From these features a model can be built to determine whether certain computers are likely to be wrong.
15.2 Gaussian Distribution
15-2-Gaussian Distribution (10 min).mkv
In this video, I’m going to introduce you to the Gaussian distribution, also known as the normal distribution. Review the basics of gaussian distribution.
Generally, if we think that the variable conforms to the Gaussian distribution, its probability density function is:
We can use existing data to predict the sum in the population as follows:
Example of Gaussian distribution:
Note: In machine learning we usually divide by variance, not by statistics. And by the way, in practice, it makes little difference whether you use it or not, as long as you have a reasonably large training set, most people in machine learning tend to use this version of the formula. The two versions of the formula differ slightly in theoretical and mathematical properties, but in practical use the differences are so small that they are almost negligible.
15.3 algorithm
15-3-algorithm (12 min).mkv
In this video, I will use the Gaussian distribution to develop an exception detection algorithm.
Anomaly detection algorithm:
For a given data set, we compute an estimate of the sum for each feature.
Once we have an estimate of the mean and variance, given a new training instance, calculate according to the model:
When, is an exception.
The following figure shows a training set consisting of two features and the distribution of features:
The following three-dimensional diagram shows the density estimation function, with the axis being the estimated values based on the values of two features:
We choose one that will serve as our decision boundary, when the predicted data is normal, otherwise it is abnormal.
In this video, we show how to fit, that is, the probability values of, to develop an anomaly detection algorithm. At the same time, in this class, we also gave the fitting parameters of the given data set, carried out parameter estimation, obtained the parameter sum, and then detected the new sample to determine whether the new sample is abnormal.
In the rest of the course, we’re going to take a deeper look at this algorithm, and we’re going to take a deeper look at how to make it work more efficiently.
15.4 Develop and evaluate an anomaly detection system
15-4 – Developing and Evaluating an Anomaly Detection System (13 minutes).mkv
The anomaly detection algorithm is an unsupervised learning algorithm, which means that we cannot tell whether the data is really abnormal based on the value of the result variable. We need another way to help verify that the algorithm works. When we develop an anomaly detection system, we start with labeled (abnormal or normal) data, from which we select a portion of the normal data to build the training set, and then use the rest of the normal data and abnormal data to mix the data to form cross-check sets and test sets.
For example, we have data for 10000 normal engines and 20 abnormal engines. We allocate data like this:
Data from 6000 normal engines as a training set
Data from 2000 normal engines and 10 abnormal engines were used as a cross-check set
Data from 2000 normal engines and 10 abnormal engines were used as the test set
The specific evaluation methods are as follows:
- Based on the test set data, we estimate the mean and variance of features and construct the function
- For the cross-check set, we try to use different values as thresholds and predict whether the data is abnormal, choosing according to the value or the ratio of precision to recall
- After selection, prediction is made for the test set to calculate the value of the anomaly check system, or the ratio of precision to recall
15.5 Comparison between anomaly detection and supervised learning
See video 15-5 – Anomaly Detection vs. Supervised Learning (8 min). MKV
The anomaly detection system we built before also uses marked data, which is similar to supervised learning. The following comparison is helpful for choosing supervised learning or anomaly detection:
Comparison of the two:
Anomaly detection | Supervised learning |
---|---|
Very few positive classes (exception data), lots of negative classes () | There are a large number of both positive and negative classes |
Lots of different kinds of exceptions, very difficult. Train algorithms on very small amounts of forward class data. | There are enough forward class instances for training algorithms that future forward class instances encountered may be very similar to those in the training set. |
Future exceptions may be very different from known exceptions. | |
For example, fraud detection detects the health of computers in a data center for production (such as aircraft engines) | For example: mail filter weather forecast tumor classification |
Hopefully this lecture will give you an idea of what the characteristics of a learning problem are, so that you can think of this problem as an anomaly detection problem, or as a supervised learning problem. In addition, for some problems that many technology companies may encounter, the number of positive samples is usually very small, even sometimes zero, that is, there are too many different types of exceptions that have not been seen, then the algorithm that should be used for these problems is the exception detection algorithm.
15.6 Selecting Features
15-6 – Choosing What Features to Use (12 min). MKV
For the anomaly detection algorithm, the features we use are very important. Here is how to select features:
Anomaly detection assumes that features conform to gaussian distribution. If the data distribution is not Gaussian, the anomaly detection algorithm can also work, but it is better to convert the data to Gaussian distribution, for example, using logarithmic function:, where is non-negative constant; Or, a score between 0 and 1, etc. (Editor’s note: In Python, the np.log1p() function is usually used to avoid negative results, and the reverse function is np.expm1().)
Error analysis:
A common problem is that some abnormal data may also have high values and thus be considered normal by the algorithm. Error analysis can help us in this case, we can analyze the data that the algorithm has incorrectly predicted as normal, and see if we can find some problems. We may be able to see from the problem that we need to add some new features, and the new algorithm resulting from these new features can help us better detect exceptions.
Abnormal detection error analysis:
We can often through some relevant characteristics of the combination, to get some new and better characteristics (abnormal data of the characteristic value of abnormally large or small), for example, in the center of the testing data of the state of the computer case, we can use the proportion of CPU load and network traffic as a new feature, if the value is unusually large, It could mean that the server is running into some problems.
In this video, we show you how to select features and do some small transformations to make the data look more like a normal distribution, and then feed the data into an anomaly detection algorithm. At the same time, the error analysis method is also introduced to capture the possibility of various anomalies. Hopefully, you’ll learn how to choose good feature variables to help your exception detection algorithm catch different kinds of exceptions.
15.7 Multivariate Gaussian Distribution (Optional)
15-7 – Multivariate Gaussian Distribution (Optional) (14 min).mkv
If we have two related features, and the range of the two features is relatively wide, in this case, the general Gaussian distribution model may not be able to identify abnormal data very well. The reason is that the general Gaussian distribution model tries to capture the deviation of two features at the same time, thus creating a relatively large decision boundary.
The two related features are shown in the figure below. The magenta line (which can be larger or smaller depending on the ε) is the judgment boundary obtained by the general Gaussian distribution model. It is clear that the data points represented by the green X are likely to be outliers, but their values are still within the normal range. The multivariate Gaussian distribution creates the decision boundary shown by the blue curve in the image.
In the general Gaussian distribution model, we calculate by calculating the probability of each feature separately and multiplying it. In the multivariate Gaussian distribution model, we construct the covariance matrix of the feature and calculate it with all the features together.
We first calculate the average of all features, and then calculate the covariance matrix:
Note: where is a vector, each unit of which is the mean value of a row of data in the original eigenmatrix. Finally, we calculate the multivariate Gaussian distribution: where:
Is a definite matrix, computed in Octave using the det(sigma)
Is the inverse matrix, let’s see how the covariance matrix affects the model:
Here are five different models, analyzed from left to right:
- Is a general Gaussian distribution model
- Through the covariance matrix, feature 1 has a small deviation, while maintaining the deviation of feature 2
- Through the covariance matrix, feature 2 has a large deviation, while maintaining the deviation of feature 1
- Through the covariance matrix, the positive correlation between the two features is increased without changing the original bias of the two features
- Through the covariance matrix, the negative correlation between the two features is increased without changing the original bias of the two features
Relationship between multivariate Gaussian distribution model and original Gaussian distribution model:
It can be proved that the original Gaussian distribution model is a subset of the multivariate Gaussian distribution model, that is, as shown in examples 1, 2, 3 and 3 in the figure above, if the covariance matrix only has non-zero values on the units of the diagonal, it is the original Gaussian distribution model.
Comparison between the original Gaussian distribution model and multivariate Gaussian distribution model:
The original Gaussian distribution model | Multivariate Gaussian distribution model |
---|---|
Correlations between features cannot be captured but can be solved by combining features | Automatically capture correlations between features |
The computation cost is low and it can adapt to large-scale characteristics | The same applies when the computational cost is high and the training set is small |
You have to have it, otherwise the covariance matrix is irreversible, and you usually have to have another feature redundancy that makes the covariance matrix irreversible |
The original Gaussian distribution model is widely used. If there are correlations between features to some extent, we can capture these correlations by constructing new features.
If the training set is not too large and does not have many features, we can use the multivariate Gaussian distribution model.
15.8 (Optional) Using Multivariate Gaussian Distribution for Anomaly Detection
15-8 – Anomaly Detection Using the Multivariate Gaussian Distribution (Optional) (14 min).mkv
In the last video we talked about, on multivariate Gaussian distributions, we saw some of the various distribution models that you build when you change the parameters, and. In this video, let’s take these ideas and apply them to develop a different exception detection algorithm.
To review multivariate Gaussian and multivariate normal distributions:
The distribution takes two parameters, alpha and beta. The covariance matrix of the sum of the vectors in this dimension is a kind of matrix. And the probability of the formula here, as parameterized by alpha and beta, is the same as your variables alpha and beta, and you can get a range of different distributions, you know, these are all three samples, and we saw those in previous videos.
So let’s talk about parameter fitting or parameter estimation:
I have a set of samples that is a dimensional vector, and I think my sample comes from a multivariate Gaussian distribution. How do I try to estimate my parameter sum and the standard formula?
Estimate them to be the average of your training sample that you set.
And set: this is actually only written out when we use the PCA algorithm. So you just plug in these two formulas, and that will give you the parameters that you estimated and the parameters that you estimated. So, the data set given here is how you estimate the sum. Let’s do it this way and just plug it into the exception detection algorithm. So how do we put all this together to develop an exception detection algorithm?
First of all, we take our training set, and our fitting model, and we calculate, you know, set it up and describe it.
As shown in the figure, the distribution is mostly in the center, and the outer circle is smaller.
And the probability that that point is the way out here is very low.
The relationship between the original model and the multivariate Gaussian model is shown as follows:
Where, the covariance matrix is:
The comparison between the original model and multivariate Gaussian distribution is shown in the figure below:
Recommender Systems
16.1 Problem formalization
Reference Video: 16-1-Problem Formulation (8 min).mkv
In the next video, I want to talk about recommendation systems. I want to talk about recommendation systems for two reasons:
First, simply because it is an important application of machine learning. Over the last few years, I’ve been visiting different technology companies in Silicon Valley, and I’ve been talking to people who are working on machine learning applications here, and I’ve been asking them, what are the most important machine learning applications, or what are the machine learning applications that you’d like to improve. The answer I hear most often is recommendation systems. Now, there are a lot of groups in Silicon Valley trying to build good recommendation systems. So, if you think about websites like Amazon, or Netflix or ebay, or iTunes Genius, there are lots of websites or systems that try to recommend new products to users. Amazon recommends new books to you, Netflix tries to recommend new movies to you, etc. These recommendation systems look at what books you’ve bought in the past, or what movies you’ve reviewed in the past. These systems will generate a big chunk of revenue, for example for Amazon and companies like Netflix. Therefore, the improvement of recommender system performance will have a substantial and direct impact on these enterprises.
Recommendation systems are an interesting problem in academic machine learning so we can go to an academic machine learning conference, recommendation systems problems actually get very little attention, or at least a very small share in the academic world. But if you look at what’s happening, a lot of the tech businesses that have the ability to build these systems, they seem to be a very high priority for a lot of businesses. That’s one of the reasons why I talk about it in this lecture.
The second reason I want to talk about recommending systems is: What I want to do in the last couple of videos of this class is talk about some big ideas in machine learning and share them with you. We also saw in this class that features are important for machine learning, and the features you choose will have a big impact on the performance of your learning algorithm. So there’s a big idea in machine learning that for some problems, maybe not all problems, but some problems, there are algorithms that automatically learn a good set of features for you. So don’t try to design by hand and write code by hand which is what we’ve done so far. There are Settings where you can have an algorithm and just learn the features that it uses, recommendation systems are an example of type Settings. There are many others, but with recommendation systems, we’re going to get a small taste of the idea of feature learning, and at least you’re going to get an example of it, and I think the same is true of the big ideas in machine learning. So let’s start talking about formalizing the recommendation system problem.
Let’s start by defining the problem with a recommendation system with an example.
Let’s say we’re a movie vendor, and we have five movies and four users, and we ask the users to rate the movies.
The first three movies were romances and the last two were action movies, so we can see that Alice and Bob are more interested in romance, while Carol and Dave are more interested in action movies. And no one user rated all the movies. We want to build an algorithm that predicts how much each of them is likely to rate movies they haven’t seen, and use that as a basis for recommendations.
Let’s introduce some tags:
Represents the number of users
Represents the number of movies
If user J overrated the movie
Represents the user’s rating of the movie
Represents the total number of movies that users rated badly
16.2 Content-based recommendation system
16-2-Content Based Recommendations (15 min). MKV
In a content-based recommendation system algorithm, we assume that there is some data about the characteristics of the things we wish to recommend.
In our example, we can assume that each movie has two characteristics, such as the level of romance and the level of action.
Then, each film has a feature vector. For example, the feature vector of the first film is [0.9 0].
Now we will build a recommendation system algorithm based on these features. Assuming we use a linear regression model, we can train a linear regression model for each user, such as the parameters of the model for the first user. So, we have:
The user’s parameter vector
Movie eigenvectors
For users and movies, we predict ratings of:
Cost function
For users, the cost of this linear regression model is the sum of squares of prediction errors plus regularization terms:
It says we’re only counting movies that users have overrated. In a general linear regression model, the error term and the regular term should both be multiplied by, and we’ll get rid of that here. And we don’t regularize the variance term.
The above cost function is only for one user. To learn about all users, we sum the cost functions of all users:
If we want to use the gradient descent method to solve the optimal solution, after calculating the partial derivative of the cost function, the updated formula of the gradient descent is as follows:
16.3 Collaborative Filtering
Video: 16-3 – Collaborative Filtering (10 min). MKV
In the previous content-based recommendation system, we mastered the available features for each movie and used these features to train each user’s parameters. Conversely, if we have the user’s metrics, we can learn the characteristics of the movie.
But if we have neither user metrics nor movie characteristics, neither approach will work. Collaborative filtering algorithms can learn both at the same time.
Our optimization goal was to focus on both.
The partial derivative of the cost function is as follows:
Note: in the collaborative filtering slave algorithm, we usually do not use the variance term, the algorithm will learn it automatically if necessary. The steps of collaborative filtering algorithm are as follows:
- It starts with some random small value
- The gradient descent algorithm is used to minimize the cost function
- After training the algorithm, we predict how users will rate movies
The feature matrix obtained through this learning process contains important data about the movie, which is not always human readable, but can be used as a basis for recommending the movie to the user.
For example, if a user is watching a movie, we can look for another movie based on the size of the distance between the feature vectors of the two movies.
16.4 Collaborative filtering Algorithm
16-4 – Collaborative Filtering Algorithm (9 min). MKV
Collaborative filtering optimization objectives:
Given, estimate:
Given, estimate:
Both minimize and:
Vectorization: Factorization of low rank matrices
16-5-Vectorization_ Low Rank Matrix Factorization (8 min).mkv
In the last couple of videos, we talked about collaborative filtering, and in this video I’m going to talk about vectorization implementation of that algorithm, and talk about some other things you can do with that algorithm.
For example:
- When given a product, can you find other products related to it?
- A user is interested in a product recently. Are there any other related products that you can recommend to him?
What I’m going to do is implement an alternative approach that writes the predictions of the collaborative filtering algorithm.
We have a data set of five movies, and what I’m going to do is group the movie ratings of these users into a matrix.
We have five movies and four users, so the matrix is a matrix with five rows and four columns, which stores the user rating data of these movies in the matrix:
Movie | Alice (1) | Bob (2) | Carol (3) | Dave (4) |
---|---|---|---|---|
Love at last | 5 | 5 | 0 | 0 |
Romance forever | 5 | ? | ? | 0 |
Cute puppies of love | ? | 4 | 0 | ? |
Nonstop car chases | 0 | 0 | 5 | 4 |
Swords vs. karate | 0 | 0 | 5 | ? |
Launch score:
Find the related video:
Now that you’ve learned about eigenparameter vectors, we have a handy way to measure the similarity between two movies. For example: the movie has a characteristic vector, if you can find a different film, ensure that the distance between the two films eigenvector and small, that will make it strongly suggests that the film and film to some extent, are similar, at least in the sense that some people like movies, perhaps more likely to be interested in the film. To summarize, when a user when watching a film, if you are looking for 5 movie is very similar with the film, in order to recommend 5 new movie to user, you need to do is to find the movie, in these different film and the film we are looking for the smallest distance so that you can give your users to recommend a few different films.
And hopefully, by doing this, you can see how you can do a vectorized calculation to rate all users and all movies. Hopefully, you’ll also be able to learn how to find relevant movies and products by studying feature parameters.
16.6 Implementation details: mean normalization
Reference video: 16-6-Implementational Detail_ Mean Normalization (9 min). MKV
Let’s look at the user rating data below:
If we add a new user, Eve, and Eve doesn’t rate any movies, on what basis do we recommend movies for Eve?
We first need to normalize the mean value of the result matrix, subtracting the average value of all users’ ratings for a certain movie from the score of each user:
And then we use this new matrix to train the algorithm. If we were to use our newly trained algorithm to predict ratings, we would need to add the average back, and predict that, for Eve, our new model would assume that the score she gave each movie was the average score for that movie.
Code section – Exception detection and recommendation system
In this exercise, we will implement the anomaly detection algorithm using the Gaussian model and apply it to detect failed servers on the network. We’ll also see how to build a recommendation system using collaborative filtering and apply it to a movie recommendation dataset.
Anomaly Detection
Our first task is to use the Gaussian model to detect whether unmarked examples in the dataset should be considered exceptions. We have a simple two-dimensional data set to start with to help visualize what the algorithm is doing.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat
Copy the code
data = loadmat('data/ex8data1.mat')
X = data['X']
X.shape
Copy the code
(307.2)
Copy the code
fig, ax = plt.subplots(figsize=(12.8))
ax.scatter(X[:,0], X[:,1])
plt.show()
Copy the code
This is a very tight cluster, with several values moving away from the cluster. In this simple example, these can be considered exceptions. To figure it out, we are estimating the Gaussian distribution for each feature in the data. To do this, we will create a function that returns the mean and variance of each factor.
def estimate_gaussian(X) :
mu = X.mean(axis=0)
sigma = X.var(axis=0)
return mu, sigma
Copy the code
mu, sigma = estimate_gaussian(X)
mu, sigma
Copy the code
(array([14.11222578.14.99771051]), array([1.83263141.1.70974533]))
Copy the code
Now that we have our model parameters, we need to determine the probability threshold, which indicates that a sample should be considered an anomaly. To do this, we need to use a set of labeled validation data, where real exception samples have been labeled, and to authenticate the performance of the model given different thresholds.
Xval = data['Xval']
yval = data['yval']
Xval.shape, yval.shape
Copy the code
((307.2), (307.1))
Copy the code
We also need a way to calculate the probability that the data points are normally distributed. Fortunately, SciPy has this built-in method.
from scipy import stats
dist = stats.norm(mu[0], sigma[0])
dist.pdf(15)
Copy the code
0.1935875044615038
Copy the code
We can also pass the array to the probability density function and get the probability density for each point in the data set.
dist.pdf(X[:,0[])0:50]
Copy the code
array([0.183842 , 0.20221694.0.21746136.0.19778763.0.20858956.0.21652359.0.16991291.0.15123542.0.1163989 , 0.1594734 ,
0.21716057.0.21760472.0.20141857.0.20157497.0.21711385.0.21758775.0.21695576.0.2138258 , 0.21057069.0.1173018 ,
0.20765108.0.21717452.0.19510663.0.21702152.0.17429399.0.15413455.0.21000109.0.20223586.0.21031898.0.21313426.0.16158946.0.2170794 , 0.17825767.0.17414633.0.1264951 ,
0.19723662.0.14538809.0.21766361.0.21191386.0.21729442.0.21238912.0.18799417.0.21259798.0.21752767.0.20616968.0.21520366.0.1280081 , 0.21768113.0.21539967.0.16913173])
Copy the code
We calculate and store the probability density of each value in the data set given the gaussian model parameters described above.
p = np.zeros((X.shape[0], X.shape[1]))
p[:,0] = stats.norm(mu[0], sigma[0]).pdf(X[:,0])
p[:,1] = stats.norm(mu[1], sigma[1]).pdf(X[:,1])
p.shape
Copy the code
(307.2)
Copy the code
We also need to do this for the validation set (using the same model parameters). We will use these probabilities combined with real labels to determine the optimal probability threshold for assigning data points as exceptions.
pval = np.zeros((Xval.shape[0], Xval.shape[1]))
pval[:,0] = stats.norm(mu[0], sigma[0]).pdf(Xval[:,0])
pval[:,1] = stats.norm(mu[1], sigma[1]).pdf(Xval[:,1])
pval.shape
Copy the code
(307.2)
Copy the code
Next, we need a function that finds the best threshold for a given probability density value and real labels. To do this, we will calculate F1 scores for different Epsilon values. F1 is a function of the number of true positives, false positives and false negatives. The equations are in the exercise text.
def select_threshold(pval, yval) :
best_epsilon = 0
best_f1 = 0
f1 = 0
step = (pval.max() - pval.min/ ())1000
for epsilon in np.arange(pval.min(), pval.max(), step):
preds = pval < epsilon
tp = np.sum(np.logical_and(preds == 1, yval == 1)).astype(float)
fp = np.sum(np.logical_and(preds == 1, yval == 0)).astype(float)
fn = np.sum(np.logical_and(preds == 0, yval == 1)).astype(float)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = (2 * precision * recall) / (precision + recall)
if f1 > best_f1:
best_f1 = f1
best_epsilon = epsilon
return best_epsilon, best_f1
Copy the code
epsilon, f1 = select_threshold(pval, yval)
epsilon, f1
Copy the code
(0.009566706005956842.0.7142857142857143)
Copy the code
Finally, we can apply thresholds to the data set and visualize the results.
# indexes of the values considered to be outliers
outliers = np.where(p < epsilon)
outliers
Copy the code
(array([300.301.301.303.303.304.306.306], dtype=int64),
array([1.0.1.0.1.0.0.1], dtype=int64))
Copy the code
fig, ax = plt.subplots(figsize=(12.8))
ax.scatter(X[:,0], X[:,1])
ax.scatter(X[outliers[0].0], X[outliers[0].1], s=50, color='r', marker='o')
plt.show()
Copy the code
Red dots are points marked as outliers. These seem reasonable. The upper right corner with some separation (but not being marked) can also be an outlier, but pretty close.
Collaborative filtering
The recommendation engine uses project-based and user-based similarity metrics to examine users’ historical preferences in order to suggest new “things” that may be of interest to users. In this exercise, we will implement a specific recommendation system algorithm called collaborative filtering and apply it to a data set of movie ratings.
We first load and examine the data we are going to use.
data = loadmat('data/ex8_movies.mat')
Copy the code
Y is an array containing levels 1 through 5 (number of movies x number of users). R is an array of “indicators” that contain binary values indicating whether a user has rated a movie. Both should have the same dimension.
Y = data['Y']
R = data['R']
Y.shape, R.shape
Copy the code
((1682.943), (1682.943))
Copy the code
We can evaluate the average rating of a movie by averaging the rank Y.
Y[1,np.where(R[1, :] = =1) [0]].mean()
Copy the code
3.2061068702290076
Copy the code
We can also try to “visualize” the data by rendering the matrix into an image. We can’t glean much from this, but it does give us an idea of the relative density of users and movies.
fig, ax = plt.subplots(figsize=(12.12))
ax.imshow(Y)
ax.set_xlabel('Users')
ax.set_ylabel('Movies')
fig.tight_layout()
plt.show()
Copy the code
Next, we will implement the cost function of collaborative filtering. Intuitively, “cost” refers to the degree to which a set of movie rating predictions deviates from the real predictions. The cost equation is given in the exercise text. It is based on two sets of parameter matrices called X and Theta in the text. These are “expanded” into the “parameters” input so that SciPy’s optimization package can be used later. Note that I have included array/matrix shapes (for the data we use in this exercise) in the comments to help illustrate how matrix interaction works.
Cost function
def cost(params, Y, R, num_features) :
Y = np.matrix(Y) # (1682, 943)
R = np.matrix(R) # (1682, 943)
num_movies = Y.shape[0]
num_users = Y.shape[1]
# reshape the parameter array into parameter matrices
X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features))) # (1682, 10)
Theta = np.matrix(np.reshape(params[num_movies * num_features:], (num_users, num_features))) # (943, 10)
# initializations
J = 0
# compute the cost
error = np.multiply((X * Theta.T) - Y, R) # (1682, 943)
squared_error = np.power(error, 2) # (1682, 943)
J = (1. / 2) * np.sum(squared_error)
return J
Copy the code
To test this, we provide a set of pre-training parameters that we can evaluate. In order to keep the evaluation time short, we will only look at a small piece of data.
params_data = loadmat('data/ex8_movieParams.mat')
X = params_data['X']
Theta = params_data['Theta']
X.shape, Theta.shape
Copy the code
((1682.10), (943.10))
Copy the code
users = 4
movies = 5
features = 3
X_sub = X[:movies, :features]
Theta_sub = Theta[:users, :features]
Y_sub = Y[:movies, :users]
R_sub = R[:movies, :users]
params = np.concatenate((np.ravel(X_sub), np.ravel(Theta_sub)))
cost(params, Y_sub, R_sub, features)
Copy the code
22.224603725685675
Copy the code
Next we need to implement the gradient calculation. As we did with the neural network implementation in Exercise 4, we will extend the cost function to compute the gradient.
def cost(params, Y, R, num_features) :
Y = np.matrix(Y) # (1682, 943)
R = np.matrix(R) # (1682, 943)
num_movies = Y.shape[0]
num_users = Y.shape[1]
# reshape the parameter array into parameter matrices
X = np.matrix(
np.reshape(params[:num_movies * num_features],
(num_movies, num_features))) # (1682, 10)
Theta = np.matrix(
np.reshape(params[num_movies * num_features:],
(num_users, num_features))) # (943, 10)
# initializations
J = 0
X_grad = np.zeros(X.shape) # (1682, 10)
Theta_grad = np.zeros(Theta.shape) # (943, 10)
# compute the cost
error = np.multiply((X * Theta.T) - Y, R) # (1682, 943)
squared_error = np.power(error, 2) # (1682, 943)
J = (1. / 2) * np.sum(squared_error)
# calculate the gradients
X_grad = error * Theta
Theta_grad = error.T * X
# unravel the gradient matrices into a single array
grad = np.concatenate((np.ravel(X_grad), np.ravel(Theta_grad)))
return J, grad
Copy the code
J, grad = cost(params, Y_sub, R_sub, features)
J, grad
Copy the code
(22.224603725685675,
array([ 2.52899165.7.57570308.1.89979026.0.56819597.3.35265031.0.52339845.0.83240713.4.91163297.0.76677878.0.38358278.2.26333698.0.35334048.0.80378006.4.74271842.0.74040871.10.5680202 ,
4.62776019.7.16004443.3.05099006.1.16441367.3.47410789.0. , 0. , 0. ,
0. , 0. , 0. ]))
Copy the code
Our next step is to add regularization to the cost and gradient calculations. We will create a final regularized version of the functionality (note that this version contains an additional “learning rate” parameter, called “lambda” in the text).
def cost(params, Y, R, num_features, learning_rate) :
Y = np.matrix(Y) # (1682, 943)
R = np.matrix(R) # (1682, 943)
num_movies = Y.shape[0]
num_users = Y.shape[1]
# reshape the parameter array into parameter matrices
X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features))) # (1682, 10)
Theta = np.matrix(np.reshape(params[num_movies * num_features:], (num_users, num_features))) # (943, 10)
# initializations
J = 0
X_grad = np.zeros(X.shape) # (1682, 10)
Theta_grad = np.zeros(Theta.shape) # (943, 10)
# compute the cost
error = np.multiply((X * Theta.T) - Y, R) # (1682, 943)
squared_error = np.power(error, 2) # (1682, 943)
J = (1. / 2) * np.sum(squared_error)
# add the cost regularization
J = J + ((learning_rate / 2) * np.sum(np.power(Theta, 2)))
J = J + ((learning_rate / 2) * np.sum(np.power(X, 2)))
# calculate the gradients with regularization
X_grad = (error * Theta) + (learning_rate * X)
Theta_grad = (error.T * X) + (learning_rate * Theta)
# unravel the gradient matrices into a single array
grad = np.concatenate((np.ravel(X_grad), np.ravel(Theta_grad)))
return J, grad
Copy the code
J, grad = cost(params, Y_sub, R_sub, features, 1.5)
J, grad
Copy the code
(31.34405624427422,
array([ 0.95596339.6.97535514.0.10861109.0.60308088.2.77421145.0.25839822.0.12985616.4.0898522 ,
0.89247334.0.29684395.1.06300933.0.66738144.0.60252677.4.90185327.0.19747928.10.13985478.2.10136256.6.76563628.2.29347024.0.48244098.2.99791422.0.64787484.0.71820673.1.27006666.1.09289758.0.40784086.0.49026541]))
Copy the code
Again, the result matches the expected output of the exercise code, so the regularization looks normal. Before we train the model, we have one final step, our task is to create our own movie score so that we can use the model to generate personalized recommendations. Give us a file that links the movie index to its title. Next we load the file into the dictionary.
movie_idx = {}
f = open('data/movie_ids.txt',encoding= 'gbk')
for line in f:
tokens = line.split(' ')
tokens[-1] = tokens[-1] [: -1]
movie_idx[int(tokens[0]) - 1] = ' '.join(tokens[1:)Copy the code
movie_idx[0]
Copy the code
'Toy Story (1995)'
Copy the code
We will use the grades provided in the exercise.
ratings = np.zeros((1682.1))
ratings[0] = 4
ratings[6] = 3
ratings[11] = 5
ratings[53] = 4
ratings[63] = 5
ratings[65] = 3
ratings[68] = 5
ratings[97] = 2
ratings[182] = 4
ratings[225] = 5
ratings[354] = 5
print('Rated {0} with {1} stars.'.format(movie_idx[0].str(int(ratings[0))))print('Rated {0} with {1} stars.'.format(movie_idx[6].str(int(ratings[6))))print('Rated {0} with {1} stars.'.format(movie_idx[11].str(int(ratings[11))))print('Rated {0} with {1} stars.'.format(movie_idx[53].str(int(ratings[53))))print('Rated {0} with {1} stars.'.format(movie_idx[63].str(int(ratings[63))))print('Rated {0} with {1} stars.'.format(movie_idx[65].str(int(ratings[65))))print('Rated {0} with {1} stars.'.format(movie_idx[68].str(int(ratings[68))))print('Rated {0} with {1} stars.'.format(movie_idx[97].str(int(ratings[97))))print('Rated {0} with {1} stars.'.format(movie_idx[182].str(int(ratings[182))))print('Rated {0} with {1} stars.'.format(movie_idx[225].str(int(ratings[225))))print('Rated {0} with {1} stars.'.format(movie_idx[354].str(int(ratings[354))))Copy the code
Rated Toy Story (1995) with 4 stars.
Rated Twelve Monkeys (1995) with 3 stars.
Rated Usual Suspects, The (1995) with 5 stars.
Rated Outbreak (1995) with 4 stars.
Rated Shawshank Redemption, The (1994) with 5 stars.
Rated While You Were Sleeping (1995) with 3 stars.
Rated Forrest Gump (1994) with 5 stars.
Rated Silence of the Lambs, The (1991) with 2 stars.
Rated Alien (1979) with 4 stars.
Rated Die Hard 2 (1990) with 5 stars.
Rated Sphere (1998) with 5 stars.
Copy the code
We can add our own rating vector to an existing dataset for inclusion in the model.
R = data['R']
Y = data['Y']
Y = np.append(Y, ratings, axis=1) R = np.append(R, ratings ! =0, axis=1)
Y.shape, R.shape, ratings.shape
Copy the code
((1682.944), (1682.944), (1682.1))
Copy the code
We’re not just going to train the collaborative filtering model. We just need to define some variables and normalize the ratings.
movies = Y.shape[0] # 1682
users = Y.shape[1] # 944
features = 10
learning_rate = 10.
X = np.random.random(size=(movies, features))
Theta = np.random.random(size=(users, features))
params = np.concatenate((np.ravel(X), np.ravel(Theta)))
X.shape, Theta.shape, params.shape
Copy the code
((1682.10), (944.10), (26260)),Copy the code
Ymean = np.zeros((movies, 1))
Ynorm = np.zeros((movies, users))
for i in range(movies):
idx = np.where(R[i,:] == 1) [0]
Ymean[i] = Y[i,idx].mean()
Ynorm[i,idx] = Y[i,idx] - Ymean[i]
Ynorm.mean()
Copy the code
5.507036456515984 e-19
Copy the code
from scipy.optimize import minimize
fmin = minimize(fun=cost, x0=params, args=(Ynorm, R, features, learning_rate),
method='CG', jac=True, options={'maxiter': 100})
Copy the code
X = np.matrix(np.reshape(fmin.x[:movies * features], (movies, features)))
Theta = np.matrix(np.reshape(fmin.x[movies * features:], (users, features)))
X.shape, Theta.shape
Copy the code
((1682.10), (944.10))
Copy the code
The parameters we’ve trained are X and Theta. We can use these to create some suggestions for the users we add.
predictions = X * Theta.T
my_preds = predictions[:, -1] + Ymean
my_preds.shape
Copy the code
(1682.1)
Copy the code
sorted_preds = np.sort(my_preds, axis=0[: : -)1]
sorted_preds[:10]
Copy the code
matrix([[5.00000277],
[5.00000236],
[5.00000172],
[5.00000167],
[5.00000018],
[4.99999996],
[4.99999993],
[4.99999963],
[4.99999936],
[4.99999933]])
Copy the code
That gave us a top-ranked rating, but we lost the index of those ratings. We actually need to use the argsort function to predict which movies the rating corresponds to.
idx = np.argsort(my_preds, axis=0[: : -)1]
idx
Copy the code
matrix([[1499],
[ 813],
[1535],... [313],
[ 829],
[1431]], dtype=int64)
Copy the code
print("Top 10 movie predictions:")
for i in range(10):
j = int(idx[i])
print('Predicted rating of {0} for movie {1}.'.format(str(float(my_preds[j])), movie_idx[j]))
Copy the code
Top 10 movie predictions:
Predicted rating of 5.0000027697456755 for movie Santa with Muscles (1996).
Predicted rating of 5.000002356341384 for movie Great Day in Harlem, A (1994).
Predicted rating of 5.000001717626692 for movie Aiqing wansui (1994).
Predicted rating of 5.000001669423294 for movie Star Kid (1997).
Predicted rating of 5.000000180977937 for movie Prefontaine (1997).
Predicted rating of 4.999999960309181 for movie They Made Me a Criminal (1939).
Predicted rating of 4.999999928299842 for movie Marlene Dietrich: Shadow and Light (1996) .
Predicted rating of 4.999999629264883 for movie Someone ElseThe Predicted rating of 4.9999993570025705 for movie Saint of Fort Washington, The (1993). Predicted rating of 4.999999325074885 for Movie Angels: The Dorothy Day Story (1996).Copy the code
The recommended movie didn’t actually fit the exercise text, and I couldn’t figure out why.
Reference material \
[1] machine learning courses: www.coursera.org/course/ml [2] Huang Haiguang: https://github.com/fengdu78 [3] making: github.com/fengdu78/Co… [4] Job code: https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes/blob/master/code/ex8-anomaly%20detection%20and%20recommendation/M L-exercise8. Ipynb [5] Markdown file: github.com/fengdu78/Co… [6] PDF file: https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes/blob/master/ machine learning personal notes full version v5.4 – A4 print edition. PDF
On this site
The “Beginner machine Learning” public account was founded by Dr. Huang Haiguang. Huang Bo has more than 23,000 followers on Zhihu and ranks among the top 100 in github (33,000 followers). This public number is committed to the direction of artificial intelligence science articles, for beginners to provide learning routes and basic information. Original works include: Personal Notes on Machine learning, notes on deep learning, etc.
Highlights from the past
-
All those years of academic philanthropy. – You’re not alone
-
Suitable for beginners to enter the artificial intelligence route and information download
-
Ng machine learning course notes and resources (Github star 12000+, provide Baidu cloud image) \
-
Ng deep learning notes, videos and other resources (Github standard star 8500+, providing Baidu cloud image)
-
Python code implementation of Statistical Learning Methods (Github 7200+)
-
The Mathematical Essence of Machine Learning (Online reading edition) \