Now it’s time to introduce our first machine learning algorithm.
And surprisingly, this algorithm is known as linear regression, which is an algorithm that you’re probably familiar with. We will understand the principles of linear regression by looking at it as a typical machine learning algorithm with tasks, performance measures, and other attributes of machine learning methods we described earlier.
For a slight adjustment, we will consider the general multivariate case from the outset.
First, let’s start with task T. Our task is to predict the value of the scalar variable y, given the predictor or eigenvectors X0 to xD-1, this will be a vector in D dimensional space RD.
Now, in order to learn from experience, we are given a pair of data sets of X and y, which we can divide into two data sets X_train, y_train and X_test, y_test as before. Assume that both data sets are IAD samples from some data generating distribution P data. Next, consider the model architecture we want to use. By this, we mean to define an X function space with which we try to fit our data.
In this case, we choose a linear architecture. In other words, we’re constraining ourselves to a class of linear functions of X. Here, y is a vector of N predicted values. X is the size N by D design matrix. W is a vector of regression coefficients of length D.
This vector is also commonly referred to as a weight vector in machine learning.
Now, an example of this linear regression model is a model that tries to predict the daily returns of certain stocks. Let’s take Amazon stock. Although X will be a vector of the returns of various market indexes, such as the S&P 500, NASDAQ Composite index, VIX, etc. Now we must specify performance metrics. We define it as the mean square error, or MSE, calculated on the test set. So, to calculate it, we calculate the average of the squared variances between the different Y’s of O points in the test set and the Y hat predicted by the model. The same thing can be written more compactly, because the square distance between the vector Y hat and Y is rescaled.
Now we can replace Y hat with the model equation above, and write it in this form.
The next question is how to find the best parameter, W.
We proceed as discussed in the previous lecture and look for the parameter W that minimizes another objective, the MSE error on the training set. This will be the same equation as here, but X_test is replaced by X_train, Y_test by Y_train.
The reason is that both mean square errors and training and test data are estimated by the same amount, which is a generalization error.
Now, to find the best value for W, we need to set the gradient of the MSE training error to 0. This gives us this expression, for the best vector of W, it should be equal to 0. Now, let’s add an unnecessary constant multiplier of 1 to N_train and write the gradient as the gradient of the vector norm, which we write as the scalar product of the vector and itself.
Here, the symbol T stands for the transpose vector.
Now, let me simplify the relationship a little bit, dropping the superscripts, because we don’t need them. We just remember that in this calculation we only used training data.
Now, next, we extend the scalar product and compute the derivative of the entire expression with respect to W.
Since the first term is W squared, the second term is almost linear, and the last term doesn’t depend on W at all, it will be omitted in the differentiation. As a result, we have this expression. For the best vector W, it’s going to be equal to 0. Now get the best value for W by simply inverting the equation. So we ended up with this formula.
This relationship is also called the normal equation in regression analysis. Let’s recall that X and y here refer to the values in the training set.
To understand the matrix operations involved here, it is also useful to visualize the relationship as shown here.
Now, given the estimate vector W, we can do something with it. First, we can calculate the training or in sample error by multiplying X_train left by W.
If we replace W with this expression, we see that it can be represented as a product of the matrix H, consisting of the data matrix X_train and the vector y_train. This matrix is sometimes called the H hat matrix. You can check whether this matrix is a projection matrix.
In particular, it is symmetric and idempotent, meaning that its square is equal to the matrix itself.
Another aspect of estimating the vector W is to use it for prediction.
For example, if we use the test data set to predict the sample, the answer is the product of X_test and the vector W.
Now, notice the interesting thing about this relationship. We see that the expression for W involves X transpose and the reciprocal of the product of X matrices. If the data matrix X contains nearly identical or very strongly related columns, it may result in a situation where the product is almost a degenerate matrix.
Another common situation occurs when a column can be represented as a linear combination of other columns, producing an almost degenerate matrix. This is known in statistics as multicollinearity. This is the determinant of this matrix. In all of these cases, we are numerically very close to zero. This can cause the predicted value to be unstable or infinite.
There are several ways to prevent this instability in linear regression. One is to scan the set of predictive variables to rule out possible multicollinearity. But there are other ways. In particular, in the next video, we’ll talk about equalization, which is a way of providing better out of sample performance in supervised learning algorithms.