Univariate linear regression
Linear Regression with One Variable
Model to represent
Let’s start with an example of predicting housing prices using a data set that includes home prices in Portland, Oregon. So here, I’m going to plot my data set based on the prices that different house sizes have sold for. For example, if your friend’s house is 1,250 square feet, you tell them how much it would cost. So **, one thing you can do is build a model, maybe a straight line, from this data model, maybe you can tell your friend that he can sell the house for something like 220,000 dollars. So this is an example of a supervised learning algorithm. **
It’s called supervised learning because for each piece of data, we give the “right answer,” which tells us what a house actually costs based on our data, and, more specifically, it’s a regression problem. The term regression means that we predict an accurate output based on previous data, in this case the price.
Further, in supervised learning we have a data set, which is called a training set. Use lowercase M to represent the number of training samples.
Taking the previous housing transaction problem as an example, suppose we go back to the following Training Set for the problem:
The notation we will use to describe this regression problem is as follows:
-
M is the number of training set instances
-
X stands for feature/input variable
-
Y is the target variable/output variable
-
(x,y) represents instances in the training set
-
(x(I),y(I))(x^{(I)},y^{(I)})(x(I),y(I)) represents the ith observation instance
H represents the solution or function of the learning algorithm, also known as hypothesis.
So this is how a supervised learning algorithm works, and we can see here we have our training set of house prices. We feed it to our learning algorithm, and the learning algorithm outputs a function, usually represented as a lowercase h. Represents hypothesis, represents a function, the input is the size of the house, like the house your friend wants to sell, so h comes up with the y value based on the input x value, and the y value corresponds to the price of the house. Therefore, H is a functional mapping from X to Y.
So, for our housing price forecast problem, how should we express H?
One possible expression is:
Since there is only one characteristic/input variable, such problems are called univariate linear regression problems.
Cost function
We will define the concept of cost functions, which will help us figure out how to fit the most likely line to our data. As shown in figure:
All we need to do now is select the parameters θ0\ theTA_0 θ0 and θ1\ theTA_1 θ1 for our model, which in the case of house prices are the slope of the line and the y-intercept.
The parameters we choose determine how accurate the line we get is relative to our training set. The difference between the predicted value of the model and the actual value of the training set (indicated by the blue line in the figure below) is a Modeling error.
Our goal is to select model parameters that minimize the sum of squares of modeling errors.
Even if you have a cost function J (theta zero, theta 1) = 12 m ∑ I = 1 m (h theta – y (x (I)) (I)) 2 J (\ theta_0, \ theta_1) = \ frac {1} {2} m. \ sum ^ m_ {I = 1} (h_ \ theta (x ^ {(I)}) – y ^ {} (I)) ^ 2 J (theta 0, theta 1) = 2 m1 ∑ I = 1 m (h theta – y (x (I)) (I)) 2 min. Note the figure above, assuming that the difference between all points on H and the actual points in the sample is as small as possible
The ** cost function is also called the squared error function, sometimes also called the squared error cost function. The reason we want to figure out the sum of the squares of the errors is because the error squared cost function, for most problems, especially regression problems, is a reasonable choice. ** There are other cost functions that work well, but the squared error cost function is probably the most common way to solve regression problems.
Of course, what we really need is an efficient algorithm that automatically finds these parameters that minimize the cost function J
和
Gradient descent
Gradient descent is an algorithm used to minimize the function. We will use the gradient descent algorithm to find the minimum value of the cost function J(θ0,θ1)J(\theta_0,\theta_1)J(θ0,θ1).
The idea behind gradient descent is that we start with a random combination of parameters (θ0,θ1,……) , theta n) (\ theta_0, \ theta_1,… , \ theta_n) (theta 0, 1, theta… ,θn), calculate the cost function, and then we look for the next combination of parameters that can reduce the value of the cost function the most. We keep doing this until we find a local minimum. Since we haven’t tried all of the parameter combinations, we can’t be sure that the local minimum we get is the global minimum. Different initial parameter combinations can lead to different local minima.
Imagine that you’re standing at this point on the mountain, on the red mountain of your imaginary park, and in the gradient descent algorithm, what we do is we rotate 360 degrees, look around us, and ask ourselves to go in a certain direction, take small steps as fast as possible down the mountain. What direction do these little steps need to go? If we’re standing at this point on the hill, and you look around, you’ll see the best way to go down, and you look around, and you think again, in what direction should I take my little steps down? From this new point, you look around and decide in which direction you will descend fastest, take another small step, and so on, until you are near the local lowest point.
Batch gradient descent
The formula of batch gradient Descent algorithm is:
Where alpha \alpha is the learning rate, which determines how far down we go in the direction that maximizes the reduction of the cost function. In batch gradient descent, we subtract the learning rate from all parameters at the same time each time times the derivative of the cost function. (Derivative of the cost function)
To begin with, if you don’t know how many possible values θ\thetaθ have (assuming infinite space), pick any one of them and use gradient descent to approach the nearest value, that is, constantly update θ\thetaθ. Theta 0: = theta \ theta_0: = \ theta theta 0: = theta, 1: theta equals theta \ theta_1:1 = 1: \ theta_1 theta equals theta 1, as shown in the figure below
An intuitive understanding of gradient descent
The gradient descent algorithm is as follows:
Description: assign θ\thetaθ so that J(θ)J(\theta)J(θ) proceeds in the direction of the fastest gradient descent, iterating continuously until the local minimum value is finally obtained. Where α\alphaα is the learning rate, which determines how far down we go in the direction that maximizes the reduction in the cost function.
The slope of the tangent line is exactly the derivative, so it keeps updating
The value of keeps decreasing
If alpha alpha alpha is too small, if MY learning rate is too small, then you end up with this little baby thing, trying to get to the lowest point, and it takes a lot of steps to get to the lowest point, so if alpha alpha alpha is too small, it might be slow, because it’s moving a little bit, It will take many steps to reach the global low point.
If alpha \ alpha alpha is too large, the gradient descent may cross the low, even that may not convergence, the next iteration and moved a big step, over time, and over time, again and again over the low, low far away until you find that in fact, so, if the alpha \ alpha alpha is too big, it will lead to can’t convergence or even divergence.
Let’s take an example. This is the cost function J(theta)J(\theta)J(theta).
I want to find its minimum, first initialize my gradient descent algorithm, initialize it at that magenta point, and if I update a gradient descent, maybe it will take me to this point, because the derivative of this point is pretty steep. Now, at this green point, if I update it one more step, you’ll see that my derivative, my slope, is not that steep. As I get closer and closer to the lowest point, my derivative gets closer and closer to zero, so after one step of gradient descent, the new derivative gets a little bit smaller. And then I want to gradient down one more step, and at this green point, naturally I’m going to take a slightly smaller step than I did at that magenta point, and I’m going to get to the new red point, which is closer to the global lowest point, so the derivative at this point is going to be even smaller than it was at the green point. So, when I do another gradient descent, my derivative term is smaller, and the magnitude of θ1\ theTA_1 θ1 update is smaller. So as you go through gradient descent, you automatically get smaller and smaller, until you get to a point where you get to a point where you converge to a local minimum.
And intuitively, the slope gets smaller and smaller, and the derivative is 0.
In retrospect, the gradient descent method, when we are close to local lows, gradient descent will automatically take smaller amplitude, this is because when we are close to local lows, apparently in local minimum derivative is equal to zero, so when we are close to the local minimum, derivative value will automatically become smaller and smaller, so the gradient descent will automatically take smaller amplitude, That’s how gradient descent works. So there is really no need to reduce alpha \alpha alpha further.
This is the gradient descent algorithm, and you can use it to minimize any cost function JJJ, not just the cost function JJJ in linear regression.
Linear regression of gradient descent
The comparison between gradient descent algorithm and linear regression algorithm is shown in the figure below:
Gradient descent is used to converge the cost function to the minimum, and the gradient descent method is used for our previous linear regression problem. The key is to calculate the derivative of the cost function, namely:
That is, θ0\theta_0θ0 and θ1\theta_1θ1 are updated using gradient descent until the cost function is minimized
The algorithm we just used is sometimes called batch gradient descent. Too often, in fact, in the machine learning algorithm will been named, but the name “batch gradient descent,” * * refers to every step of the gradient descent, we all use all the training samples, the gradient descent, when calculating the differential derivative term, we need to add them, so, in each individual gradient descent, We’re going to end up with something that looks like this, and this term is going to sum over all m training samples. ** Thus, the name batch gDA implies that we need to consider all of these “batch” training samples, while in fact, there are sometimes other types of GDA, not the “batch” type, which does not consider the entire training set, but rather focuses on a small subset of the training set at a time.
conclusion
Univariate linear regression uses linear equation y= Ax +b for fitting. Two parameters, A and B, should be considered to minimize the cost function and obtain the optimal line. To minimize the cost function, gradient descent can be used to update A and B continuously and finally converge to 0, and the optimal parameter values, A and B, can be obtained
Review of linear algebra
Addition and scalar multiplication
Matrix addition: equal rows and columns can be added.
Ex. :
Matrix multiplication: multiply each element
Combinatorial algorithms are similar.
Matrix vector multiplication
The multiplication of matrix and vector is shown as follows: an m*n matrix multiplied by an n*1 vector yields an m*1 vector
Examples of algorithms:
Matrix multiplication
Let’s say I have two matrices A and B, and their product can be expressed in the form shown here.
Content sources
[1]. Stanford Machine Learning Course 2014
[2]. Stanford University 2014 Machine Learning Course Chinese Notes Directory
[3].Coursera-ML-AndrewNg-Notes