Ng machine learning: Linear regression
First, a little digression about the course. For Ng’s course, I chose not to take a course on Coursera, partly because Coursera has its own course cycle, which may not be suitable for everyone. Secondly, Coursera will use Octave for course assignments, and I personally think Python is a better language for both learning and future use. So IN the end, I chose the form of course video + Python implementation homework.
Click on the video and you will be able to continue learning Ng’s course. I have posted the Python code for the course assignments on Github. Click on the code and you can go to Github to view the code.
Here are notes from the first week of Ng’s machine learning course.
Machine learning
What is machine learning? An informal definition given by Arthur Samuel is that a computer is capable of learning to solve problems without explicitly programming it. Machine learning algorithms include supervised learning, unsupervised learning, reinforcement learning, recommendation system and so on. The linear regression of our first week of learning is supervised learning.
The way supervised learning works
The first step is to have a Training Set that contains the correct results of the data for the problem. Through our Learning Algorithm, we learn the training data set and finally obtain a function, which is the prediction function we need and can make accurate prediction of the data of the training set and other data input.
For linear regression, our Hypothesis is:
One of theIs the parameters that the learning algorithm needs to learn, andThese are the features that we choose for the problem.
Cost function
So how do you learnPrediction functionIn the? We need to introduceCost functionIs used to evaluate the difference between real and predicted values. Once you have this function, the goal of the learning algorithm is to findMake this function as small as possible. forLinear regressionWe useCost functionIs this:
Among themIs the sample size,Is the training data set given the answer, superscriptDenotes the group of training data,Cost function Is aboutThe function. Of course, in order to express more concisely and write programs more clearly, we usually use its matrix expression:
And finally, just to connect the dots, let’s look at when there’s only one feature,Cost function Looks like.
The picture on the right isContour map, each line representsCost functionIs the same value as the red XCost functionThe lowest point.
Gradient descent algorithm
And then for the individual features that we just looked atCost functionThe image plus the “goal of the learning algorithm is to findmakeCost functionAs small as possible “. An intuitive idea is to take any point on the slope, and then go downhill and get to the lowest point. And that’s exactly the idea of gradient descent, where we update backwards along the gradient(downhill in the steepest direction) untilCost functionConverges to a minimum. Gradient descent algorithm updatedIn the following ways:
Among themIs the learning rate,Represents the value substitution using the right-hand formulaThe original value. forLinear regression, we updateIn the following ways:
This is where we can finish the whole thingLinear regressionIs a machine learning algorithm. setThe gradient descent algorithm is used to update the initial value ofTo the value ofConvergence. As for why to use the reverse gradient can be seenThis articleThe author explains the reason mathematically.
Normal equations
forLinear regressionWe can do it mathematicallyAt the minimumThe value of the. This involves a little bit of derivatives and linear algebra, and if you’re interested, you can watch the derivation in the video. And I’m going to solve it directlyFormula:
When useNormal equationsThere are certain restrictions, likeThe matrix has to be invertible. So why do we need gradient descent when we have a direct way to solve the problem? Because the gradient descent method is more general, it can be used to solve many problems, such as nonlinearCost function.
Feature standardization
In practical application, the features we choose, such as length, weight, area and so on, usually have different units and ranges, which will cause the gradient descent algorithm to slow down. So we want to scale the features to a relatively uniform range. There are Standardization and Normalization methods in common. Standardization is to change data into a standard normal distribution, even if the original is some strange distribution, according to the central limit theorem, the amount of data is large enough, the same becomes normal, the updated formula is:
Normalization is friendly to the gradient descent algorithm, which may eventually converge and improve training speed and accuracy. The updated formula is as follows:
Polynomial regression
Sometimes the linear Hypothesis is not necessarily suitable for the data we need to fit, so we choose polynomial fitting. For example:
And then we can convert it to thetaLinear regressionProblem, just make new features..That’s it.
So, that’s all for the week. Thank you for your patience.
hertzcat
2018-03-24