Preschool knowledge:
Supervised learning: input data is labeled. 2. Unsupervised learning: Input data is not labeled
1. Model description
In the picture, there is a relationship between the size of a house and its price, which corresponds to the Red Cross on the picture. Question? Based on the known relationship between size and price, can we make a prediction of the house price when we know the size of the house next time? Observe the distribution of scattered points and find that they are always scattered around a line. Then you can use a straight line to simulate this rule, as shown in the blue line (not necessarily the optimal solution). Break down the data further:(X,Y) correspond to (size, price) in the rectangular coordinate system above. Predicting the value of Y by input X is called regression in machine learning.
Here’s a look at the machine learning, supervised learning process:Process: by training set (multiple training samples), training set includes characteristic value, target value. The target value is the field (or label) that needs to be predicted. Price above, the eigenvalue is the remaining part of the training set after removing the eigenvalue, as shown above (The eigenvalue is generally correlated with the target value. If the correlation between an eigenvalue and the target value is 0, Therefore, the significance of keeping the eigenvalue for prediction is not significant.) Since the prediction is simulated by straight line above, the modified hypothesis function can be expressed as:The graph looks like a function y is equal to aX plus b
2. Cost function
Suppose hθ (x) = θ0+θ1X from the figure above, θ0 and θ1 affect hθ. The hypothesis function hθ will affect the accuracy of prediction. So the accuracy of the prediction results will be related to θ0, θ1.From the first formula in the figure above, is the sum squared of the error between the predicted value and the true value. Plus one more argument
Now it’s the mean squared error between the predicted value and the true value. And now our goal is to minimize J of θ0, θ1. Write forThat is:J(θ0, θ1) is a binary function, and the coordinate axes are three-dimensional graph rows. It’s not easy to discuss.
As shown in the figure below, the value of J(θ1) is discussed, and it is found that the image of J(θ1) is roughly unary quadratic function. It can be found in the figure that the function has a certain minimum value or local minimum value. So we can solve for miniJ(θ1)
Now let’s talk more about J(θ0,θ1), which involves functions of two variables. The resulting graph of a function of two variables is a three-dimensional graph. So how do you find miniJ(θ0,θ1) in three dimensions?It has contour lines:The center point of the contour line is miniJ(θ0,θ1).
3. Gradient descent algorithm
Gradient descent algorithm: Its main purpose is to find the minimum of the objective function through iteration, or convergence to the minimum.
A man is stuck on a mountain and needs to get down (to find the lowest point of the mountain, which is the valley). But at this time the mountain fog is very heavy, resulting in very low visibility; As a result, the path down the mountain is uncertain, and you must use the information around you to find your way down the mountain step by step. At this point, you can use the gradient descent algorithm to help you go down the mountain. So how do you do that? First, take his current position as the base, look for the steepest part of that position, then take a step down, then continue to take his current position as the base, look for the steepest part of that position, and then walk until he reaches the lowest part.And the idea of gradient descent is very similar to the idea of descent. First of all, we have a differentiable function. This function represents a mountain. Our goal is to find the minimum value of this function, which is the bottom of the mountain. According to the previous scenario, the fastest way down the hill is to find the steepest direction of the current position, and then go down in that direction, corresponding to the function, which is to find the gradient at a given point, and then go in the opposite direction of the gradient, and the function will fall the fastest! (The meaning of gradient is introduced in calculus, grad.)So, we take this method over and over again, we take gradients over and over again, and we get to local minima, and it’s kind of like going downhill. Taking the gradient determines the steepest direction, which is the way to measure direction in a scene.
α is called learning rate or step size in gradient descent algorithm, which means that we can control the distance of each step by α, so as not to go too fast and miss the lowest point as shown in the green path above. Also make sure you don’t go too slow so that the sun goes down (as in the red path above) before you get down the hill. So the choice of alpha is often important in gradient descent! Alpha can not be too large or too small, too small, may lead to slow to the lowest point, too large, will lead to miss the lowest point!
Now it is discussed on the binary function J(θ0,θ1), and the three-dimensional image is:
Then the gradient descent algorithm is expressed as:
Are looking for valleys in all directions.