1. More features
2. Multivariate feature descent method
3. Multivariate feature descent method – feature scaling
When there are multiple variables to find the global optimal solution, if the value range of variables is very different, the contour map will become flat, for example, the house size and the number of rooms in the graph, one is 0-2000 and the other is 1-5, which will lead to the slow finding of the global optimal solution and take a long time to calculate.
Therefore, the range of features should be reduced to a relatively similar range, such as x1/2000 and x2/5, so that the range of both x1 and x2 is [0,1], so that the contour graph looks more round and the global optimal solution can be found faster.
Normalization means normalization:μ1 is the average of x1, s1 is the range of x1, and it’s generally the highest value of x1 minus the lowest value of x1. The above is the introduction of using feature scaling, using this method can greatly reduce the number of convergent, improve the efficiency of calculation.
4. Multiple feature descent method – learning rate
How to choose the learning rate α in gradient descent is the problem to be discussed in this section.
When the difference between the result of the previous iteration and the result of the current iteration is less than the power of 10 -3, it is basically considered to have converged and there is no need to continue the iteration. But in practice, some algorithms converge after 30 iterations, while others may need three million iterations, depending on different scenarios and algorithms. This number of iterations and when to stop is difficult to determine.
Conclusion:
1. If the learning rate is too small, it will lead to slow convergence. 2. If the learning rate is too large, the cost function may not decrease every time and may not converge.
Therefore, in order to find the correct learning rate, it is best to draw the graph of the cost function and judge the choice of learning rate according to the graph. Each time Teacher Ng uses 3 times to increase the learning rate, he will find the maximum and minimum learning rate, and the value he finally finds may be slightly smaller than the maximum learning rate.
5. Characteristic and polynomial regression
Like the example above, if you’re familiar with a linear equation you’ll want to fit it as a function of square root. So there is no fixed method for polynomial regression, and you can use various ways to fit the equation after you are familiar with it.
#6. Normal equation (direct solution as distinct from iterative method)The normal equation is x transpose times the inverse of x, times x transpose, times y to get θ. (We’re done without telling you where we got this equation.) Normal equations do not require the use of characteristic scaling and can be computed directly.
Comparison of advantages and disadvantages of normal equation and gradient descent:
1. The advantage of normal equation is that there is no choice of learning rate and no need for iterative calculation. The disadvantage is that it is impossible to calculate when n is too large, because the time complexity of the normal equation is O(n^3), so the complexity is too high and the calculation takes a long time. Generally, the normal equation should not be used when the value of n is greater than 10,000. 2. The advantage of gradient descent is that it can calculate the case of large n, and it works very well, but it is sure to choose a learning rate alpha, and it needs constant iterative calculation.
7. Normal Equation (solution in case of Irreversibility of matrix)
Generally, the matrix will not be irreversible. If it is, the following methods can be used to solve the problem: 1. For example, x1 is square meter, x2 is square foot, then the two functions have only fixed conversion relationship, which may cause the normal equation is irreversible. In this case, check the feature and remove the repeated 2. If there are too many features that make the matrix irreversible, delete some features in the calculation