Read the background knowledge: matrix derivation, a drop programming knowledge
One, the introduction
Introduces two binary classification algorithm – in front of the perceptron algorithm, pocket algorithm, these algorithms are classified to solve the problem, but the reality is more of such as predict an area housing prices, how many lines of credit card, bank should give someone today what should buy and sell stock, etc. The resulting numerical results of a specific problem, This type of problem is universally known in machine learning as regression problems. Regression analysis is a method to study the relationship between multiple groups of variables in statistics, and is also widely used in machine learning. One of the algorithm models, Linear Regression 1, is introduced below.
Ii. Model introduction
Before introducing the model, let’s take a look at an example. Suppose there is a place that calculates the income of people with different working years (simulated data), as shown in the following table:
Working fixed number of year | Average monthly salary |
---|---|
1 year | 1598 yuan |
2 years | 3898 yuan |
3 years | 6220 yuan |
4 years | 7799 yuan |
5 years | 10510 yuan |
From the data in the table above, it is possible to plot the average monthly income of the local area every other year:
By above can see these intuitive point seems to be in a straight line or in the vicinity, based on our intuition can determine the average income of local and working years seems to have a certain linear relationship, we just need to find a straight line, then we can predict the local six years wages of the average income of people. The process of finding such a line is called linear regression analysis.
Definition: Given some random sample points {x1, y1}, {x2, y2}… , find a hyperplane (a straight line with one variable and a plane with two variables) to fit these sample points. The linear equation is as follows: (1) The general form of the hyperplane function equation (2) Like the perceptron algorithm, b is regarded as the 0th W, and the hyperplane function equation is simplified into the dot product of two vectors W and x
A univariate is a straight line, a two-variable is a plane, and a multivariable is a hyperplane
Three, algorithm steps
So how do you find the best hyperplane from a bunch of sample points? In the above example of years of service and average monthly salary, it seems possible to draw many lines to fit the points.
Just like the two lines above, which line intuitively fits better? It can be seen from the dotted line in the figure that the distance between each sample point and line B is farther than that of line A, and line A is obviously better fitted than line B, which indicates that the fit can be judged by the distance between the sample point and line. Suppose there are N sample points {x, y}, and each sample point has M independent variables {x1, x2… }, the sum of the distances between all sample points and the hyperplane can be defined as the cost function of fitting these sample points, where W is an M-dimensional column vector, x is also an M-dimensional column vector, and y is a real number:
Since there is an absolute value in the function, rewrite it as a square, which is known in geometry as the Euclidean distance 2.
Intuitively, we only need to minimize the value of the cost function, that is, the sum of eucliian distances from all sample points to the hyperplane, and its corresponding W is the weight coefficient of the hyperplane.
argmin3Function returns the variable when the value of the function equation in parentheses is minimized
The solution method based on the minimization of the above function is called the least square method. Since the cost function is a convex function 4, the local minimum value is the global minimum value according to the properties of the convex function, and the optimal analytical solution of W can be directly obtained, where X is the N X M matrix and y is the n-dimensional column vector:
4. Proof of principle
The cost function of linear regression is convex. The convex function is a real-valued function f defined on a convex subset C of a vector space, and for any two vectors x1 and x2 in a convex subset C, the following equation is satisfied:
The left side of the inequality:
The right-hand side of the inequality:
(1) Multiply the left hand side of the inequality by 2. (2) Move the 2 into the continuous operation, write the three terms in parentheses. (3) expand the square into six terms
(1) The right-hand side of the inequality is also multiplied by the sum of 2 (2) to write the sum of the two terms (3) to expand the square term in parentheses (4) to be consistent with the above
Subtract the left side of the inequality from the right side of the inequality, and call the difference delta, and now we just have to prove that the difference between these two terms is greater than or equal to zero. (1) Observe the results of the two terms. The last three terms are the same. After subtracting (2), remove half of (3) by observing the parentheses, it will be found that it is a flat way
If the difference between the right hand side and the left hand side of the inequality is a continuous operation in a flat way, the final result must be greater than or equal to zero in the range of real numbers.
Analytic solution of linear regression cost function (1) The cost function of linear regression (2) can be rewritten as the dot product of two N-dimensional vectors (3) Using A to represent the first n-dimensional row vector, then the cost function is actually A vector times A vector transpose
(1) Definition of vector A (2) Vector A can be written as two n-dimensional row vectors minus (3) The first row vector can be written as w transpose, as an m-dimensional vector multiplied by an M x N matrix (remember x is originally an m-dimensional column vector, (4) Define an N x M matrix x, N rows correspond to N sample numbers, M columns correspond to M variables, define an n-dimensional column vector y, N dimensions correspond to N sample numbers. And you can see that the combination of a bunch of column vectors x is the transpose of x, and the combination of a bunch of real numbers y is the transpose of the column vector y
(1) cost function as the dot product of two vector (2) above, in the form of simplified into A cost function in (3) according to the nature of the transposed 5, you can transfer at the back of the rewrite (4) will be launched (5) to observe the multiplication are transposed two found among them, and because the final result of the two is A real number, So these two terms have to be the same, so we can combine these two terms
(1) The partial derivative of the cost function with respect to W is obtained. According to the vector derivative formula, only the first and second terms are related to W, and the last term is a constant. Because the cost function is convex, when the partial derivative of the cost function with respect to W is 0 vector, the cost function is minimum. (2) Take the second term and divide it by 2, and multiply both sides by the inverse, and the left-hand side and the inverse are the identity matrix, so we’re just left with the W vector.
In this way, the analytic solution of W can be obtained. It can be seen that there is no inverse matrix when N x N matrix in parentheses is not full rank matrix 6, and its essence is that there is Multicollinearity 7 between independent variables X, The next section introduces multicollinearity in linear regression and how to solve such multicollinearity problems. The part before the y vector in the following formula is called the pseudo-inverse of the matrix X, which can be obtained by singular value decomposition 8 (SVD).
You can see that X is an N X M matrix, and y is an n-dimensional column vector. In the above example of years of service and average monthly salary, X is a 5 X 2 matrix, y is a 5-dimensional column vector, and finally w can be calculated as a 2-dimensional column vector, so the linear equation of this example is y = 2172.5 * x-512.5.
Geometric interpretation of analytic solution of linear regression cost function matrix form of linear equation, where y is n-dimensional column vector, X is N X M matrix, w is m-dimensional column vector:
First, take a look at the figure below. The gray hyperplane S is composed of M black N-dimensional column vectors X. The red n-dimensional column vector Y is the y value of the actual sample point. In the example of years of service and average monthly salary, X1 = (1, 1, 1, 1, 1), X2 = (1, 2, 3, 4, 5), y = (1598, 3898, 6220, 7799, 10510). And now we want to find the linear combination of y prime of x, which minimizes y minus y prime, which is the purple vector. It can be seen from the figure that when y – y’ is a normal vector to the hyperplane, it is shortest, which is also equivalent to the fact that y – y’ is perpendicular to every vector x.
Geometric interpretation
(1) y – y’ is perpendicular to each of the vectors x, and notice that the result is the m-dimensional zero vector (2) replaces the linear equation of y’ (3) by expanding the parentheses (4) and transpose the term to obtain w
You can see that the geometric interpretation is consistent with the result by taking derivatives
Five, code implementation
Using Python to implement the linear regression algorithm:
def linear(X, y) :
"" linear regression ARgs: X-training data set Y-target label value return: W-weight coefficient ""
The pinv function directly evaluates the pseudo-inverse matrix of the matrix
return np.linalg.pinv(X).dot(y)
Copy the code
Sixth, third-party library implementation
Scikit – learn9 implementation:
from sklearn.linear_model import LinearRegression
Initialize the linear regressor
lin = LinearRegression(fit_intercept=False)
# Fit the linear model
lin.fit(X, y)
# Weight coefficient
w = lin.coef_
Copy the code
Seven, animation demonstration
When the linear equation takes different weight coefficients, the corresponding cost function changes. It can be seen that the cost function first decreases to a minimum value and then gradually increases.
Mind mapping
Ix. References
- En.wikipedia.org/wiki/Linear…
- En.wikipedia.org/wiki/Euclid…
- En.wikipedia.org/wiki/Arg_ma…
- En.wikipedia.org/wiki/Convex…
- En.wikipedia.org/wiki/Transp…
- En.wikipedia.org/wiki/Rank_ (…
- En.wikipedia.org/wiki/Multic…
- En.wikipedia.org/wiki/Singul…
- Scikit-learn.org/stable/modu…
The full demo can be found here
Note: this article strives to be accurate and easy to understand, but because the author is a beginner, the level is limited, such as the existence of errors or omissions in the article, please readers through the way of comment criticism