Linear regression is the most basic mathematical model in statistics. It can be seen in almost every subject, such as quantitative finance, econometrics, etc. Deep learning, the current craze, is also partly based on linear regression. Therefore, it is necessary for everyone to understand the principle of linear regression.
Linear regression models existing data and can predict future data. Some people find linear regression too easy to even call it machine learning; Others feel that linear regression has been wrapped up in many programming libraries, so you can just call a function and not learn much about the math. In fact, linear regression is the best starting point for all machine learning techniques, and many of the most sophisticated machine learning techniques, as well as the deep neural networks of the current wave, are more or less based on linear regression.
Machine learning modeling process
The most common scenario for machine learning is supervised learning: given some data, use a computer to learn a pattern and then use it to predict new data. A simple supervised learning task can be expressed as, given N ptwo data pairs, a machine learning model is used to model it, and a model is obtained, in which a given data pair is sample.Is the feature,Is the true value (label).
For example, the figure above shows the income data set that shows educational attainment versus annual earnings. Based on our social experience and the distribution of data in the graph, we feel that we can use a straight line to describe the phenomenon that “income increases with the increase of education time”. So for this data set, one of the simplest machine learning models you can use is unary linear regression.
Unary linear regression
In high school, we often used the above equation to solve some math problems, the equation describes variablesAs the variableAnd change. The equation, graphically, is a straight line. So if we build a mathematical model like this, we know thatWe can get a prediction. Statisticians put a small cap on the variable to indicate that it is a predicted value, as opposed to the actual observed data. The equation has only one independent variable, and does not contain square cubic and other non-first term, so it is calledUnary linear equation.
When modeling the income data set, we can evaluate the parametersm
andb
Take different values to build different lines, thus forming a family of parameters. There is an optimal combination in the parameter family that describes the data set in a statistically optimal way. The process of supervised learning can then be defined as: given N data pairs, find the best parameters 和 So that the model can better fit these data.
The figure above gives different parameters. Which line is the best? How do you measure whether the model optimally fits the data? Machine learning uses loss function to measure this problem. The loss function is also called cost function, which calculates the predicted value of the modelŷ
And the real valuey
The degree of difference between. As can be seen from the name, this function calculates the loss or cost of a model error. The larger the loss function, the worse the model, and the less it can fit the data. Statisticians usually useTo represent the loss function.
For linear regression, a simple and practical loss function is the square of the error between the predicted value and the true value. The figure above shows the error between predicted and true values on the income data set.
Formula 1 represents the error between the predicted value and the real value at a single sample point.
Formula 2 shows that all the errors of the data set are summed and averaged, and then substituted into the formulaFormula 3 is obtained.
In Formula 4, arg min is a common mathematical symbol, which means to find parameters m* and b* that can minimize L function.
The error is a line segment from the graph, and a square is formed after the square is squared. The sum and average of the square area is the loss function of Formula 3. The smaller the average area of all the squares, the smaller the loss. For a given data set, the values of x and y are known, and parameters M and b need to be solved. The process of solving the model is the process of solving Formula 4. So that’s the mathematical representation of the least square method, where “two” means square, and “minimum” means minimum loss function. So far, we find that with the loss function, the process of machine learning is resolved into the process of finding the optimal solution to the loss function, that is, to find an optimization problem.
So why can’t you just take the square and not take the absolute value? When I was in middle school, I learned that the square function has a minimum value, which can ensure that when solving parameters later, we can find the minimum value quickly by taking the derivative, while the absolute value function does not have this property.
Least square method parameter solution
For the function in Formula 4, we can take the derivative of m and b, and when the derivative is 0, the loss function is minimal.
Formulas 5 and 6 are the partial derivatives of the loss function with respect to m and b.
There are repeated summations of x and y in the result, and given the data set, these summations can be computed, are constants, and can be expressed by formulas 7 and 8.
When the derivative is 0, the minimum value can be obtained, that is, the optimal solutions M and b can be obtained from formulas 9 and 10.
The above is the least square method of linear regression solution process.
Many machine learning models need to go through the above process: determine the loss function, find the parameters to minimize the loss function. The solution involves some simple calculus, so a review of partial derivatives of calculus will help you understand the mathematics of machine learning.
Multiple linear regression
Now let’s putx
It is extended to the multivariate case, that is, multiple factors affect variables togethery
. This is often the case in real life, for example, to predict house prices, you need to consider includingWhether the school district
,The room number
,Whether the surrounding area is prosperous
,Convenience of transportation
And so on. There are D dimensions of influence, which are called by the field of machine learningCharacteristics of the(feature). Each sample has one to predictAnd a set of D vectors. Original parameterIt becomes a D dimensional vector. So, somebodyCan be represented as, includingAccording to the firstSample vectorIn the firstDimensional eigenvalues.
To make it easier to express, independent offset terms can be generalized to vectors using the following methodUse in:To represent the parametersThat will beAdd to the last bit of the feature vector to extend the D-dimensional feature to D+1, i.eAnd then finally, I add a value of 1, and that gives me formula 13 from formula 12.
The shapes of each vector are shown in Formula 14. Where, the vectorRepresents the weight of each feature in the model; matrixEach line of is a sample, and each sample contains D+1 eigenvalues, which are values of the sample in different dimensions. vectorIs the true value. Summation terms can be expressed as inner products:. The data composed of N samples can be expressed as:
Although Formula 13 can describe the relationship between variables, the general machine learning field prefers the vector multiplication form of Formula 15. This is not only because it’s easier to represent, but also because modern computers are so optimized for vector computing that both CPUS and Gpus like vector computing, processing data in parallel, and getting hundreds or thousands of times faster speeds.
It is important to note that in the formula, the bold ones represent scalars and the bold ones represent vectors or matrices.
More complex than unary linear regression, multiple linear regression is not a straight line, but a hyperplane in a multidimensional space, with data points scattered on both sides of the hyperplane.
The loss function of multiple linear regression is still usedPredicted value - True value
Formula 16 is the vector representation of the loss function of the whole model. There’s a vertical part here, and it’s called the square of the L2 norm. Norm is usually for vectors, a mathematical notation that is often used in machine learning, and formula 17 shows a vectorThe squared of the L2 norm of phi and its derivative.
For the vector that’s in formula 16Take the derivative, and you get formula 18.
Formula 19 is the vectorAccording to the knowledge of linear algebra, the optimal solution is actually to solve the matrix equation, which is called in EnglishNormal Equation
.
Linear programming leads to a certain optimal solution, and these optimal solutions are grouped into an optimal hyperplane.
Reading this, some friends who know a little about machine learning may ask, why is it not mentioned in the solutionGradient descent
. You can actually use gradient descent instead of solving the matrix equation. In Formula 20, the calculation of matrix inverse is relatively large, and the complexity isLevel. When the characteristic dimension reaches millions or more or the number of samples is very large, the calculation time is very long, and the memory of a single computer can not even store these parameters, so the solution of matrix equation is not practical, and the gradient descent method must be used. The gradient descent method approaches the optimal solution with a certain precision, and the solving speed is very advantageous in the case of large amount of data, but the absolute optimal solution may not be obtained. This column will cover gradient descent in more detail in the future.
Earlier, I spent some time describing the solution process of linear programming, and a lot of formulas appeared. Like many friends, I used to hate looking at formulas, and when I saw some formulas, I felt it was very difficult to learn. However, if you calm down and read it carefully, you will find that in fact, these formulas are used in calculus, linear algebra in the relatively basic part, do not need to lofty knowledge, science and engineering background friends should be able to understand.
Use scenarios for linear regression
So when exactly can you use linear regression? Statistician Anscombe gave four data sets, known as anscombe Quartet, and it can be seen from the distribution of these four data sets that not all data sets can be modeled by unary linear regression. Real-world problems tend to be more complex, and variables are almost impossible to fit perfectly into linear models. Therefore, to use linear regression, the following assumptions should be followed:
- Linear regression is a regression problem.
- Variables to predict
y
With the independent variablex
Is the relationship betweenlinear. - The errors obey the normal distribution, the mean value is 0, and
x
With the variance. - variable
x
There must be variability in the distribution of. - In multiple linear regression, different features should be independent of each other to avoid linear correlation.
Regression problem and classification problem
In contrast to regression, classification problem, the output set of variable Y to be predicted by classification problem is finite, and the predicted value can only be one in the finite set. When the output set of the variable y to be predicted is infinite and continuous, we call this regression. For example, a weather forecast predicting whether it will rain tomorrow is a dichotomous problem; Predicting how much rain will fall tomorrow is a regression problem.
The variables are linear
Linearity usually refers to the constant proportion between variables. Graphically, the shapes between variables are straight lines and the slopes are constant. This is a very strong assumption, and the distribution of data points presents a complex curve, which cannot be modeled using linear regression. It can be seen that the data in the upper right corner of the quartet is not suitable for modeling by means of linear regression.
The error follows a positive distribution with a mean of zero
The concept of error has been mentioned in the previous solving process of the least square method, and error can be expressed as error = actual value – predicted value.
This assumption can be interpreted as follows: linear regression allows an error between the predicted value and the true value, with the average error of the data being 0 as the amount of data increases; Graphically, each true value may be above or below the line, and when there is enough data, each value cancels out. If the error does not follow a positive distribution with a mean of zero, then it is likely that some outliers have occurred, and the distribution of the data is likely to be the case in the lower right corner of the Anscombe Quartet.
This is also a very strong assumption. If you use a linear regression model, you have to assume that the error mean of the data is positively distributed with zero.
variablex
There must be variability in the distribution of
Linear regression also requires certain changes in variable X, which cannot be distributed in a vertical line like the data in the lower right corner of Anscombe Quartet.
Multiple linear regression features are independent of each other
If different features are not independent of each other, collinearity between features may occur, which leads to inaccurate model. To take an extreme example, multiple features are used to predict housing price: number of rooms, number of rooms *2, – number of rooms, etc. Features are linearly correlated. If the model only has these features and lacks other effective features, a model can be trained, but the model is inaccurate and predictive.
There are many other mathematical assumptions about linear regression, but they are not relevant to the problem at hand and will not be covered here.
conclusion
Linear regression is the most basic mathematical model in statistics. It can be seen in almost every subject, such as quantitative finance, econometrics, etc. Deep learning, the current craze, is also partly based on linear regression. Therefore, it is necessary for everyone to understand the principle of linear regression.
One of the most intuitive solutions to linear regression is the least square method, whose loss function is the square of the error and has a minimum point, which can be obtained by solving the matrix equation. Although there is a lot of mathematical notation involved in the derivation, linear regression is not mathematically complicated, and anyone with a background in calculus and linear algebra can understand how it works.