Introduction of algorithm
Linear regression is the simplest of all regression algorithms. Algorithms: take a two-dimensional plane as an example. In junior high school math, you have 2 points and you can find a line. So what if I had 100 points. How do I find a straight line? It’s impossible for a straight line to fit 100 points. There must be some error, so the goal of linear programming is to fit the 100 points with a straight line. Minimize the total error. Regression is usually used to study linear relationships between objects and features. Mainly applicable to predictive analysis.
Algorithm scenario
Let’s use code to construct the usage scenarios of the algorithm. First we define n points. Let’s look at the distribution of these n points.
from sklearn.datasets import load_diabetes import matplotlib.pyplot as plt import numpy as np X = Np. Arange (0,10,0.05). Reshape (1, Seed (100) y = 0.5 * np.arange(0,10,0.05) + 3 + np.random. Rand (len(X)) plt.scatter(x.reshape (1, -1), y)Copy the code
The results are shown below:
We can see that the n points we deliberately constructed are roughly distributed around the line y=θ∗x+by = \ Theta * x+by =θ∗x+b.
So now we have a problem, if we’re given a new point, and the x-coordinate is x0x_0x0, please predict the y-coordinate y0y_0y0.
Algorithm model
We saw in the last step that given a bunch of points, in a linear pattern, we need a line to fit this bunch of samples.
So let’s say that the sample is centered around
This line is distributed. Y ^\hat yy^ indicates the predicted value. The problem of solving the linear regression model is then reduced to solving the parameters θ\thetaθ and BBB.
As we said in the last step, the goal of linear regression is to find a line that minimizes the error at all points. What criteria do we use to represent the “error” in order to solve for the optimal parameter?
We now know that the characteristics of some sample are xix_ixi. The target value yiy_iyi corresponding to sample xix_ixi is also known.
We calculate the predicted value y^ I \hat y_iy^ I using the parameters θ\thetaθ and BBB:
Then we define the loss of a single point
That’s the squared distance between the real value and the predicted value. Then the sum of the errors of n points is as follows:
Substituting the parameters, we get:
That’s the loss function we’re going to define. So now the problem has shifted to
That is, solving θ\thetaθ and ${b} minimizes the loss function.
model
There are two common solving methods for linear regression model :(1) least square method (2) gradient descent algorithm.
The least square method is easy to solve. The idea is to compute the derivatives of L with respect to θ and bL with respect to \theta and bL with respect to θ and b, so that the formula is 0. That’s when the formula has a minimum. The details will be deferred.
Since the gradient descent algorithm has a wider range of application scenarios, we also use the gradient descent algorithm here.
The process of gradient descent algorithm solving linear regression:
(1) Initialize θ\thetaθ and BBB to very small values that are not equal to 0.
(2) Calculate the predicted value y^\hat YY ^ using the current θ\thetaθ and BBB.
(3) If the error between the predicted value y^\hat YY ^ and the actual predicted value YYy is greater than ϵ\epsilonϵ, jump to (4); If the error between the predicted value y^\hat YY ^ and the actual predicted value YYy is less than ϵ\epsilonϵ, return the current θ\thetaθ and BBB. We’re done. (4) Solve the partial derivative of the loss function and update the parameters θ\thetaθ and BBB:
Implement their own gradient descent algorithm
Class LinearRegression: def __init__(self, eta=0.02, n_iters = 1e4, epsilon = 1e-3): class LinearRegression: def __init__(self, eta=0.02, n_iters = 1e4, epsilon = 1e-3) self._theta = None self._eta = eta self._n_iters = n_iters self._epsilon = epsilon def Cost(self, theta, X_b, y): try: return np.sum(np.square(y - X_b.dot(theta))) / len(X_b) except: return float('inf') def dCost(self, theta, X_b, y): return (X_b.dot(theta) - y).dot(X_b) * 2 / len(X_b) def gradient_descent(self, X_b, y, initial_theta, eta, n_iters, epsilon): theta = initial_theta eta = self._eta n_iters = self._n_iters epsilon = self._epsilon i_iter = 0 try: while i_iter < n_iters: gradient = self.dCost(theta, X_b, y) last_theta = theta theta = theta - eta * gradient diff = abs(self.Cost(theta, X_b, y) - self.Cost(last_theta, X_b, y)) if (diff > 1e100): print('eta is too large! ') break if (diff < epsilon): break i_iter += 1 self._theta = theta except: self._theta = float('inf') return self def fit(self, X_train, y_train): assert X_train.shape[0] == y_train.shape[0], 'train data has wrong dimenssion' self._X_train = X_train self._y_train = y_train X_b = np.hstack([np.ones((len(self._X_train), 1)), Self._x_train]) initial_theta = np.zeros(x_B.shape [1]) eta = 0.01 self.gradient_descent(X_b, y_train, initial_theta, self._eta, self._n_iters, self._epsilon) return self def predict(self, X_test): X_b = np.hstack([np.ones((len(X_test), 1)), X_test]) return np.dot(X_b, self._theta)Copy the code
The above code is saved as a separate file, which can then be referenced and invoked. Notice in our model
In the corresponding code
In our single-feature model, this is:
And when you do the matrix operation, you find that
The left-hand side is the representation in the model, and the right-hand side is the representation in the code, just to make it easier to do matrix operations.
Call your own linear programming model
Linear_regression = Linear_regression (Epsilon = 1E-6, ETA = 0.02) Linear_regression. Fit (X_train, y_train) y_predict = linear_regression.predict(X_test) params = linear_regression._thetaCopy the code
The results are as follows:
Array ([3.45402168, 0.51257892])Copy the code
In other words, the solution can be:
And it’s actually pretty close to what we saw when we constructed this sample. Let’s draw the points and lines in the same picture, and see what it looks like:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1) y_predict = theta * X + b plt.scatter(X, y_predict)Copy the code
The results are shown below:
It’s actually a pretty good fit for this sample.
Note:
The eta must be appropriate, because if it is too small, the result will converge too slowly. After the given number of iterations, it does not converge to the desired value. If it’s too big, it doesn’t converge, it diverges explosively.
Epsilon is the error range given by us. When the error epsilon is less than a certain value, we think that the line has met the accuracy of fitting.
Call the linear programming model in SkLearn
Let’s call the skLearn model again to see what happens:
from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)
y_predict = linear_regression.predict(X_test)
linear_regression.coef_, linear_regression.intercept_
Copy the code
The results are as follows:
(array ([0.51042252]), 3.4682654875517605)Copy the code
In other words, through the linear model in Sklearn, it can be solved as follows:
Take a look at the fitting diagram:
It turned out to be an invisible difference.
When we call the linear programming model in SkLearn, we do not set specific parameters and use the default parameters. It is found that the results obtained by our own model are very different from the sklearn model, and there is a certain error, because the optional parameter values of the two models are inconsistent. If the parameters are the same, the results will be closer.
conclusion
Linear programming is an easy algorithm to understand. This paper only introduces the gradient descent algorithm, the least square method is interested in students, you can study.