Linear regression is a statistical analysis method that uses regression analysis in mathematical statistics to determine the interdependent quantitative relationship between two or more variables, which is widely used. Regression is used to predict the relationship between input variables and output variables, especially when the value of the input variable changes, so does the value of the output variable. A regression model is simply a function that represents a mapping from input variables to output variables.
Unary linear regression
Unary linear models are very simple. Suppose we have a variable xix_ixi and a target yiy_iyi, each I corresponds to a data point, and want to build a model
Y ^ I \hat{y}_iy^ I is the result of our prediction, and we hope to use y^ I \hat{y}_iy^ I to fit the target yiy_iyi. Generally speaking, it is to find this function and fit yiy_iyi to minimize the error, that is, the minimized value is expressed by the function:
L is commonly known as the cost function, which is the average square distance between the predicted value and the true value, which is generally called MAE(mean square Error) in statistics. Substitute the previous function expression into the cost function, and regard the parameters W and b to be solved as the independent variables of function L, and obtain
The current task is to solve the values of W and B when L is minimized, that is, the core objective optimization formula is
There are two ways to solve this
1) Least Square Method
Solving w and B is the process of minimizing the loss function. In statistics, it is called the least square “parameter estimation” of linear regression models. We can take the derivative of L(w,b) with respect to W and b, and get (What is the nature of the least square method?)
Let the above two equations be 0, and the closed-form solution of the optimal solution of W and B can be obtained:
2) Gradient descent
The first optimization algorithm that we touched on, very simple, but very powerful, is used a lot in deep learning, so let’s start with a simple example of how gradient descent works.
The gradient
A gradient is mathematically a derivative, and if it’s a function of several variables, then a gradient is a partial derivative. So if I have a function f of x, y, then the gradient of F is going to be
It can be called Grad F (x, Y) or ∇𝑓(𝑥,𝑦)∇ F (x, Y). Specific point (𝑦 𝑥 0, 0) (x0, y0) gradient is ∇ 𝑓 (𝑦 𝑥 0, 0) ∇ f (x0, y0).
What does gradient mean? Geometrically, the gradient of a point is where the function changes the most rapidly. Specifically, for the function F (x, y), at the point (𝑥0,𝑦0)(X0,y0), along gradient ∇𝑓(𝑥0, 𝑦0)∇ F (X0,y0), the function increases the most, that is, along the direction of the gradient, We can find the maximum of the function faster, or the opposite direction of the gradient, we can find the minimum of the function faster.
Gradient descent method
With an understanding of gradients, we can understand the principle of gradient descent. So we want to minimize the error, so we want to find the minimum point of the error, and then we can find the minimum point by going in the opposite direction of the gradient.
We can look at an intuitive explanation. Such as we are in a position is somewhere on the mountain, because we don’t know how to down the mountain, he decided to take it one step, which is at the time of each go to a place, to solve the current position of the gradient, along the negative gradient direction, which is the current position of the steepest downward step, and then continue to solve the current position gradient, Take the steepest and easiest step down to where this step is. We walked on step by step until we felt that we had reached the foot of the mountain. It is possible, of course, that we may not reach the foot of the mountain, but reach the lower part of a local peak.
Analogous to our problem, we keep changing the values of W and B along the opposite direction of the gradient, and eventually find the best set of w and B that minimizes the error.
In the update, we need to decide the size of each update, such as in the case of the mountain, we need to go down every time the step length, the length of the vector is called, with eta \ eta eta, said the more important, different vector will lead to different results, vector is too small will lead to decreased very slowly, Learning rate is too large will lead to beat is very obvious.
And finally, our update formula is going to be
By iterating over and over again, we can eventually find an optimal set of w and B, and that’s how gradient descent works.
code
import torch
import numpy as np
from torch.autograd import Variable
import matplotlib.pyplot as plt
def fun1() :
torch.manual_seed(2017)
x_train = np.array([[3.3], [4.4], [5.5], [6.71], [6.93], [4.168],
[9.779], [6.182], [7.59], [2.167], [7.042],
[10.791], [5.313], [7.997], [3.1]], dtype=np.float32)
y_train = np.array([[1.7], [2.76], [2.09], [3.19], [1.694], [1.573],
[3.366], [2.596], [2.53], [1.221], [2.827],
[3.465], [1.65], [2.904], [1.3]], dtype=np.float32)
plt.ion()
plt.figure()
plt.plot(x_train, y_train, 'bo')
plt.show()
Translate into Tensor
x_train = torch.from_numpy(x_train)
y_train = torch.from_numpy(y_train)
# w = Variable(torch. Randn (1), requires_grad=True) # Random initialization
# b = Variable(torch. Zeros (1), requires_grad=True) # initialize with 0
w = torch.randn(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
print('w:', w)
print('b:', b)
def linear_model(x) :
return x * w + b
y_ = linear_model(x_train)
plt.figure()
plt.plot(x_train.data.numpy(), y_train.data.numpy(), 'bo', label='real')
plt.plot(x_train.data.numpy(), y_.data.numpy(), 'ro', label='estimated')
plt.legend()
plt.show()
# Calculation error
def get_loss(y_, y) :
return torch.mean((y_ - y_train) ** 2)
loss = get_loss(y_, y_train)
print(loss)
# autoderivative
loss.backward()
Look at the gradients of w and B
print(w.grad)
print(b.grad)
Update a parameter
w.data = w.data - 1e-2 * w.grad.data
b.data = b.data - 1e-2 * b.grad.data
y_ = linear_model(x_train)
plt.figure()
plt.plot(x_train.data.numpy(), y_train.data.numpy(), 'bo', label='real')
plt.plot(x_train.data.numpy(), y_.data.numpy(), 'ro', label='estimated')
plt.legend()
for e in range(100) :# 10 updates
y_ = linear_model(x_train)
loss = get_loss(y_, y_train)
w.grad.zero_() Remember to zero gradient
b.grad.zero_() Remember to zero gradient
loss.backward()
w.data = w.data - 1e-2 * w.grad.data Update # w
b.data = b.data - 1e-2 * b.grad.data Update # b
# print('epoch: {}, loss: {}, {}'.format(e, loss.item(), w.grad))
print("epoch:{}, loss:{}, w:{}-{}, b:{}-{}".format(e, loss, w, w.grad, b, b.grad))
# plt.figure()
# plt.plot(x_train.data.numpy(), y_train.data.numpy(), 'bo', label='real')
# plt.plot(x_train.data.numpy(), y_.data.numpy(), 'ro', label='estimated')
# plt.show()
# # PLT. Pause (0.5)
# plt.close()
# input("Press Enter to Continue") #
y_ = linear_model(x_train)
plt.figure()
plt.plot(x_train.data.numpy(), y_train.data.numpy(), 'bo', label='real')
plt.plot(x_train.data.numpy(), y_.data.numpy(), 'ro', label='estimated')
plt.legend()
print("w", w)
print("b", b)
print("-" * 10)
if __name__ == '__main__':
fun1()
Copy the code
Execute output:
W: tensor ([2.2691], requires_grad = True) b: Tensor ([0], requires_grad = True) tensor (153.3520, Grad_fn =<MeanBackward0>) tensor([161.0043]) tensor([22.8730]) epoch:0, Loss :3.135774850845337, W :tensor([0.4397], Requires_grad = True) - tensor ([21.9352]), b: tensor ([0.2576], requires_grad = True) - tensor ([2.8870]) epoch: 1, Loss: 0.3550890386104584, w: tensor ([0.4095], requires_grad = True) - tensor ([3.0163]), b: tensor ([0.2593]. Requires_grad = True) - tensor ([0.1687]) epoch: 2, loss: 0.30295437574386597, w: tensor ([0.4051]. Requires_grad = True) - tensor ([0.4424]), b: tensor ([0.2573], requires_grad = True) - tensor ([0.2005]) epoch: 3, Loss: 0.30131959915161133, w: tensor ([0.4041], requires_grad = True) - tensor ([0.0922]), b: tensor ([0.2548]. Requires_grad = True) - tensor ([0.2502])... Epoch :93, Loss :0.2522490322589874, W :tensor([0.3743], Requires_grad =True)-tensor([0.0294]), B :tensor([-0.0477], Requires_grad = True) - tensor ([0.2048]) epoch: 94, loss: 0.2518215477466583, w: tensor ([0.3740]. Requires_grad = True) - tensor ([0.0294]), b: tensor ([0.0456], requires_grad = True) - tensor epoch ([0.2043]) : 95. Loss: 0.2513962388038635, w: tensor ([0.3737], requires_grad = True) - tensor ([0.0293]), b: tensor ([0.0436]. Requires_grad = True) - tensor ([0.2037]) epoch: 96, loss: 0.250973105430603, w: tensor ([0.3734]. Requires_grad = True) - tensor ([0.0292]), b: tensor ([0.0416], requires_grad = True) - tensor epoch ([0.2032]) : 97. Loss: 0.250552237033844, w: tensor ([0.3731], requires_grad = True) - tensor ([0.0291]), b: tensor ([0.0395]. Requires_grad = True) - tensor ([0.2027]) epoch: 98, loss: 0.25013336539268494, w: tensor ([0.3728]. Requires_grad = True) - tensor ([0.0291]), b: tensor ([0.0375], requires_grad = True) - tensor epoch ([0.2022]) : 99. Loss: 0.24971675872802734, w: tensor ([0.3725], requires_grad = True) - tensor ([0.0290]), b: tensor ([0.0355]. Tensor ([0.3725], requires_grad=True) B tensor([-0.0355], Requires_grad =True)Copy the code
- Initial data:
- First fitting:
- Second fitting:
- End result:
Reference:
What is the nature of the least square method? Gradient descent Algorithm principle explanation – Machine learning What is gradient descent? Summary of Gradient Descent