Linear regression

Least square method

use`torch.lstsq()`Solve the linear regression problem

Two important corollary

Dot it with the weight

X [I:] ⋅ W = X [0] [I, 0] W + X [1] + [I, 1) W… W + x [I, m – 1] [m – 1) + x (I, m] w x [I:] [m], [I, 0] w w = x [0] + x w (I, 1] [1] +… W + x [I, m – 1]] [m – 1 + x/I, m a w [m] x [I:] ⋅ w = x [0] [I, 0] w + x w (I, 1] [1] +… W + x [I, m – 1] [m – 1) + x (I, m] w [m]. 2. The calculation method of two norm ∣ ∣ ⋅ – x Y w ∣ ∣ 22 = ∑ I = 0 n – 1 (Y – x [I:] [I] ⋅ w) | | – x, Y w | | _2 ^ 2 = \ sum_ {I = 0} ^ {}, n – 1 (Y – [I] X [I:] w.) ∣ ∣ ⋅ – X Y W ∣ ∣ 22 = ∑ I = 0 n – 1 (Y – X [I:] [I] ⋅ W) 3. Error expression ζ(W; X, Y) = 1 n ∣ ∣ ⋅ – X Y W ∣ ∣ 22 \ zeta (W; X, Y) = {1 \ over n} | | – X, Y W | | _2 ^ 2 zeta (W; X, Y) = n1 ∣ ∣ ⋅ – X Y W ∣ ∣ 22

import torch
x = torch.tensor([[1..1..1.], [2..3..1.], [3..5..1.], [4..2..1.], [5..4..1.]])
y = torch.tensor([-10..12..14..16..18.])
wr, _ = torch.lstsq(y, x)
w = wr[:3]
print(wr)
print(w)
Copy the code

Tensor ([[4.6667], [2.6667], [12.0000], [10.0885], [2.2110]]) tensor ([[4.6667], [2.6667], [12.0000]])Copy the code

Several loss functions

MSE loss function

The result is the sum of the squares of the difference between the desired value and the predicted value

$MSE = {1 \over n}\sum_{i=1}^n(y_i – y_i ^p)^2$

Advantages: all points are continuous smooth, convenient derivation, with a relatively stable solution
Disadvantages: Not robust enough. When the input value of the function is far from the center value, the gradient descent method is used to solve the problem, which may lead to gradient explosion
Pytorch corresponding classtorch.nn.MSELoss

MAE loss function

The formula is the sum of the absolute value of the difference between the target value and the predicted value

$MAE = {1 \over n}\sum_{i=1}^n|y_i – y_i ^p|$

Advantages: No matter for what kind of input value, has a stable gradient, will not lead to gradient explosion problem, has a more robust solution
Disadvantages: the center point is a folding point, can not be derived, not convenient to solve

L1 loss function

L1 norm loss function, also known as minimum absolute deviation (LAD), minimum absolute error (LAE). In general, it minimizes the sum (S) of the absolute difference between the target value (Yi) and the estimated value (f(xi)) : formula

$L1 = \sum_{i-1}^n|Y_i – f(x_i)|$

The corresponding class in PyTorchtorch.nn.L1Loss

L2 loss function

L2 norm loss function, also known as least square error (LSE). In general, it minimizes the sum of squares (S) of the difference between the target value (Yi) and the estimated value (f(xi)) : formula

$L2 = \sum_{i-1}^n(Y_i – f(x_i))^2$

The advantages and disadvantages of L1 loss function and L2 loss function are interchangeable with those of MSE loss function and MAS loss function

Smooth L1 loss function

So we figured out that it would be nice if we could just get rid of the fold point of the L1 loss function and make it differentiable, so we got the smooth L1 loss function formula

0.5 x ^ 2 & | | x < 1 \ \ | | x – 0.5 & 1 \ | | x \ geq end {cases}

Advantages: This function is actually a piecewise function, and it is actually L2 loss between [-1,1], which solves the problem of L1 non-smoothness; outside [-1,1], it is actually L1 loss, which solves the problem of outlier gradient explosion. Moreover, Smooth L1 Loss combines the advantages of L1 and L2: L1 is used in the early stage for gradient stability and rapid convergence, and L2 is used in the later stage for gradual convergence to the optimal solution.
The corresponding class in PyTorchtorch.nn.SmoothL1Loss

An example of calling the MSE loss function

# instantiate this class
criterion = torch.nn.MSELoss()
pred = torch.arange(5, dtype=torch.float32,requires_grad=True)
y = torch.ones(5)
loss = criterion(pred, y)
print(loss)
loss.backward()
# print(loss.grad)
Copy the code

Output:

tensor(3., grad_fn=<MseLossBackward0>)
Copy the code

Use optimizer to solve linear regression

Whatever the loss function, we can always use gradient descent method to find the right weight W, makes the minimum loss, loss is realized by using this method we need to first, and then the loss gradient, and then update the weight value of W, but even the simplest MSE loss function, when the memory too much and not all one-time data loading, We can use the stochastic gradient descent method to select a part of the data for operation during each iteration. The following example implements the same result as the previous example, but this method is more laborious and takes more time, so if you can use tourch.lstsq(), use tourch.lstsq(). We really can’t use tourch.lstSq () (for example, the loss is not MSE loss or too much data can’t be loaded into memory all at once), so we use this method

import torch
import torch.nn
import torch.optim

x = torch.tensor([[1..1..1.], [2..3..1.], [3..5..1.], [4..2..1.], [5..4..1.]], device='cuda')
y = torch.tensor([-10..12..14..16..18.], device='cuda')
w = torch.zeros(3, requires_grad=True, device='cuda')

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam([w, ], )

for step in range(30001) :if step:
        optimizer.zero_grad() # reset
        loss.backward()       Gradient # o
        optimizer.step()      Update the independent variables according to the gradient
    
    pred  = torch.mv(x, w) # Matrix multiplication
    loss = criterion(pred, y)
    if step % 5000= =0:
        print('step = {} loss = {:g} W = {}'.format(step, loss, w.tolist()))
Copy the code

Output:

Step = 0 loss = 204 W = [0.0, 0.0, 0.0] step = 5000 loss = 40.8731 W = [2.3051974773406982, 1.712536334991455, -0.6180324554443359] step = 10000 loss = 27.9001 W = [3.6783804893493652, 1.7130744457244873, -5.2205023765563965] step = 15000 Loss = 22.31 W = [4.292291641235352, 2.293663263320923, -9.385353088378906] step = 20000 Loss = 21.3341 W = [4.655962944030762, 2.6559813022613525 -11.925154685974121] step = 25000 Loss = 21.3333 W = [4.666664123535156, 2.666663885116577, Step = 30000 Loss = 21.3333 W = [4.666667938232422, 2.666668176651001, -11.999998092651367] step = 30000 Loss = 21.3333 W = [4.666667938232422, 2.666668176651001, -11.999998092651367]Copy the code

use`torch.nn.Linear()`implementation

import torch
import torch.nn
import torch.optim

x = torch.tensor([[1..1..1.], [2..3..1.], [3..5..1.], [4..2..1.], [5..4..1.]])
y = torch.tensor([-10..12..14..16..18.])

fc = torch.nn.Linear(3.1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(fc.parameters())

weights, bias = fc.parameters()
fc(x)
for step in range(30001) :if step:
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    pred = fc(x)
    loss = criterion(pred, y)
    if step % 5000= =0:
        print('step = {} loss = {:g} W = {}, bias = {}'.format(step, loss, weights[0, :].tolist(), bias.item()))
Copy the code

Output:

< Generator object module. parameters at 0x000001ED118B8270> step = 5000 Loss = 106.462w = [0.4140699803829193, 0.7813165187835693, 2.938326358795166], BIAS = 2.9747958183288574 step = 10000 loss = 104 W = [0.007105899043381214, 0.007294247858226299, 4.956961631774902], BIAS = 4.993431568145752 step = 15000 loss = 104 W = [2.2107651602709666E-06, Bias = 5.018227577209473 step = 20000 Loss = 104 W = [2.7108444555778997E-07, bias = 5.018227577209473 step = 20000 loss = 104 W = [2.71084445557789907, E-07 2.585106244623603, 4.981764793395996], Bias = 5.018234729766846 step = 25000 loss = 104 W = [-4.070022259838879E-05, -4.075446486240253E-05, 4.981725215911865], bias = 5.018195152282715 step = 30000 Loss = 104 W = [1.3781600500806235E-06, 1.4800637018197449E-06, 4.981767177581787], BIAS = 5.018237113952637Copy the code

Normalization of data

Why normalization?

In some linear programming problems, the value range of features differs greatly from the value range of labels, or from one feature to another. At this point, some weight values can be particularly large, making it difficult for the optimizer to learn these weight values.

How to normalize?

Normalize feature A mean(A) to mean(A), variance of STD (A) Anorm=A−mean(A) STD (A)A_{norm} = {a-mean (A) \over STD (A)}Anorm= STD (A)A−mean(A)

What are the features of the normalized data?

The mean of normalized data is 0 and the variance is 1

Code examples:

Code not normalized:

import torch.nn
import torch.optim

x = torch.tensor([[1000000.0.0001], [2000000.0.0003], [3000000.0.0005], [4000000.0.0002], [5000000.0.0004]], device="cuda")
y = torch.tensor([-1000..1200..1400..1600..1800.], device='cuda').reshape(-1.1)

fc = torch.nn.Linear(2.1)
fc = fc.cuda()
# Get the result of the current weight calculation
pred = fc(x)
print(pred)
criterion = torch.nn.MSELoss()
criterion = criterion.cuda()
optimizer = torch.optim.Adam(fc.parameters())

for step in range(100001) :if step:
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    pred = fc(x)
    loss = criterion(pred, y)
    if step % 10000= =0:
        print('step = {}, loss = {:g}'.format(step, loss))
Copy the code

The output

Tensor ([[580872.8750], [1161746.1250], [1742619.3750], [2323492.5000], [2904365.7500]], Device =' CUDa :0', Grad_fn =<AddmmBackward0>) step = 0, loss = 3.70667e+12 step = 10000, loss = 436096 step = 20000, loss = 435005 step = 30000, loss = 432516 step = 40000, loss = 430062 step = 50000, loss = 427641 step = 60000, loss = 425254 step = 70000, loss = 432383 step = 80000, loss = 420584 step = 90000, loss = 418410 step = 100000, loss = 416046Copy the code

It can be found that this speed is too slow. The loss value is still very high after 10,000 iterations, so the data will be normalized below

import torch
import torch.nn
import torch.optim

x = torch.tensor([[1000000.0.0001], [2000000.0.0003], [3000000.0.0005], [4000000.0.0002], [5000000.0.0004]])
y = torch.tensor([-1000..1200..1400..1600..1800.]).reshape(-1.1)

x_mean, x_std = torch.mean(x, dim=0), torch.std(x, dim=0)
x_norm = (x - x_mean) / x_std

y_mean, y_std = torch.mean(y, dim=0), torch.std(y, dim=0)
y_norm = (y - y_mean) / y_std

fc = torch.nn.Linear(2.1)
# Get the result of the current weight calculation
pred = fc(x)
print(pred)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(fc.parameters())

for step in range(10001) :if step:
        optimizer.zero_grad()
        loss_norm.backward()
        optimizer.step()
    pred_norm = fc(x_norm)
    loss_norm = criterion(pred_norm, y_norm)
    # Data restore
    pred = pred_norm * y_std + y_mean
    loss = criterion(pred, y)
    if step % 1000= =0:
        print('step = {}, loss = {:g}'.format(step, loss))
Copy the code

Output:

Tensor ([[-599029.2500], [-1198058.6250], [-1797088.0000], [-2396117.5000], [-2995146.7500]], Grad_fn =<AddmmBackward0>) steop = 0, loss = 4.38259e+06 steop = 1000, loss = 654194 steop = 2000, loss = 224888 steop = 3000, loss = 213705 steop = 4000, loss = 213341 steop = 5000, loss = 213333 steop = 6000, loss = 213333 steop = 7000, loss = 213333 steop = 8000, loss = 213333 steop = 9000, loss = 213333 steop = 10000, loss = 213333Copy the code

In actual combat