Linear regression

Least square method

usetorch.lstsq()Solve the linear regression problem

Two important corollary

  1. Dot it with the weight

X [I:] ⋅ W = X [0] [I, 0] W + X [1] + [I, 1) W… W + x [I, m – 1] [m – 1) + x (I, m] w x [I:] [m], [I, 0] w w = x [0] + x w (I, 1] [1] +… W + x [I, m – 1]] [m – 1 + x/I, m a w [m] x [I:] ⋅ w = x [0] [I, 0] w + x w (I, 1] [1] +… W + x [I, m – 1] [m – 1) + x (I, m] w [m]. 2. The calculation method of two norm ∣ ∣ ⋅ – x Y w ∣ ∣ 22 = ∑ I = 0 n – 1 (Y – x [I:] [I] ⋅ w) | | – x, Y w | | _2 ^ 2 = \ sum_ {I = 0} ^ {}, n – 1 (Y – [I] X [I:] w.) ∣ ∣ ⋅ – X Y W ∣ ∣ 22 = ∑ I = 0 n – 1 (Y – X [I:] [I] ⋅ W) 3. Error expression ζ(W; X, Y) = 1 n ∣ ∣ ⋅ – X Y W ∣ ∣ 22 \ zeta (W; X, Y) = {1 \ over n} | | – X, Y W | | _2 ^ 2 zeta (W; X, Y) = n1 ∣ ∣ ⋅ – X Y W ∣ ∣ 22

import torch
x = torch.tensor([[1..1..1.], [2..3..1.], [3..5..1.], [4..2..1.], [5..4..1.]])
y = torch.tensor([-10..12..14..16..18.])
wr, _ = torch.lstsq(y, x)
w = wr[:3]
print(wr)
print(w)
Copy the code
Tensor ([[4.6667], [2.6667], [12.0000], [10.0885], [2.2110]]) tensor ([[4.6667], [2.6667], [12.0000]])Copy the code

Several loss functions

MSE loss function

The result is the sum of the squares of the difference between the desired value and the predicted value


M S E = 1 n i = 1 n ( y i y i p ) 2 MSE = {1 \over n}\sum_{i=1}^n(y_i – y_i ^p)^2

  • Advantages: all points are continuous smooth, convenient derivation, with a relatively stable solution
  • Disadvantages: Not robust enough. When the input value of the function is far from the center value, the gradient descent method is used to solve the problem, which may lead to gradient explosion
  • Pytorch corresponding classtorch.nn.MSELoss

MAE loss function

The formula is the sum of the absolute value of the difference between the target value and the predicted value


M A E = 1 n i = 1 n y i y i p MAE = {1 \over n}\sum_{i=1}^n|y_i – y_i ^p|

  • Advantages: No matter for what kind of input value, has a stable gradient, will not lead to gradient explosion problem, has a more robust solution
  • Disadvantages: the center point is a folding point, can not be derived, not convenient to solve

L1 loss function

L1 norm loss function, also known as minimum absolute deviation (LAD), minimum absolute error (LAE). In general, it minimizes the sum (S) of the absolute difference between the target value (Yi) and the estimated value (f(xi)) : formula


L 1 = i 1 n Y i f ( x i ) L1 = \sum_{i-1}^n|Y_i – f(x_i)|

  • The corresponding class in PyTorchtorch.nn.L1Loss

L2 loss function

L2 norm loss function, also known as least square error (LSE). In general, it minimizes the sum of squares (S) of the difference between the target value (Yi) and the estimated value (f(xi)) : formula


L 2 = i 1 n ( Y i f ( x i ) ) 2 L2 = \sum_{i-1}^n(Y_i – f(x_i))^2

The advantages and disadvantages of L1 loss function and L2 loss function are interchangeable with those of MSE loss function and MAS loss function

Smooth L1 loss function

So we figured out that it would be nice if we could just get rid of the fold point of the L1 loss function and make it differentiable, so we got the smooth L1 loss function formula

0.5 x ^ 2 & | | x < 1 \ \ | | x – 0.5 & 1 \ | | x \ geq end {cases}
  • Advantages: This function is actually a piecewise function, and it is actually L2 loss between [-1,1], which solves the problem of L1 non-smoothness; outside [-1,1], it is actually L1 loss, which solves the problem of outlier gradient explosion. Moreover, Smooth L1 Loss combines the advantages of L1 and L2: L1 is used in the early stage for gradient stability and rapid convergence, and L2 is used in the later stage for gradual convergence to the optimal solution.
  • The corresponding class in PyTorchtorch.nn.SmoothL1Loss
An example of calling the MSE loss function

# instantiate this class
criterion = torch.nn.MSELoss()
pred = torch.arange(5, dtype=torch.float32,requires_grad=True)
y = torch.ones(5)
loss = criterion(pred, y)
print(loss)
loss.backward()
# print(loss.grad)
Copy the code

Output:

tensor(3., grad_fn=<MseLossBackward0>)
Copy the code

Use optimizer to solve linear regression

Whatever the loss function, we can always use gradient descent method to find the right weight W, makes the minimum loss, loss is realized by using this method we need to first, and then the loss gradient, and then update the weight value of W, but even the simplest MSE loss function, when the memory too much and not all one-time data loading, We can use the stochastic gradient descent method to select a part of the data for operation during each iteration. The following example implements the same result as the previous example, but this method is more laborious and takes more time, so if you can use tourch.lstsq(), use tourch.lstsq(). We really can’t use tourch.lstSq () (for example, the loss is not MSE loss or too much data can’t be loaded into memory all at once), so we use this method

import torch
import torch.nn
import torch.optim

x = torch.tensor([[1..1..1.], [2..3..1.], [3..5..1.], [4..2..1.], [5..4..1.]], device='cuda')
y = torch.tensor([-10..12..14..16..18.], device='cuda')
w = torch.zeros(3, requires_grad=True, device='cuda')

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam([w, ], )

for step in range(30001) :if step:
        optimizer.zero_grad() # reset
        loss.backward()       Gradient # o
        optimizer.step()      Update the independent variables according to the gradient
    
    pred  = torch.mv(x, w) # Matrix multiplication
    loss = criterion(pred, y)
    if step % 5000= =0:
        print('step = {} loss = {:g} W = {}'.format(step, loss, w.tolist()))
Copy the code

Output:

Step = 0 loss = 204 W = [0.0, 0.0, 0.0] step = 5000 loss = 40.8731 W = [2.3051974773406982, 1.712536334991455, -0.6180324554443359] step = 10000 loss = 27.9001 W = [3.6783804893493652, 1.7130744457244873, -5.2205023765563965] step = 15000 Loss = 22.31 W = [4.292291641235352, 2.293663263320923, -9.385353088378906] step = 20000 Loss = 21.3341 W = [4.655962944030762, 2.6559813022613525 -11.925154685974121] step = 25000 Loss = 21.3333 W = [4.666664123535156, 2.666663885116577, Step = 30000 Loss = 21.3333 W = [4.666667938232422, 2.666668176651001, -11.999998092651367] step = 30000 Loss = 21.3333 W = [4.666667938232422, 2.666668176651001, -11.999998092651367]Copy the code

usetorch.nn.Linear()implementation

import torch
import torch.nn
import torch.optim

x = torch.tensor([[1..1..1.], [2..3..1.], [3..5..1.], [4..2..1.], [5..4..1.]])
y = torch.tensor([-10..12..14..16..18.])

fc = torch.nn.Linear(3.1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(fc.parameters())

weights, bias = fc.parameters()
fc(x)
for step in range(30001) :if step:
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    pred = fc(x)
    loss = criterion(pred, y)
    if step % 5000= =0:
        print('step = {} loss = {:g} W = {}, bias = {}'.format(step, loss, weights[0, :].tolist(), bias.item()))
Copy the code

Output:

< Generator object module. parameters at 0x000001ED118B8270> step = 5000 Loss = 106.462w = [0.4140699803829193, 0.7813165187835693, 2.938326358795166], BIAS = 2.9747958183288574 step = 10000 loss = 104 W = [0.007105899043381214, 0.007294247858226299, 4.956961631774902], BIAS = 4.993431568145752 step = 15000 loss = 104 W = [2.2107651602709666E-06, Bias = 5.018227577209473 step = 20000 Loss = 104 W = [2.7108444555778997E-07, bias = 5.018227577209473 step = 20000 loss = 104 W = [2.71084445557789907, E-07 2.585106244623603, 4.981764793395996], Bias = 5.018234729766846 step = 25000 loss = 104 W = [-4.070022259838879E-05, -4.075446486240253E-05, 4.981725215911865], bias = 5.018195152282715 step = 30000 Loss = 104 W = [1.3781600500806235E-06, 1.4800637018197449E-06, 4.981767177581787], BIAS = 5.018237113952637Copy the code

Normalization of data

Why normalization?

In some linear programming problems, the value range of features differs greatly from the value range of labels, or from one feature to another. At this point, some weight values can be particularly large, making it difficult for the optimizer to learn these weight values.

How to normalize?

Normalize feature A mean(A) to mean(A), variance of STD (A) Anorm=A−mean(A) STD (A)A_{norm} = {a-mean (A) \over STD (A)}Anorm= STD (A)A−mean(A)

What are the features of the normalized data?

The mean of normalized data is 0 and the variance is 1

Code examples:

  • Code not normalized:
import torch.nn
import torch.optim

x = torch.tensor([[1000000.0.0001], [2000000.0.0003], [3000000.0.0005], [4000000.0.0002], [5000000.0.0004]], device="cuda")
y = torch.tensor([-1000..1200..1400..1600..1800.], device='cuda').reshape(-1.1)

fc = torch.nn.Linear(2.1)
fc = fc.cuda()
# Get the result of the current weight calculation
pred = fc(x)
print(pred)
criterion = torch.nn.MSELoss()
criterion = criterion.cuda()
optimizer = torch.optim.Adam(fc.parameters())

for step in range(100001) :if step:
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    pred = fc(x)
    loss = criterion(pred, y)
    if step % 10000= =0:
        print('step = {}, loss = {:g}'.format(step, loss))
Copy the code

The output

Tensor ([[580872.8750], [1161746.1250], [1742619.3750], [2323492.5000], [2904365.7500]], Device =' CUDa :0', Grad_fn =<AddmmBackward0>) step = 0, loss = 3.70667e+12 step = 10000, loss = 436096 step = 20000, loss = 435005 step = 30000, loss = 432516 step = 40000, loss = 430062 step = 50000, loss = 427641 step = 60000, loss = 425254 step = 70000, loss = 432383 step = 80000, loss = 420584 step = 90000, loss = 418410 step = 100000, loss = 416046Copy the code

It can be found that this speed is too slow. The loss value is still very high after 10,000 iterations, so the data will be normalized below

import torch
import torch.nn
import torch.optim

x = torch.tensor([[1000000.0.0001], [2000000.0.0003], [3000000.0.0005], [4000000.0.0002], [5000000.0.0004]])
y = torch.tensor([-1000..1200..1400..1600..1800.]).reshape(-1.1)

x_mean, x_std = torch.mean(x, dim=0), torch.std(x, dim=0)
x_norm = (x - x_mean) / x_std

y_mean, y_std = torch.mean(y, dim=0), torch.std(y, dim=0)
y_norm = (y - y_mean) / y_std

fc = torch.nn.Linear(2.1)
# Get the result of the current weight calculation
pred = fc(x)
print(pred)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(fc.parameters())

for step in range(10001) :if step:
        optimizer.zero_grad()
        loss_norm.backward()
        optimizer.step()
    pred_norm = fc(x_norm)
    loss_norm = criterion(pred_norm, y_norm)
    # Data restore
    pred = pred_norm * y_std + y_mean
    loss = criterion(pred, y)
    if step % 1000= =0:
        print('step = {}, loss = {:g}'.format(step, loss))
Copy the code

Output:

Tensor ([[-599029.2500], [-1198058.6250], [-1797088.0000], [-2396117.5000], [-2995146.7500]], Grad_fn =<AddmmBackward0>) steop = 0, loss = 4.38259e+06 steop = 1000, loss = 654194 steop = 2000, loss = 224888 steop = 3000, loss = 213705 steop = 4000, loss = 213341 steop = 5000, loss = 213333 steop = 6000, loss = 213333 steop = 7000, loss = 213333 steop = 8000, loss = 213333 steop = 9000, loss = 213333 steop = 10000, loss = 213333Copy the code

In actual combat

Linear regression of world population by least square method

import os
os.environ["KMP_DUPLICATE_LIB_OK"]  =  "TRUE"
# ignore the above two lines, otherwise a warning may be generated
import torch
import pandas as pd
url = "https://zh.wikipedia.org/wiki/%E4%B8%96%E7%95%8C%E4%BA%BA%E5%8F%A3"
# Get data from Wikipedia
df = pd.read_html(url, header=0, attrs={"class": "wikitable"}, encoding="utf8") [0]
# print(df)
world_populations = df.copy().iloc[18:31[0.1]]

# if access to wikipedia, click https://oss.xuziao.cn/blogdata/%E6%B5%8B%E8%AF%95%E6%95%B0%E6%8D%AE.csv to download the data
# World_populations.to_csv(' Test data.csv ')

# convert the columns corresponding to years to tensors
years = torch.tensor(world_populations.iloc[:, 0].values.astype(float), dtype=torch.float32)
# Convert the columns corresponding to the population into tensors
populations = torch.tensor(world_populations.iloc[:, 1].values.astype(float), dtype=torch.float32)

# becomes [[year, 1], [year, 1],.....] It's going to be w1 times year plus w2 times 1 when you multiply matrices
x = torch.stack([years, torch.ones_like(years)], 1)

y = populations

# Use least squares
wr, _ = torch.lstsq(y, x)
# print(wr)
Get the first two digits (i.e. W1, w2)
slope, intercept = wr[:2.0]
result = 'population = {:.2e}*year {:.2e}'.format(slope, intercept)
print('Regression result:'+result)

# drawing
import matplotlib.pyplot as plt
plt.scatter(years, populations, s = 7, c='blue', marker='o')
estimates = [slope * yr + intercept for yr in years]
plt.plot(years, estimates, c='red')
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()
Copy the code

Output:

Population = 7.43e+03*year -1.43e+07Copy the code

As can be seen from the above image, the fitting effect is quite good, and Adam optimizer will be used for linear regression in the next interview

Linear regression was performed using Adam optimizer

import os
os.environ["KMP_DUPLICATE_LIB_OK"]  =  "TRUE"
# ignore the above two lines, otherwise a warning may be generated
import pandas as pd
import torch
url = "https://zh.wikipedia.org/wiki/%E4%B8%96%E7%95%8C%E4%BA%BA%E5%8F%A3"
# Get data from Wikipedia
df = pd.read_html(url, header=0, attrs={"class": "wikitable"}, encoding="utf8") [0]
# print(df)
world_populations = df.copy().iloc[18:31[0.1]]

# if access to wikipedia, click https://oss.xuziao.cn/blogdata/%E6%B5%8B%E8%AF%95%E6%95%B0%E6%8D%AE.csv to download the data
# World_populations.to_csv(' Test data.csv ')

# convert the columns corresponding to years to tensors
years = torch.tensor(world_populations.iloc[:, 0].values.astype(float), dtype=torch.float32)
# Convert the columns corresponding to the population into tensors
populations = torch.tensor(world_populations.iloc[:, 1].values.astype(float), dtype=torch.float32)

# The above code is copied from the previous section, there is nothing to see


import torch.nn
import torch.optim

x = years.reshape(-1.1)
# print(x)
y = populations
# Below is the normalization of the data. It can be seen that the magnitude of the data is quite different, and the normalization of the data can quickly decrease

x_mean, x_std = torch.mean(x, dim=0), torch.std(x, dim=0)
x_norm = (x - x_mean) / x_std

y_mean, y_std = torch.mean(y, dim=0), torch.std(y, dim=0)
y_norm = (y - y_mean) / y_std

# One input and one output will randomly generate a 1 by 1 matrix
fc = torch.nn.Linear(1.1)
# MSE loss function
criterion = torch.nn.MSELoss()
# create optimizer
optimizer = torch.optim.Adam(fc.parameters())
Shallow copy?
weights_norm, bias_norm = fc.parameters()

for step in range(6001) :if step:
        # Clear the weight
        fc.zero_grad()
        # Calculate gradient
        loss_norm.backward()
        # Update weights (some properties in FC)
        optimizer.step()
    Get the output (normalized output, in this case any value with a _norm suffix is normalized)
    output_norm = fc(x_norm)
    # Remove all dimensions of one
    pred_norm = output_norm.squeeze()
    # Calculate the loss value by MSE loss function
    loss_norm = criterion(pred_norm, y_norm)
    # Calculate the weight of the original data by the normalized weight. This formula and the following formula are derived from advanced mathematics
    weights = y_std / x_std * weights_norm
    Get the offset of the original data from the normalized offset
    bias = (weights_norm * (0 - x_mean) / x_std + bias_norm) * y_std + y_mean
    if step % 1000= =0:
        print({} : weight = {}, bias = {}'.format(step, weights.item(), bias.item()))

# drawing
import matplotlib.pyplot as plt
plt.scatter(years, populations, s = 7, c='blue', marker='o')
estimates = [weights * yr + bias for yr in years]
plt.plot(years, estimates, c='red')
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()

Copy the code

Output:

Step 0: weight = -4349.91064453125, bias = 9026279.0, weight = 1948.0953369140625, bias = -3404077.75 Weight = 5750.35400390625, bias = -10932547.0, weight = 7200.87255859375, bias = -13804574.0 Weight = 7425.09765625, bias = -14248540.0, weight = 7432.94873046875, bias = -14264084.0 Weight = 7432.95751953125, bias = -14264102.0Copy the code

Attached tensor construction method:

The function name The contents of the elements in a tensor
torch.tensor() The content is incoming data
The torch. Zeros (), the torch. Zeros_like () All elements are 0
Torch. ‘ones (), the torch. Ones_like () All elements are 1
The torch. Full (), the torch. Full_like () All elements are all specified values
The torch. The empty (), the torch. Empty_like () The value of the element is not specified
torch.eye() The main diagonal is 1, and the others are 0
Text.arange (), text.range (), text.linspace () The elements are equally unequal
torch.logspace() Equal proportions of the elements
The torch. The rand (), the torch. Rand_like () Each element independently follows the standard uniform distribution
Torch. Randn (), torch. Randn_like (), torch. Normal () Each element independently obeies the standard normal distribution
The torch. The randint (), the torch. Randint_like () Each element independently obeys discrete uniform distribution
torch.bernoulli() A two-point distribution on {0, 1}
torch.multinomial() {0, 1,… N -1}
torch.randperm() Each element is (0, 1…… A random permutation of n minus 1