Linear regression
Least square method
usetorch.lstsq()
Solve the linear regression problem
Two important corollary
- Dot it with the weight
X [I:] ⋅ W = X [0] [I, 0] W + X [1] + [I, 1) W… W + x [I, m – 1] [m – 1) + x (I, m] w x [I:] [m], [I, 0] w w = x [0] + x w (I, 1] [1] +… W + x [I, m – 1]] [m – 1 + x/I, m a w [m] x [I:] ⋅ w = x [0] [I, 0] w + x w (I, 1] [1] +… W + x [I, m – 1] [m – 1) + x (I, m] w [m]. 2. The calculation method of two norm ∣ ∣ ⋅ – x Y w ∣ ∣ 22 = ∑ I = 0 n – 1 (Y – x [I:] [I] ⋅ w) | | – x, Y w | | _2 ^ 2 = \ sum_ {I = 0} ^ {}, n – 1 (Y – [I] X [I:] w.) ∣ ∣ ⋅ – X Y W ∣ ∣ 22 = ∑ I = 0 n – 1 (Y – X [I:] [I] ⋅ W) 3. Error expression ζ(W; X, Y) = 1 n ∣ ∣ ⋅ – X Y W ∣ ∣ 22 \ zeta (W; X, Y) = {1 \ over n} | | – X, Y W | | _2 ^ 2 zeta (W; X, Y) = n1 ∣ ∣ ⋅ – X Y W ∣ ∣ 22
import torch
x = torch.tensor([[1..1..1.], [2..3..1.], [3..5..1.], [4..2..1.], [5..4..1.]])
y = torch.tensor([-10..12..14..16..18.])
wr, _ = torch.lstsq(y, x)
w = wr[:3]
print(wr)
print(w)
Copy the code
Tensor ([[4.6667], [2.6667], [12.0000], [10.0885], [2.2110]]) tensor ([[4.6667], [2.6667], [12.0000]])Copy the code
Several loss functions
MSE loss function
The result is the sum of the squares of the difference between the desired value and the predicted value
- Advantages: all points are continuous smooth, convenient derivation, with a relatively stable solution
- Disadvantages: Not robust enough. When the input value of the function is far from the center value, the gradient descent method is used to solve the problem, which may lead to gradient explosion
- Pytorch corresponding class
torch.nn.MSELoss
MAE loss function
The formula is the sum of the absolute value of the difference between the target value and the predicted value
- Advantages: No matter for what kind of input value, has a stable gradient, will not lead to gradient explosion problem, has a more robust solution
- Disadvantages: the center point is a folding point, can not be derived, not convenient to solve
L1 loss function
L1 norm loss function, also known as minimum absolute deviation (LAD), minimum absolute error (LAE). In general, it minimizes the sum (S) of the absolute difference between the target value (Yi) and the estimated value (f(xi)) : formula
- The corresponding class in PyTorch
torch.nn.L1Loss
L2 loss function
L2 norm loss function, also known as least square error (LSE). In general, it minimizes the sum of squares (S) of the difference between the target value (Yi) and the estimated value (f(xi)) : formula
The advantages and disadvantages of L1 loss function and L2 loss function are interchangeable with those of MSE loss function and MAS loss function
Smooth L1 loss function
So we figured out that it would be nice if we could just get rid of the fold point of the L1 loss function and make it differentiable, so we got the smooth L1 loss function formula
- Advantages: This function is actually a piecewise function, and it is actually L2 loss between [-1,1], which solves the problem of L1 non-smoothness; outside [-1,1], it is actually L1 loss, which solves the problem of outlier gradient explosion. Moreover, Smooth L1 Loss combines the advantages of L1 and L2: L1 is used in the early stage for gradient stability and rapid convergence, and L2 is used in the later stage for gradual convergence to the optimal solution.
- The corresponding class in PyTorch
torch.nn.SmoothL1Loss
An example of calling the MSE loss function
# instantiate this class
criterion = torch.nn.MSELoss()
pred = torch.arange(5, dtype=torch.float32,requires_grad=True)
y = torch.ones(5)
loss = criterion(pred, y)
print(loss)
loss.backward()
# print(loss.grad)
Copy the code
Output:
tensor(3., grad_fn=<MseLossBackward0>)
Copy the code
Use optimizer to solve linear regression
Whatever the loss function, we can always use gradient descent method to find the right weight W, makes the minimum loss, loss is realized by using this method we need to first, and then the loss gradient, and then update the weight value of W, but even the simplest MSE loss function, when the memory too much and not all one-time data loading, We can use the stochastic gradient descent method to select a part of the data for operation during each iteration. The following example implements the same result as the previous example, but this method is more laborious and takes more time, so if you can use tourch.lstsq(), use tourch.lstsq(). We really can’t use tourch.lstSq () (for example, the loss is not MSE loss or too much data can’t be loaded into memory all at once), so we use this method
import torch
import torch.nn
import torch.optim
x = torch.tensor([[1..1..1.], [2..3..1.], [3..5..1.], [4..2..1.], [5..4..1.]], device='cuda')
y = torch.tensor([-10..12..14..16..18.], device='cuda')
w = torch.zeros(3, requires_grad=True, device='cuda')
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam([w, ], )
for step in range(30001) :if step:
optimizer.zero_grad() # reset
loss.backward() Gradient # o
optimizer.step() Update the independent variables according to the gradient
pred = torch.mv(x, w) # Matrix multiplication
loss = criterion(pred, y)
if step % 5000= =0:
print('step = {} loss = {:g} W = {}'.format(step, loss, w.tolist()))
Copy the code
Output:
Step = 0 loss = 204 W = [0.0, 0.0, 0.0] step = 5000 loss = 40.8731 W = [2.3051974773406982, 1.712536334991455, -0.6180324554443359] step = 10000 loss = 27.9001 W = [3.6783804893493652, 1.7130744457244873, -5.2205023765563965] step = 15000 Loss = 22.31 W = [4.292291641235352, 2.293663263320923, -9.385353088378906] step = 20000 Loss = 21.3341 W = [4.655962944030762, 2.6559813022613525 -11.925154685974121] step = 25000 Loss = 21.3333 W = [4.666664123535156, 2.666663885116577, Step = 30000 Loss = 21.3333 W = [4.666667938232422, 2.666668176651001, -11.999998092651367] step = 30000 Loss = 21.3333 W = [4.666667938232422, 2.666668176651001, -11.999998092651367]Copy the code
usetorch.nn.Linear()
implementation
import torch
import torch.nn
import torch.optim
x = torch.tensor([[1..1..1.], [2..3..1.], [3..5..1.], [4..2..1.], [5..4..1.]])
y = torch.tensor([-10..12..14..16..18.])
fc = torch.nn.Linear(3.1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(fc.parameters())
weights, bias = fc.parameters()
fc(x)
for step in range(30001) :if step:
optimizer.zero_grad()
loss.backward()
optimizer.step()
pred = fc(x)
loss = criterion(pred, y)
if step % 5000= =0:
print('step = {} loss = {:g} W = {}, bias = {}'.format(step, loss, weights[0, :].tolist(), bias.item()))
Copy the code
Output:
< Generator object module. parameters at 0x000001ED118B8270> step = 5000 Loss = 106.462w = [0.4140699803829193, 0.7813165187835693, 2.938326358795166], BIAS = 2.9747958183288574 step = 10000 loss = 104 W = [0.007105899043381214, 0.007294247858226299, 4.956961631774902], BIAS = 4.993431568145752 step = 15000 loss = 104 W = [2.2107651602709666E-06, Bias = 5.018227577209473 step = 20000 Loss = 104 W = [2.7108444555778997E-07, bias = 5.018227577209473 step = 20000 loss = 104 W = [2.71084445557789907, E-07 2.585106244623603, 4.981764793395996], Bias = 5.018234729766846 step = 25000 loss = 104 W = [-4.070022259838879E-05, -4.075446486240253E-05, 4.981725215911865], bias = 5.018195152282715 step = 30000 Loss = 104 W = [1.3781600500806235E-06, 1.4800637018197449E-06, 4.981767177581787], BIAS = 5.018237113952637Copy the code
Normalization of data
Why normalization?
In some linear programming problems, the value range of features differs greatly from the value range of labels, or from one feature to another. At this point, some weight values can be particularly large, making it difficult for the optimizer to learn these weight values.
How to normalize?
Normalize feature A mean(A) to mean(A), variance of STD (A) Anorm=A−mean(A) STD (A)A_{norm} = {a-mean (A) \over STD (A)}Anorm= STD (A)A−mean(A)
What are the features of the normalized data?
The mean of normalized data is 0 and the variance is 1
Code examples:
- Code not normalized:
import torch.nn
import torch.optim
x = torch.tensor([[1000000.0.0001], [2000000.0.0003], [3000000.0.0005], [4000000.0.0002], [5000000.0.0004]], device="cuda")
y = torch.tensor([-1000..1200..1400..1600..1800.], device='cuda').reshape(-1.1)
fc = torch.nn.Linear(2.1)
fc = fc.cuda()
# Get the result of the current weight calculation
pred = fc(x)
print(pred)
criterion = torch.nn.MSELoss()
criterion = criterion.cuda()
optimizer = torch.optim.Adam(fc.parameters())
for step in range(100001) :if step:
optimizer.zero_grad()
loss.backward()
optimizer.step()
pred = fc(x)
loss = criterion(pred, y)
if step % 10000= =0:
print('step = {}, loss = {:g}'.format(step, loss))
Copy the code
The output
Tensor ([[580872.8750], [1161746.1250], [1742619.3750], [2323492.5000], [2904365.7500]], Device =' CUDa :0', Grad_fn =<AddmmBackward0>) step = 0, loss = 3.70667e+12 step = 10000, loss = 436096 step = 20000, loss = 435005 step = 30000, loss = 432516 step = 40000, loss = 430062 step = 50000, loss = 427641 step = 60000, loss = 425254 step = 70000, loss = 432383 step = 80000, loss = 420584 step = 90000, loss = 418410 step = 100000, loss = 416046Copy the code
It can be found that this speed is too slow. The loss value is still very high after 10,000 iterations, so the data will be normalized below
import torch
import torch.nn
import torch.optim
x = torch.tensor([[1000000.0.0001], [2000000.0.0003], [3000000.0.0005], [4000000.0.0002], [5000000.0.0004]])
y = torch.tensor([-1000..1200..1400..1600..1800.]).reshape(-1.1)
x_mean, x_std = torch.mean(x, dim=0), torch.std(x, dim=0)
x_norm = (x - x_mean) / x_std
y_mean, y_std = torch.mean(y, dim=0), torch.std(y, dim=0)
y_norm = (y - y_mean) / y_std
fc = torch.nn.Linear(2.1)
# Get the result of the current weight calculation
pred = fc(x)
print(pred)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(fc.parameters())
for step in range(10001) :if step:
optimizer.zero_grad()
loss_norm.backward()
optimizer.step()
pred_norm = fc(x_norm)
loss_norm = criterion(pred_norm, y_norm)
# Data restore
pred = pred_norm * y_std + y_mean
loss = criterion(pred, y)
if step % 1000= =0:
print('step = {}, loss = {:g}'.format(step, loss))
Copy the code
Output:
Tensor ([[-599029.2500], [-1198058.6250], [-1797088.0000], [-2396117.5000], [-2995146.7500]], Grad_fn =<AddmmBackward0>) steop = 0, loss = 4.38259e+06 steop = 1000, loss = 654194 steop = 2000, loss = 224888 steop = 3000, loss = 213705 steop = 4000, loss = 213341 steop = 5000, loss = 213333 steop = 6000, loss = 213333 steop = 7000, loss = 213333 steop = 8000, loss = 213333 steop = 9000, loss = 213333 steop = 10000, loss = 213333Copy the code
In actual combat
Linear regression of world population by least square method
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
# ignore the above two lines, otherwise a warning may be generated
import torch
import pandas as pd
url = "https://zh.wikipedia.org/wiki/%E4%B8%96%E7%95%8C%E4%BA%BA%E5%8F%A3"
# Get data from Wikipedia
df = pd.read_html(url, header=0, attrs={"class": "wikitable"}, encoding="utf8") [0]
# print(df)
world_populations = df.copy().iloc[18:31[0.1]]
# if access to wikipedia, click https://oss.xuziao.cn/blogdata/%E6%B5%8B%E8%AF%95%E6%95%B0%E6%8D%AE.csv to download the data
# World_populations.to_csv(' Test data.csv ')
# convert the columns corresponding to years to tensors
years = torch.tensor(world_populations.iloc[:, 0].values.astype(float), dtype=torch.float32)
# Convert the columns corresponding to the population into tensors
populations = torch.tensor(world_populations.iloc[:, 1].values.astype(float), dtype=torch.float32)
# becomes [[year, 1], [year, 1],.....] It's going to be w1 times year plus w2 times 1 when you multiply matrices
x = torch.stack([years, torch.ones_like(years)], 1)
y = populations
# Use least squares
wr, _ = torch.lstsq(y, x)
# print(wr)
Get the first two digits (i.e. W1, w2)
slope, intercept = wr[:2.0]
result = 'population = {:.2e}*year {:.2e}'.format(slope, intercept)
print('Regression result:'+result)
# drawing
import matplotlib.pyplot as plt
plt.scatter(years, populations, s = 7, c='blue', marker='o')
estimates = [slope * yr + intercept for yr in years]
plt.plot(years, estimates, c='red')
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()
Copy the code
Output:
Population = 7.43e+03*year -1.43e+07Copy the code
As can be seen from the above image, the fitting effect is quite good, and Adam optimizer will be used for linear regression in the next interview
Linear regression was performed using Adam optimizer
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
# ignore the above two lines, otherwise a warning may be generated
import pandas as pd
import torch
url = "https://zh.wikipedia.org/wiki/%E4%B8%96%E7%95%8C%E4%BA%BA%E5%8F%A3"
# Get data from Wikipedia
df = pd.read_html(url, header=0, attrs={"class": "wikitable"}, encoding="utf8") [0]
# print(df)
world_populations = df.copy().iloc[18:31[0.1]]
# if access to wikipedia, click https://oss.xuziao.cn/blogdata/%E6%B5%8B%E8%AF%95%E6%95%B0%E6%8D%AE.csv to download the data
# World_populations.to_csv(' Test data.csv ')
# convert the columns corresponding to years to tensors
years = torch.tensor(world_populations.iloc[:, 0].values.astype(float), dtype=torch.float32)
# Convert the columns corresponding to the population into tensors
populations = torch.tensor(world_populations.iloc[:, 1].values.astype(float), dtype=torch.float32)
# The above code is copied from the previous section, there is nothing to see
import torch.nn
import torch.optim
x = years.reshape(-1.1)
# print(x)
y = populations
# Below is the normalization of the data. It can be seen that the magnitude of the data is quite different, and the normalization of the data can quickly decrease
x_mean, x_std = torch.mean(x, dim=0), torch.std(x, dim=0)
x_norm = (x - x_mean) / x_std
y_mean, y_std = torch.mean(y, dim=0), torch.std(y, dim=0)
y_norm = (y - y_mean) / y_std
# One input and one output will randomly generate a 1 by 1 matrix
fc = torch.nn.Linear(1.1)
# MSE loss function
criterion = torch.nn.MSELoss()
# create optimizer
optimizer = torch.optim.Adam(fc.parameters())
Shallow copy?
weights_norm, bias_norm = fc.parameters()
for step in range(6001) :if step:
# Clear the weight
fc.zero_grad()
# Calculate gradient
loss_norm.backward()
# Update weights (some properties in FC)
optimizer.step()
Get the output (normalized output, in this case any value with a _norm suffix is normalized)
output_norm = fc(x_norm)
# Remove all dimensions of one
pred_norm = output_norm.squeeze()
# Calculate the loss value by MSE loss function
loss_norm = criterion(pred_norm, y_norm)
# Calculate the weight of the original data by the normalized weight. This formula and the following formula are derived from advanced mathematics
weights = y_std / x_std * weights_norm
Get the offset of the original data from the normalized offset
bias = (weights_norm * (0 - x_mean) / x_std + bias_norm) * y_std + y_mean
if step % 1000= =0:
print({} : weight = {}, bias = {}'.format(step, weights.item(), bias.item()))
# drawing
import matplotlib.pyplot as plt
plt.scatter(years, populations, s = 7, c='blue', marker='o')
estimates = [weights * yr + bias for yr in years]
plt.plot(years, estimates, c='red')
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()
Copy the code
Output:
Step 0: weight = -4349.91064453125, bias = 9026279.0, weight = 1948.0953369140625, bias = -3404077.75 Weight = 5750.35400390625, bias = -10932547.0, weight = 7200.87255859375, bias = -13804574.0 Weight = 7425.09765625, bias = -14248540.0, weight = 7432.94873046875, bias = -14264084.0 Weight = 7432.95751953125, bias = -14264102.0Copy the code
Attached tensor construction method:
The function name | The contents of the elements in a tensor |
---|---|
torch.tensor() | The content is incoming data |
The torch. Zeros (), the torch. Zeros_like () | All elements are 0 |
Torch. ‘ones (), the torch. Ones_like () | All elements are 1 |
The torch. Full (), the torch. Full_like () | All elements are all specified values |
The torch. The empty (), the torch. Empty_like () | The value of the element is not specified |
torch.eye() | The main diagonal is 1, and the others are 0 |
Text.arange (), text.range (), text.linspace () | The elements are equally unequal |
torch.logspace() | Equal proportions of the elements |
The torch. The rand (), the torch. Rand_like () | Each element independently follows the standard uniform distribution |
Torch. Randn (), torch. Randn_like (), torch. Normal () | Each element independently obeies the standard normal distribution |
The torch. The randint (), the torch. Randint_like () | Each element independently obeys discrete uniform distribution |
torch.bernoulli() | A two-point distribution on {0, 1} |
torch.multinomial() | {0, 1,… N -1} |
torch.randperm() | Each element is (0, 1…… A random permutation of n minus 1 |