This is the 25th day of my participation in the August Genwen Challenge.More challenges in August
The concept of model fit degree
Among all model optimization problems, the most basic and core problem is the discussion and optimization of model fitting degree. If the model can capture the general law well, it can have better prediction effect of unknown data. However, there are two main reasons that limit the model to capture the general law:
- Sample data can be very good response in general rule, if the sample data itself is not very good response in general rule, the modeling process of even captured the rule may not apply to the unknown data, an extreme example, in the fraud detection, if it is to be based on has not been fraud case history data for modeling, There would be no law to capture the model; Or, when the disturbance term is too large, the noise will also cover the real law to a certain extent.
- Sample data can reflect the general law, but the model does not capture it. If the first case needs more efforts in data acquisition, then if the data can reflect the general law but the model is not good, the core reason lies in the model itself. Model assessment is mainly based on the performance of the model on the test set, the result is bad, if the test set is considered model has yet to be promoted, but the cause of bad model on the test set is basically has two, one is the model didn’t capture the data on the training set and the other is the model to capture the data on the training set too much, As a result, the model captures a large number of laws (local laws) unique to the training set that are not applicable to the whole, while the test set is also from the whole, and these laws are not applicable to the test set. In the former case, we call the model underfitting, and in the latter case, we call the model overfitting. We can further understand it through the following examples:
Model fitting test
2.1 Overfitting of the model
import random import time import math import matplotlib.pyplot as plt from mpl_toolkits .mplot3d import Axes3D import numpy as np import pandas as pd import torch from torch import nn,optim import torch.nn.functional as F from torch.utils .data import Dataset,TensorDataset,DataLoader from torch.utils.data import random_split from torch.utils .tensorboard import SummaryWriter from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity='all'Copy the code
The function that I’m using
Def tensorGenReg(num_examples=1000, w=[2, -1, 1], bias=True, delta=0.01, deg=1): :param num_examples: the amount of data to create the dataset :param w: Eigencoefficient vector including intercept (if present) : Param bias: whether intercept is required: Param delta: perturbation term Value: Param deg: number of equations :return: generated eigentensor and label tensor """ if bias == True: Inputs = len(w) - 1 # Features_true = torch. Randn (num_examples, 0 0 W_true = Torch. Tensor (W [:-1]) 0 0 B_true = torch. Tensor (w[-1]). Float () # intercept if num_inputs == 1: Labels_true = torch.pow(Features_true, deg) * w_true + b_true else: labels_true = torch.mm(torch.pow(features_true, deg), w_true) + b_true features = torch.cat((features_true, torch.ones(len(features_true), 1)), Labels_true + Torch.randn (size=labels_true.shape) * delta else: Labels_true.shape =labels_true.shape num_inputs = len(w) features = torch.randn(num_examples, num_inputs) w_true = torch.tensor(w).reshape(-1, 1).float() if num_inputs == 1: labels_true = torch.pow(features, deg) * w_true else: labels_true = torch.mm(torch.pow(features, deg), w_true) labels = labels_true + torch.randn(size=labels_true.shape) * delta return features, Labels def split_loader(features, labels, batch_size=10, rate=0.7): labels def split_loader(features, labels, batch_size=10, rate=0.7) Data = GenData(features, labels) num_train = int(data.lens * 0.7) num_test = data.lens-num_train data_train, data_test = random_split(data, [num_train, num_test]) train_loader = DataLoader(data_train, batch_size=batch_size, shuffle=True) test_loader = DataLoader(data_test, batch_size=batch_size, shuffle=False) return (train_loader, test_loader)Copy the code
Np.random. Seed (123) # create data n_dots=20 x=np.linspace(0,1,n_dots) y=np.sqrt(x)+0.2*np.random. Rand (n_dots)-0.1Copy the code
X is an NDARray composed of 20 points distributed isometric between 0 and 1, y= the square root of x+r, where R is man-made random noise, which obeys uniform distribution between [-0.1,0.1]. Polynomial fitting is performed using Numpy’s Polyfit function, which uses the least square method to fit the given data based on the set polynomial order and returns the coefficients of the fitted order. Perform polynomial fitting. The first order X polynomial, the third order X polynomial and the tenth order X polynomial are respectively used to fit y. And use graph to observe polynomial fitting degree, first define an auxiliary drawing function, convenient observation.
Y0 = x * * 2 np. Polyfit (x, y0, 2) p = np. Poly1d (np) polyfit (x, y0, 2)) print (p) def plot_polynomial_fit (x, y, deg) : P = np. Poly1d (np) polyfit (x, y, deg) t = np linspace (0,1,200) PLT. The plot (x, y, 'ro', t, p (t), '-', t, np. SQRT (t), 'r -') plot_polynomial_fit(x,y,3)Copy the code
While t is [0,1], and p is a polynomial regression fitting equation determined by deg parameters, p(t) is the polynomial output of the fitting equation x input t values. The plot_polynomial_fit function is used to generate a red point graph containing the primitive values of (x, y), (t, (t, np.sqrt(t))) and (t, np.sqrt(t)). Test the fitting results of order 3 polynomialNote that the red dot graph composed of (x,y) is equivalent to the data distribution of two-dimensional space with noise, and the blue curve formed by (t, p(t)) is equivalent to the result of the 3-order polynomial fitting of the original data set ((x,y) data set), while the objective law contained in the original data set is actually 𝑦= the square root of x, Therefore, the red dotted line (t, Np.SQRT (t)) actually represents the objective law behind the red point set. In other words, it is hoped that the fitting polynomial can fit the red dotted line representing the objective law as much as possible, instead of being attracted by noise data and deviating from the position of the red dotted line, and it is also hoped that the law of the red curve will not be completely captured. Next, we try to plot the order 1, 3 and 10 fitting in one graph.
Plt. figure(figsize=(18,4),dpi=200) titles=['under fitting','fitting','overfitting'] for index,deg in ,3,10 enumerate ([1]), PLT, subplot (1, 3, index + 1) plot_polynomial_fit (x, y, deg) PLT. Title (the titles [index], fontsize = 20)Copy the code
- According to the results of the final output, 1 order polynomial fitting blue when fitting curve is unable to capture data set of distribution, objective laws also far behind from the data set, and the third order polynomial did well in these two aspects, ten order polynomial is on the distribution of data sets to capture performance is good, but it also deviates from the red curve is far away. At this point, the first-order polynomial is actually under-fitting, while the tenth order polynomial overcaptures the distribution law of noise data (where the noise distribution is uniform). Therefore, even if the tenth order polynomial has a high fitting degree on the current training data set, the useless law captured by it cannot be extended to the new data set. Therefore, the model will have a large error in the execution process on the test data set. That is, the model training error is small, but the generalization error is large.
- The root cause of over-fitting is still “error” between samples, or data rules of different batches are not completely consistent.
- Model underfitting: there is a large error in the training set
- Model overfitting: the error in the training set is small, but the error in the test set is large
- Whether the model is underfitting or overfitting is closely related to the complexity of the model. The more complex the model is, the more capable it is to capture the rules on the training set. If the model is not fit, it can further capture the rules by increasing the complexity of the model, but at the same time, it will face the risk of over-fitting due to the over-complexity of the model.
2.2 The model is not fit
When the model is not fit, the basic method to improve the model effect is to improve the model complexity. Of course, from the perspective of the overall model structure of neural network, there are only two ways to improve the complexity.
- One is to modify the activation function to perform more complex processing of the weighted summation values within the neuron,
- The second is to add hidden layers, including the number of hidden layers and the number of neurons in each hidden layer.
Create polynomial regression equations
Set =tensorGenReg(w=[2,-1], Bias =False, deG =2); set =tensorGenReg(w=[2,-1], BIAS =False, deG =2) PLT. Subplot (1, 2, 1) PLT. Scatter (the features [0] :,, labels) PLT, subplot (122) PLT. Scatter (the features [:, 1), labels)Copy the code
Train_loader,test_loader=split_loader(features,labels) # define simple linear regression equation class LR_class(nn.module): def __init__(self,in_features=2,out_features=1): Super (LR_class,self).__init__() self.linear=nn.Linear(in_features,out_features) def forward(self,x): Out =self.linear(x) return outCopy the code
2.2.1 Model training
Manual_seed (420) # instantiate model LR=LR_class() train_l=[] # store training error test_l=[] # store test error num_epochs=20 in range(num_epochs): Fit (net=LR,criterion= nn.mseloss (), Optimizer = optim.sgd (lr.parameters (), LR =0.03), batchData =train_loader,epochs=epochs ) train_l.append(mse_cal(train_loader,LR).detach().numpy()) test_l.append(mse_cal(test_loader,LR).detach().numpy()) # Draw an image, Check the mse PLT. The plot (list (range (num_epochs)), train_l, label = 'train_mse') plt.plot(list(range(num_epochs)),test_l,label='test_mse') plt.legend(loc=1)Copy the code
The model has poor effect, and the training error and test error are both large (before 0.0001). The model is under fitting, so we should consider increasing the complexity of the model.
def model_train_test(model, train_data, test_data, num_epochs = 20, criterion = nn.MSELoss(), optimizer = optim.SGD, Lr = 0.03, CLA = False, Eva = mSE_cal): """ Test data: Param num_epochs: Iteration rounds: Param Criterion: Loss function: Param LR: Learning rate: Param CLA: Train_l = [] test_l = [] # model training process for epochs in range(num_epochs): fit(net = model, criterion = criterion, optimizer = optimizer(model.parameters(), lr = lr), batchdata = train_data, epochs = epochs, cla = cla) train_l.append(eva(train_data, model).detach()) test_l.append(eva(test_data, model).detach()) return train_l, test_lCopy the code
2.2.2 Increase the test effect of model complexity
Test_l = train_train_test (LR,train_loader,test_loader) ,num_epochs=20,criterion= nn.mseloss (),optimizer= optim.sgd,lr=0.03,cla=False, Eva =mse_cal) plt.plot(list(range(num_epochs)),train_l,label='train_mse') plt.plot(list(range(num_epochs)),test_l,label='test_mse') plt.legend(loc=1)Copy the code
Under the basic structure of neural network, how to improve the model complexity so that multiple linear regression data can be modeled? First of all, when the activation function is only linear transformation Y =x, increasing the number of layers does not significantly improve the result, which is verified by the following experiments:
class LR_class1(nn.Module): def __init__ (self,in_features=2,n_hidden=4,out_features=1): Super (LR_class1,self).__init__() self.linear1=nn.Linear(in_features,n_hidden) self.linear2=nn.Linear(n_hidden,out_features) def forward(self,x): z1=self.linear1(x) out=self.linear2(z1) return outCopy the code
Test_l = model_train_test(test_l, train_loader,); Test_loader, num_epochs = 20, criterion = nn.mseloss (), optimizer = optim.sgd, lr = 0.03, cla = False, Eva = mse_cal) # Plt.plot (list(range(num_epochs)), train_l, label='train_mse') plt.plot(list(range(num_epochs)), test_l, label='test_mse') plt.legend(loc = 1)Copy the code
The results are not significantly improved, but the stability of the model is improved. As for the neural network model with superimposed linear layers, the model is only affine transformation of data, so it cannot meet the purpose of fitting higher-order terms. In other words, in the process of increasing the complexity of the model, it is necessary to activate the function coordination first, and then increase the number of layers of the model and the number of neurons in each layer. The more complex the model, the more stable the output result, but this is only a local law, in fact, in most of the time, the more complex the model, the more unstable the output result.
Comparison of activation function performance
3.1 Comparison of common activation functions
For activation function, the difference is very apparent, the effect of different activation function of the output layer and hidden layer activation function of the activation function should be treated separately, hidden layer activation function is nonlinear transformation to the data, and the output layer activation function is usually to meet a specific design of the output.
Class Sigmoid_class1(nn.Module): def __init__(self,in_features=2,n_hidden=4,out_features=1,bias=True): super(Sigmoid_class1,self).__init__() self.linear1=nn.Linear(in_features,n_hidden,bias=bias) self.linear2=nn.Linear(n_hidden,out_features,bias=bias) def forward(self,x): Class tanh_class1(nn.module): z1=self.linear1(x) p1=torch. Sigmoid (z1) out=self.linear2(p1) return out #tanh activation function class tanh_class1(nn.module): def __init__(self,in_features=2,n_hidden=4,out_features=1,bias=True): super(tanh_class1,self).__init__() self.linear1=nn.Linear(in_features,n_hidden,bias=bias) self.linear2=nn.Linear(n_hidden,out_features,bias=bias) def forward(self,x): Class ReLU_class1(nn.module): z1=self.linear1(x) p1= linear.tanh (z1) out=self.linear2(p1) return out #RELu activation function class ReLU_class1(nn.module): def __init__(self,in_features=2,n_hidden=4,out_features=1,bias=True): super(ReLU_class1,self).__init__() self.linear1=nn.Linear(in_features,n_hidden,bias=bias) self.linear2=nn.Linear(n_hidden,out_features,bias=bias) def forward(self,x): z1=self.linear1(x) p1=torch.relu(z1) out=self.linear2(p1) return outCopy the code
3.1.1 Instantiation model
torch.manual_seed(420) LR1 = LR_class1() sigmoid_model1 = Sigmoid_class1() tanh_model1 = tanh_class1() relu_model1 = ReLU_class1() model_l = [LR1, sigmoid_model1, TANh_model1, relu_model1] # Place the instantiated model in a list container 'sigmoid_model1', 'tanh_model1', 'relu_model1']Copy the code
3.1.2 Define core parameters, training set and test set MSE storage tensor
Zeros (len(model_l), num_epochs) mse_test = torch. Zeros (len(model_l), num_epochs)Copy the code
3.1.3 Model training
for epochs in range(num_epochs):
for i, model in enumerate(model_l):
fit(net = model,
criterion = nn.MSELoss(),
optimizer = optim.SGD(model.parameters(), lr = lr),
batchdata = train_loader,
epochs = epochs)
mse_train[i][epochs] = mse_cal(train_loader, model).detach()
mse_test[i][epochs] = mse_cal(test_loader, model).detach()
Copy the code
Training error mapping
Error for I, name in enumerate(name_l): plt.plot(list(range(num_epochs)), mse_train[i], label=name) plt.legend(loc = 1) plt.title('mse_train')Copy the code
Test error plot
Error for I, name in enumerate(name_l): plt.plot(list(range(num_epochs)), mse_test[i], label=name) plt.legend(loc = 1) plt.title('mse_test')Copy the code
Compared with other activation functions, the effect of ReLU activation function is significantly better.
Construct a complex neural network
After the preliminary judgment that the effect of ReLU activation function was better than Sigmoid activation function and TANH activation function, the model complexity was increased, that is, the hidden layer was added to build a more complex neural network model. The first is the stacking of ReLU activation functions, so we consider adding several hidden layers and consider using ReLU functions in hidden layers, so called adding ReLU layers. Here we create a ReLU_class2 structure based on ReLU_class1:
class ReLU_class2(nn.Module):
def __init__(self, in_features=2, n_hidden_1=4, n_hidden_2=4, out_features=1, bias=True):
super(ReLU_class2, self).__init__()
self.linear1 = nn.Linear(in_features, n_hidden_1, bias=bias)
self.linear2 = nn.Linear(n_hidden_1, n_hidden_2, bias=bias)
self.linear3 = nn.Linear(n_hidden_2, out_features, bias=bias)
def forward(self, x):
z1 = self.linear1(x)
p1 = torch.relu(z1)
z2 = self.linear2(p1)
p2 = torch.relu(z2)
out = self.linear3(p2)
return out
Copy the code
Model test
Manual_seed (24) # instantiate the model relu_model1 = ReLU_class1() relu_model2 = ReLU_class2() # model list container model_l = [relu_model1, relu_model2] name_L = ['relu_model1', 'relu_model2'Copy the code
def model_comparison(model_l, name_l, train_data, test_data, num_epochs = 20, criterion = nn.MSELoss(), Optimizer = optim.SGD, lr = 0.03, clA = False, Eva = mse_cal): "" model comparison function: :param model_l: model sequence :param name_L: model name sequence :param train_data: training data :param test_data: test data :param num_epochs: Number of iterations: Param Criterion: Loss function: Param LR: learning rate: Param CLA: Return: Train_l = torch. Zeros (len(model_l), num_epochs) test_l = torch. Zeros (len(model_l), Num_epochs: for I, model in enumerate(model_L): fit(net = model, criterion = criterion, optimizer = optimizer(model.parameters(), lr = lr), batchdata = train_data, epochs = epochs, cla = cla) train_l[i][epochs] = eva(train_data, model).detach() test_l[i][epochs] = eva(test_data, model).detach() return train_l, test_lCopy the code
train_l, test_l = model_comparison(model_l = model_l, name_l = name_l, train_data = train_loader, Test_data = test_loader, num_epochs = num_epochs, criterion = nn.mseloss (), optimizer = optim.sgd, lr = 0.03, cla = False, eva = mse_cal)Copy the code
4.1 Training error plotting
Error for I, name in enumerate(name_l): plt.plot(list(range(num_epochs)), train_l[i], label=name) plt.legend(loc = 1) plt.title('mse_train')Copy the code
Test error plot
Error for I, name in enumerate(name_l): plt.plot(list(range(num_epochs)), test_l[i], label=name) plt.legend(loc = 1) plt.title('mse_train')Copy the code
The model effect did not improve significantly, but showed more fluctuation, and the convergence speed of iteration also decreased. The effect of the model cannot be improved because the model is not complex enough, try adding more hidden layers
= = tensorGenReg(w=[2, -1], Bias =False); Deg =2) # train_loader (test_loader) = split_loader(features, labels)Copy the code
# Build three hidden layers of neural network class ReLU_class3(nn.Module) def __init__(self, in_features=2, n_hidden1=4, n_hidden2=4, n_hidden3=4, out_features=1, bias=True): super(ReLU_class3, self).__init__() self.linear1 = nn.Linear(in_features, n_hidden1, bias=bias) self.linear2 = nn.Linear(n_hidden1, n_hidden2, bias=bias) self.linear3 = nn.Linear(n_hidden2, n_hidden3, bias=bias) self.linear4 = nn.Linear(n_hidden3, out_features, bias=bias) def forward(self, x): z1 = self.linear1(x) p1 = torch.relu(z1) z2 = self.linear2(p1) p2 = torch.relu(z2) z3 = self.linear3(p2) p3 = Torch. Relu (z3) out = linear4(p3) return out # Build four hidden layers of neural network class ReLU_class4(nn.Module): def __init__(self, in_features=2, n_hidden1=4, n_hidden2=4, n_hidden3=4, n_hidden4=4, out_features=1, bias=True): super(ReLU_class4, self).__init__() self.linear1 = nn.Linear(in_features, n_hidden1, bias=bias) self.linear2 = nn.Linear(n_hidden1, n_hidden2, bias=bias) self.linear3 = nn.Linear(n_hidden2, n_hidden3, bias=bias) self.linear4 = nn.Linear(n_hidden3, n_hidden4, bias=bias) self.linear5 = nn.Linear(n_hidden4, out_features, bias=bias) def forward(self, x): z1 = self.linear1(x) p1 = torch.relu(z1) z2 = self.linear2(p1) p2 = torch.relu(z2) z3 = self.linear3(p2) p3 = torch.relu(z3) z4 = self.linear4(p3) p4 = torch.relu(z4) out = self.linear5(p4) return outCopy the code
Relu_model1 = ReLU_class1() relu_model2 = ReLU_class2() relu_model3 = ReLU_class3() relu_model4 = ReLU_class4() # Model_l = [relu_model1, relu_model2, relu_model3, Relu_model4] name_L = ['relu_model1', 'relu_model2', 'relu_model3', 'relu_model4'] # num_epochs = 20 LR = 0.03Copy the code
train_l, test_l = model_comparison(model_l = model_l,
name_l = name_l,
train_data = train_loader,
test_data = test_loader,
num_epochs = num_epochs,
criterion = nn.MSELoss(),
optimizer = optim.SGD,
lr = lr,
cla = False,
eva = mse_cal)
Copy the code
Error for I, name in enumerate(name_l): plt.plot(list(range(num_epochs)), train_l[i], label=name) plt.legend(loc = 4) plt.title('mse_train')Copy the code
Error for I, name in enumerate(name_l): plt.plot(list(range(num_epochs)), test_l[i], label=name) plt.legend(loc = 4) plt.title('mse_test')Copy the code
In the process of stacking the ReLU activation function, the model effect did not develop in the expected direction, not only did the MSE not get lower and lower, model3 and model4 even appeared model failure situation! This fully demonstrates that more complex model building is not always better. With the increase of model complexity, the model convergence speed slows down, the fluctuation of convergence process increases, and even the model failure may occur. The problem of the current experimental complex model is not the problem of the algorithm theory itself, but the lack of “technical means” to solve these problems, that is, the model optimization method. In fact, this also explains the importance of optimization algorithms from the side.