Ten optimizers are commonly used
-
Torch. Optim.SGD Stochastic Gradient Descent algorithm (Momentum optional)
-
Torch. Optim.ASGD mean stochastic gradient descent algorithm
-
Torch. Optim. Rprop elastic backpropagation — for full-batch, not for Mini-Batch
-
< recommendation > Torch. Optim.Adagrad adaptive optimization method — Adaptively assigns different learning rates to each parameter. The change of learning rates will be affected by the size of the gradient and the number of iterations. The higher the gradient is, the lower the learning rate is. The smaller the gradient, the greater the learning rate. Disadvantage is training later, learning rate is too small
-
< recommended > Torch. Optim.Adadelta –Adadelta is an improvement on Adagrad to avoid too low learning rate late in training
-
< recommended > Torch.optim.rmsprop
-
Torch. Optim.Adam(AMSGrad) –Adam is an adaptive learning rate optimization method.Adam dynamically adjusts the learning rate by using the first-order moment estimation and second-order moment estimation of the gradient by adding additional constraints, so that the learning rate is always positive
-
Torch. Optim.Adamax -Adamax is a concept that adds a maximum learning rate to Adam
-
torch.optim.SparseAdam
-
torch.optim.LBFGS
Introduction to the
Torch. Optim is a library that implements various optimization algorithms.
To use Torch. Optim, you need to build an Optimizer object. This object can maintain the current parameter state and update parameters based on the calculated gradient.
optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
optimizer = optim.Adam([var1, var2], lr = 0.0001)
Copy the code
Set options separately for each parameter
The model.base parameter will use 1E-2 learning rate, the model.classifier parameter will use 1E-3 learning rate, and momentum 0.9 will be used for all parameters.
optim.SGD([ {'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3} ],
lr=1e-2, momentum=0.9)
Copy the code
Optimization function
Adagrad (Adaptive Gradient) < recommended >
Square the gradient of each iteration of each parameter and then take the square root. Divide the basic learning rate by this number to do the dynamic update of learning rate.
Recommend RMSprop < >
In order to reduce the rapid attenuation of Adgrad learning rate, RMSprop attenuates the accumulated information
Recommend Adadelta < >
Adadelta is an extension of Adagrad, which does not require the initial learning rate to be set, but instead uses the previous step size to estimate the next step size.
Momentum impulse method
Simulate the concept of physical momentum, accumulating previous momentum in place of a real gradient. Added updates in the same direction and reduced updates in different directions.
Nesterov Accelerated Gradient (NAG)
Make a big jump in the direction of the gradient before accelerating (brown vector), calculate the gradient and then correct it (green ladder vector)
Training code
Classification problem training code
#----------------data------------------
data_num = 100
x = torch.unsqueeze(torch.linspace(-1.1,data_num), dim=1)
y0 = torch.zeros(50)
y1 = torch.ones(50) y = torch.cat((y0, y1), ). The data type (torch. LongTensor) # # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - "train" -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- optimizer = torch. Optim. SGD (mynet. The parameters (), lr =0.1) # optimizer loss_func = torch. Nn. CrossEntropyLoss () # loss functionfor epoch in range(1000) : optimizer.zero_grad() #forward + backward + optimize pred = mynet(x) loss = loss_func(pred,y) loss.backward() optimizer.step() #----------------prediction--------------- test_data = torch.tensor([-1.0])
pred = mynet(test_data)
print(test_data, pred.data)
Copy the code
Regression problem training code
#--------------data--------------------
x = torch.unsqueeze(torch.linspace(-1.1.100),dim=1)
y = x.pow(2)
#--------------train-------------------
optimizer = torch.optim.SGD(mynet.parameters(),lr=0.1) # optimizer loss_func = torch.nn.mseloss () # loss functionfor epoch in range(1000) : optimizer.zero_grad() #forward + backward + optimize pred = mynet(x) loss = loss_func(pred,y) loss.backward() optimizer.step() #----------------prediction--------------- test_data = torch.tensor([-1.0])
pred = mynet(test_data)
print(test_data, pred.data)
Copy the code
Zero_grad () must return to zero gradient before each backpropagation, otherwise gradients will add up and the result will not converge.