The trick of gradient accumulation is used in the case that the GPU is not very rich, so too large batch size cannot be used

In the case of pyTorch, the normal training process is:

import torch import math def set_seed(seed): torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True Torch. Backends. Cudnn. Benchmark = False # fixed random set_seed (0) dtype = torch. Float device = torch. The device (" CPU ") # x1 generate the training data = torch.linspace(-math.pi, math.pi, 500, device=device, dtype=dtype) y1 = 2 * x1 x2 = torch.linspace(-math.pi, math.pi, 500, device=device, dtype=dtype) y2 = 2 * x2 data = [[x1, y1], [x2, y2]] a = torch.randn((), device=device, dtype=dtype, requires_grad=True)Copy the code
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-3
optimizer = torch.optim.RMSprop([a], lr=learning_rate)
for epoch in range(500):
    for i, (x, y) in enumerate(data):
        y_pred = a * x
        loss = loss_fn(y_pred, y)
        optimizer.zero_grad()
        loss.backward()
        print(a.grad)
        optimizer.step()
    break
print(a)
Copy the code

That is, approximating the function of y=2x, the gradient of output parameter A can be seen twice as follows:

Tensor tensor (1516.1149) (1483.0844)Copy the code

When we delete the optimizer.zero_grad() line, we can see that the gradient of output a is twice:

Tensor tensor (1516.1149) (2999.1992)Copy the code

If you don’t use zero_grad() for pyTorch, the gradient is always accumulating, so we can take advantage of this by changing the code slightly (instead of using the example above, we give you a generic gradient accumulating template) as follows:

model.zero_grad() # Reset gradients tensors for i, (inputs, labels) in enumerate(training_set): predictions = model(inputs) # Forward pass loss = loss_function(predictions, labels) # Compute loss function loss = loss / accumulation_steps # Normalize our loss (if averaged) loss.backward() # Backward pass if (i+1) % accumulation_steps == 0: # Wait for several backward steps optimizer.step() # Now we can do an optimizer step model.zero_grad() # Reset gradients  tensors if (i+1) % evaluation_steps == 0: # Evaluate the model when we... evaluate_model() # ... have no gradients accumulatedCopy the code

Loss.backward () calculates the gradient, and optimizer.step() updates the argument, so add the gradient several times before Optimizer.step () (without zero_grad() clearing the gradient), then update the argument once more, and then clear the gradient again. Small batch size is used to simulate the training of large batch size

See tensorflow version: gchlebus. Making. IO / 2018/06/05 /…