When we are training the neural network, the size of the batch size will have a great influence on the final model effect. Under certain conditions, the larger the batch size is set, the more stable the model will be. The value of batch size is usually set between 8 and 32, but when we do some computationally demanding tasks (such as semantic segmentation, GAN, etc.) or input image sizes are too large, we can only set batch size to 2 or 4. Otherwise, a “CUDA OUT OF MEMORY” force majeure error occurs.

Poverty is the ladder to promote human progress. How to adopt a larger batch size in training under the condition of limited computing resources? This is Gradient Accumulation.

Taking Pytorch as an example, the training process of a neural network is usually as follows:

for i, (inputs, labels) in enumerate(trainloader):
    optimizer.zero_grad()                   # Gradient zero clearing
    outputs = net(inputs)                   # Forward propagation
    loss = criterion(outputs, labels)       # Calculate the loss
    loss.backward()                         # Back propagation, calculate the gradient
    optimizer.step()                        # update parameters
if (i+1) % evaluation_steps == 0:
    evaluate_model()
Copy the code

From the code, we can clearly see how the neural network achieves the training: 1. Clear the network gradient after the previous batch calculation 2. Forward propagation, the data is transmitted to the network, and the forecast result is obtained. 3. According to the forecast result and label, the loss value is calculated. 5. Use the calculated parameter gradient to update network parameters

Here’s how gradient accumulation works:

model.zero_grad()                                   # Gradient zero clearing
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward propagation
    loss = loss_function(predictions, labels)       # Calculate the loss
    loss = loss / accumulation_steps                # Normalize our loss (if averaged)
    loss.backward()                                 # Back propagation, calculate the gradient
    if (i+1) % accumulation_steps == 0:             # after several accumulation_steps
        optimizer.step()                            # update parameters
        model.zero_grad()                           # Gradient zero clearing
        if (i+1) % evaluation_steps == 0:           # Evaluate the model when we...
            evaluate_model()                        # ...have no gradients accumulated
Copy the code

1. Forward propagation, data is transmitted to the network, and the forecast result is obtained. 2. The loss is used for back propagation, and the parameter gradient 4 is calculated. Repeat 1-3, and the gradient is accumulated by 5 instead of clearing the gradient. After the gradient accumulation reaches a fixed number of times, the parameters are updated, and then the gradient is cleared to zero

To sum up, gradient accumulation means that each batch of gradient is calculated, and gradient accumulation is not carried out. When the accumulation reaches a certain number of times (accumulation_steps), network parameters will be updated and the gradient will be cleared.

By this means of parameter delay updating, the effect similar to that of adopting large batch size can be achieved. In ordinary experiments, I usually use gradient accumulation technology. In most cases, the model effect of gradient accumulation training is much better than that of small batch size training.