Batch Normalization has in-depth understanding

1. What is the background of BN?

An important assumption in statistical learning is that the distribution of inputs is relatively stable. If this assumption is not satisfied, the convergence of the model will be slow or even impossible. So, for general statistical learning problems, it is a common trick to go on the job on the job if you broke data on the job before training.

But this problem becomes more difficult to solve in deep neural networks. In neural network, the network is hierarchical, each layer can be regarded as a separate classifier, and a network can be regarded as a series of classifiers. This means that, in the training process, as the parameters of a layer of classifier change, the distribution of its output will also change, which leads to the unstable distribution of the input of the next layer. The classifier needs to adapt to the new distribution constantly, which makes the model difficult to converge.

Pre-processing of the data solves the first-layer input distribution problem, but not the hidden layer problem, the Internal Covariate Shift. But Batch Normalization is really dealing with this problem.

In addition, the gradient size of the general neural network is often related to the size of the parameters (affine transformation), and with the training process, there will be a large fluctuation, which leads to the learning rate should not be set too large. Batch Normalization makes the gradient size relatively fixed, which allows us to use higher learning rates to some extent.

There is no normalization (left) and Batch normalization (right) has been applied

2. How does BN work?

Given that our input is a mini-batch of size N, the VALUE of Y calculated through the following four formulas is the batch Normalization(BN) value

The data looks like a Gaussian distribution.

Firstly, the mean and variance of mini-batch are obtained from (2.1) and (2.2), and then the normalization operation of (2.3) is carried out. A small constant is added to the denominator to avoid the division by 0 operation. In the whole process, only the last (2.4) introduced additional parameters γ and β, whose sizes are characteristic lengths, which are the same as xi.

The BN layer is usually added before the activation function of the hidden layer, and after the linear transformation. If we look at (2.4) together with the subsequent activation functions, they can be regarded as a complete neural network (linear + activation). (Note that the BN linear transformation is still different from the ordinary hidden layer linear transformation, the former is element-wise, the latter is matrix multiplication.)

At this point,Can be viewed as the input to this layer of network, andIt has a fixed mean and a fixed variance. This solves the Covariate Shift.

In addition, Y can also ensure the ability of data expression.During the process of normalization, there will inevitably be changes in its distribution, which will result in some loss of the ability to express the features learned. By introducing parameters γ and β, the network can recover the original distribution of data by training γ and β to the standard deviation and mean of the original distribution in extreme cases. So this guarantees that the introduction of BN won’t make it worse.

3. What is the BN implementation method?

We have divided the Batch Normalization into two processes of positive (which includes only training) and reverse.

The forward process parameter X is a mini-batch of data, gamma and beta are BN layer parameters, and BN_param is a dictionary, includingThe value of and applied to inferenceandAnd finally returns the BN layer output y, the intermediate variable cache that will be used in the reverse process, and the updated moving average.

The parameters of the reverse process are dOUT, the error signal from the upper layer, and cache, the intermediate variable stored in the forward process, and finally returnedPartial derivative of phi.

The difference between implementation and derivation is that implementation is an operation on the entire batch.

import numpy as np def batchnorm_forward(x, gamma, beta, bn_param): # read some useful parameter N, D = x.shape eps = bn_param.get('eps', 1e-5) momentum = bn_param.get('momentum', Get ('running_mean', np.zeros(D, dtype=x.dtype)) running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype)) # BN forward pass sample_mean = x.mean(axis=0) sample_var = x.var(axis=0) x_ = (x - sample_mean) / np.sqrt(sample_var + eps) out = gamma * x_ + beta # update moving average running_mean = momentum * running_mean + (1-momentum) * sample_mean running_var = momentum * running_var + (1-momentum) * sample_var bn_param['running_mean'] = running_mean bn_param['running_var'] = running_var # storage variables for backward pass cache = (x_, gamma, x - sample_mean, sample_var + eps) return out, cache def batchnorm_backward(dout, cache): # extract variables N, D = dout.shape x_, gamma, x_minus_mean, var_plus_eps = cache # calculate gradients dgamma = np.sum(x_ * dout, axis=0) dbeta = np.sum(dout, axis=0) dx_ = np.matmul(np.ones((N,1)), gamma.reshape((1, -1))) * dout dx = N * dx_ - np.sum(dx_, X_ * np.sum(dx_ * x_, axis=0) dx *= (1.0/N)/np. SQRT (var_plus_eps) return dx, dgamma, dbetaCopy the code

 

4. What are the advantages of BN?

  • Faster convergence.
  • Reduce the importance of the initial weight.
  • Robust hyperparameter.
  • Requires less data to generalize.

5. What are the disadvantages of BN?

  • Unstable when using small Batch size: Batch normalization has to calculate the mean and variance in order to normalize the previous output in batch. If the batch size is relatively large, the statistical estimation is relatively accurate, while the accuracy of the estimation continues to decrease with the reduction of the batch size.

The above is the validation error diagram for RESNET-50. It can be inferred that if the Batch size is kept at 32, its final validation error is around 23, and the error will continue to decrease as the Batch size decreases (the Batch size cannot be 1 because it is itself an average). Losses vary considerably (about 10 percent).

If batch size is an issue, why don’t we use a larger batch? We cannot use a larger batch in every case. For finetune, we could not use large batches, lest too high a gradient would harm the model. In distributed training, the large batch will eventually be distributed among each instance as a group of small batches.

  • Results in increased training time: Experiments conducted by NVIDIA and Carnegie Mellon University showed that “although Batch Normalization is less computationally intensive, the total number of iterations required for convergence is reduced.” But the time for each iteration increased significantly and further as the batch size increased.

Batch normalization consumed about a quarter of the total training time. The reason is that Batch Normalization requires two iterations of input data, one to calculate batch statistics and the other to normalize output.

  • Training and reasoning have different results: for example, doing “object detection” in the real world. When training an object detector, we usually use large batch(YOLOv4 and ftP-RCNN are both trained with the default batch size = 64). But once in production, the models didn’t work as well as they had trained. This is because they are trained to be batch large, whereas in the real-time case their batch size is equal to 1 because it must be processed per frame. Given this limitation, some implementations prefer to use pre-calculated means and variances based on the training set. Another possibility is to calculate the mean and variance values based on your test set distribution.
  • Not good for online learning: online learning is a learning technique in which a system is progressively trained by providing it with data instances in turn, either individually or through a process calledmini-batchThe group carried on. Each learning step is fast and inexpensive, so the system can learn in real time as new data arrives.

Because it relies on external data sources, data may arrive individually or in batches. Due to batch size changes in each iteration, the generalization ability of scale and offset of input data is poor, which ultimately affects performance.

  • Bad for circulating neural networks:

    Although Batch normalization can significantly improve the training and generalization speed of convolutional neural networks, it is difficult to apply to recursive structures. Batch normalization can be applied across the RNN stack, where normalization is applied “vertically,” meaning the output of each RNN. But it cannot be applied “horizontally”, such as between time steps, because it can hurt training by creating explosive gradients with repeated re-scaling.

    [^ Note]: Some research experiments have shown that batch normalization makes the neural network prone to countermeasures vulnerabilities, but we did not include this due to lack of research and evidence.

6. Alternative methods:

In cases where batch normalization does not work well, there are several alternatives.

  • Layer Normalization
  • Instance Normalization
  • Group Normalization (+ weight standardization)
  • Synchronous Batch Normalization

7. Reference materials

  1. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
  2. Deriving the Gradient for the Backward Pass of Batch Normalization
  3. CS231n Convolutional Neural Networks for Visual Recognition
  4. Towardsdatascience.com/curse-of-ba…