【 In-depth learning 】 Batch Normalization.

directory

Why do we need BN?

What does BN do?

What does BN solve?

How do we figure out the mean and variance?

CNN in the BN


BN was proposed by Google in 2015, which is a deep neural network training technique. It can not only accelerate the convergence speed of the model, but also alleviate the problem of “gradient dispersion” in the deep network to some extent, so as to make the training of the deep network model easier and more stable. So BN is now the standard technique for almost all convolutional neural networks.

There is literally a Batch Normalization (BN) that normalizes each Batch of data. This is true for a particular Batch of data in training {X1, X2,… Xn}, notice that this data can either be input or output at some level in the middle of the network. Before there were BN we normalized operation is generally in the data input layer, calculating mean and variance of input data are normalized, but the presence of the BN broke that a regulation, we can in the network any a layer of normalized processing, because we now optimization methods are used in the min – batch SGD. So our Normalization operation is called Batch Normalization.

Why do we need BN?

We know that once the network trains, the parameters will be updated. Except for the data of the input layer (because of the data of the input layer, we have artificially normalized each sample), the input data distribution of each layer of the network is always changing, because during the training, The update of training parameters in the previous layer will result in a change in the distribution of input data in the later layer. Take the second layer of the network as an example: the input of the second layer of the network is calculated by the parameters and input of the first layer, and the parameters of the first layer have been changing throughout the whole training process, so it is inevitable to cause changes in the distribution of input data of each subsequent layer. The “Internal Covariate Shift” changes in data distribution during training. BN is proposed to solve the problem of data distribution changes in the middle layer during training.

What does BN do?

As shown in the figure above, THE BN step is mainly divided into 4 steps:

  1. Find the mean value of each training batch data
  2. Find the variance of each training batch data
  3. The obtained mean and variance were used to normalize the training data of this batch, and the 0-1 distribution was obtained. εε is a small positive number used to avoid divisor 0.
  4. Scaling and migration: Multiply xixi by γγ to adjust the value and add ββ to increase the migration to obtain Yiyi, where γγ is the scaling factor and ββ is the translation factor. This step is the essence of BN. Because the normalized XIXI will be basically restricted to normal distribution, the expression ability of the network will decline. To solve this problem, we introduce two new parameters: γγ and ββ. Gamma gamma and beta beta are learned by the network itself during training.

What does BN solve?

A standard normalization is subtracting the mean and dividing the variance, so what does this normalization do? If we look at the picture below,

The left figure in A shows the input data without any processing, and the curve is sigmoid function. If the data is in a region with a very small gradient, the learning rate will be slow or even fall into a long period of stagnation. After subtracting the mean and dividing the variance, the data is moved to the central region as shown in the figure on the right. For most activation functions, the gradient in this region is maximum or gradient (e.g. ReLU), which can be seen as an effective means to combat gradient loss. If you do that for one layer, if you do that for each layer, the distribution of the data is always in the region that is sensitive to change, so you don’t have to worry about the distribution of the data, so it’s more efficient to train.

So why is there a fourth step, isn’t it just subtracting the mean and dividing the variance to get the desired effect? We consider that the distribution obtained by subtracting the mean and dividing the variance is a normal distribution. Can we think that the normal distribution is the best or best characteristic distribution of our training samples? Cannot, such as the data itself is very asymmetrical, or activation function is not necessarily a variance data for one of the best effect, such as Sigmoid activation function, the gradient changes very little between 1 and 1, so the effect of nonlinear transformation cannot very good reflected, in other words, the average reduction in addition to the variance after operation may weaken the performance of the network! In response to this situation, a fourth step is added after the first three steps to complete true batch normalization.

The essence of BN is to make use of optimization to change the size of variance and mean position, so that the new distribution is more consistent with the real distribution of data and ensure the nonlinear expression ability of the model. The extreme case of BN is that the two parameters are equal to the mini-batch mean and variance, so the data and the input will be exactly the same after the batch normalization. Of course, the normal case will be different.

How do we figure out the mean and variance?

In the training, we will solve the mean and variance of the same batch of data, and then normalize the operation. But what about our mean and variance for the prediction? For example, when we predict a single sample, how can we calculate the mean sum? In fact, the mean and variance used in the prediction phase also come from the training set. For example, during model training, we recorded the mean value and variance of each batch. After the training, we calculated the expected value of the mean value and variance of the entire training sample as the mean value and variance of BN when we made prediction:

In the final test phase, the formula for BN is:

As for the position of BN, the independent variable X of the S-type function S (x) is the result of BN processing before the nonlinear activation function in CNN. Therefore, the calculation formula of forward conduction should be:

In fact, because the bias parameter B is useless after going through the BN layer, it will also be normalized by the mean. Of course, there is a β parameter behind the BN layer as the bias term, so the parameter B can not be used. So the BN layer + activation layer becomes:

CNN in the BN

Notice that everything I’ve written is a little bit different for the general case, for the convolutional neural network. Since the features of convolutional neural network correspond to the whole characteristic response graph, BN should also be made in units of response graph rather than in dimensions. For example, at a certain layer, the batch size is M and the response graph size is W ×h, so the amount of BN data is M × W ×h.

BN plays a very obvious role in the deep neural network: when the neural network training encounters slow convergence speed, or “gradient explosion” and other untrainable situations, BN can be tried to solve them. At the same time, BN can also be added to accelerate model training and even improve model accuracy in conventional use.