Article from: wechat public account [Machine learning alchemy]. Pay attention to reply [add group], you can join AI beginner communication group.

As a matter of fact, I have already covered BN layer in the previous article “gradient explosion”, but since I have been asked about it many times in interviews, I will give a more comprehensive explanation here.

Internal Covariate Shift (ICS)

The authors of the original Batch Normalization paper give a more standardized definition of Internal Covariate Shift: In the process of deep network training, Internal Covariate Shift changes data distribution of Internal nodes due to changes in network parameters.

Here is a simple mathematical definition. For a fully linked network, the mathematical expression of layer I can be expressed as: Zi = Wi * inputi + biZ ^ ^ I = W \ times I input I + b ^ ^ iZi = Wi * inputi + bi inputi + 1 = gi (Zi) input ^ {I + 1} = g ^ ^ I (Z) I inputi + 1 = gi (Zi)

  • The first formula is a simple linear transformation;
  • The second formula represents the process of an activation function.

We know that the parameters Wi,biW^ I,b^iWi,bi of each layer are constantly updated as the gradient descent goes on, which means that the distribution of ZiZ^iZi is also constantly changing, thus changing the distribution of inputi+1input^{I +1}inputi+1. This means that, except for the input data of the first layer, the distribution of input data of all layers changes with the update of model parameters, and each layer constantly ADAPTS to the change of data distribution, which is called Internal Covariate Shift.

BN solves the problem

[Slow convergence caused by ICS] Because the parameters of each layer constantly change, the distribution of calculation results of each layer changes, and the network of the later layer constantly ADAPTS to such distribution changes, which will make the learning speed of the whole network too slow.

[Gradient saturation problem] Because saturated activation functions such as SigmoID and TANH are often used in neural networks, model training has the risk of falling into gradient saturation region. There are two ways to solve such gradient saturation problem: the first is a more unsaturated activation function, such as linear rectifier function ReLU, which can solve the problem of training into gradient saturation region to a certain extent. Another way of thinking about it is that we keep the input distribution of the activation functions in a stable state to avoid as much as possible gradient saturation, which is called Normalization.

Batch Normalization

There is batchNormalization as the name but there is normalization on a batch of data.

Now suppose a batch has 3 data, each data has two characteristics :(1,2),(2,3),(0,1)

If you do a simple normalization, you do mean and variance, and you subtract the mean divided by the standard deviation to get the standard form of zero mean and one variance.

For the first feature: Mu = 13 (1 + 2 + 0) = 1 \ mu = \ frac {1} {3} (1 + 2 + 0) = 1 mu = 31 (1 + 2 + 0) = 1 Sigma 2 = 13 (1) (1 – (2-1) 2 + 2 + 1) (0-2) = 0.67 \ sigma ^ 2 = \ frac {1} {3} ((1-1) ^ 2 + 1) (2 – (0-1) ^ ^ 2 + 2) = 0.67 sigma 2 = 31 (1) (1 – (2-1) 2 + 2 + 1) (0-2) = 0.67

Mu = 1 m general formula 】 【 mz \ mu = ∑ I = 1 \ frac {1} {m} \ sum_ {I = 1}} {Z ^ m mu = mz m1 ∑ I = 1 Sigma 2 = 1 m ∑ I = 1 m (Z – mu) \ sigma ^ 2 = \ frac {1} {m} \ sum_ {I = 1} ^ m (Z – \ mu) sigma 2 = ∑ m1 I = 1 m (Z – mu) Z ^ = Z – mu sigma 2 + ϵ \ hat} {Z = \ frac {Z – \ mu} {\ SQRT {\ sigma ^ 2 + \ epsilon}} Z ^ = sigma 2 + ϵ Z – mu

  • M indicates a batch quantity.
  • ϵ\epsilonϵ is a minimum that prevents the denominator from being 0.

So far, we’ve managed to get the mean of each feature distribution to be 0 and the variance to be 1. It’s the same distribution, so you’re not going to have an ICS problem

As mentioned above, Normalization mitigates ICS issues by stabilizing the distribution of input data across each layer of the network, but it leads to a loss of data representation. The distribution of each layer is the same, the data distribution of all tasks is the same, so what does the model learn

[Disadvantages of 0 mean 1 variance data]

  1. Lack of data presentation ability;
  2. By making the mean and variance of the input distribution of each layer equal to 0, the input is easy to fall into the linear region of the nonlinear activation function when passing through the Sigmoid or TANh activation function. (The linear region and the saturated region are not ideal, preferably the nonlinear region)

To solve this problem, the BN layer introduces two learnable parameters γ\gammaγ and β\betaβ. As a result, the normalization data from the BN layer follows the mean value of β\betaβ and the variance of γ2\gamma^2.

So for a certain layer of the network, we now have a process like this:


  1. Z = W x i n p u t i + b Z=W\times input^i+b

  2. Z ^ = gamma x Z mu sigma 2 + ϵ + Beta. \hat{Z}=\gamma \times \frac{Z-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta

  3. i n p u t i + 1 = g ( Z ^ ) input^{i+1}=g(\hat{Z})

(In the above formula, III is omitted. In general, it represents the process that the network layer of layer I generates the input data of layer I +1.)

BN for the test phase

We know that BN calculates μ\muμ and σ2\sigma^2σ2 at each layer based on the training data in the current batch, but this brings up a problem: In the prediction phase, it is possible to predict only one sample or a small number of samples, without as much data as the training sample. How do we calculate σ2\sigma^2σ2 and μ\muμ?

After training the model well, in fact, each batch of BN layer of each layer is retained
mu \mu
and
sigma 2 \sigma^2
And then use the whole training set to estimate the test set
mu t e s t \mu_{test}
and
sigma t e s t 2 \sigma_{test}^2

mu t e s t = E ( mu t r a i n ) \mu_{test}=E(\mu_{train})

sigma t e s t 2 = m m 1 E ( sigma t r a i n 2 ) \sigma_{test}^2=\frac{m}{m-1}E(\sigma_{train}^2)
Then BN layer on the test machine:

Of course, the method of calculating the training set μ\muμ and \simga\simga\simga \simga is in addition to the mean value above. Mr Ng also suggested in his course that exponential weighted average could be used. But it’s the same idea, to estimate the mean variance of the test machine based on the entire training set.

What are the benefits of the BN layer

  1. BN makes the distribution of input data in each layer of the network relatively stable and accelerates the model learning speed.

BN by standardization and linear transformation of the input data of each layer network mean value and variance are all within a certain range, after making a layer of network don’t have to constantly adapt to the change of the input in the underlying network, so as to realize the decoupling between the middle and network layer, allowing each layer to independent learning, to improve the neural network learning speed.

  1. BN allows networks to use saturation activation functions (e.g. Sigmoid, TANH, etc.) to mitigate the gradient extinction problem

Normalize allows the input data of the activation function to fall in the gradient unsaturated region, alleviating the problem of gradient disappearance. In addition, adaptive learning of γ\gammaγ and β\betaβ allows the data to retain more original information.

  1. BN has a certain regularization effect

During the Batch Normalization, the mean and variance of mini-batch are used as the estimation of the mean and variance of the overall training samples. Although the data in each Batch is obtained from the overall sample, the mean and variance of different Mini-batches will be different. This adds random noise to the network’s learning process

Comparison of BN with other normalizaiton

Normalization 【weight normalization】 Weight normalization is the normalization of network weights, which is L2 norm.

It has the following advantages over BN:

  1. WN accelerates the convergence of network parameters by rewriting the weight of neural network, independent of mini-Batch. BN cannot be used for RNN networks because it is minibatch, whereas WN can. And BN needs to save the mean variance of each batch, so WN saves memory;
  2. BN has regularization effect, but adding noise is not suitable for noise sensitive reinforcement learning, GAN and other networks. WN can introduce less noise.

However, WN should pay special attention to the choice of parameter initialization.


A more common comparison of Layer normalization is between BN and LN. The BN layer has two disadvantages:

  1. Online learning cannot be performed because the mini-batch of online learning is 1. LN can
  2. The BN mentioned earlier cannot be used in RNN; LN can
  3. Consume a certain amount of memory to record the mean and variance; LN no

However, LN does not achieve better results than BN in CNN.

Reference links:

  1. zhuanlan.zhihu.com/p/34879333
  2. www.zhihu.com/question/59…
  3. zhuanlan.zhihu.com/p/113233908
  4. www.zhihu.com/question/55…