The article translation: towardsdatascience.com/batch-norma…
There is a lot of Batch Normalization online. Yet many of them are defending outdated understandings. I spent a lot of time putting these pieces of information together to build a good understanding of the basic approach, and I thought a step by step walkthrough might be useful for some readers.
In particular, this article aims to bring
-
Updated Batch Normalization interpretation through three levels of understanding: 30 seconds, 3 minutes, and a comprehensive guide
-
Cover the key elements to make full use of BN
-
Pytorch was used to implement the BN layer in Google Colab, replicating the mnIST-based experiment in the official paper
-
Key insights to understand why BN is still rarely understood (even after reading explanations from high quality authors!)
Let’s get started!
A) Within 30 seconds
Batch-normalization (BN) is an algorithm that makes deep neural network (DNN) training faster and more stable.
It involves normalizing the activation vector of the hidden layer using the first and second statistical moments (mean and variance) of the current batch. Apply this normalization step before (or after) a nonlinear function.
Don’t use the BN multilayer perceptron (MLP) | Credit: the author – Design: Lou HD
Using the BN multilayer perceptron | Credit: the author – Design: Lou HD
Current deep learning frameworks have implemented the Batch Normalization approach. It is commonly used as a module and can be inserted as a standard layer in DNN.
Note: For those who prefer to read code rather than text, I’ve written a simple Batch Normalization implementation in this REPo.
B) In 3 minutes
B. 1 nature
Batch Normalization is calculated differently during training and testing
B. 1.1 training
On each hidden layer, Batch Normalization transforms the data as follows
(1) mu = 1 n ∑ iZ (I) \ mu = \ frac {1} {n} \ sum_ {I} Z ^ {(I)} mu = n1 ∑ iZ (I)
(2) the sigma ∑ I = 1 n (Z (I) – mu) \ sigma = \ frac {1} {n} \ sum_ {I} \ left (Z ^ {(I)} -, mu, right) sigma = n1 ∑ I (Z (I) – mu)
(3) Znorm (I) = Z (I) – mu sigma 2 – ϵ Z_ {\ text {norm}} ^ = {(I)} \ frac {Z ^ {(I)} – \ mu} {\ SQRT {\ sigma ^ {2} – \ epsilon}} Znorm (I) = sigma 2 – ϵ Z (I) – mu
(4) Z ^ = gamma ∗ Znorm (I) + beta \ hat} {Z = \ gamma * Z_ {\ text {norm}} ^ {(I)} + \ betaZ ^ = gamma ∗ Znorm (I) + beta
Batch Normalization first calculates the mean and variance in Batch through formulas (1) and (2).
Z(I)Z^{(I)}Z(I) is then normalized using equation (3). In this way, the output of each neuron in batch satisfies the standard normal distribution (ϵ\ Epsilon ϵ is a constant used for numerical stability).
The first step of Batch Normalization, for example, is a hidden layer with three neurons and Batch size B. Using Batch after Normalization, each neuron output meet the standard normal distribution | Credit: the author – Design: Lou HD
Finally, the network layer output z^ I \hat{z}^{I}z^ I is calculated by the linear transformation of γ\gammaγ and β\betaβ, as shown in equation (4). This step allows the model to select the optimal distribution for each hidden layer by adjusting two parameters:
-
γ\gammaγ is used to adjust standard deviation.
-
β\betaβ is used to adjust bias to move the curve to the left or right.
Benefits of γ\gammaγ and β\betaβ parameters. Modifying the distribution (top) allows us to use different regions of nonlinear functions (bottom). | Credit : author – Design : Lou HD
Note: There are misunderstandings and errors (even in the original text) explaining the validity of BN. A recent paper [2] has overturned some erroneous assumptions and updated the community’s understanding of this approach. We will discuss this question in section C.3: “Why does BN work?” .
In each iteration, the network calculates the mean μ\muμ and the standard deviation σ\sigmaσ corresponding to the current batch. We then train μ\muμ and σ\sigma sigma by gradient descent, and use exponential moving average (EMA) to give more weight to the latest iteration.
B. 1.2 test
In the testing phase, unlike the training phase, we may not have a complete batch input into the model. To solve this problem, we calculate μ POP \mu_{POP}μ POP and σ POP \sigma_{POP}σ POP:
-
μ POP \mu_{pop}μ POP: Estimate the mean of the population
-
σ POP \ SIGMA_ {POP}σ POP: Estimate the standard deviation of the population
These values are calculated for all the μ Batch \mu_{batch}μbatch and σ Batch \sigma_{batch}σbatch during training, and are directly entered into formula (3) during testing (rather than calling formula (1) and (2) first).
Note: We will discuss this issue in depth in section C.2.3, “Normalization during Evaluation”.
B. 2 practice
In practice, Batch Normalization is considered as a standard layer, such as the perceptron, convolutional, activation function, or dropout layer.
Batch Normalization has been implemented for each of the popular frameworks. Such as:
-
Pytorch : torch.nn.BatchNorm1d, torch.nn.BatchNorm2d, torch.nn.BatchNorm3d
-
Tensorflow / Keras : tf.nn.batch_normalization, tf.keras.layers.BatchNormalization
All BN implementations allow you to set each parameter independently. However, the size of the input vector is most important. Should be set to:
-
How many neurons are currently in the hidden layer (for MLP)
-
How many filters are currently hidden in the layer (for convolutional networks)
Take a look at the online documentation for your favorite frameworks and read the BN layer pages. Is there anything specific about their implementations?
B.3 Overview of the results
Even if we don’t know all the underlying mechanics of BN (see C.3), there’s one thing everyone can agree on: it works.
Let’s first look at the results of the official article [1] :
How BN affects training. ImageNet (2012) verifies the accuracy of the set. The horizontal axis is the number of iterative training and the vertical axis is the accuracy of the verification set. Five networks are compared: “Inception” is an ordinary Inception network [3]; “Bn-x” is a Inception network with BN layer (for 3 different learning rates: X1, X5, X30); Optimal Inception network, “BN-X-Sigmoid” is Inception network with BN layer, all relUS are replaced by Sigmoid. | source: [1]
The results show that BN layer makes the training speed faster and allows a wider range of learning rates without affecting the convergence of training.
Note: By now, you should know enough about the BN layers to use them. However, if you want to get the most out of BN, you need to dig deeper!
What’s the connection between BN and this picture? Source: | Danilo Alvesd on Unsplash
C) Comprehensive understanding
C. 1
I re-implemented the BN layer with Pytorch to reproduce the original paper results. The code is available at Github.
I suggest you take a look at some online implementations of the BN layer. It’s enlightening to see how it’s encoded in your favorite framework!
C.2 BN in Practice
Before delving into the theory, let’s start with certainty about BN. In this section, we will see:
-
How does BN affect training performance? Why is this approach so important in deep learning today?
-
We have to pay attention to what are the side effects of BN?
-
When and how to use BN?
C.2.1 Results of the original text
As mentioned earlier, BN is widely used because it almost always makes deep learning models perform better.
The official article [1] carried out 3 experiments to prove the effectiveness of their method.
First, they trained the classifier on a MNIST dataset. The model consists of three fully connected layers with 100 neurons in each layer, all transformed by sigmoID activation function. Using stochastic gradient descent (SGD) and the same learning rate (0.01), they trained the model twice (with and without BN layers) over 50,000 iterations. Notice that the BN layer is placed directly in front of the activation function.
You can easily reproduce these results without a GPU, which is a great way to familiarize yourself with the concept!
Figure 2: BN in training, the impact of trade, for example for the accuracy and the relationship between the number of iterations | | the left the right training for the loss and the relationship between the number of iterations | Credit: author
Looks good! BN improves our network performance, including training loss and accuracy.
The second experiment looked at activation values in the hidden layer. Here is the diagram corresponding to the last hidden layer (just before the nonlinearity) :
Figure 3: BN of the last hidden layer activation values influence | Credit: author
In the absence of batch normalization, the activation values fluctuated significantly during the first iteration. In contrast, the activation curve is smoother with BN.
The BN model figure 4: the activation of the curve is smoother than no BN model | Credit: author
At the same time, when BN layer is added, the signal noise is less. This seems to make convergence of the model easier.
This example does not show all the benefits of BN.
The official article carried out a third experiment. They wanted to compare the performance of the models while adding BN layers to a larger dataset, ImageNet (2012). To do this, they trained a then-powerful neural network called Inception[3]. Initially, the network did not have any BN layers. They added a few and trained the model by modifying the learning rates (X1, X5, x30). They also tried to replace each ReLU activation function with a SigmoID from another network. They then compared the performance of these new networks to the original ones. The results are as follows
Figure 5: The effect of BN on training on ImageNet. “Inception” : raw network [3]; “Bx-baseline” : Inception plus BN, the learning rate is the same as the original network; “Bx-x5” : same as BX-baseline, learning rate x5; “Bx-x30” : same as BX-baseline, learning rate x30; “BN – x5 – Sigmoid” : same as the BX – Baseline, vector x5, using Sigmoid rather than ReLU | source: [1]
It can be concluded from the curve that:
- Adding BN layers results in faster and better convergence (better means more precision)
On such a large data set, this improvement is much more significant than that observed on a small MNIST data set.
- Adding BN layers can use higher learning rates (LR) without affecting convergence
The author has successfully trained the Inception-with-BN network with a learning rate 30 times higher than the original network. Note that the 5 times greater learning rate has diverged the original network!
This makes it easier to find an acceptable learning rate: the LR interval between underfit and gradient explosion is much larger.
At the same time, the higher learning rate helps to avoid local minimum convergence. By encouraging exploration, the optimizer will more easily converge to better results.
- The sigmoID-based model achieves similar results to the ReLU based model
We need to step back and look at the bigger picture. It is clear that the RELu-based model can achieve better performance than the SigmoID-based model, but this is not important.
To illustrate why this result makes sense, let me repeat what Ian Goodfellow, inventor of GANs[6] and author of the famous handbook of deep learning, said:
Before BN, we thought that it was almost impossible to efficiently train deep models using sigmoid in the hidden layers. We considered several approaches to tackle training instability, such as looking for better initialization methods. Those pieces of solution were heavily heuristic, and way too fragile to be satisfactory. Batch Normalization makes those unstable networks trainable ; — Ian Goodfellow (rephrased from: source)
Now we understand why BN is so important to the field of deep learning.
These results outline the benefits of BN for network performance. However, to get the most out of BN, you should consider some side effects.
C.2.2 Regularization, side effects of BN
BN relies on the first and second statistical moments (mean and variance) to regulate the activation of the hidden layer. The output value is then closely associated with the current batch statistics. Based on the input examples used in the current batch, this transformation adds some noise.
Add some noise to avoid overfitting… Sounds like a regularization process, doesn’t it? 😉
In practice, we should not rely on BN to avoid overfitting due to orthogonality problems. Simply put, we should always ensure that one module solves one problem. Relying on several modules to deal with different problems makes the development process much harder than it needs to be.
However, the regularization effect is interesting to understand because it can explain some unexpected behavior in networks, especially during sanity checks.
Note: The larger the batch size, the smaller the regularization (because it reduces the noise impact).
What if we want to deploy a BN model on such a great embedded system? | Credit: Marilia Castelli on Unsplash
C.2.3 Normalization during the evaluation period
There are two situations in which models can be downplayed in evaluation mode:
-
Cross-validation or testing occurs during model training and development
-
When deploying the model
In the first case, we can apply batch normalization using the current batch statistics for convenience. In the second case, it doesn’t make sense to use the same method because we don’t necessarily have to predict the whole batch.
Let’s look at an example of a robot with an embedded camera. We can have a model that uses the current framework to predict the location of any upcoming obstacles. Therefore, we want to compute inferences on a single frame (that is, an RGB image) per iteration. If the training batch size is N, what should we select for (n-1) other inputs for the forward propagation expected by the model?
Remember that for each BN layer, (β\betaβ, γ\gammaγ) are trained using standardized signal. So we need to determine (μ\muμ, σ\sigma sigma) in order to get meaningful results.
One solution is to choose any value to complete the batch. By adding the first batch to the model, we will get some results for the images we are interested in. If we build the second batch using other random values, there will be different predictions for the same image. Two different outputs for a single input are not ideal model behavior.
The technique is to define (μ POP,σ POP)(\mu_{POP}, \sigma_{POP})(μ POP,σ POP), which are the estimated mean and variance of the target data, respectively. These parameters are calculated during the training and are the mean values of (μbatch,σ Batch)(\mu_{batch}, \sigma_{batch})(μbatch,σ Batch).
That’s how we do it!
Note: This technique can lead to instability during the evaluation phase, so let’s discuss it in the next section.
C.2.4 Stability problems
Even if BN works well, it sometimes causes stability problems. In some cases, the BN layer makes the activation value explode during the evaluation phase (making the model return Loss =NaN).
We have just mentioned how (μ POP,σ POP)(\mu_{POP}, \sigma_{pop})(μ POP,σ POP) can be determined for use in the evaluation process: We calculate the average values of all (μbatch,σ Batch)(\mu_{batch}, \sigma_{batch})(μbatch,σ Batch) calculated during training.
Let’s consider a model that trains only on a data set containing sneakers. What if there are Derby-like shoes in the test set?
If from training to evaluate the input distribution change is too big, may produce excessive reaction to certain numerical model, the activation values spread | Credit: Grailify (left), and Jia Ye on Unsplash (right)
We assume that there will be a significantly different distribution of activation values for the hidden layer during training and evaluation — probably too many. In this case, the estimated (μ POP,σ POP)(\mu_{POP}, \sigma_{pop})(μ POP,σ POP) cannot correctly estimate the actual population mean and standard deviation. Applications (mu pop, sigma pop) (\ mu_ {pop}, \ sigma_ {pop}) (mu pop, sigma pop) probably makes activation values (mu = 0, sigma = 1) (\ mu = 0, \ sigma = 1) (mu = 0, sigma = 1), leading to the activation values overvalued.
Note: The shift between the training set and the test set is called the covariate shift. We will discuss this effect again in section (C.3.).
This effect is enhanced by a well-known feature of BN: during training, the activation values are normalized using their own values. In the reasoning process, the (μ POP,σ POP)(\mu_{POP}, \sigma_{POP})(μ POP,σ POP) calculated in the training process are normalized. Therefore, the coefficients used for normalization do not take into account the actual activation values themselves.
In general, the training set must be sufficiently similar to the test set; otherwise, it is impossible to train the model correctly for the target task. Thus, in most cases, (mu pop, sigma pop) (\ mu_ {pop}, \ sigma_ {pop}) (mu pop, sigma pop) should fit well with the test set. If not, we may conclude that the training set is not large enough, or that the training quality is not good enough to accomplish the target task.
But sometimes, it just happens. There is not always a neat solution to this problem.
I ran into it once in a Pulmonary Fibrosis Progression Kaggle contest. The training dataset consists of metadata and 3D lung scans associated with each patient. The content of these scans is complex and varied, but we only have about 100 patients divided into training and validation sets. As a result, the convolutional network I wanted for feature extraction returned NaN when the model switched from training mode to evaluation mode. Fun debugging.
When you don’t have easy access to other data to enhance your training set, you have to find a way around it. In this case, I also manually calculate the mandatory BN layer of the validation set (μbatch,σ Batch)(\mu_{batch}, \sigma_{batch})(μbatch,σ Batch). (I agree, this is an ugly way to fix it, but I ran out of game time;) )
Adding a BN layer to the network is not always the best strategy assuming it will not negatively affect the network!
C.2.5 Cyclic networks and layer normalization
In practice, it is generally believed that:
- For the convolutional network (CNN), BN has a better effect
- For recursive networks (RNN), Layer Normalization(LN) does better
While BN uses the current batch to normalize each individual value, LN uses all of the current layers to do so. In other words, normalization is performed using other characteristics in a single example rather than using the same characteristics in all current batches. For circular networks, this solution seems more efficient. Note that it is difficult to define a consistent strategy for these types of neurons because they rely on multiple multiplications of the same weight matrix. Should we normalize each step independently? Or should we calculate the mean of all the steps and recursively apply normalization? (Source of intuitive argument: here)
I won’t go any further on this topic because that’s not the purpose of this article.
C.2.6 Before or after nonlinearity
Historically, the BN layer was just before the nonlinear function, which was consistent with the author’s goals and assumptions at the time:
In their article, they say:
“We would like to ensure that, for any parameter values, The network always produces activations with the desired distribution. “– Sergey Ioffe & Christian Szegedy (Source: [1])
Some experiments show that better results can be obtained by using BN layers after nonlinear functions. Here’s an example.
Keras founder Francois Chollet, an engineer who now works at Google, said:
“I haven’t gone back to check what they are suggesting in their original paper, but I can guarantee that recent code written by Christian [Szegedy] applies relu before BN. It is still occasionally a Topic of debate, though. “– Francois Chollet (Source)
Nevertheless, many commonly used transfer learning architectures apply BN (ResNet, Mobilenet-V2, etc.) before nonlinearity.
Note that article [2] challenges the hypothesis (see C.3.3) made by the original article [1] to explain the validity of BN by placing the BN layer before the activation function, without giving an exact reason.
As far as I know, the question is still under discussion.
Further reading: Here’s an interesting Reddit discussion, even if some of the arguments are unconvincing, mostly in favor of BN after activation.
Why does my messy code still work? | Credit : Rohit Farmer on Unsplash
C.3 Why does BN work
In most cases, BN can improve the performance of deep learning models. Too good. But we want to know what happened in the black box.
This is where things get a little hairy.
The problem is: we still don’t know what makes BN work so well. There are some assumptions that are often discussed in the deep learning community: we’ll explore them step by step.
Before we go any further, here’s what we’ll see:
-
The original paper [1] assumed that the validity of BN was due to the reduction of what they called internal covariate transfer (ICS). A recent paper [2] refutes this assumption. 3.1) (see C.
-
The first is more cautiously replaced by another hypothesis: BN reduces the interdependence between layers during training. 3.2) (see C.
-
A recent MIT paper [2] highlighted the effect of BN on optimizing landscape smoothness, making training easier. 3.3) (see C.
I bet exploring these assumptions will help you develop a strong intuition about BN.
C.3.1 First hypothesis: confusion about internal covariance transfer (ICS)
Although BN has a fundamental impact on DNN’s performance, it can still be easily misunderstood.
Confusion about BN is mainly due to incorrect assumptions supported by the original [1].
Sergey Ioffe and Christian Szegedy introduced BN as follows:
“We refer to the change in the control of internal nodes of a deep network, in the course of training, As Internal Covariate Shift. […] We propose a new mechanism, which we call Batch Normalization, that takes a step towards reducing internal covariate shift, And in doing so michelin-michelin the training of deep neural nets. “– Sergey Ioffe & Christian Szegedy (Source: [1])
In other words, BN is efficient because it partially solves the internal covariable-transfer problem.
This claim has been seriously challenged by later work [2].
Note: From now on, ICS is an internal covariable transfer.
To understand the causes of this confusion, let’s first discuss what covariate transition is and how it is affected by normalization.
What is covariate transition (stability of distribution view)?
[1] The authors define it clearly: covariational transition, which describes the transition of model input distribution in terms of distribution stability. By extension, internal covariance transfer describes how this phenomenon occurs in the hidden layers of deep neural networks.
Let’s look at an example of why this might be a problem.
Suppose we want to train a classifier to answer the following question: Does this image contain a car? If we wanted to extract all the car images from a very large unlabeled data set, such a model would save us a lot of time.
We will have an RGB image as input, some convolution layers, and then fully connected layers. The output should be a single value that is entered into a Logistic function so that the final value is between 0 and 1, describing the probability that the input image contains a car.
Figure 5: A simple CNN classifier. | Credit : author – Design : Lou HD
To train such a model, we need a large number of labeled images.
Now, suppose we only have regular cars for training. If we asked the model to classify Formula One cars, how would it react?
As mentioned earlier, variations in covariates can diverge network activation (Section C.2.4, p. Even if it doesn’t, it affects overall performance! | Credit : Dhiva Krishna (left) and Ferhat Deniz Fors (right) on Unsplash
In this example, there is an offset between the training and test distributions. More broadly, different car directions, lightning, or weather conditions may affect our model performance. In this case, the generalization performance of the model is poor.
If we plot the extracted features in the feature space, we get the following results:
Figure 6.a: Why do we need normalized model input values? Irregularities. During training, the input values are scattered: where the point density is high, the approximation function will be very accurate. Conversely, at low point densities, it will be inaccurate and random (for example, the approximate curve can be one of the 3 lines drawn). | Credit : author – Design : Lou HD
Assume that the cross description does not contain any features associated with the image of a car, while the circle describes features of an image containing a car. Here, a single function can separate the two classes. But the function may not be very accurate in the upper right part of the graph: there are not enough samples to determine a good function. This can cause the classifier to make many errors during evaluation.
To train our models effectively, we need to provide many images of cars in every possible situation we can imagine. Even though this is still the way we train CNN, we use as few samples as possible to make sure the model generalizes well.
The problem can be summarized as follows:
From the model point-of-view, training images are – statistically – too different from testing images. There is a covariate shift.
We can solve this problem with a simpler model. It is well known that linear regression models are easier to optimize when the input values are normalized (even if the distribution is close to (μ=0,σ=1)(\mu=0, \sigma=1)(μ=0,σ=1) : this is why we usually normalize the input values of the model.
Figure 6.b: Why do we need normalized model input values? Normalized situation. During training, normalizing the input data makes points in the feature space closer to each other: it is now easier to find a good generalization function. | Credit: the author – Design: Lou HD
This solution was well known before the BN paper was published. [1] authors wish to extend this approach to the hidden layer with BN to aid in training.
The original paper hypothesized that internal covariate transfer training leads to training deterioration
Diagram 7: Internal covariate transfer principle from the perspective of distributed stability. | Credit : author – Design : Lou HD
In our car classifier, we can think of hidden layers as units that are activated when they recognize some conceptual feature related to the car: it could be a wheel, a tire, or a door. We can assume that the effects described earlier may occur inside a hidden cell. Wheels with a certain orientation Angle activate a specific distribution of neurons. Ideally, we’d like to have some neurons respond similarly to wheels in any direction, so that the model can efficiently figure out the probability that the input image contains a car.
If there is a large covariate shift in the input signal, the optimizer will be difficult to generalize well. Conversely, if the input signals always follow the standard normal distribution, the optimizer will be easier to generalize. With this in mind, [1] authors adopted a strategy of normalizing signals in hidden layers. They believe that force (mu = 0, sigma = 1) (\ mu = 0, \ sigma = 1) (mu = 0, sigma = 1) of the intermediate signal distribution would help network on the concept of level of generalization.
However, we don’t always want the hidden units of the standard normal distribution. This reduces the representativity of the model:
Figure 8: Why we don’t always want to have standard normal distributions in hidden cells. Here,, the sigmoid function works only in its linear region. | Credit : author – Design : Lou HD
The article uses the sigmoid function as an example to show why normalization itself is a problem. If the input signal value is between 0 and 1, the nonlinear function will only work within its linear range. That sounds wrong.
To solve this problem, they added two trainable parameters, β\betaβ and γ\gammaγ, allowing the optimizer to define the best mean (using β\betaβ) and standard deviation (using γ\gammaγ) for a particular task.
⚠ Warning: The following assumptions are now outdated. Many important things about BN still cite it as the reason the method works in practice. Recent work, however, presents a serious challenge to it.
In the years since its release [1], the deep learning community has explained the validity of BN as follows:
Hypothesis 1
BN➜ normalization of input signals within implicit cells ➜ adding two trainable parameters to adjust the distribution and maximize the use of nonlinearity ➜ is easier to train
Here, the normalized to (mu = 0, sigma = 1) (\ mu = 0, \ sigma = 1) (mu = 0, sigma = 1) is to explain the main cause of BN effectiveness. This assumption is challenged (see section C.3.3) and replaced by another:
Assumption 2
BN➜ Normalization of input signals within hidden cells ➜ reduces interdependence between hidden layers (from the point of view of distribution stability) ➜ is easier to train
There is a subtle but very important difference. The goal of normalization here is to reduce the interdependence between layers (from a distribution stability perspective), so the optimizer can choose the best distribution by adjusting just two parameters! Let’s explore this hypothesis further.
What’s the connection between BN and this picture? Source: | Danilo Alvesd on Unsplash
C.3.2 Second hypothesis: ease the interdependence between distributions
Regarding this section: I can’t find any hard evidence for the assumptions discussed in this section. So, I decided to rely mostly on Ian Goodfellow’s explanation of what happened (especially in this wonderful video).
Consider the following example:
Figure 9: A simple DNN, consisting of linear transformations. | Inspired by Ian Goodfellow
Where (a), (b), (c), (d) and (e) are adjacent layers of the network. This is a very simple example where all the layers are connected by a linear transformation. Suppose we want to train such a model with SGD.
To update the weight of layer (a), we need to calculate the gradient based on the network output:
Let’s first consider a network with no BN layer. From the equation above, we conclude that if all gradients are large, the gradient (a) will be very large. Conversely, if all gradients are small, the gradient (a) is almost negligible.
By looking at the input distribution of the hidden cells, it is easy to see the layer-to-layer dependencies: modifying the weights of (a) will modify the input distribution of the weights of (b), and ultimately modify the input signals of (d) and (e). This interdependence can create problems for training stability: if we want to adjust the input distribution of a particular hidden unit, we need to consider the entire layer sequence.
SGD, however, considers first-order relationships between layers. So they don’t take into account the higher level of relationships mentioned above!
Figure 10: Hypothesis 2 principle. The BN layer makes signal regulation easier by normalizing the signal within each hidden cell and allowing distribution adjustment using β\betaβ and γ\gammaγ. BN acts like a valve, making flow control easier at certain points without reducing the complexity of the network! | Credit : author – Design : Lou HD
During training, the addition of BN layers significantly reduced the interdependence between layers (from the point of view of distribution stability). BN acts as a valve that stops the flow and allows it to be regulated using β\betaβ and γ\gammaγ. This eliminates the need to consider all parameters to get a clue to the hidden distribution within the cell.
Note: The optimizer can make larger weight changes without affecting adjustment parameters in other hidden layers. It makes it easier to adjust hyperparameters!
This example is to put aside a hypothesis, that is, the validity of BN is due to the standardization of intermediate signal distribution (mu = 0, sigma = 1) (\ mu = 0, \ sigma = 1) (mu = 0, sigma = 1). The purpose of BN here is to make the optimizer’s job easier by adjusting the distribution of hidden layers with just two parameters at a time.
⚠ Keep in mind, though, that this is mostly speculation. These discussions should be used to build intuitive insights into BN. We still don’t know why BN works in practice!
In 2019, a research group at MIT conducted some interesting experiments on BN [2]. Their results seriously challenge hypothesis 1 (there are still plenty of serious blog posts and MOOCs being shared).
If we want to avoid local minimum assumptions about the effect of BN on training, we should take a look at this paper… 😉
Alright… you better initialize well. | Credit : Tracy Zhang on Unsplash
C.3.3 Third hypothesis: make optimized landscape smoother
On this part: I have combined the results [2], which helps us to build a better intuition about BN. I can’t be exhaustive, this paper is very dense, and I encourage you to read it carefully if you’re interested in these concepts.
Let’s skip straight to the second experiment [2]. Their goal was to examine the correlation between ICS and BN on the improvement of training results (Hypothesis 1).
Note: We now refer to this covariable shift with ICSdistribICS_{distrib}ICSdistrib.
To do this, the researchers trained three VGG networks on CIFAR-10:
-
The first one has no BN layer
-
The second one has the BN layer
-
The third method is similar to the second, except that before activation, they explicitly add some ICSdistribICS_{distrib}ICSdistrib distribution inside the hidden cell (by adding random bias and variance).
They measured the accuracy achieved by each model, as well as the variation of distribution values with the number of iterations. Here’s what they got:
Diagram 6: ICSdistribICS_{distrib} Network with BN on ICSdistrib trains faster than standard network; The explicit addition of ICSdistribICS_{distrib}ICSdistrib to the regulated network does not reduce the efficiency of BN. | Source : [2]
We can see that, as expected, the third network has very high ICS. However, the training speed of noise network is still faster than that of standard network. Its performance is comparable to that of standard BN networks. This result shows that the validity of BN is independent of the distribution of ICS. Oh dear!
We should not discard the ICS theory too quickly: if the BN validity does not come from an ICS distribution, it may be related to another definition of ICS. After all, the intuition behind hypothesis 1 makes sense, doesn’t it?
The main problem with ICSdistribICS_{distrib}ICSdistrib is that its definition is related to the input distribution of hidden cells. Therefore, the optimization problem itself is not directly related.
[2] the authors propose another definition of ICS:
Let’s consider a fixed input X.
From an optimization point of view, we define the internal covariable offset as the difference between the gradient calculated on the hidden layer K after the backpropagation error L(X)itL(X)_{it}L(X)it and the gradient calculated on the same layer K from the loss L(X)it+1L(X)_{it+1}L(X)it+1, The weight of IT calculated after iteration is updated.
This definition is intended to focus on gradients rather than input distributions for hidden layers, on the assumption that it will give us better clues as to how ICS affects the underlying optimization problem.
Note: ICSoptiICS_{opti}ICSopti now refers to ICS defined from an optimization perspective.
In the following experiments, the authors evaluated the effect of ICSoptiICS_{OPTI}ICSopti on training results. To do this, they measured changes in ICS during training for DNN with and without BN layers. To quantify the gradient changes mentioned in the ICSoptiICS_{OPTI}ICSopti definition, they calculated:
-
L2 difference: Is there a close norm for gradient before and after weight update? Ideally: 0
-
Do gradients have tight directions before and after weight updates? Ideally: 1
Figure 7: BN for ICSoptiICS_ {opti} ICSopti | L2 the differences and the influence of cosine Angle show that BN does not prevent ICSoptiICS_ {opti} ICSopti (seemed slightly increased in some way it). | Source [2]
The results are somewhat surprising: BN-dependent networks seem to have higher ICSoptiICS_{opti}ICSopti than standard networks. Remember, networks with BN (blue curve) train faster than standard networks (red curve)!
ICS seems to have nothing to do with training performance… At least for the definition of ICS. In a way, BN has another effect on the network, which makes convergence easier. Now, let’s look at how BN affects landscape optimization. We can find clues there.
Here’s the final experiment in this article:
Figure 11: optimized landscape in the gradient direction to explore, experiments in the paper [2] | | Inspired by Andrew Ilyas – Design: Lou HD
From a single gradient, we update the weights with different optimization steps (like a learning rate). Intuitively, we define a direction from a point in the feature space (i.e. network structure ω\omegaω), and then further explore optimizing landscape along this direction.
At each step, we measure the gradient and the loss. Thus, we can use a starting point to compare differences in landscape optimization. If we measure large changes, the landscape is very unstable and the gradient is uncertain: large steps can worsen the optimization. Conversely, if the measured changes are small, the landscape is stable and the gradient is credible: we can apply larger steps without affecting optimization. In other words, we can use a larger learning rate and make convergence faster (a well-known property of BN).
Let’s look at the results:
Figure 8: BN for optimizing landscape smooth | use BN can significantly reduce the influence of change of gradient. |Source : [2]
We can clearly see that the optimization effect of BN layer is much smoother.
We ended up with a result that could explain the effectiveness of BN: the BN layer somehow smoothes the optimized landscape. This makes the optimizer’s job easier: we can use a much larger learning rate without causing gradient disappearance or gradient explosion.
We can conclude a third hypothesis, which is put forward by [2] :
Hypothesis 3
BN➜ normalized the input signal in the hidden unit ➜ to make the optimized landscape smoother ➜ faster and more stable training
It raises another question: how does BN make optimized landscape smoother?
[2] The author also discusses these problems from a theoretical perspective. Their work is illuminating and helps to better grasp the smoothing effect of BN. In particular, they show that BN smoothes the optimized landscape while preserving the minimum of all normal landscapes. In other words, BN resets potential optimization problems, making training faster and easier!
⚠ In additional research, [2] authors observe that this effect is not unique to BN. They achieved training performance similar to other normalization methods, such as L1 or L2 normalization. These observations suggest that the effectiveness of BN is largely due to chance, taking advantage of underlying mechanisms that we have not yet fully identified.
Now is the time to set a very high learning rate. | Credit : Finding Dan | Dan Grinwis on Unsplash
To conclude this section, this paper seriously challenges the widely accepted view that the effectiveness of BN is mainly due to the reduction of ICS (from the perspective of the distribution of training stability, as well as from an optimization perspective). But the effect of BN smoothing effect on landscape optimization is emphasized.
The paper proposes a hypothesis about the effect of BN on training speed, but does not answer why BN helps generalization.
They briefly point out that smoothing the landscape more by optimizing it helps the model converge to a flat minimum, which has better generalization properties. There were no further details.
Their main contribution is to challenge the accepted BN effect on ICS, which is already important!
C.4 Conclusion: What do we know now
- Hypothesis 1: BN layer alleviates internal covariance transfer (ICS)
❌ Error: [2] indicates that in practice there is no correlation between ICS and training results.
- Hypothesis 2: The BN layer makes the optimizer’s job easier because it allows adjustment of the input distribution of the hidden cell with only two parameters.
❓ Perhaps: This assumption highlights the interdependence of parameters, making the optimization task more difficult. There is no hard evidence, though.
- Hypothesis 3: The BN layer resets the underlying optimization problem to make it smoother and more stable.
❓ maybe: their results are the most recent. As far as I know, they have not been challenged so far. They provide empirical proof and theoretical proof, but some basic questions remain unanswered (e.g. “How does BN help generalization?” ).
Discussion: In my opinion, the last two assumptions are compatible. Intuitively, we can think of hypothesis 2 as a projection from a problem with multiple parameters to a problem with multiple parameters; A dimension reduction, which will help with generalization. Any ideas?
Many questions remain unanswered and BN is still the subject of current research. Discussing these assumptions still helps to better understand this common approach and dismiss some of the misstatements we’ve been thinking about for years.
However, these problems do not prevent us from using the benefits of BN in practice!
conclusion
BN is an important progress in the field of deep learning in recent years. The method relies on two continuous linear transformations to make deep neural network (DNN) training faster and more stable.
The most widely accepted assumption about the effectiveness of BN in practice is the reduction of interdependencies between hidden layers during training. However, the effect of normalized transformation on landscape smoothness optimization seems to be an important mechanism for the effectiveness of BN.
Now many popular DNNS rely on BN (e.g. ResNet[4], EfficientNet[5],…). .
If you are interested in deep learning, you must be familiar with this method!
Open question
Even if BN appears to have worked in practice over the years, many questions about its underlying mechanism remain unanswered.
Here is a non-exhaustive list of open questions about BN:
-
How does BN help generalization?
-
Is BN the best normalization method to make optimization easier?
-
How do β\ Beta β and γ\gammaγ affect landscape smoothness optimization?
-
Experiments in literature [2] exploring landscape optimization focused on the short-term effects of BN on gradients: they measured changes in gradients and losses in a single iteration at several step-size values. What is the long-term effect of BN on the gradient? Does the interdependence of weights have any other interesting implications for landscape optimization?
Thank you
Many thanks to Lou Hacquet Delepine for all the charts and her comprehensive help with proofreading!
reference
[1] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167.
[2] Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization? , Advances in Neural Information Processing Systems
[3] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D.,… & Rabinovich, A. (2015). Going deeper with convolutions, Proceedings of the IEEE conference on computer vision and pattern recognition
[4] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition
[5] Tan, M., & Le, Q. V. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks, arXiv preprint arXiv:1905.11946.
[6] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. Bengio, Y. (2014), Generative adversarial nets, Advances in neural information processing systems
Further exploration
-
Wonderful talk by LAN Goodfellow. He talked about BN: Links at the beginning of the course
-
Oral presentation of the paper by one of the authors [2]. The audience asked tough questions and had a great debate about BN: Links
-
Should we put BN before or after activation in StackOverflow: link
-
Should we put BN before or after activation in Reddit: link
This article is participating in the “Nuggets 2021 Spring Recruitment Campaign”, click to see the details of the campaign