Preface:
Normalization related technologies have been developed for several years. At present, there are corresponding methods for different application situations. In this paper, these methods are summarized, and their ideas, methods and application scenarios are introduced. Mainly involved: LRN, BN, LN, IN, GN, FRN, WN, BRN, CBN, CmBN, etc.
This paper is also called “BN and its successors” because almost all the normalization methods after BN are improved against the three defects of BN, which are also introduced in this paper. It is believed that readers will have a more comprehensive understanding and understanding of the normalization method after reading this article.
LRN(2012)
Local Response Normalization (LRN) was first proposed in AlexNet. Since the introduction of BN, it has been largely abandoned, so here is only its source and the main idea.
The idea for LRN comes from the neurobiological process known as lateral inhibition, in which activated neurons inhibit neighboring neurons. LRN can be described in one sentence: Let the feature map with large response value become larger, let the feature map with small response value become smaller.
Its main idea is to make the correlation between feature maps generated by different convolution kernels smaller, so as to realize the effect of feature maps on different channels focusing on different features. For example, feature A is more significant on the channel, and feature B is more significant on the other channel.
Batch Normalization(2015)
Batch Normalization: Accelerating Deep Networks Training by Reducing Internal Covariate Shift
The explanation of BN in the paper is as follows: Depth of the trained neural network is very complex, because in the process of training, change with the previous parameters of each layer, the distribution of each input will also change, the change of the input layer distribution poses a problem, because the layer need to constantly adapt to a new distribution, so training is complicated, as the network becomes deeper, subtle variations in network parameters will be magnified.
This slows down training due to the low learning rate and careful parameter initialization required, and it is notoriously difficult to train models with saturated nonlinearity. We refer to this phenomenon as internal covariate deviation and solve this problem by normalized layer input.
Other interpretations: Suppose the input data contains multiple features X1, X2… Xn. Each function may have a different range of values. For example, feature X1 might have a value between 1 and 5, while feature X2 might have a value between 1000 and 99999.
As shown in the figure on the left below, since the two data are not in the same range, but they are using the same learning rate, the gradient descent trajectory oscillates back and forth along one dimension, requiring more steps to reach the minimum value. At this time, the learning rate is not easy to set. If the learning rate is too large, it will oscillate back and forth for the data with a small range, while if the learning rate is too small, it will basically have no change for the data with a large range.
As shown in the figure on the right below, after normalization, the features are all in the same size range, so the loss landscape is like a bowl, the learning rate is easier to set, and the gradient descent is relatively stable.
Implementation algorithm:
In a batch, in each BN layer, the mean and variance of the same channel of each sample are calculated, and then the data are normalized. The normalized value has the characteristics of zero mean and unit variance. Finally, two learnable parameters gamma and Beta are used to scale and shift the normalized data.
In addition, the mean value and variance of each BN layer of each mini-batch are saved during the training process. Finally, the expected value of the mean value and variance of all mini-batch is calculated as the mean value and variance of the BN layer in the reasoning process.
Note: BN works better after the activation function than before it.
Actual effect:
1) Compared with no BN, a larger learning rate can be used
2) To prevent overfitting, Dropout and Local Response Normalization can be eliminated
3) Since dataloader scrambles the order, the mini-batch in each epoch is different. The normalization of different Mini-batches can enhance the data.
4) Obviously accelerate the convergence rate
5) Avoid gradient explosion and gradient disappearance
Note: BN has some problems, and most of the subsequent normalization papers have been improved around these shortcomings of BN. For the convenience of writing, these deficiencies are addressed in each of the subsequent papers.
The difference and connection between BN, LN, IN and GN
The difference between them is quite obvious in the figure below. (N represents N samples, and C represents the channel. In order to facilitate expression, the two-dimensional HxW is represented by the one-dimensional H*W.)
The main problem that these last three address is that the effect of BN depends on the batch size, and when the batch size is small, performance degrades significantly. As you can see, IN, LN, and GN are all independent of the Batch size.
LN calculates the mean and variance of a single sample on all channels, IN calculates the mean and variance of a single sample on each channel, GN divides the channels of each sample into G groups, and calculates the mean and variance of each group.
The contrast between their effects. (Note: this effect is only a one-case comparison, in fact each has its own application, and the last three obviously outperform BN in their respective applications)
Instance Normalization(2016)
Instance Normalization: The Missing Ingredient for Fast Stylization
IN image and video recognition tasks, the effect of BN is better than IN. However, for generation tasks such as GAN, style Transfer and Domain adaptation, the effect of IN is significantly better than BN.
The reason for this phenomenon is analyzed from the difference between BN and IN: BN calculates the mean and variance of multiple samples, and the domains of these samples are likely to be different, which is equivalent to the model normalizing the data distribution of different domains.
Layer Normalization (2016)
Thesis: Layer Normalization
The first defect of BN is that it relies on Batch size. The second defect is that it has no obvious effect on dynamic networks such as RNN, and problems are easy to occur when the length of inference sequence exceeds the length of all trained sequences. To this end, Layer Normalization is proposed.
When we apply batch normalization to an RNN in an obvious way, we need to compute and store separate statistics for each time step in the sequence. This is problematic if the test sequence is longer than any training sequence. LN has no such problem because its normalization term depends only on the total input of the current time step to the layer. It also has only a set of gain and bias parameters that are shared across all time steps. (Note: The gain and bias in LN are the same as gamma and beta in BN)
LN applications: RNN, Transformer, etc.
Group Normalization(2018)
Thesis on Group Normalization
As shown in the figure below, when the batch size decreases, BN degrades significantly, while Group Normalization remains consistent. As the batch size is large, it is slightly lower than BN, but as the batch size is small, it is significantly better than BN.
But GN has two defects, one is slightly less than BN when the batchsize is large, and the other is that because it is grouped on channels, it requires the number of channels to be a multiple of the number of groups G.
GN application scenario: In the target detection, semantic segmentation and other tasks requiring as large a resolution as possible, due to memory limitations, only a small batchsize can be used to achieve a larger resolution. GN, a normalization method independent of batchsize, can be selected.
GN implementation algorithm
Weights Normalization(2016)
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
The previous methods are all normalized based on feature map, and this paper proposes to normalize Weights.
It takes a lot of writing to explain this method, but here’s a sentence to explain the main method: Decompose the weight vector W into a scalar g, where the scalar g represents the length of the weight vector W, and a vector V, where the vector V represents the direction of the weight vector W.
This approach improves the conditions of the optimization problem and accelerates the convergence of stochastic gradient descent, independent of the batch size characteristics, and is suitable for cyclic models (such as LSTM) and noise-sensitive applications (such as deep reinforcement learning or generation models), for which batch normalization is not well suited.
There is also an obvious defect of Weight Normalization: WN does not have the function of normalized feature scale as BN does, so the initialization of WN needs to be careful. For this reason, the author proposes an initialization method of vector V and scalar G.
Batch Renormalization(2017)
Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
As mentioned earlier, BN uses the expectation of the mean and variance of each Mini-batch in the training process as the mean and variance in the reasoning process, on the premise that Mini-batch and the sample population are independently and idensely distributed. Therefore, the third defect of BN is poor performance when the samples in mini-batch are not independently and identically distributed.
Based on the first defect batchsize too small performance degradation and the third defect, the authors proposed Batch Renormalization (BRN).
The main difference between BRN and BN is that BN uses the expectation of the mean and variance of each mini-batch in the training process as the mean and variance of the whole data set, while each Mini-batch in the training process has its own mean and variance, so the mean and variance in the reasoning stage is different from that in the training. However, BRN proposed that the mean value and variance of the whole data set should be continuously learned and revised during the training process, so as to make it as close as possible to the mean value and variance of the whole data set, and finally be used in the reasoning stage.
BRN implementation algorithm is as follows:
Note: here r and D represent scaling and translation and do not participate in back propagation.
When small batchsize or independent identically distributed mini-batch is used for training, the performance of the model trained with BRN is significantly better than BN. At the same time, BRN retains the advantages of BN, such as sensitivity to initialization and training efficiency
Cross-GPU BN(2018)
MegDet: A Large mini-batch Object Detector
In the case of multi-card distributed training, the input data is equally divided into multiple parts, and forward and back transmission and parameter update are completed on their respective cards. BN is the normalization of samples on a single card, so the actual normalized sample number is not batchsize. For example, batchsize=32, training with four cards, actually only normalizes on 32/4=8 samples.
The idea of cross-GPU Batch Normalization is to perform Normalization on multiple cards.
The specific implementation algorithm is as follows:
FRN(2019)
Thesis: Filter Response Normalization Layer: Essence Batch Dependence in the Training of Deep Neural Networks
FRNS are still improved based on the problem that small batchsizes degrade performance.
The FRN consists of two components, Filter Response Normalization (FRN) and Thresholded Linear Unit (TLU).
The former is very similar to Instance Normalization IN that it is based on single sample and single channel, except as IN subtracts from the mean and then divides by the standard deviation. The FRN is not subtracted from the mean. The reasons given by the author are as follows: although subtracting the mean value is a normal operation of the normalization scheme, it is arbitrary for the batch independent normalization scheme without any reason.
TLU adds a threshold to ReLU, which is a learnable parameter. This is in consideration of the fact that the FRN does not have the operation of subtracting the mean, which may cause the normalized result to be arbitrarily offset by 0. If the FRN is followed by the ReLU activation layer, many 0 values may be generated, which is not good for model training and performance.
FRN implementation algorithm
The experimental results
Cross-Iteration BN(2020)
Thesis: Cross-Iteration Batch Normalization
The main idea of CBN is to involve the samples of the first K-1 iteration in the calculation of the current mean and variance. However, since the data of the first K-1 iteration is updated, it cannot be used directly. This paper proposes a method to approximate the data of the first K-1 iteration by Taylor polynomials.
An improved version of Yolo_v4 is also proposed that only four Mini-batches’ data will be counted per batch and that weights, scaling and offsets will be updated after the fourth Mini-batch.
The experimental results
conclusion
This paper introduces the more classic normalization method at present, most of them are modified for BN, in this paper are introduced in detail their main ideas, improve way, as well as the application scenario, some methods have not detail implementation details, for readers interested in or have need to please to read the original paper.
In addition to the above methods, there are many Normalization methods, such as Eval Norm, Normalization Propagation, Normalizing the normalizers, etc. However, these methods are not commonly used and will not be discussed here.
Other articles
Summary of attention Mechanism
Summary of feature pyramid
Summary of data enhancement methods
CNN visualization technology summary
Summary of CNN structure evolution — classic model
Summary of CNN structure evolution — lightweight model
Summary of CNN’s structural evolution — Design principles
Summary of pooling technology
Summary of non-maximum inhibition
Summary of English literature reading methods
Summary of common ideas of paper innovation
This article comes from the technical summary series of the public CV technical guide.
The summary PDF of all the above summary series articles can be obtained by replying “Technical summary” in the public account “CV Technical Guide”
The reference papers
-
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
-
Instance Normalization: The Missing Ingredient for Fast Stylization
-
Layer Normalization
-
Group Normalization
-
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
-
Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
-
MegDet: A Large Mini-Batch Object Detector
-
Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks
-
Cross-Iteration Batch Normalization
-
YOLOv4: Optimal Speed and Accuracy of Object Detection
-
EvalNorm: Estimating Batch Normalization Statistics for Evaluation