preface
This article explains why initialization is important, and summarizes several commonly used initialization methods: all-zero or equivalent initialization, normal initialization, uniform initialization, Xavier initialization, He initialization, and pre-trained initialization. Some of the initialization directions are still active: Data correlation initialization, sparse weight matrix and random orthogonal matrix initialization.
This article is from the public account CV technical Guide **** technical summary series ********
Welcome to CV technical Guide **, focusing on computer vision technology summary, the latest technology tracking, classic paper interpretation. **
Why is initialization important
Improperly initialized weights can cause gradient disappearance or explosion problems, which can negatively affect the training process.
For the gradient disappearance problem, the weight updates are small, leading to slower convergence — this makes the optimization of the loss function slow and, in the worst case, may prevent the network from converging completely. In contrast, initialization with excessive weights can cause the gradient to explode during forward or back propagation.
Common initialization methods
1. Initialize all or equivalent values
Since the values are all the same, each neuron learns the same thing, leading to a “Symmetry” problem.
2. Normal Initialization
The mean is zero, and the standard deviation is set to a small value.
The nice thing about doing this well is that you have the same bias, positive and negative weights. It’s reasonable.
Example: In 2012 AlexNet used the initialization method of “Gaussian (normal) noise with mean of zero, standard deviation of 0.01 and deviation of 1 to initialize”. However, this normal random initialization method is not suitable for training very deep networks, especially those using ReLU activation functions, because of the previously mentioned gradient extinction and explosion problems.
3. Uniform Initialization
The interval of uniform distribution is usually [-1/ SQRT (FAN_in), 1/ SQRT (fan_in)].
Fan_in represents the number of input neurons and FAN_OUT represents the number of output neurons.
4. Xavier Initialization
Understanding the difficulty of Training deep feedforward Neural networks
According to the characteristics of sigmoID function image
If the initial value is very small, then the variance approaches zero as the number of layers is passed, and the input value becomes smaller and smaller. In sigmoID, it is near zero, close to linear, and loses its nonlinearity.
If the initial value is very large, the variance will increase rapidly with the transfer of layers, and the input value will become very large at this time. However, when sigmoID writes the reciprocal of the large input value approaches 0, it will encounter the problem of gradient disappearance in the back propagation.
To address this problem, Xavier and Bengio proposed “Xavier” initialization, which takes into account the size of the network (the number of input and output units) when initializing weights. This method ensures that the weight stays within a reasonable value range by inversely proportional to the square root of the number of cells in the previous layer.
There are two variants of Xavier’s initialization.
**Xavier Normal: ** The mean of the Normal distribution is 0 and the variance is SQRT (2/(FAN_in + FAN_out)).
Xavier Uniform: ** Uniform distribution interval is [-sqRT (6/(FAN_in + FAN_out)), SQRT (6/(fan_in + FAN_out))].
Xavier initialization applies to networks that use TANH and SigmoID as activation functions.
5. He Initialization
Delving Deep into RECTIFIERS: Surpassing Human-level Performance on ImageNet Classification
The choice of activation function ultimately plays an important role in determining the effectiveness of the initialization method. Activation functions are differentiable and introduce nonlinear properties into neural networks, which are crucial for solving the complex tasks that machine learning and deep learning are designed to solve. ReLU and Leaky ReLU are commonly used activation functions because they are relatively robust to the disappearance/explosion gradient problem.
Xavier performs well in tanh functions, but not well in ReLU and other activation functions. He introduced a more robust weight Initialization method –He Initialization.
There are also two variants of He Initialization:
**He Normal: ** The mean of the Normal distribution is 0 and the variance is SQRT (2/ FAN_in).
SQRT (6/fan_in), SQRT (6/fan_in)
He Initialization applies to networks that use nonlinear activation functions such as ReLU and Leaky ReLU.
Both He Initialization and Xavier Initialization use similar theoretical analyses: they find good variances for the distributions from which the initial parameters are extracted. This variance applies to the activation function used and is derived without explicit consideration of the distribution type.
The figure is from a paper by Kaming Ho.
The paper shows how He’s improved initialization strategy (red) reduces the error rate faster than ReLU’s Xavier method (blue).
For a proof of the Xavier and He initialization methods, see Pierre Ouannes’ article how to Initialize deep Neural Networks? Xavier and Kaiming Initialization.
6. Pre-trained
Using pre-trained weights as initializers results in faster convergence and a better starting point than other initializers.
In addition to the above Initialization methods, there is also LeCun Initialization. The method is similar to that used by He Initialization and Xavier Initialization, but is rarely used, so I won’t list it here.
Weight initialization is still an active area of research. Several interesting research projects have emerged, including data correlation initialization, sparse weight matrix and random orthogonal matrix initialization.
Data related initialization
Thesis: Data-dependent Initializations of Convolutional Neural Networks
Address: arxiv.org/abs/1511.06…
The sparse weight matrix is initialized
Address: openai.com/blog/block-…
Random orthogonal matrix initialization
Exact Solutions to The Nonlinear Dynamics of Learning in Deep Linear Neural Networks
Address: arxiv.org/abs/1312.61…
The resources
1. Medium.com/comet-ml/se…
2. Medium.com/analytics-v…
3. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification He, K. et al. (2015)
4. Understanding the difficulty of training deep feedforward neural networks
Welcome attention public number CV technical guide, focus on computer vision technology summary, the latest technology tracking, classic paper interpretation.
Reply keyword “technical summary” in the public account to obtain the summary PDF of the original technical summary article of the public account.
Other articles
CV technical Guide – Summary and classification of essential articles
Normalization method summary | under fitting and over fitting
NMS summary | loss function technical summary
Attention mechanism technical summary | technical summary characteristics of pyramid
Pooling technical summary | summary data method
Paper innovation common thinking summary | GPU parallel card training summary
Summary of CNN structure Evolution (I) Classical model
Summary of CNN structural evolution (II) Lightweight model
Summary of CNN structure evolution (iii) Design principles
Summary of CNN visualization technology (I) Feature map visualization
Summary of CNN visualization technology (II) Convolution kernel visualization
CNN visualization technology summary (iii) class visualization
Summary of CNN visualization technology (IV) Visualization tools and projects
Summary of image annotation tools in computer vision
Review and summary of various Optimizer gradient descent optimization algorithms
Summary of efficient Reading methods of English literature in CV direction
Summary | classic open source data sets at home and abroad
The Softmax function and its misconceptions
Common strategies for improving machine learning model performance
The Softmax function and its misconceptions
Resources sharing | SAHI: big slices of small target detection in auxiliary reasoning library
Summary of image annotation tools in computer vision
Batch Size effect on neural network training
ModuleList and Sequential in PyTorch: Distinction and usage scenarios
Summary of tuning methods for hyperparameters of neural networks
Use Ray to load the PyTorch model 340 times faster
Summary of image annotation tools in computer vision
Complexity analysis of convolutional neural networks
A review of the latest research on small target detection in 2021
Capsule Networks: The New Deep Learning Network
Classic paper series | target detection – CornerNet & also named anchor boxes of defects
Tesseract vs. EasyOCR open Source Framework for Word recognition
Summary of computer vision terms (a) to build the knowledge system of computer vision
A review of small sample learning in computer vision