Gradient-centered GC can make the training of the network more stable and improve the generalization ability of the network by zero-mean weight gradient. The algorithm is simple and the theoretical analysis of this paper is very sufficient, which can explain the principle of GC well

Source: Xiaofei algorithm engineering Notes public account

Gradient Centralization: A New Optimization Technique for Deep Neural Networks

  • Thesis Address:Arxiv.org/abs/2004.01…
  • Thesis Code:Github.com/Yonghongwei…

Introduction


Optimizer is very important for the training of deep neural networks on large data sets, such as SGD and SGDM. The Optimizer has two objectives: to speed up the training process and improve the generalization ability of models. At present, a lot of work has been done on how to improve the performance of optimizers such as SGD, such as overcoming the problems of gradient disappearance and gradient explosion in training, effective TRICK power initialization, activation function, gradient clipping and adaptive learning rate. Some work standardizes weights and eigenvalues from the perspective of statistics to make training more stable, such as feature map standardization method BN and weight standardization method WN.

Different from the standardization method on weights and eigenvalues, this paper proposes a high performance network optimization algorithm gradient centralization (GC) acting on weight gradient, which can accelerate network training, improve generalization ability and compatible model Fine-tune. As shown in Figure A, the idea of GC is simple, zero-mean-averaged gradient vectors that can be easily embedded into a variety of optimizers. The main contributions are as follows:

  • A new general network optimization method, gradient centralization (GC), is proposed, which can not only smooth and speed up the training process, but also improve the generalization ability of the model.
  • The theoretical properties of GC are analyzed, and it is shown that GC can constrain loss function, standardize weight space and eigenvalue space, and improve the generalization ability of model. In addition, the constrained loss function has better Lipschitzness(anti-disturbance ability, function slope is constant less than a Lipschitze constant), making training more stable and efficient.

Gradient Centralization


Motivation

BN and WS use Z-Score standardization to operate on eigenvalues and weights respectively, which in fact indirectly constrain the gradient of weights, so as to improve the Lipschitz attribute of the loss function during optimization. Inspired by this, the paper directly tried the z-Score standardization for gradient operation, but the experiment found that it did not improve the stability of training. After that, we try to calculate the mean value of the gradient vector and zero mean value of the gradient vector. The experiment shows that it can effectively improve the Lipschitz attribute of the loss function, make the network training more stable and more generalized, and obtain the gradient centralization (GC) algorithm.

Notations

Define some basic symbols, We use W∈RM×NW \in \mathbb{R}^{M \times N}W∈RM×N to express the weight matrix Wfc∈RCin×CoutW_{fc} \in \mathbb{R}^{C_{in}\times C_{out}}Wfc∈RCin×Cout and the weight tensor of convolution layer Wconv∈R(Cink1k2)×CoutW_{conv} \in \mathbb{R}^{(C_{in} k_1 k_2)\times C_{out}}Wconv∈R(Cink1k2)×Cout, WI ∈RMw_i \in \mathbb{R}^Mwi∈RM is the third column of weight matrix WWW, L\mathcal{L}L is the objective function, ∇WL\ Nabla_ {W}\ Mathcal {L}∇WL and ∇wiL\ Nabla_ {W_I}\ Mathcal {L}∇wiL is the gradient of L\ Mathcal {L}L to WWW and WIW_iwi, WWW and ∇WL\ Nabla_ {W}\mathcal{L}∇WL are the same in size. If XXX is defined as the input feature graph, WTXW^T XWTX is the output feature graph, e=1M1e=\frac{1}{\ SQRT {M}}1e=M 11 is the unit vector of MMM bits, I∈RM×MI\in\mathbb{R}^{M\times M}I∈RM×M is the identity matrix.

Formulation of GC

For the weight vector WIW_iwi of the convolutional layer or the fully connected layer, its gradient ∇wiL\ Nabla_ {w_I}\mathcal{L}∇wiL, obtained by back propagation, Then as shown in figure b to calculate the average mu ∇ wiL = 1 M ∑ j = 1 M ∇ wi, jL, mu, nabla_ {w_i} \ mathcal {L} = \ frac {1} {M} {\ sum} ^ M_ {j = 1} \ nabla_ {w_ {I, j} \ mathcal {L}} mu ∇ wiL = ∑ M1 j = 1 m ∇ wi, jL, GC Φ \ Phi Φ are defined as follows:

Formula 1 can also be converted to matrix form:

PPP consists of the unit matrix and the unit vector formation matrix, which are responsible for preserving the original value and calculating the mean value respectively.

Embedding of GC to SGDM/Adam

GC can be simply embedded into the current mainstream network optimization algorithms, such as SGDM and Adam, directly using the zero-mean gradient φ GC(∇wL)\Phi_{GC}(\ nabla_W \mathcal{L}) φ GC(∇wL) for weight update.

Algorithms 1 and 2 respectively demonstrate that embedding GC into SGDM and Adam requires almost no modification to the original optimizer algorithm, just adding a line of gradient zeromeans calculation, which takes about 0.6 SEC.

Properties of GC


The following is a theoretical analysis of why GC can improve the generalization ability of the model and accelerate the training.

Improving Generalization Performance

An important advantage of GC is to improve the generalization ability of the model, mainly due to the regularization of weight space and eigenvalue space.

  • Weight space regularization

Firstly, the physical significance of PPP is introduced, which can be calculated as follows:

That is, PPP can be regarded as a mapping matrix, mapping ∇WL\nabla_W \mathcal{L}∇WL to the hyperplane of the normal vector of the space vector as EEE, and P∇WLP\nabla_W \ Mathcal {L}P∇WL as the mapping gradient.

Taking SGD optimization as an example, the mapping of weight gradient can constrain the weight space in a hyperplane or Riemannian manifold, as shown in FIG. 2. The gradient is first mapped to the hyperplane eT(W − WT)= 0E ^T(W-W ^T)=0eT(W − WT)=0. It is then updated in the direction of the mapping gradient −P∇ WTL-p \ Nabla_ {w^t}\mathcal{L}−P∇wtL. From eT e – wt (w) = 0 ^ T ^ T) (w – w = 0 eT – wt (w) = 0 can get eTwt + 1 = eTwt =. = eTw0e ^ Tw ^ ^ {T + 1} = e Tw ^ T = \ cdots = e ^ Tw eTwt ^ 0 + 1 = eTwt =. = eTw0, objective function into actual:

This is a constrained optimization problem for weight space WWW. Regularizing the solution space of WWW reduces the possibility of overfitting (overfitting usually involves learning complex weights to fit training data) and can improve the generalization ability of the network, especially when there are fewer training samples. WS imposes the constraint eTw= 0E ^Tw=0eTw=0 on the weights. When the initial weights do not meet the constraint, the weights will be directly modified to meet the constraint conditions. Assuming fine-tune training, WS will completely discard the advantages of the pre-training model, while GC can accommodate any initial weights eT(W0 − W0)= 0E ^T(W ^0- W ^0)=0eT(w0− W0)=0.

  • Output feature space regularization

SGD optimization method, for example, a weight updating wt + 1 = wt – alpha tP ∇ wtLw ^ ^ {t + 1} = w t – \ alpha ^ tP \ nabla_ {w_t} \ mathcal {L} wt + 1 = wt – alpha tP ∇ wtL, You can derive Wt = w0 – P ∑ I = 0 t – 1 alpha (I) (I) of Lw ^ ∇ w t = w ^ 0 – P ^ {\ sum} {t – 1} _ {I = 0} \ alpha ^ {(I)} \ nabla_ ^ {w} {(I)} \ mathcal {L} wt = w0 – P ∑ I = 0 t – 1 alpha ∇ w (I) L (I). For any input eigenvector XXX, there is the following theorem:

Theorem 4.1 shows that constant changes in input characteristics will result in changes in output, and that the change in output is only related to scalar γ\gammaγ and 1Tw01^Tw^01Tw0, not the current weight WTW ^ TWT. γ1Tw0\gamma1^Tw^0γ1Tw0 is the mean value after scaling of the initial weight vector. Assuming that w0W ^0w0 is close to 0, the constant change of the input eigenvalue will hardly change the output eigenvalue, which means that the output eigenspace is more robust to the change of the training sample.

Visualization of the different initial weights of ResNet50 shows that the weights are very small (less than E −7e^{-7}e−7), indicating that if GC is used for training, the output features are not overly sensitive to changes in the input features. This attribute regularizes the output feature space and improves the generalization ability of network training.

Accelerating Training Process

  • Optimization landscape smoothing

As mentioned above, both BN and WS indirectly constrain the weight gradient to make the loss function satisfy the Lipschitz attribute. ∣ ∣ ∇ wL ∣ ∣ 2 | | \ nabla_w \ mathcal {L} | | _2 ∣ ∣ ∇ wL ∣ ∣ 2 and ∣ ∣ ∇ w2L ∣ ∣ 2 | | \ nabla ^ 2 _w \ mathcal {L} | | _2 ∣ ∣ ∇ w2L ∣ ∣ 2 the Hessian matrix (WWW) has a upper bound. GC directly restrains the gradient and also has properties similar to BN and WS. Compared with the original loss function, GC satisfies the following theorem:

The relevant proof can be found in the appendix of the original text. Theorem 4.2 shows that GC has better Lipschitzness than the original function. Better Lipschitzness means that the gradient is more stable and the optimization process is smoother, which can accelerate the training process similar to BN and WS.

  • Gradient explosion suppression

Another advantage of GC is that it prevents gradient explosion and makes training more stable, similar to gradient clipping. Too large gradient will lead to serious loss oscillation and difficult to converge, while gradient cutting can suppress large gradient and make training more stable and faster.

The L2L_2L2 norm and maximum value of the gradient are visualized. It can be seen that the values after GC are smaller than the original function, which is also consistent with Theorem 4.2. GC can make the training process smoother and faster.

Experiment


Performance comparison combined with BN and WS.

Comparison experiments on mini-Imagenet.

Comparative experiments on CIFAR100.

Contrast experiment on ImageNet.

Performance comparison on fine-grained data sets.

Performance comparison on detection and segmentation tasks.

Conclustion


Gradient-centered GC can make the training of the network more stable and improve the generalization ability of the network by zero-mean weight gradient. The algorithm is simple and the theoretical analysis of this paper is very sufficient, which can explain the principle of GC well.





If this article was helpful to you, please give it a thumbs up or check it out

For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]