This is the 11th day of my November challenge

Improvement of stochastic gradient descent

Hessian technology

Hessian technique is a method to minimize the cost function for C=C(w),w=w1,w2… C=C(w),w=w_1,w_2,… C=C(w),w=w1,w2,… . Through Taylor’s expansion (if the function meets certain conditions, Taylor’s formula can use the derivative values of each order at a certain point as coefficients to construct a polynomial to approximate the function), the cost function can be approximated at point WWW as:


C ( w + Δ w ) = C ( w ) + j partial C partial w j Δ w j + 1 2 j k Δ w j partial 2 C partial w j partial w k Δ w k + . . . C(w+\Delta w)=C(w)+\sum_j\frac{\partial C}{\partial w_j}\Delta w_j+\frac{1}{2}\sum_{jk}\Delta w_j\frac{\partial^2 C}{\partial w_j\partial w_k}\Delta w_k+…

It can be simplified as:


C ( w + Δ w ) = C ( w ) + C Δ w + 1 2 Δ w T H Δ w + . . . C(w+\Delta w)=C(w)+\nabla C\Delta w+\frac{1}{2}\Delta w^TH\Delta w+…

∇C\ Nabla C∇C represents the gradient vector, HHH is the Hessian matrix, and the JKJKJK term in the matrix is partial 2C∂wj∂wk\frac{\partial w_j\partial w_K}∂ Wj ∂ wK ∂2C.

To simplify the calculation of higher-order terms, the approximate CCC value can be omitted:


C ( w + Δ w ) material C ( w ) + C Δ w + 1 2 Δ w T H Δ w C(w+\Delta w) \approx C(w)+\nabla C\Delta w+\frac{1}{2}\Delta w^TH\Delta w

To prove that the right-hand expression can be minimized (that is, the graph of the function is concave), use the following method: O C + Δ w (w) ‘ ‘, and prove C + Δ w (w) ‘ ‘> 0 C (w + w) Delta \’, and proved that C (w + w) Delta \ ‘> 0 C + Δ w (w)’ ‘, and prove C + Δ w (w) ‘ ‘> 0, The Hessian matrix has the property that the second partial derivative of the function is always >0 when it is a positive definite matrix, and the Hessian matrix and the function are concave and convex. To:


Δ w = H 1 C \Delta w=-H^{-1}\nabla C

Is: w + Δ w = w – H – 1 ∇ the Cw + \ Delta w = w – H ^ {1} \ nabla the Cw + Δ w = w ∇ C – H – 1

In practice, Δ w = – eta H – 1 ∇ \ Delta w = C – eta H ^ \ {1} \ nabla Δ w = C – eta ∇ H – 1 C, which eta \ eta is eta for the learning rate. The Hessian method converges faster than the standard gradient descent method, and avoids the problems of multipaths commonly encountered in gradient descent by introducing the second-order variation information of cost function.

However, the calculation of Hessian matrix in Hessian method is very troublesome. For a network with 10710^7107 weights, the corresponding Hessian matrix will have 107×10710^7 times 10^7107×107 elements, which requires a large amount of calculation.

Gradient descent based on momentum

Gradient descent based on momentum improves the Hessian method to avoid the problem of too large matrix (that is, avoid the matrix of the second derivative), which is an optimization method. Momentum method introduces the concept of speed in physics, introduces the parameter VVV of velocity and the term μ\muμ (moment coefficient) representing friction, and the expression method of CCC becomes:


v v = mu v eta C   w w = w + v v\rightarrow v’=\mu v-\eta \nabla C \\ \ \\ w\rightarrow w’=w+v’

To understand this formula, consider mu = 1 \ mu = 1 mu = 1 (that is, there is no friction) what happens, v = v – eta ∇ C, w = w + v – eta ∇ Cv = v – eta \ \ nabla C, w = w + v – eta \ \ nabla Cv = v – eta ∇ C, w = w + v – eta ∇ C, As you can see in the figure below, each step increases in speed, so it reaches the bottom faster and faster, ensuring that momentum techniques run faster than standard gradient descent.

However, it has been mentioned before that too large descending step length (or speed) will lead to the shock across the bottom when descending near the bottom, which will affect the learning speed. To solve this problem, the parameter of friction μ\muμ is introduced to control the descending speed. When μ=1\mu =1 μ=1, there is no friction. The speed is entirely determined by the size of gradient ∇C\ Nabla C∇C. When μ=0\mu =0μ=0, velocity has no effect and the original gradient descent method is returned. In practice, the appropriate μ\muμ is selected by hold-out validation of the data set, similar to the previous method of selecting η\etaη.

Other models of artificial neurons

The neurons mentioned above are all S-type neurons, and s-type neurons can theoretically fit any distribution. But in practice, learning might be better for some applications of other neurons.

Tanh neurons

Use hyperbolic tangent instead of S-tangent:


t a n h ( w x + b )   t a n h ( z ) = e z e z e z + e z .   z = w x + b tanh(wx+b) \\ \ \\ tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}},\ z=wx+b

Sigma (z)\sigma(z)σ(z)


sigma ( z ) = 1 1 + e z \sigma(z)=\frac{1}{1+e^{-z}}

Practice: prove sigma (z) = 1 + tanh (z / 2) 2 \ sigma (z) = \ frac {1 + tanh (z / 2)} {2} sigma (z) = 21 + tanh (z / 2)


t a n h ( z / 2 ) = e z / 2 e z / 2 e z / 2 + e z / 2 = e z / 2 e z / 2 e z / 2 + e z / 2 e z / 2 e z / 2 e z / 2 e z / 2 = e 2 2 e 0 + e z e z e z   t a n h ( z / 2 ) + 1 = e z 2 + e z e z e z + e z e z e z e z = 2 ( e z 1 ) e z e z = 2 ( e z 1 ) ( e z 1 ) ( 1 + e z ) = 2 sigma ( z ) tanh(z/2)=\frac{e^{z/2}-e^{-z/2}}{e^{z/2}+e^{-z/2}}=\frac{e^{z/2}-e^{-z/2}}{e^{z/2}+e^{-z/2}}\frac{e^{z/2}-e^{-z/2}}{e^{ z/2}-e^{-z/2}}= \frac{e^2-2e^0+e^{-z}}{e^z-e^{-z}} \\ \ \\ tanh(z/2)+1=\frac{e^z-2+e^{-z}}{e^z-e^{-z}}+\frac{e^z-e^{-z}}{e^z-e^{-z}}=\frac{2(e^z-1)}{e^z-e^{-z}}=\frac{2(e^z-1)}{(e ^z-1)(1+e^{-z})}=2\sigma(z)

In summary, the TANh function is actually formed after the Sigmoid function changes in proportion, and its shape is similar, as shown below:

The difference between the two functions is that the output range of TANh neurons is (−1, 1) instead of (0, 1). Therefore, when using TANh neurons, it may be necessary to regularize the final output to limit the activation value to 0-1.

Differences in the use of TANH and S type neurons

In the case of s-type neurons, all activation values are positive, and the gradient is aδa\deltaaδ in the back propagation, so the activation value must be positive, so the gradient is only related to δ\deltaδ, and the ownership weight for the same neuron will either increase or decrease together. In some cases, the weight of the same neuron may need to be reversed, and TANH neuron may be a better choice. It’s just a heuristic idea, and there are no quick, accurate rules about which types of neurons learn faster or generalize best for a particular application.

ReLU neurons

ReLU neuron, namely, corrected linear neuron or corrected linear unit, whose output is:


m a x ( 0 . w x + b ) max(0,wx+b)

The image is as follows:

ReLU neurons show good results in some image recognition tasks.