Let’s think about how we can speed up our neural network training process.

In general, we know that the more complex the neural network, the more training data. The more time we spend training these experimental data. The reason is simply that you need to calculate too much data in the process. But often in the actual process, complex data and massive data are often unavoidable. At this point, we need to find ways to make neural networks smarter and faster.

Therefore, people find the most basic method SGD (Stochastic Gradient Descent)


Now let’s think that the red square is the data we need to train. If we want to follow the previous whole process and writing method, we need to put the whole set of data into the neural network for learning and training in NN. The resources consumed in this process can be significant. We need a different way of thinking at this point. Suppose we divide these data into small batches and then continuously put them into NN for calculation. This is what we call the correct SGD opening. In each process, the use of component data can not reflect the overall situation, but it greatly accelerates the training process of NN to a large extent, and does not reduce the accuracy of NN in this process. But what if, even with SGD, you still feel the training is too slow?

All right! Satisfy you. After repeated research by scientists, it has been proved that SGD is not the fastest practice method. As can be seen from the figure below, when SGD reaches the experimental target, it consumes the longest practice, which is obviously inconsistent with the optimal path for problem solving in my article written yesterday. In contrast, there are many ways to speed up training. Most of the rest are done while updating the parameters of the neural network.


For the formula W+= -learning rate*dx. The update of the traditional parameter W is to add a negative Learning efficiency Learning rate times the correction value dx to the original Learning parameter. Although the learning process takes a lot of twists and turns, the resulting line chart looks like a drunken man’s wobbly path. There’s a lot of detours, so we take this guy off the flat and we put him on a downhill slope, and as soon as he goes downhill a little bit, he unconsciously goes downhill and takes fewer detours. This is known as Momentum’s updating approach. Their mathematical form is m= B1 *m-Learning rate*dx W+=m

Another learning method is the AdaGrad method, which is actually a manipulation on the learning rate, finally making each parameter update will have a different effect. It’s a bit like Momentum, except instead of giving a drunk guy a downhill run, give him a pair of smaller shoes so that his feet hurt when he walks, which becomes a drag on the curve and forces him to go straight.

So at this point you might be like, wouldn’t it be better to combine the downhill with the tough shoes?

Yes, you have the right idea. In this case, we have the method of RMSProp. With Momentum on the wrong track, plus AdaGrad’s resistance on the wrong track, RMSProp has the advantage of both. Now the math is as follows

V =b1*v +(1-b1)*dx^2

W+= -Learning rate*dx/v

RMSProp is not the addition of AdaGrad and Momentu, but is missing the -learning Rate *dx part of Momentu. So we derive the algorithm Adam, and derive the formula in Adam:

Momentu: m = m + b1 * * dx (1 – b1)

AdaGrad:V =b1*v +(1-b1)*dx^2

W+= -Learning rate*dx/v

Through experiments, it has been proved that in most cases, Adam can quickly achieve the goal of convergence and complete the task quickly and well. In the neural network training process, Adam algorithm can be said to be indispensable.

These are some of the contents that this article wants to explain, but because of the time is short, there must be a lot of mistakes, also hope that everyone can put forward more opinions, in order to promote our common promotion!