This article is the notes section of Ng’s deep Learning course [1].

Author: Huang Haiguang [2]

Main Author: Haiguang Huang, Xingmu Lin (All papers 4, Lesson 5 week 1 and 2, ZhuYanSen: (the third class, and the third, three weeks ago) all papers), He Zhiyao (third week lesson five papers), wang xiang, Hu Han, laughing, Zheng Hao, Li Huaisong, Zhu Yuepeng, Chen Weihe, the cao, LuHaoXiang, Qiu Muchen, Tang Tianze, zhang hao, victor chan, endure, jersey, Shen Weichen, Gu Hongshun, when the super, Annie, Zhao Yifan, Hu Xiaoyang, Duan Xi, Yu Chong, Zhang Xinqian

Editorial staff: Huang Haiguang, Chen Kangkai, Shi Qinglu, Zhong Boyan, Xiang Wei, Yan Fenglong, Liu Cheng, He Zhiyao, Duan Xi, Chen Yao, Lin Jianyong, Wang Xiang, Xie Shichen, Jiang Peng

Note: Notes and assignments (including data and original assignment files) and videos can be downloaded on Github [3].

I will publish the course notes on the official account “Machine Learning Beginners”, please pay attention.

Week 2: Optimization Algorithms

2.1 Mini-Batch gradient descent

This week we’re going to learn about optimization algorithms, which make your neural networks run faster. The application of machine learning is a highly experience-dependent process. With a lot of iterations, you need to train many models to find the right one, so optimization algorithms can help you train models quickly.

One difficulty is that deep learning does not have the greatest effect in the field of big data. We can use a large data set to train neural networks, but training on a large data set is slow. So, you will find that using fast optimization algorithms, using good optimization algorithms can greatly improve your efficiency and the efficiency of your team. So, let’s talk about mini-Batch gradient descent first.

Vectorization, as you learned, allows you to efficiently evaluate all of the samples, allows you to deal with the entire training set without any explicit formula. So we’re going to have to scale up the training sample into a very large matrix, same thing, so the dimension of omega is omega, the dimension of omega is omega, and vectorization allows you to process all the samples relatively quickly. If large, processing is still slow. For example, if it’s 5 million or 50 million or more, what you do when you do gradient descent on the entire training set, is you have to process the entire training set before you can do one gradient descent, and then you have to reprocess another 5 million training samples before you can do the next gradient descent. So if you let gradient descent process one part of the training set before you process the entire 5 million samples, your algorithm will be faster, and that’s exactly what you can do.

You can break up the training set into smaller subsets, which are called mini-batch, and if you assume that there are only 1,000 samples in each subset, you can take those samples and call them the first subset, which is also called mini-batch, and then you can take the next 1,000 samples, starting from, And then you take another 1,000 samples, and so on.

And then I’m going to say a new notation, let’s call it alpha, alpha, alpha, if you have 5 million training samples, each mini-batch is going to be 1,000 samples, so you have 5,000 mini-batches, because 5,000 times 1,000 is 50 million.

You have 5,000 mini-batches, so you end up with a

You have to do the same thing for omega, and you have to split up your training set accordingly, so this is omega, and then you go from omega to omega, this is called omega all the way to omega.

The mini-batch consists of and, which is 1000 training samples containing corresponding input and output pairs.

So before WE go on, let me just make sure that my notation, we used the top bracket to represent the values in the training set, so this is the first training sample. We used the upper Angle brackets to represent the number of layers of the neural network, to represent the value of the first layer in the neural network, and now we have the curly brackets to represent the different mini-batches, so we have and, check if you understand that correctly.

The dimension of the sum: if it’s a training set with 1000 samples, or the value of 1000 samples, so the dimension of the sum should be, the dimension of the sum should be, and so on. So all of the subset dimensions are going to be PI, and the dimensions of these alpha’s are going to be PI.

Batch gradient descent to explain the name of the algorithm, batch gradient descent refers to the gradient descent algorithm we talked about earlier, which processes the entire training set at the same time. The name comes from the ability to see the entire batch training set being processed at the same time. It’s not a great name, but that’s what it’s called.

Mini-batch gradient descent, by contrast, refers to the algorithm that we’ll talk about in the next slide, where you process individual Mini-batch sums at a time, rather than all sums at the same time.

So what is the principle of mini-batch gradient descent? Running mini-Batch gradient descent on the training set, you run for t=1… 5,000, because we have 5,000 groups of 1,000 samples each, and all you have to do in the for loop is basically perform a gradient descent on alpha and beta. Let’s say you have a training set with 1,000 samples, and let’s say you’re already familiar with processing all at once, and you’re going to use vectorization to process 1,000 samples almost simultaneously.

First on the input that is, do the forward propagation, and then execute, before we only had, but now you’re dealing with the whole training set, you’re dealing with the first mini-batch, and when you’re dealing with the mini-batch it becomes, and then execute, and the reason it’s capitalized is because it’s a vector connotation, and so on, Until, this is your predicted value. Notice that you need a vectorized execution command here, and this vectorized execution command, to process 1,000 samples at a time instead of 5 million samples. Next you need to calculate the loss cost function because the subset size is 1000, which, to clarify, refers to the sample from the mini-batch sum.

If you are using regularization, you can also use regularization terminology, since this is a mini-batch loss, SO I write the loss as the top corner and put it in curly brackets ().

You’ll also notice that we’re doing something very familiar, exactly the same thing we did with gradient descent, except now you’re not dealing with alpha, alpha and beta. And then, you do the back propagation to compute the gradient, you just use the sum, and then you update the weighting, actually, to do the same thing. This was a step in training the sample using mini-Batch gradient descent, and the code I wrote could also be called training for “1 epoch.” The word generation means that the training set is traversed only once.

Using Batch gradient descent allows you to do only one gradient descent at a time, while using mini-Batch gradient descent allows you to do 5,000 gradients at a time. Of course, if you want to iterate through the training set multiple times, you also need to set up another for loop for another while loop. So you can keep iterating through the training set until eventually you can converge to a reasonable precision.

If you have a missing training set, mini-Batch gradient descent works faster than Batch gradient descent, so almost everyone who does deep learning will use it when training huge data sets. In the next video, we’ll talk more about mini-Batch gradient descent. You’ll also have a better understanding of what it does and how it works.

2.2 Understanding Mini-Batch gradient Descent

In last week’s video, you learned how to use mini-Batch gradient descent to start a training set and to start a gradient descent. Even if you’ve only done part of the training set, even if you’re doing it for the first time. In this video, we’ll take a closer look at how gradient descent works and how it works.

When using batch gradient descent method, each iteration you need through the entire training set, can expect each iteration cost will decrease, so if the cost function is a function of the number of iterations, each iteration as it should be to reduce, if in any iteration increases, that certainly out of the question, perhaps you too much more.

Using the mini-Batch gradient descent method, if you plot the cost function over the entire process, not every iteration is going down, especially in each iteration, you’re dealing with alpha and beta, and if you plot the cost function, and only alpha and beta, That is, each iteration you’re training a different sample set or training a different mini-batch, and if you were to graph the cost function, you would probably see something like this, going down, but with more noise, so if you make the graph, because you’re training the Mini-Batch gradient descent method over multiple generations, You might see something like this. It does not matter that each iteration does not go down, but the trend should go down. The noise is caused by mini-batch, which is easier to calculate, so the cost will be lower. But maybe by chance, and it’s mini-batch that’s hard to calculate, and maybe you need some incomplete samples, and then the cost is a little bit higher, and that’s why you get these swings, because you’re running mini-Batch gradient descent to figure out the cost function.

One of the variables you need to decide is the mini-batch size, which is the size of the training set. In the extreme case, if the mini-batch size is equal to, in fact, the Batch gradient descent method, in this extreme case, you have the mini-batch and, And the mini-batch is equal to the entire training set, so the mini-batch size can be set as the Batch gradient descent method.

At the other extreme, assuming the mini-batch size is 1, there is a new algorithm, called stochastic gradient descent, where each sample is an independent mini-batch. When you look at the first mini-batch, which is the sum, if the mini-batch size is 1, it is your first training sample, This is your first training sample. Then the second mini-batch, the second training sample, takes the gradient descent step, then the third training sample, and so on, one at a time.

Look at the optimization of the cost function at both extremes, and if this is the outline of the cost function that you want to minimize, the minimum is there, and batch gradient descent starts somewhere, relatively low noise, higher amplitude, and you can keep looking for the minimum.

On the contrary, in the stochastic gradient descent method, starting from a certain point, we choose a starting point, each iteration, you only to gradient descent of a sample, you toward the global minimum value is close to most of the time, sometimes you’re far away from the minimum value, because the sample to you refer to in the right direction, so the stochastic gradient descent method is has a lot of noise, on average, It’s going to get closer to the minimum, but sometimes it’s going to get the wrong direction, because STOCHASTIC gradient descent never converges, it’s going to keep moving around the minimum, but it doesn’t get to the minimum and stay there.

In fact, the mini-batch size you choose is between the two, between 1 and 1, while 1 is too small and too large. The reason is that if batch gradient descent method is used, the mini-batch size is, a large number of training samples need to be processed in each iteration. The main disadvantage of this algorithm is that the single iteration takes too long, especially when the number of training samples is large. Batch gradient descent works well if the training sample is small.

On the contrary, if using a stochastic gradient descent method, if as long as you deal with a sample, this method is very good, do no problem, the vector by reducing the noise can be improved or reduced, but one of the main drawback is that the stochastic gradient descent method, you will lose all the acceleration of vectorization bring you, because the disposable dealt with only one training sample, so the efficiency is too low, Therefore, in practice, it is best to choose the mini-batch size, which in fact achieves the fastest learning rate. And you see two benefits, one, you get a lot of vector-quantization, and in the example we used in the last video, if you had a mini-batch size of 1,000 samples, you could vector-quantize 1,000 samples much faster than if you had multiple samples at once. On the other hand, you don’t have to wait for the entire training set to be processed before you can start working on it. Using the numbers from the last video, each training set allows us to take 5,000 gradient descent steps, so some of the mini-batch sizes in the middle actually work best.

Using mini – batch gradient descent method, we start from here, one iteration in so doing, twice, three times, four times a week, it won’t always towards minimum value is close to, but it is more than a stochastic gradient descent consistently close to the direction of the minimum value, it does not necessarily convergence or fluctuations in small scope, if appear this problem, the vector can reduce slowly, In the next video we’ll talk about learning rate decay, which is how to reduce learning rate.

If the mini-batch size is neither 1 nor 1, the median value should be chosen. There are guidelines.

First of all, if the training set is small, directly use the Batch gradient descent method, the sample set is small, there is no need to use the mini-batch gradient descent method, you can quickly process the entire training set, so it is also good to use the Batch gradient descent method, the small here means less than 2000 samples, so it is more suitable to use the Batch gradient descent method. Otherwise, if the sample size is large, the general mini-batch size is 64 to 512. Considering the way the computer memory is set up and used, the code will run faster if the mini-batch size is 2 to the power of 6, and so on, 128 is 2 to the power of 7, 256 is 2 to the power of 8. 512 is 2 to the ninth power. So I always set the mini-batch size to the power of two. In the last video, I set the mini-batch size to 1000, so I suggest you try 1024, which is 2 to the tenth power. Mini-batch sizes of 1024 are also available, but are less common. Mini-batch sizes of 64 to 512 are more common.

The last thing to note is that in your Mini-batch, ensuring and matching CPU/GPU memory depends on your application direction and the size of your training set. If you’re dealing with a mini-batch that doesn’t match CPU/GPU memory, no matter what method you use to process the data, you’ll notice that the performance of the algorithm takes a turn for the worse, so I want you to get a sense of the mini-batch size that the average person uses. In fact, the mini-batch size is another important variable, and you need to do a quick trial to find the one that is most effective in reducing the cost function. I usually try a couple of different values, a couple of different powers of 2, and see if I can find the one that makes the gradient descent algorithm most efficient. Hopefully this will guide you in how to start finding this number.

You learned how to perform mini-batch gradient descent to make the algorithm run faster, especially with large training samples. But there is a more efficient algorithm, much more efficient than both gDA and Mini-Batch GDA, which we’ll explain in the next video.

2.3 Exponentially Weighted Average

I want to show you a couple of optimization algorithms, which are faster than gradient descent, and to understand these algorithms, you need to use exponential weighted averages, also called exponential weighted moving averages in statistics, and we’ll start with that, and then we’ll move on to more complex optimization algorithms.

Although I live in America now, I was actually born in London, England. For example, I have the daily temperature in London for the last year, so on January 1, the temperature was 40 degrees Fahrenheit, which is 4 degrees Celsius. I know most of the world uses Celsius, but the United States uses Fahrenheit. On January 2nd it was 9 degrees Celsius and so on. In the middle of the year, 365 days of the year, the middle of the year means about 180 days, which is the end of May, the temperature is 60 degrees Fahrenheit, which is 15 degrees Celsius and so on. It gets warmer in summer, then cooler in winter.

If you plot the data, you get the following, starting in January, beginning of summer here, end of year here, end of December.

So here we have January 1, the middle of the year towards summer, and then we have the end of the year, which looks a little bit messy, if you want to calculate trends, which are local averages of temperature, or moving averages.

What you do is, first of all, each day, you need to take the weighted value of 0.9 plus 0.1 times the temperature of that day, so here is the temperature of the first day.

The next day, a weighted average of 0.9 times the previous value plus 0.1 times the temperature of the day is obtained, i.e., and so on.

The value of the second day plus the value of the third day is 0.1, so down. The general formula for a given day is equal to 0.9 of the previous day’s value plus 0.1 of the current day’s temperature.

So if you do that, and then you draw the red line, you get something like this.

You get the moving average, the exponentially weighted average of the daily temperature.

If you look at the formula on the last slide, let’s change the constant from 0.9 to 0.1 to 1, 0, 0

And for reasons we’ll have to think about later, when you calculate it, the visual temperature is roughly the daily temperature, if it’s 0.9, you might think, this is the average of ten days, which is the red line.

Let’s try something else, set it to a value close to one, like 0.98, and calculate, and this is a rough average of the temperature over the last 50 days, and then plot it to get the green line.

This high value to pay attention to what time, you get some curve to smooth, the reason is that you more than the average temperature for a few days, so the curve, fluctuations in smaller, more smooth, the disadvantage is that the curve further moves to the right, because now the average temperature of more, more value, average index weighted average formula when the temperature changes, to adapt to more slowly, So there is a delay because when, which is equivalent to giving too much weight to the previous day’s value, only 0.02 is given to the value of the day, so the temperature fluctuates as the temperature changes, and when it is large, the exponentially weighted average ADAPTS more slowly.

We can try another extreme value, say 0.5. According to the formula () on the right, this is the average temperature over two days.

The yellow line is drawn after the plot is run.

Since the average temperature is only for two days, the average data is too few, so the curve obtained has more noise and may have outliers, but the curve can adapt to temperature changes more quickly.

So the exponentially weighted average is often used, and again, it’s called the exponentially weighted moving average in statistics, and we’ll just call it the exponentially weighted average for short. By adjusting this parameter (), or learning the algorithm later, you will find that this is a very important parameter and can achieve slightly different results. There is usually a value in the middle that works best. The red curve that is in the middle averages the temperature better than the green and yellow lines.

Now that you know the basics of exponential weighted averages, in the next video we’ll talk a little bit more about what it does.

2.4 Understanding Weighted Averages

In the last video, we looked at exponential weighted averages, which are a key part of several optimization algorithms that can help you train your neural network. What I want to do in this video is take a closer look at the nature of algorithms.

Recall this key equation for calculating the exponential weighted average.

If it’s closer to 1, say 0.98, it’s a green line. If it’s smaller, if it’s 0.5, it’s a yellow line.

Let’s take it a step further to understand how to calculate the average daily temperature.

The same formula,

So, let’s write down the formulas, so as WE go from zero to one to two to three, the values of alpha are increasing, and for better analysis, I’m going to keep decreasing alpha as I write, and then I’m going to keep going.

Let’s look at the first formula. What is it? Let’s switch these two terms ().

So what is it? We just substitute this formula (), so:

.

So what is it? You can use this formula to calculate () and substitute in the formula, so:

.

And so on, if you expand all of these parentheses,

So this is a sum and a union average, number 100, which is the temperature of the day. The composition of our analysis, which is the data calculated on the 100th day of the year, but this is the sum, which includes data 100, data 99, data 97, and so on. So one way to graph this is, suppose we have the temperature for some date, so this is the number, this is the number, so number 100 has a number, number 99 has a number, number 98 and so on, is 100, 99, 98 and so on, and this is the number of days.

Then we construct an exponential decay function, starting at 0.1, to, to, and so on, so we have this exponential decay function.

You do that by taking the corresponding elements of the two functions and summing them up, you take this number 100 times 0.1, 99 times 0.1, that’s the second term, and so on, so you take the daily temperature, you multiply it by the exponential decay function, and you sum it up, and you get it.

And it turns out, and we’ll talk more about that later, but all of these coefficients, they add up to 1 or they approach 1, and we call that bias correction, and we’ll talk about that in the next video.

Finally, you might ask, how many days does it take to average the temperature? It’s actually about 0.35, and this is about the fact that E is one of the foundations of the natural algorithm. Roughly speaking, if you have, in this case, and so is approximately equal to, is about 0.34, 0.35, in other words, after 10 days, the height of the curve down to, is equivalent to the peak.

And so when we say it’s as if you’re calculating an exponential weighted average, looking only at the temperature of the last 10 days, because after 10 days, the weight has dropped to less than a third of that day’s weight.

If, on the other hand, then how many powers does 0.98 take to get to such a small number? It’s approximately equal to, so for the first 50 days this ratio is large, and it decays rapidly, so essentially this is a function of a very large decrease, and you can view it as averaging the temperature for 50 days. Because in your case, want to plug in the left side of the equation, and so to 50, we thus get formula, we average about the temperature of the day, instead of here, that is to say, according to some constant, you can probably know how can the average day temperature, but it’s just a general direction to think, is not a formal mathematical proof.

And finally how to do it in practice, remember? We’re going to start with 0, then count the first day, and then, and so on.

Now explain algorithm, can be, and so on in specific variables, but in the actual execution, you want to do is will be initialized to zero at the beginning, and then on the first day to make, and then the next day, update the value, and so on, some people take indexed, to represent data index is used to calculate the weighted average.

Again, but rephrasal, and then every day, take the data from day one and update it to.

Index formula is one of the advantages of the weighted average, it takes up very little memory, occupy only one line Numbers in computer memory, and then put the latest data generation formula, continuously cover is ok, because of this reason, its efficiency, it basically takes only one line of code, only takes the weighted average calculation index single-line digital storage and memory, of course it is not the best, It’s not the most accurate way to calculate the average. If you want to calculate the moving window, you just take the sum of the last 10 days, the sum of the last 50 days, divide by 10 and 50, and that often gives you a better estimate. The downside is that saving all the most recent temperature data, plus the sum of the last 10 days, requires more memory, more complex execution, and more expensive computation.

So in the next videos, we’re going to average multiple variables, which is an efficient method in terms of computation and memory efficiency, so it’s used a lot in machine learning, not to mention one line of code, which is an advantage.

Now that you know how to calculate exponential weighted averages, you also need to know a technical concept called bias correction, which we’ll talk about in the next video, and then you can use it to build better optimization algorithms than the straightforward gradient descent method.

2.5 Bias Correction in Weighted Averages

You learned how to calculate an exponential weighted average, and there’s a technical term called bias correction to make the average calculation more accurate, so let’s see how it works.

In a video on the corresponding value of 0.9 (red) curve, the corresponding = 0.98 (green) curve, if you write the formula here, at the time of is equal to 0.98, the curve is not green, purple curve, but you can notice the purple curve with low starting point, let’s take a look at how to deal with.

When calculating the moving average, initialize, but, so this part is missing (), so, so if the temperature of the day is 40 degrees Fahrenheit, then, therefore, the value will be much smaller, so the first day’s temperature estimate is not correct.

If you plug in, and you multiply, so assuming that the sum is both positive, it’s going to be much less than the sum, so it’s not a good estimate of the temperature for the first two days of the year.

There is a way to modify this estimate, to make it better, to make it more accurate, especially at the beginning of the estimate, which is instead of using t, t is the number of days now. For example, when, so the estimate of the next day’s temperature becomes the weighted average of and, and the bias is removed. And you can see that as you increase, it’s close to zero, so when it’s very large, the bias correction has very little effect, so when it’s very large, the purple line is basically on top of the green line. But at the beginning of the learning phase, you start to predict the warm-up exercises, and the bias correction helps you better predict the temperature, the bias correction helps you go from the purple line to the green line.

In machine learning, most of the time when calculating the exponential weighted average, people don’t care about performing bias correction, because most people would rather get past the initial period, get a biased estimate, and move on. If you are concerned with initial bias, bias correction can help you get a better estimate early on when you start calculating the exponential weighted moving average.

So you learned how to calculate the exponentially weighted moving average, let’s go on to use it to build better optimization algorithms!

2.6 Gradient Descent with Momentum

And an algorithm called Momentum or Momentum gradient descent method, almost always run faster than a standard gradient descent algorithm, in short, the basic idea is to calculate the gradient index weighted average, and by using the gradient update your weight, in this video, we do want to apart clauses describe together, see you exactly how to calculate.

, for example, if you want to optimize the cost function, shape function is shown in figure, the position of the red dot represents the minimum, suppose you begin from here (blue dots) gradient descent method, if for one iteration of gradient descent method, whether in batch or mini – batch descent method, might point to it, now, on the other side of the oval, computing the next step gradient descent, Maybe it will, and then you do one more step, and then you do another step, and you see that gradient descent is a lot of steps, right?

This fluctuation slows down the gradient descent method and you cannot use a larger learning rate. If you use a larger learning rate (purple arrow), the result may be out of the range of the function. In order to avoid too large a swing, you should use a smaller learning rate.

Another way of looking at it is, on the vertical axis, you want to learn slowly, because you don’t want these oscillations, but on the horizontal axis, you want to learn fast, you want to move quickly from left to right, to the minimum, to the red dot. So with the momentum gradient descent method, what you do is, in each iteration, exactly in the first iteration, you compute the derivative, and I’m going to omit the superscript, and you compute with the existing mini-batch. If you use the Batch gradient descent method, the current mini-batch is all the batch, and the effect of the Batch gradient descent method is the same. If the existing Mini-Batch is the entire training set, and it works fine, all you have to do is calculate, which is similar to what we did before, the moving average of, and then do the same calculation, and then re-assign the weights, and again, that will slow down the gradient descent.

For example, in the last couple of derivatives, you find that the average of these oscillations along the vertical axis is close to zero, so in the vertical axis, you want to slow down a little bit, and in the average, the positive and negative cancel out, so the average is close to zero. But in the transverse direction, all pointing in the direction transverse differential, therefore the average transverse direction still is bigger, so use algorithm after several iterations, you find the momentum gradient descent method, the motion of the ultimate longitudinal axis direction becomes inverse direction faster, so your algorithm took a more direct path, reaching the minimum of the way to reduce the swing.

One of the nature of the gradient descent method of momentum, which works for some people but not all people, is that if you minimize the bowl function, this is the shape of the bowl, which I didn’t draw very well.

They minimize the bowl function, these differential terms, imagine that they give you the acceleration of a ball rolling down a hill, the Momentum term equals the velocity.

Imagine you have a bowl, you take a ball, differential gave the ball a acceleration, the ball is rolling down the hill to the ball because acceleration rolled faster and faster, and because of slightly less than 1, show some friction, so the ball will not speed up indefinitely, so don’t like gradient descent method, independent of the previous steps, each step of your ball can roll down, gain momentum, You can pick up momentum by accelerating down from the bowl. I find that the analogy of a ball rolling down a bowl works well with people who are physically competent, but not everyone does, and if the analogy of a ball rolling down a bowl doesn’t make sense to you, don’t worry.

And then finally, let’s see how we do it, so here’s the algorithm.

So you have two hyperparameters, the learning rate and the parameters, controlling the exponential weighted average. The most commonly used value is 0.9. We previously averaged the temperature for the last ten days, so we now average the gradient for the first ten iterations. It actually works well when it’s 0.9, you can try different values, you can do some hyperparametric studies, but 0.9 is a good robust number. So the bias correction, so you take the sum and divide by it, and actually people don’t do that, because after 10 iterations, because you’re past the initial phase of your moving average. In practice, one does not suffer from bias correction when using gradient descent or momentum gradient descent. And of course the initial value is 0, and notice that this is a zero matrix that has the same dimension as the matrix that has the same dimension, that has the same initial value as the vector zero, so alpha and beta have the same dimension, which means alpha and beta are the same dimension.

And finally, if you look up momentum gradient descent, you’ll often see a technical word that’s been deleted, it’s been deleted, and you end up with. The purple version of the result is, so it’s shrunk by a factor of two, which is the same thing as multiplying by a factor of three, so if you want to use gradient descent the latest value, you have to change it accordingly. In fact, both work well and only affect the optimal learning rate. I don’t think this formula to use so natural, because there is an effect, if you are the last to adjust parameters, will affect the and, you may also modify more, so I prefer to the left of the formula, rather than out of this formula, so I tend to use more to the left of the formula, there is this formula, However, both formulas will be set to 0.9, which is a common choice for hyperparameters, but the adjustment of learning rate will be different in the two formulas.

So this is Momentum gradient descent, and this algorithm is definitely better than gradient descent without Momentum, and there are other things we can do to speed up the learning algorithm, and we’ll talk about those in the next video.

2.7 RMSprop

You know Momentum can accelerate gradient descent, and there’s an algorithm called RMSprop, root Mean Square Prop, that can also accelerate gradient descent, so let’s see how that works.

Remember our previous example, if you perform gradient descent, even though you are advancing in the horizontal direction, there is a big swing in the vertical direction, and for the sake of analysis of this example, suppose that the vertical axis represents the parameters, and the horizontal axis represents the parameters, there may be, or there may be other important parameters, which for the sake of understanding are called sum.

So, you want to slow down learning in the direction, the vertical axis, while speeding up, or at least not slowing down learning in the horizontal axis, and the RMSprop algorithm can do this.

In the first iteration, the algorithm can also calculate the differential of the mini – batch, so I will keep this index weighted average, we use a new symbol, not, therefore, to clarify, the square of operations are aimed at the symbol on the whole, do so to preserve differential square weighted average, again, again, Squared is the operation on the whole symbol.

RMSprop then updates the parameter values like this to understand how this works. Remember in transverse direction or the direction in the example, we want to learn fast, and in the vertical direction, is the example of the direction, we hope that slow the oscillation of the vertical axis, and so, we hope that we will be relatively small, so we have to divided by a smaller number, the hope is bigger, so here we are going to divided by a larger number, That will slow down the change along the vertical axis. If you look at these differentials, the vertical one is much bigger than the horizontal one, so the slope is much bigger in the direction, so these differentials are bigger and smaller, because the function is more tilted in the b direction than it is in the horizontal direction. The square of theta is larger, so it will be larger, and by contrast, it will be smaller, or the square will be smaller, so it will be smaller, so the result is that the updates on the vertical axis will be divided by a larger number, eliminating the wobble, and the updates in the horizontal direction will be divided by a smaller number.

The effect of RMSprop is that your update will end up like this (green line), with less swing up the vertical axis and more push down the horizontal axis. Another effect is that you can use a higher learning rate and then speed up learning without having to deviate vertically on the vertical axis.

And just to be clear, I’ve been calling the vertical and the horizontal directions alpha and omega, just for demonstration purposes. In practice, you would be in the higher dimensional space of the parameters, so you need to eliminate the vertical dimension of the oscillations, you need to eliminate the oscillations, which are actually a collection of parameters, etc., the horizontal dimension might, etc., so the separation of the and is just a convenience for illustration. It’s actually a higher dimensional parameter vector, it’s also a higher dimensional parameter vector, but your intuition is that in the dimensions where you want to get rid of the wiggle, you end up calculating a larger sum, the weighted average of the derivative of the sum of squares, so you end up getting rid of the directions where the wiggle is. So this is RMSprop, which stands for root mean square, because you square the differential, and then you end up using the square root.

And then finally we’ll talk a little bit more about this algorithm, and then we’ll move on. In the next video, we’ll combine RMSprop with Momentum. We’ll use a hyperparameter for Momentum, but to avoid confusion, we’ll use a hyperparameter to ensure that Momentum and RMSprop use the same hyperparameter. To make sure that your algorithm doesn’t divide by 0, what if the square root of theta goes to 0? You get a really big answer, and to make sure that the number is stable, in practice, you have to put a very, very small in the denominator, it doesn’t matter what it is, it’s a good choice, it just makes sure that the number is stable, you don’t want to divide by a very, very small number for whatever reason. So RMSprop is very similar to Momentum in that it eliminates the wobble in gradient descent, including mini-batch gradient descent, and allows you to use a larger learning rate to speed up your algorithm learning.

So you learned how to use RMSprop, which is another way to speed up the learning algorithm. One of the interesting things about RMSprop is that it was first proposed not in an academic research paper, but many years ago on Jeff Hinton’s Coursera course. I don’t think Coursera deliberately set out to be a platform for spreading emerging academic research, but it has had an unexpected effect. It was from Coursera that RMSprop became widely known and grew rapidly.

We talked about Momentum, we talked about RMSprop, and if you combine the two, you get a better optimization algorithm, and we’ll talk more about why in the next video.

2.8 Adam Optimization Algorithm

In the history of deep learning, including many well-known researchers, put forward the optimization algorithm, and solved some problems very well, but then the optimization algorithm are pointed out and cannot be generalized, does not apply to a variety of neural network, time is long, deep learning circles of people are beginning to somewhat questioned the new optimization algorithm, A lot of people think Momentum gradient descent works, and it’s hard to come up with a better algorithm. So RMSprop optimization algorithm and Adam (Adam optimization algorithm and the content of this video), is one of the few people withstand the test of two kinds of algorithm, has been proved to be suitable for different depth study of the structure, the algorithm I will not hesitate to recommend to you, because a lot of people have tried, and solved many problems very well in it.

Adam optimization algorithm is basically a combination of Momentum and RMSprop, so let’s see how to use Adam algorithm.

With the Adam algorithm, first you initialize , , , , ,. In the first iteration, you compute the differential, using the current mini-batch calculation. Normally, you would use the Mini-Batch gradient descent method. We then calculate the weighted average of Momentum exponents, so (using, so as not to be confused with hyperparameters, because RMSprop will be used later), we definitely use this formula when we use Momentum, but now we call it instead. The same.

And then you update it with RMSprop, with different hyperparameters, and again, you square the whole derivative.

Equivalent to Momentum updating its hyperparameters, RMSprop updating its hyperparameters. So normally when you use Adam’s algorithm, you calculate the bias correction, and the bias correction is after the bias correction,

.

In the same way,

We also use bias correction, which is theta.

The weights are updated last, so the updated is (if you just use Momentum, use or modified, but now we’ve added the RMSprop part, so we’ll divide by the corrected square root plus).

Update values according to a similar formula,.

Therefore, Adam algorithm combines Momentum and RMSprop gradient descent method, and is an extremely common learning algorithm, which has been proved to be effectively applicable to different neural networks and a wide range of structures.

There are a lot of hyperparameters in this algorithm, and the learning rate of hyperparameters is very important and often needs debugging, so you can try a bunch of values and see what works. The usual default is 0.9, which is the moving average of dW, or the weighted average of, which is the term for Momentum. For hyperparameters, Adam’s author, the inventor of Adam’s algorithm, recommends using 0.999, which is calculating the moving-weighted average of and. The choice is not really that important, Adam’s author suggests, but you don’t need to set it because it doesn’t affect algorithm performance. But with Adam, people tend to just use the default values, and both, and I don’t think anyone is going to tweak and try different values to see what works best. You can adjust the sum, but few people I know in the industry do that.

Why is this algorithm called Adam? Adam stands for Adaptive Moment Estimation. It is used to calculate this differential () and is called the first Moment. It is used to calculate the exponentially weighted average of square numbers () and is called the second Moment.

By the way, I have an old friend and business partner named Adam Coates. As far as I know, he has nothing to do with Adam’s algorithm, but I think he uses it occasionally, but sometimes I get asked about it, and I think you might have the same question.

That’s all there is to Adam’s optimization algorithm, with which you can train neural networks much more quickly, and before we wrap up this week, we’ll talk a little bit about hyperparametric tuning, and a better understanding of what the optimization problems of neural networks are. In the next video, we’ll talk about learning rate decay.

2.9 Learning Rate Decay

One way to speed up a learning algorithm is to slowly reduce the learning rate over time, which we call learning rate decay, and let’s see how we do that, first of all, with an example, why we calculate learning rate decay.

Let’s say you’re going to use the mini-Batch gradient descent method, and the mini-batch is a small number, about 64 or 128 samples, and there’s noise during the iteration (blue line), and the descent is toward the minimum here, but it doesn’t converge exactly, so your algorithm ends up oscillating around, and it doesn’t really converge because you’re using a fixed value, Different mini-batches have noise.

But vector to gradually reduce, at the time of the early, more also is bigger, your learning or relatively quickly, but with the smaller, you will slow down the pace of smaller, so finally your curve (green line) will be in a small area near the minimum value in the swing, not in the training process, greatly near the minimum value.

So the nature of tapering down is that you can afford a larger pace in the early stages of learning, but when you start to converge, a smaller learning rate allows you to take a smaller pace.

You can do this by learning rate decay, remember to iterate over the data once, if you have a training set like this,

You should break it up into mini-batches. The first iteration of the training set is called generation 1. The second is the second generation, and so on. You can set decay-rate to decay-rate, and epoch-num to epoch-num. Note that this decay-rate is another hyperparameter you need to adjust.

Here is a concrete example, if you compute generations, that is, traverses several times, if is 0.2 and decay-rate is 1, then in the first generation,, which is substituting this formula to compute (), decay rate is 1 and algebra is 1. The learning rate is 0.67 in the second generation, 0.5 in the third generation, 0.4 in the fourth generation, etc. You can calculate a few more numbers yourself. Understand that, as an algebraic function, according to the above formula, your learning rate decreases. If you want to use learning rate decay, what you do is you try different values, including the hyperparameter, and the hyperparameter decay rate, and find the right value, people use other formulas besides this learning rate decay formula.

For example, this is called exponential decay, which corresponds to a value less than 1, such as, so your learning rate decreases exponentially.

Other formulas people use are or (mini-batch numbers).

Sometimes people also use a discrete descent, where a step has a certain rate and then after a while the rate is halved, then halved, then halved again, which is what discrete descent means.

So far, we’ve talked a little bit about formulas to see how the learning rate actually changes over time. The other thing that people sometimes do is they do manual attenuation. If you train one model at a time, if you train it for hours or days, which some people do, look at their model and train it for days, and then they say, hey, the learning rate is slowing down, let me turn it down a little bit. Manual control of course works, day after day, day after day, only works when the number of models is small, but sometimes people do it.

So now you have multiple options to control your learning rate. You might be wondering, a lot of hyperparameters, which one should I do, but I think it’s too early to worry about that. Next week, we’ll talk about how to systematically select hyperparameters. Attenuation is not for me, the more I try to point to set a fixed, and then adjust well, will have great influence, more attenuation does go a long way, sometimes can speed up the training, but it is not I will take the lead in content to try, but next week we will cover the super parameter adjustment, you can learn more ways to manage all super parameters in the system, And how to search for hyperparameters efficiently.

That’s learning rate decay, and finally I’m going to talk a little bit about local optimality and saddle points in neural networks, so you get a better idea of the optimization problems that your algorithm is solving as you train your neural network, and we’ll talk about those in the next video.

2.10 The Problem of Local Optima

In the early stage of deep learning research, people always worried that optimization algorithms would be trapped in extremely poor local optimal. However, with the continuous development of deep learning theory, our understanding of local optimal has changed. Let me show you how we think about local optimality and optimization in deep learning.

This is the picture that people used to have in mind when they thought about local optimality, maybe you want to optimize some parameters, let’s call them sums, and the height of the plane is the loss function. Local optimality seems to be distributed everywhere in the diagram. Gradient descent or an algorithm may get stuck in a local optimum and never reach a global optimum. If you plot a number, like these two dimensions, it’s easy to have multiple graphs with different local optimality, and these lower-dimensional graphs have influenced our understanding, but they’re not correct. In fact, if you’re going to create a neural network, usually the zero gradient point is not the local best point in this graph, and in fact the zero gradient point of the cost function is usually the saddle point.

So at this point, this is the sum, the height is the value of the cost function.

But a function with a higher dimensional space, if the gradient is zero, then in every direction, it could be either convex or concave. If you are in the 20000 dimensions, so want to get the local optimal, all 20000 direction need is such, but perhaps a small probability of occurrence, perhaps it is, you are more likely to encounter some direction curve that bending upwards, other direction curve bending down, rather than all of the bending upwards, so in high dimension space, you are more likely to encounter a saddle point.

Like this one:

You don’t get local optimality. As to why is called a surface saddle points, you can imagine, just like the saddle on horseback, if this is a horse, this is the head of the horse, that is the horse’s eyes, draw well please inclusion, then you are the man riding, sitting on the saddle, so this point here, derivative to zero point, the point is called a saddle point. I think that’s exactly the point where you’re sitting on the saddle, and here the derivative is zero.

So one of the lessons we’ve learned from the history of deep learning is that most of our intuitions about lower-dimensional Spaces, like you can draw the graph above, don’t apply to higher-dimensional Spaces. For other algorithms, because if you have 20,000 parameters, then the function has 20,000 dimensional vectors, you’re more likely to have saddle points than local optima.

If local optimality is not the problem, what is? Study result is smooth segment will slow, smooth section is an area, including derivative close to zero for a long time, if you’re in here, the gradient will be decreased from surface from from the top down, because the gradient is equal or close to zero, the surface is very smooth, you have to spend a long time slowly reached a smooth segment of this point, because the left or the right of random disturbance, I change the ink color, You can see it a little bit better, and then your algorithm can get out of the stationary phase (red pen).

We can go along this long slope, all the way to here, and then come out of the plateau.

So the point of this video is, first of all, you’re less likely to be trapped in very bad local optimality if you’re training a large neural network, if you have a lot of parameters, and if the cost function is defined in a higher dimensional space.

Second, the stationary section is a problem, which makes learning very slow, and this is where algorithms like Momentum or RMSprop Adam can speed up learning algorithms. In these cases, more sophisticated optimization algorithms, such as Adam’s algorithm, can speed you up and get you out of the plateau sooner.

Network to solve optimization problems because of you, to tell the truth, to face such a high dimension space, I don’t think anyone has intuition, as well as you know the space looks like, and our understanding of them is still in development, but I hope this will allow you to better understand the problems faced by optimization algorithm.

The resources

[1] Deep Learning Courses:Mooc.study.163.com/university/…[2] Huang Hai-Guang:github.com/fengdu78[3]github: Github.com/fengdu78/de…