This article is the notes section of Ng’s deep Learning course [1].
Author: Huang Haiguang [2]
Main Author: Haiguang Huang, Xingmu Lin (All papers 4, Lesson 5 week 1 and 2, ZhuYanSen: (the third class, and the third, three weeks ago) all papers), He Zhiyao (third week lesson five papers), wang xiang, Hu Han, laughing, Zheng Hao, Li Huaisong, Zhu Yuepeng, Chen Weihe, the cao, LuHaoXiang, Qiu Muchen, Tang Tianze, zhang hao, victor chan, endure, jersey, Shen Weichen, Gu Hongshun, when the super, Annie, Zhao Yifan, Hu Xiaoyang, Duan Xi, Yu Chong, Zhang Xinqian
Editorial staff: Huang Haiguang, Chen Kangkai, Shi Qinglu, Zhong Boyan, Xiang Wei, Yan Fenglong, Liu Cheng, He Zhiyao, Duan Xi, Chen Yao, Lin Jianyong, Wang Xiang, Xie Shichen, Jiang Peng
Note: Notes and assignments (including data and original assignment files) and videos can be downloaded on Github [3].
I will publish the course notes on the official account “Machine Learning Beginners”, please pay attention.
Week 2: Basics of Neural Network Programming
2.1 Binary Classification
This week we’re going to look at the basics of neural networks, and one thing to note is that there are some very important techniques and tricks we need to know when implementing a neural network. For example, if you have a training set of samples, you might be used to running a for loop through each sample in the training set, but when implementing a neural network, we don’t usually run the entire training set directly through the for loop, so in this week’s lesson you’ll learn how to handle training sets.
In addition, in neural network calculation, there is usually a step called forward pause or foward Propagation. Then there is a step called backward pause or backward propagation. So this week I’m also going to show you why neural network training can be divided into two separate parts: forward propagation and back propagation.
In the course I will use logistic regression to convey these ideas to make them easier to understand. Even if you’ve read about logistic regression before, I think there’s something new and interesting to discover and learn, so let’s get down to business.
Logistic regression is an algorithm for binary classification. Let’s start with a problem. Here’s an example of a dichotomous problem. If you have an image as input, say, this cat, if you recognize this image as a cat, output tag 1 as the result; If the cat is not identified, the output label 0 is the result. Now we can use letters to represent the output result label, as shown below:
Let’s take a look at how an image is represented in a computer. To save an image, you need to save three matrices corresponding to the red, green and blue color channels in the image. If your image is 64×64 pixels, you have three matrices 64×64 pixels. Corresponding to the intensity values of red, green and blue pixels in the picture. For the sake of representation, HERE I have drawn three very small matrices, note that they are 5×4 rather than 64×64, as shown below:
In order to put those pixel values into a feature vector, we need to extract those pixel values and put them into a feature vector. To convert these pixel values into feature vectors, we need to define a feature vector to represent the image like the following. We take out all the pixels, such as 255, 231, etc., until we run out of red pixels, then 255, 134, etc. , 255, 134, etc., until you get a feature vector that lists all the red, green, and blue pixel values in the image. If the image is 64×64 pixels, then the total dimension of the vector will be 64 times 64 times 3, which is the total number of pixels in the three pixel matrix. The result in this example is 12,288. Now we’re going to use theta to represent the dimensions of the input eigenvectors, and sometimes I’ll just use lowercase for brevity. Therefore, in the binary classification problem, our goal is to acquire a classifier that takes the feature vector of the picture as the input and then predicts whether the output is 1 or 0, that is, whether there is a cat in the picture:
So let’s go through some of the notations that we’re going to use for the rest of the course.
Symbol definition:
: represents a dimension data, is the input data, dimension is;
: indicates the output result. The value is.
: indicates the first group of data, which may be training data or test data. The default value is training data.
: represents the input values of all training data sets, placed in a matrix, where represents the number of samples;
: corresponding to the output values of all training data sets, dimension is.
A pair is used to represent a single sample, representing the eigenvectors of the dimension, indicating that the label (output result) can only be 0 or 1. And the training set is going to consist of one training sample, which represents the inputs and outputs of the first sample, which represents the inputs and outputs of the second sample, up to the last sample, and then all of these together represent the entire training set. This is the number of samples in the test set. This is the number of samples in the test set.
And finally, in order to make the training set a little bit more compact, we’re going to define a matrix in capital letters, and it’s going to consist of input vectors, etc., in the columns of the matrix as shown below, so now we’re going to put as the first column in the matrix, as the second column, in the first column, and then we have the training set matrix. So this matrix has columns, is the number of samples of the training set, and then the height of this matrix is denoted as, note that sometimes for some other reason the matrix may be stacked by rows of training samples instead of columns, as shown below: Transpose of to transpose of, but when implementing neural networks, using the form on the left makes the whole implementation much easier:
Now, a quick review: is a matrix of size times, and when you implement it in Python, you’ll see x. shape, which is a Python command that shows the size of the matrix, that x. shape is equal to, is a matrix of size times. So in summary, this is how to represent the training sample (the set of input vectors) as a matrix.
What about output tags? Similarly, in order to make it easier to implement a neural network, putting labels in columns will make subsequent calculations very convenient, so we define capital equals, so here is a matrix of size 1 times, and again using Python will be denoted as y.shape equals, It says this is a matrix of scale 1 times.
As you implement neural networks later in the course, you will find that a good symbolic convention organizes data from different training samples. And the numbers I’m talking about include not only or include other quantities that you’ll see later. Take the data from the different training samples, and then just like we did with or, stack them in the columns of the matrix to form the symbolic representation that we’ll use later in logistic regression and neural networks. If sometimes you forget the meaning of these symbols, such as what is, or what is, or forget about other things, we will also put on the character in the course website, and then you can quickly refer to each specific symbols represent what meaning, well, we went on to the next video, in the next video, we will start in logistic regression. Note: notations are also written in the appendix.
2.2 Logistic Regression
In this video, we will review the logistic regression learning algorithm, which is suitable for dichotomous problems. This section will mainly introduce the Hypothesis Function of logistic regression.
For binary classification problems, given an input eigenvector, it might correspond to an image, and you want to identify the image to see if it’s a cat or not a cat, and you want an algorithm that produces a prediction, which you can only call, which is your estimate of the actual value. More formally, the probability or the chance that you want the representation to be equal to one, given the input characteristics. In other words, if it’s the picture we saw in the last video, you want to be told what the odds are that it’s a picture of a cat. In the previous video, it was a vector with a dimension. The parameter that we use to represent the logistic regression, this is also a dimension vector (because it’s actually the feature weight, the dimension is the same as the feature vector), and in the parameter, this is a real number. So given the inputs and the parameters and how do we generate the predicted values of the outputs, one thing you can try but can’t do is let.
So what we get is a linear function of the input, which is actually what you use when you do linear regression, but it’s not a very good algorithm for binary classification problems, because if you want to represent the probability that the actual value is equal to 1, it should be between 0 and 1. This is a problem that needs to be solved because it can be much larger than 1, or even a negative value. It doesn’t make sense for the probability that you want to be between 0 and 1, so in logistic regression, our output should be equal to the linear function that we got above as the independent variable in the Sigmoid function, the formula at the bottom of the figure above, converting the linear function to a nonlinear function.
The graph below is the graph of the sigmoid function, if I take the horizontal axis as the axis, then the sigmoid function of theta looks like this, it goes smoothly from 0 to 1, let me label the vertical axis here, this is 0, the intercept of the curve intersecting the vertical axis is 0.5, this is the graph of the sigmoid function of theta. The value that we usually use to represent.
The formula for the sigmoid function is this, here is a real number, there are a few things to note here, if it’s very large then it’s going to be close to zero, the sigmoid function of theta is going to be close to one over one plus something very close to zero, Because this is going to be close to zero if it’s a very negative exponential, so if it’s very large then the sigmoid function of theta is going to be very close to one. On the other hand, if it’s very small or if it’s a very negative number with a very large absolute value, then this term is going to be a very large number, so you can think of this as 1 over 1 plus a very, very large number, so this is going to be close to 0. In fact you see that when it becomes a very negative number, the sigmoid function of theta is very close to zero, so when you do logistic regression, your job is to make the machine learning parameters and that makes it a good estimate of the probability of that happening.
Before moving on, I introduce a symbolic convention that separates arguments from arguments. One thing to note symbolically is that when we program neural networks we often separate the parameters from the parameters, where the parameters correspond to a kind of bias. In previous machine learning classes, you’ve probably seen other notations for dealing with this problem. For example, in some cases, you define an extra feature called, and you make it equal to one, so now it’s a plus one dimension variable, and then you define the sigmoid function. In this alternative notation convention, you have one parameter vector, which acts as, this is a real number, and the rest until acts as, and it turns out that when you implement your neural network, one of the easier ways is to keep and separate. But we’re not going to use any of these notation conventions in this lecture, so don’t worry about that. Now that you know what a logistic regression model looks like, the next step is to train parameters and parameters, and you need to define a cost function that we’ll explain next time.
2.3 Logistic Regression Cost Function
In the last video, we looked at logistic regression models, and in this video, we’re going to look at the cost function of logistic regression.
Why we need a cost function:
In order to train the parameters and parameters of the logistic regression model we need a cost function to get the parameters and parameters by training the cost function. First look at the output function of logistic regression:
\
In order for the model to learn to adjust parameters, you need to give a sample of the training set, which will let you find parameters and parameters on the training set, to get your output.
Predictive value of the training set, we will it is written, we hope it will be more close to the value of the training focus, in order to more detailed introduction to the above formula, we need to explain the above definition is for a training sample, this form is used for each training sample, we use these with parentheses superscript to differentiate between the index and samples, The predictive value that the training sample corresponds to is going to be from the training sample and it’s going to be obtained by the sigmoid function, or you can define it as, we’re going to use this notation, superscript to indicate the first training sample of data representation or or other data, that’s what superscript means.
Loss function:
Loss function, also called error function, is used to measure the operation of the algorithm.
We measure how close the predicted output is to the actual value by what’s called the loss function. Generally we use the predicted value and actual value (yi – (mxi + b)) or half of them (yi – (mxi + b)), but usually we don’t do this in the logistic regression, because when we are learning logistic regression parameters, will find that our optimization goal is not convex optimization, only to find multiple local optimal value, the gradient descent method is likely to find the global optimal value, Although the square deviation is a good loss function, we will define another loss function in the logistic regression model.
The loss function we use in logistic regression is:
Why use this function as a logical loss function? When we use the squared error as a loss function, you want to make the error as small as possible, and for the logistic regression loss function, we also want to make it as small as possible. To better understand how the loss function works, let’s take two examples:
When, the loss function, if you want the loss function to be as small as possible, then you want it to be as big as possible, because the sigmoid function takes value, so it’s going to be infinitely close to 1.
When, the loss function, if you want the loss function to be as small as possible, then you want it to be as small as possible, because the sigmoid function takes values, so it’s going to be infinitely close to zero.
There are a lot of functions in this class that are similar to this one, which is if it’s equal to 1, we make it as big as possible, and if it’s equal to 0, we make it as small as possible. The loss function is defined in a single training sample, which measures how the algorithm performs in a single training sample. To measure how the algorithm performs in all training samples, we need to define the cost function of an algorithm. The cost function of the algorithm is summed over the loss function of each sample and divided by: The loss function is only applicable to a single training sample like this, while the cost function is the total cost of parameters. Therefore, when training the logistic regression model, we need to find the appropriate sum to minimize the total cost of the cost function. Based on our derivation of the logistic regression algorithm and the derivation of the loss function for a single sample and the derivation of the total cost function for the parameters selected by the algorithm, it turns out that logistic regression can be viewed as a very small neural network, and we’ll see what the neural network does in the next video.
2.4 Gradient Descent
What can gradient descent do?
On your test set, the sum of parameters trained by minimizing the cost function,
In the second line, the cost function (cost function) of the logistic regression algorithm is given as before.
A visual illustration of gradient descent
In this diagram, the horizontal axis represents your spatial parameters and, in practice, could be higher dimensions, but for the sake of drawing, we define alpha and beta as single real numbers, and the cost function is the surface on the horizontal axis and, therefore, the height of the surface is the value of the function at some point. So what we’ve done is we’ve found the parameters that make the cost function minimum.
As shown, the cost function is a convex function, like a large bowl.
This is kind of the opposite of what we just saw, because it’s non-convex and it has a lot of different local minima. Because of the cost function (cost function) characteristic of logistic regression, we must define cost function (cost function) as convex function. Initialize the sum,
To initialize the parameters and with that little red dot, you can also use random initialization, which works for logistic regression almost all initialization methods, because the function is convex and should reach the same point or roughly the same point no matter where it is initialized.
We initialize the parameters and with the coordinates of the little red dot in the figure.
2. Take a step down the steepest slope, iterating over and over
We take a step down the steepest slope, as shown here, to the second little red dot.
We could either stop here or we could go one more step down the steepest slope, and then after two iterations we get to the third little red dot.
3. Until the global optimal solution is reached or close to the global optimal solution
Through the above three steps we can find the global optimal solution, that is, the cost function (cost function) this convex function of the minimum point.
A detailed description of gradient descent (one parameter only)
Assuming that the cost function (cost function) has only one parameter, it is better to draw the graph by replacing the multidimensional curve with a one-dimensional curve.
To iterate is to repeat the following formula over and over again:
Represents the update parameter,
Represents the learning rate, which is used to control the step. Derivative is the length of a step down. In code, we use derivative to represent this result
A more visual understanding of the derivative is slope, and the derivative of this point is the height division of the small triangle tangent to this point. Suppose we start at this point, and the sign of the slope at that point is positive, i.e., so we take a step to the left.
The whole iterative process of gradient descent is to keep going left until you get to the minimum point.
Suppose we start at this point, and the slope at that point has a negative sign, i.e., so we take a step to the right.
The whole iterative process of gradient descent is to keep going right, that is, towards the minimum point.
Detailed description of gradient descent method (two parameters)
The cost function of logistic regression (cost function) takes two parameters.
A derivative is called round, which is derivative of a function with respect to derivative. In code, we use the derivative to represent the result. A lower case letter is derivative, so a function has only one parameter. The partial derivative notation is used to find partial derivatives. That is, a function has more than two parameters.
2.5 Derivatives
What I want to do in this video is give you an intuitive understanding of calculus and derivatives. Maybe you think you haven’t done calculus since college. Depending on when you graduate, it may have been a while, but if you’re concerned about that, please don’t worry. In order to use neural networks and deep learning effectively, you don’t need to understand calculus very deeply. So if you’re watching this video or future videos and you think, “Wow, all this knowledge, all this math is complicated for me.” My advice to you is: stick with the videos, do your homework after class, successfully complete your programming assignments, and then you can use deep learning. After the fourth week, you’ll see a lot of different kinds of functions defined, and they’ll help you put everything together through calculus, some of which are called forward functions and inverse functions, so you don’t need to know all the functions that you use in calculus. So you don’t have to worry about them, but on top of that, in an attempt at deep learning, this week we’re going to go into the details of calculus. All you need is an intuitive understanding of calculus to build and successfully apply these algorithms. And finally, if you’re in that small subset of people who are proficient in calculus, who are very familiar with calculus, you can skip this part of the video. The rest of you let us dive into derivatives.
A function. It’s a line. So let’s just think about the derivative a little bit. Let’s look at a couple of points in the function, let’s say, so 3 times PI is equal to 6, that is if, then the function. Let’s say you change the value a little bit, just a little bit, to 2.001, and you make a little shift to the right. The difference of 0.001 is too small to show in the graph, so let’s move it to the right a little bit. Now it equals 3 times 6.003, drawn in the graph. The proportions don’t quite fit. Look at the little triangle highlighted in green. If you move to the right by 0.001, you increase by 0.003 by 3 times as much as you move to the right. So we say that the function is at. It’s the slope of the derivative, or when, the slope is 3. The concept of derivative means slope, and derivative sounds like a scary, scary word, but slope is a very friendly way of describing the concept of derivative. So when we talk about the derivative, we just think of it as the slope of the function. A more formal definition of slope is height divided by width in this little green triangle above. So the slope is 0.003 divided by 0.001, which is 3. Or the derivative is equal to 3, which means that when you move it to the right by 0.001, you increase it by 3 times the amount in the horizontal direction.
Now let’s look at this function in a different way. Suppose, at this point. Move it to the right by a small margin, increasing it to 5.001. The slope is 3, and that’s what it means when you change the value of the variable a little bit. An equivalent derivative expression can be written like this, and it doesn’t matter whether you put it on the top or on the right. In this video, I’m going to talk about the situation where we’re going to offset by 0.001, and if you want to know the mathematical definition of a derivative, a derivative is a value that you shift to the right by a very small amount (not 0.001, but a very, very small amount). The usual definition of a derivative is that you shift to the right (measurable value) an infinitesimally small value and increase it by 3 times (increasing it by a very, very small value). That’s the height to the right of this triangle.
That’s the formal definition of a derivative. But for intuition, we’ll talk about right-shifting this value, even though 0.001 is not an infinitesimally small measurable number. One property of the derivative is that the slope of the function is always equal to 3 everywhere, no matter what the value of or, the slope of the function is always equal to 3, which means that no matter how the value of or changes, if you increase by 0.001, the value of or increases by three times. This function has the same slope everywhere. One way to prove it is that no matter where you draw a little triangle, its height divided by its width is always 3.
I want to give you a sense of: What is slope? What is the derivative? For a line, the slope of the function, in this example, is 3 everywhere. In the next video, let’s look at a more complicated example where the slope of the function is variable at different points.
2.6 More Derivative Examples
In this video I’m going to give you a more complicated example, where the slope of the function is different at different points, but let’s start with an example:
Let me draw a function here, if so, then. Let’s push it a little to the right, so now (if you use a calculator, the exact value should be 4.004. If you draw a little triangle here, you will see that if you move it to the right by 0.001, it will increase by four times, by 0.004. In calculus we call the slope of the hypotenuse of this triangle the slope of the hypotenuse of this triangle the slope of the hypotenuse of this triangle the slope of the hypotenuse of this triangle the slope of the hypotenuse of this triangle the slope of the hypotenuse of this triangle the slope of the hypotenuse of this triangle the slope of the hypotenuse of this triangle the slope of the hypotenuse of this triangle is the slope of the hypotenuse of this triangle.
And there’s an intuitive way to explain why the slope of a point is different at different points and if you draw little triangles at different points on the curve you’ll see that the ratio of the height to the width of the triangle is different at different points on the curve. So when, the slope is 4; When, the slope is 10. If you look in your calculus book, it will tell you that the slope of the function is zero. That means that at any given point, if you increase it a little bit by 0.001, then you will see that it will increase by the slope or the derivative of the point at that point times the distance you move to the right.
Now, there’s a little bit of detail to note, that the value that the derivative increases, it’s not exactly the value that you get from the derivative formula, it’s just an estimate that you get from the derivative.
To summarize what we’ve learned in this lesson, let’s look at a few more examples:
Let’s say if you look at the table of derivatives, you see that the derivative of this function is equal to. So what does that mean? Again, let’s take an example: let’s take theta again. So, if we’re going to increase it a little bit again, you’ll find, you can check it yourself, that if we take 8.012, you’ll find that it’s close to 8.012. In fact, when we take theta, the derivative is theta, theta. So the derivative formula says that if you move 0.001 to the right, you’ll move 12 times to the right, 0.012.
Let’s look at one last example. Suppose, some might write, that the slope of the function should be zero, so we can explain it as follows: If you take any value, say, take it again, and then move it to the right by 0.001 then it will increase, and if you use your calculator, you’ll find that when; And when. So it’s increased by 0.0005, and if you look at the formula for the derivative, the value of the derivative of theta. This means that if you increase it by 0.001, it will only increase by 0.001 1/2, or 0.0005. If you draw a little triangle you will see that if the axis is increased by 0.001, the function on the axis will be increased by half 0.001 or 0.0005. So this is the slope of this line when theta is theta. So that’s a little bit about derivatives.
There are only two things you need to remember in this video:
The first point is that a derivative is the slope, and the slope of a function is different at different points. So in the first example, this is a line, and it has the same slope at any point, 3. But for functions, or their slopes change, so their derivatives, or their slopes, are going to be different at different points on the curve.
Second, if you want to know the derivative of a function, you can refer to your calculus book or Wikipedia, and you should be able to find the formula for the derivative of these functions.
And finally, HOPEFULLY, you’ll be able to get a sense of what derivatives and slopes look like through my lively presentation, and in the next video we’ll talk about graphs, and how you can use them to take derivatives of more complicated functions.
2.7 Computation Graph
The computation of a neural network, so to speak, is organized according to forward or back propagation processes. First we calculate the output of a new network (the forward process), followed by a reverse transport operation. The latter we use to figure out the corresponding gradient or derivative. The calculation diagram explains why we organize these calculations the way we do. In this video, we’re going to do an example of what a computational graph is. Let’s take a simpler, or less formal, example of a neural network than logistic regression.
We’re trying to compute the function, which is a function of three variables, which is the function. There are actually three different steps to compute this function, the first is to compute times, we store it in a variable, so; Then calculate; And then the output, that’s the function that we’re going to evaluate. We can draw these three steps as the following calculation, let me draw three variables here, the first step is the calculation, I’ll put a rectangle around here, its input is, and then the second step, and then the last step. So for example, that’s 6, that’s 5+6=11. It’s 3 times. So. Namely. And if you work it out, you actually get 33. When there are different or special output variables, such as the cost function you want to optimize in this example and logistic regression, the calculation diagram can be handy to handle these calculations. And you can see from this little example, by going from left to right, you can calculate the value of theta. To compute the derivative, the process from right to left (red arrow, opposite to blue arrow) is the most natural way to compute the derivative. So just to recap: the form of organizing the computation is going from left to right with the blue arrow. Let’s see how to compute the derivative of the reverse red arrow (right to left) in the next video. Let’s move on.
2.8 Derivatives with a Computation Graph
In the last video, we looked at an example of using a flow diagram to compute the function J. Now let’s clean up the description of the flowchart and see how you can use it to compute the derivative of a function.
The following formula is used:
.
Here’s a flow chart:
Suppose you were to calculate, how would you calculate it? So, for example, if we were to take this value and change it, what would happen to the value of theta?
So by definition, now, so if you increment it by a little bit, say to 11.001, then, so I increment it by 0.001 here, and then the end result is that it goes up by a factor of 3, so, because for any increment there’s a factor of 3, and this is similar to the example we did in the last video, We have, and then we derive, so here we have, so here’s the role that played in the example in the previous video.
In terms of back propagation algorithms, we see that if you want to compute the derivative of the final output variable, using the derivative of the variable that you care about most with respect to, then we’ve done one step of back propagation, which in this flowchart is a reverse step.
Let’s look at another example. What is it? In other words, if we raise the value, what happens to the value of theta?
Ok, so let’s look at this example. The variable, we increase it to 5.001, so the effect on V is that, before, it now becomes 11.001, as we can see from above, it now becomes 33.003, so what we see is that if you increase it by 0.001, increase it by 0.003. So the increase, I mean if you change this 5 to some new value, then the change in delta is propagated to the far right of the flow chart, so it ends up 33.003. So the change in J is 3 times the change in omega, which means that the derivative is 3.
One way to explain the calculation is: if you change it, it changes, and by changing it, it changes, so the net change in value, when you increase this value (0.001), when you increase the value a little bit, this is the change in value (0.003).
First of all, a goes up, it goes up, how much does it go up? It depends on, then the change in delta causes it to also increase, so this is actually called the chain rule in calculus, if it affects delta, it affects delta, then the change in delta when you make it bigger is just the change in delta when you change it times the change in delta when you change it, and in calculus it’s called the chain rule.
We see from this calculation that if you increase it by 0.001, it will change by the same amount, so. In fact, if you plug in, we did this before, so this product, 3 by 1, actually gives you the right answer.
And this little graph shows how to calculate, which is the derivative with respect to the variable, and it helps you calculate, so this is another step back propagation calculation.
Now, I want to introduce a new notation convention, that when you program back propagation, there’s usually one final output value that you care about, the final output variable, that you really want to care about or optimize. In this case the final output variable is J, is in the flow chart of the last symbol, so there are a lot of computing attempts to calculate the derivative of the output variable, so the output variables of a derivative, we’ll use, so you need to calculate the final output results in many computing the derivative of, in this example, there are a variety of intermediate variables, such as, When you implement it in software, what’s the name of the variable? One thing you can do is, in Python, you can write a very long variable name, for example, but this variable name is a little long, so we’ll use it, but because you’re taking the derivative of phi, the derivative of this final output variable. I’m going to introduce a new notation here, and in the program, when you’re programming, in the code, we’re going to use the variable name to represent that quantity.
Ok, so in the program it’s the derivative, the derivative of the final variable that you care about, and sometimes it’s the derivative of the final variable with respect to the various intermediate quantities in your code, so this thing in your code, you’re going to represent this value, so your code representation is going to be.
Ok, so let’s do the backward propagation algorithm through this flow chart. Let’s look at the rest of this example on the next slide.
Let’s clean up a new flow chart, and let’s review, so far, we’ve been propagating back and calculating, again, the variable names in the code, the real definition of which is. And I found, again, that the variable name in the code actually represents the value.
I’m just going to do it by hand, how to calculate the back propagation of two lines.
Ok, so let’s go ahead and compute the derivative, let’s look at this value, so what is it? By doing the same calculation as before, now we start with, if you increase to 6.001, so it was 11, now it’s 11.001, so it goes from 33 to 33.003, so the increment is 3 times, so. The analysis of PI is very similar to the analysis of A, and in fact the calculation is that, given this, we can calculate PI and the final result is PI.
So we have one more step of back propagation, and we finally figure out that this is, of course, one.
Now, let’s take a closer look at the last example. What about that? So imagine, if you change phi, you want to change phi a little bit to get to a maximum or a minimum, what’s the derivative? The slope of this function, when you change the value a little bit. In fact, using the chain rule of calculus, this can be written as the product of the two, which is, and the reason is, if you change it a little bit, so the change let’s say 3.001, the way it affects J is, first of all it affects, how much does it affect? Well, the definition is, so this was 6, and now it’s 6.002, right, because in our example, so this tells us that when you increase by 0.001, it doubles. So now I’m thinking twice as much, so what’s the increase? We already figured out that this is equal to 3, so if we multiply these two things, we find that.
Ok, so that’s the second part of the derivation, where we want to know what happens if we increase by 0.002. In fact, this tells us that when u increases by 0.002, it goes up by 3 times, so it should go up by 0.006, right. That comes from the derivation.
And if you look at the math, you see, if it goes to 3.001, then it goes to 6.002, then it goes to 11.002, and then, right? That’s how you get it.
To fill it in, if we go the other way around, and it’s actually the name of the variable in Python code.
I won’t go through the last example in great detail, but in fact, if you do the math, this is 9.
I’m not going to elaborate on this example, but in the last step, we can derive.
So the point of this video is, for that example, when you compute all of these derivatives, the most efficient way to do it is from right to left, following this red arrow. Especially when we compute the derivative with respect to theta the first time, and we’ll use that later when we compute the derivative with respect to theta. And then the derivative with respect to phi, let’s say this term and this term right here:
It helps to compute the derivative of phi with respect to phi, and then the derivative of phi with respect to phi.
So this is a flow chart, which is a forward or left-to-right calculation to compute the cost function J, which you might need to optimize, and then reverse it to compute the derivative from right to left. If you’re not familiar with calculus or the chain rule, I know some of the details are fast, but if you haven’t followed all the details, don’t worry. In the next video, I’ll do it again. Go through it in the context of logistic regression, and show you what you need to do to write code to perform derivative calculations in logistic regression models.
The resources
[1]
Deep learning courses: mooc.study.163.com/university/…
[2]
Huang Haiguang: github.com/fengdu78
[3]
Making: github.com/fengdu78/de…