Problems that need to be solved

Error function and gradient descent are two of the most fundamental concepts in machine learning. Before we explain these two concepts, let’s talk about the problem machine learning is trying to solve. Machine learning is making predictions about the future based on previous data. So the next question is, first of all, how to make this prediction, and how to evaluate the accuracy of this prediction?

So let’s answer the first question, how do we make predictions. Since we are based on the previous data to predict, so we need to be found in the data before a certain regularity, in general can predict things mostly has certain causal correlation, for example we can according to the age to predict income (here is used as an example), in this case the influencing factors of income is only one, We can represent these factors with a number of variables:

– > age

– > monthly income

What machine learning does is figure out the relationship between these two things, and because these relationships can be expressed mathematically, we can build mathematical models. Let’s say your sample data looks like this:

By looking at the approximate distribution of the data, we can use linear models to make predictions, so you can get the following model expression:


So the question here is how do you get a fairly accurate oneAnd how do you judgeHow about that? And that’s where the error function comes in,The error function solves the problem of calculating the difference between prediction and reality, and the smaller the difference, the better the parameters of the model are chosen. The other question is, how do you choose the right one? The purpose of gradient descent is to solve this problem, and the purpose of gradient descent is to compute a more appropriate error function by constantly decreasing the error function.


Error function

As mentioned earlier, the error function is used to calculate the difference between actual and forecast, which can be expressed as follows:


Here,The actual value, of course, is not enough just to write it like this, because we might have multiple numbers, so we can write it as a sum:


hereIt means that we have m samples,This is the one in the middleA piece of data, same thingThis is the number oneThe age in the data,This is the number oneThe actual income in the data. But there’s a problem here,It could be positive or negative, the positive is an error, the negative is an error, but the plus and the minus will cancel out, which does not reflect the real situation very well, so we can consider the square and sum, so we get the following formula:


If you think in combination with the actual situation, you will find that the result of the above formula has a positive correlation with the amount of data, that is, the error will increase when the amount of data increases. When we use different amounts of data to train the model, the results of the error function cannot be compared horizontally. Therefore, we consider using the average value to describe the error function, and the following formula can be obtained:


Here, we multiplied by 1/2 mainly for the convenience of taking derivatives later, so that we could obtain the following expression of the error function based on the model:


Notice that the independent variable of the error function is zeroThat is, the parameters of the model. And we can plot it from the sample data, there are three variables, so it is a three-dimensional graph:

And the z-axis is going to beWe need to letAs small as possible, and then gradient descent.


Gradient descent

There’s a great video on derivatives and gradients on YouTube, it’s only 5 minutes long, but it’s very graphic and easy to understand, so it might help you.

Gradient descent is a very widely used algorithm for finding minima, and I’m talking about minima, not minima, as I’ll explain later. Now, before we talk about gradient descent, there’s another little bit of math we need to know, and that’s derivative. If you have studied Advanced Mathematics in college, you should be familiar with this. Maybe you know how to take derivatives. After all, you have done many problems and set many formulas before.

Let’s take two-dimensional coordinates for example. If we have a curve in two-dimensional coordinates, then the derivative at some point on the curve is the slope at that point:

And you can see that we can actually take the tangent line at this point, and the slope is going to be the slope of this tangent line, so what’s the slope? The slope tells you how much the dependent variable changes with respect to the independent variable, exactlyThis tells us ifThe changes are the same,The bigger the change, the bigger the slope. The slope can describe not only the magnitude of the change, but also the direction of the change, if you go withThe increase of,It gets bigger, so that’s a positive correlation, the slope is positive, and vice versa, it’s a negative correlation, the slope is negative, and of course there are only two-dimensional curves that we can use as tangents, but in higher dimensions, we have to use vectors, but the basic principle is the same.

What is the purpose of all this? So just to remind you that when we talked about the error function, we said we need to pick the right oneTo makeAs small as we can giveTake a starting value, figuratively speaking, and choose a point on the previous 3D image, and then take the derivative at this point to get the “slope”, and then adjust according to the “slope”, can be expressed as follows:



Let’s say at some pointThe “slope” of is negative, indicating thatAs theDelta increases and decreases, then we hopeIncrease to decreaseAnd you can see that from this expression as well. Here,Is the learning rate, which is used to represent the change range of each step.We can adjust the value of. The value can neither be too large nor too small. If the value is too large, the extreme point will be exceeded and the minimum value cannot be approached. If the value is too small, the change in each step is small, resulting in a long calculation time and reduced algorithm efficiency. By iterating over and over again, we can lower and lower, the change curve is as follows:

One thing we need to know about gradient descent is that gradient descent is only a local minimum, not necessarily a global minimum, as shown in the following example:

Here we pick two different starting points, and we end up with two different sets of parameters. This is a property of gradient descent, and you have to be careful when you use it.

We can use gradient descent to get a more reasonableAs well asWith these two parameters, we can get our model:

Given that, given a person’s age, we can use the model we’ve trained to predict what that person’s income is going to be. This is the simplest model, linear model, or linear regression, depending on the distribution of data, we can also choose other models, such as square, cubic or square root, etc., and the basic training process is similar to linear model.


conclusion

These two concepts are not difficult, but sometimes we get bogged down in the details of calculation, derivation and so on. It’s much more important to understand the concepts, and it’s only when you understand the important concepts, the calculation, the derivation, that it becomes clear, that it makes sense, for example, to take derivatives. Learn these things don’t take it for granted that to recite formulas, depends directly set of formula, so it’s hard to do and have a comprehensive understanding of an algorithm, you need to repeatedly thinking about the meaning of each variable in the formula, the meaning of each step is derived, found the problem and speculative, summarize can achieve mastery.

For calculations, we can use matrices as an aid. Tools such as Octave/Matlab can help us do matrix calculations very well, which is much more efficient than writing simple for loops, because Octave/Matlab has a lot of low-level optimization for matrix calculations. Using matrices requires some knowledge of linear algebra, but it is ultimately just a tool that you can use a lot. The important thing is to understand the algorithm itself.