Function model

The linear regression function is something you probably learned in high school, and we won’t go into it in detail in this article. First, as in the following scatter chart:

How do we fit the function?

We can see at a glance that a straight line through the data fits the data very well, and the data is evenly distributed on both sides of the line. Now we can write our linear function directly:


But usually we think that, so when the formula is expanded, it is:


In machine learning, we often use linear algebra to represent data, so we write it in the following form


In other words, if the expression is written as a matrix, it becomes:


This is only the simplest kind of linear function. Some linear functions widely used in neural networks are nested layer by layer, for example:


At present, we do not discuss the nonlinear function model. For the nonlinear function model, we often need to conduct regularization and adjustment, which I will introduce in detail later.

Cost function

The cost function is also called the loss function, which we can try to derive here. Our cost function is mainly used to measure the error between the predicted data and the actual data. It would be tempting to just measure the difference between them.


But there’s a problem with this function, which isThe sign of is uncertain, so even if the error is large,It could even be zero, so we need to improve this function.

So what about adding an absolute value?


So obviously our cost function has better performance, but there’s still a problem, which is you can’t differentiate absolute values. So we’re going to say, well, what’s a way to do that that doesn’t affect the algebraic properties of the function but doesn’t cause errors? Obviously, quadratic functions are a good idea.


At this point, we’ve finished our function design, but there’s still a little problem, which isIt’s possible that it’s too large, and we need to average it, and we usually multiply it by a parameter, just to make it easier for us to take the derivativeOf course, you won’t have a problem if you don’t. The final analytic expression for the function is:


I believe a lot of friends with better mathematical background, must have found one thing, it is not the distortion of variance we often use? Yes, this function is called the mean square error function, and you can think of it as the variance in a broad sense, and we remember that the variance is subtracted from the mean of the data, and it’s similar here, because the fitting function that we’re using is already acting as the mean.

Maximum likelihood estimation

It has a weird name, and it’s hard to tell what it’s for. Maximum likelihood estimation is a kind of saying similar to classical Chinese, likelihood, which is the meaning of possibility. Maximum likelihood estimation is the estimation of maximum likelihood.

It’s easy to construct the conditions for maximum likelihood estimators, the I.I.D conditions,Data independence and homodistribution conditions. Let’s give you a probability function. How do you evaluate a function? We know that for a given data, it is best if the probability reaches the maximum value. Then for the whole data set, we should take the product of the probability function:


  That’s our likelihood function, and that’s what maximizes the functionThat’s the maximum likelihood estimate that we’re talking about.

Most of the time, we don’t like to take the product, we have to figure out how to do itIt’s going to be a summation function, so it’s going to be a lot easier. It’s easy to think about logarithm properties. We have toTake the logarithm and get:


And again, we’re going to keep it from getting too big, so we’re going to take an average of it, but the average here is going to be negative. Why? If you look at the graph of the logarithm function, the probability is balanced between [0,1], and the logarithm is always negative, so we should use negative numbers.


The logarithm function also has a disadvantage, that is, when 0 is encountered, the logarithm function is powerless. In some classification problems, assuming that the real marker is 0, the probability function estimated by maximum likelihood should also be 0, or close to 0, which will lead to the logarithm of it when the value becomes huge. Here we introduce a concept called entropy. We’ll cover this in detail in the next section on information theory.

My nuggets :WarrenRyan

My Jane book :WarrenRyan

Welcome to my blog to get the first update of blog.tity. Online

My lot: StevenEco