Loss function is used to estimate the degree of inconsistency between the predicted value F (x) and the real value Y of your model. It is a non-negative real value function and is usually represented by L(Y, f(x)). The smaller the loss function is, the better the model’s robustness will be. Loss function is the core part of empirical risk function and also an important part of structural risk function. The structural risk function of the model includes the empirical risk term and the regular term, which can be expressed as follows:
$$\theta^* = \arg \min_\theta \frac{1}{N}{}\sum_{i=1}^{N} L(y_i, f(x_i; \theta)) + \lambda\ \Phi(\theta)$$
Among them, the former mean function represents the empirical risk function, L represents the loss function, and the latter $\Phi$is a regularizer or penalty term, which can be L1, L2, or other regular functions. The whole formula means finding the value of $\theta$that minimizes the target function. Several common loss functions are listed below.
1. Log-log Loss Function (Logistic Regression)
Some people might think that logistic regression’s loss function is a square loss, but it’s not. The square loss function can be derived by linear regression under the assumption that the sample is gaussian distribution, while the square loss is not obtained by logistic regression. In the derivation of logistic regression, it assumes that the sample follows a Bernoulli distribution (0-1 distribution), then obtains the likelihood function that satisfies that distribution, then takes logarithms to find extremums, and so on. Logistic regression does not seek the extreme value of likelihood function, but regards maximization as an idea, and then deduces its empirical risk function: minimize the negative likelihood function (i.e. Max F(y, F(x)) — > min-f (y, F(x))). From the point of view of the loss function, this is the log loss function.
Log loss function of the standard form: $$L (Y, P (Y | X)) = – \ log P (Y | X) $$has just said, the exponential is for the convenience of calculation of maximum likelihood estimate, because in the MLE, direct derivation is difficult, so usually take first logarithm derivation for extreme value point again. Loss function L (Y, P (Y | X)) is sample in the case of classification Y X, make the probability P (Y | X) maximum (in other words, is to use the known sample distribution, to find the most likely (that is, the maximum probability) that causes the distribution parameter values; Or what parameters would give us the best chance of seeing the current set of data). Because the log function is monotone increasing, so the logP (Y | X) will reach the maximum, so the plus minus sign in front, to maximize the P (Y | X) is equivalent to minimizing the L.
Logistic regression of P (Y = Y | x) expression as follows (in order to uniform category labels Y for 1 s and 0 s, separate expression said below) :
Put it into the above equation, and the logistic loss function expression can be obtained through deduction as follows:
The final objective formula obtained by logistic regression is as follows:
$$J(\theta) = – \frac{1}{m} \sum_{i=1}^m \left [ y^{(i)} \log h_{\theta}(x^{(i)}) + (1-y^{(i)}) \log(1-h_{\theta}(x^{(i)})) \right ]$$
The above is for dichotomies. It needs to be explained here: The reason why some people think logistic regression is a square loss is that when gradient descent is used to find the optimal solution, its iterative formula is very similar to the formula after derivation of square loss, which gives people an intuitive illusion.
Here is a PDF for reference: Lecture 6: Logistic Regresse.pdf.
2. Least Squares loss function
OLS transforms the problem into a convex optimization problem. In linear regression, it assumes that both the sample and the noise obey the Gaussian distribution. In fact, there is a little knowledge point hidden here, which is the Central limit theorem, which can be referred to by the central Limit Theorem. Finally, the least squares formula can be derived by the maximum likelihood estimation (MLE). The basic principle of least squares is that the optimal fitting line should be the line that minimizes the sum of the distances from each point to the regression line, that is, the sum of squares. In other words, OLS is based on distance, and this distance is the Euclidean distance that we use most. Why it chooses to use The Euclidean distance as the error measure (i.e. Mean Squared Error, MSE), mainly for the following reasons:
- Simple, convenient calculation;
- Euclidean distance is a good measure of similarity.
- The characteristic properties remain unchanged after the transformation of different representation domains.
The standard form of Square loss is as follows:
$$ L(Y, f(X)) = (Y – f(X))^2 $$
When the number of samples is N, the loss function becomes:
Y-f(X)
This is the residual, this is the whole thingThe sum of squares of residuals, and our goal is to minimize the objective function value, i.e., minimize the residual sum of RSS.
In practical application, the mean square deviation (MSE) is usually used as a measurement index, and the formula is as follows: $MSE = \frac{1}{n} \sum_{I =1} ^{n} (\tilde{Y_i} -y_i)^2$MSE = \frac{1}{n} \sum_{I =1} ^{n} (\tilde{Y_i} -y_i)^2 One is that the dependent variable y is a linear function of the parameter $\alpha$. In machine learning, it’s usually the latter.
Iii. Exponential Loss Function (Adaboost)
If you’ve learned Adaboost, it’s a special case of the forward stepwise addition algorithm, and it’s a sum model, where the loss function is the exponential function. $f_{m} (x)$f_{m} (x)$
The purpose of each iteration of Adaboost is to find the parameters $\alpha$and G to minimize the following:
The standard form of exponential loss function (exp-loss) is as follows
It can be seen that the target expression of Adaboost is exponential loss. In the case of n samples, the loss function of Adaboost is:
For the derivation of Adaboost, please refer to Wikipedia: Adaboost or Statistical Learning Methods P145.
Iv. Hinge Loss Function (SVM)
Hinge loss function and SVM are closely related in machine learning algorithm. inLinear support vector machines, the optimization problem can be equivalent to the following equation:
Let’s make a transformation of the formula, let’s say:
Thus, the original formula becomes:
$\lambda=\frac{1}{2C}$
As can be seen, this formula is very similar to the following:
$L $in the first half is the Hinge loss function, and the second is the EQUIVALENT of the L2 regular item.
Hinge loss function of the standard form $$L (y) = \ Max (0, 1 – y \ tilde {y}), y = \ PM 1 $$, you can see that when | y | > = 1, L (y) = 0.
For more information, see Hinge-Loss.
There are 4 kernel functions to choose from in libSVM. The corresponding -t parameters are:
- 0- linear kernel;
- 1-polynomial kernel;
- 2 – RBF kernel;
- 3 – sigmoid kernel.
Other loss functions
In addition to the above loss functions, commonly used are:
0 minus 1 loss function
Absolute loss function
So let’s look at some visualizations of some loss functions, look at the abscissa, look at the ordinate, look at what each line represents, look at it a couple of times and digest it.
OK, I’ll write it here for now and take a break. Finally, something to keep in mind:The more parameters, the more complex the model, and the more complex the model is easier to overfit. Overfitting means that the performance of the model on the training data is much better than that on the test set. At this point, regularization can be considered. By setting the Hyper parameter in front of the regularization term, loss function and regularization term can be weighed to reduce the parameter scale and simplify the model, so that the model has better generalization ability.
6. Sample analysis
The 0/1 loss function is the ideal loss function, which is 1 if the classification is wrong (as long as there is only one error) and 0 if all are correct. However, this function has the properties of convex function, discontinuous and so on, which is difficult to realize in practical application, so there are several alternative loss functions, they are continuous, convex functions. The upper bound of the 0/1 loss function of the other three loss functions, if we can make the other three loss functions smaller, we can make the 0/1 loss function smaller approximately. The Hinge loss function is not required as long as the hinge loss function is correctly classified, while the other hinge loss function is required to be more correct even if the hinge loss function is correctly classified. In other words, if you take a test with 60 points, the Hinge loss function is satisfied that you got 60 points. But the other two won’t be content with that and will continue to give you an 80, which is how much better the Hinge loss function can be. This can easily lead to over-fitting if there is a constant pursuit of perfection. The update speed of these functions is that when the classification effect is poor, the index declines fastest, and when the classification effect becomes good, hinge declines fastest. However, if the squared error function is correctly classified or badly classified, it will lead to a big error, and the model will go wrong. Therefore, the function is not used in the classification, but the loss function is used in the fitting.
Take a look at the following example:
1 import numpy as np;
2 import matplotlib.pyplot as plt;
3
4 x = np.linspace(-2, 2, 300)
5
6 hinge_loss_function = []
7 for i in (1-x):
8 if i > 0:
9 hinge_loss_function.append(i)
10 else:
11 hinge_loss_function.append(0)
12 exponential_loss_function = np.exp(-x)
13 logistic_loss_function = np.log(1+np.exp(-x))/np.log(2)
14
15 l0_1_loss_function = []
16 for j in x:
17 if j < 0:
18 l0_1_loss_function.append(1)
19 else:
20 l0_1_loss_function.append(0)
21
22 pingfang_loss_function = (x-1) ** 2
23
24 plt.plot(x, hinge_loss_function, 'r-')
25 plt.plot(x, exponential_loss_function, 'b-')
26 plt.plot(x, logistic_loss_function, 'g-')
27 plt.plot(x, l0_1_loss_function, 'k-')
28 plt.plot(x, pingfang_loss_function, 'c-')
29 plt.legend(['hinge_loss_function', 'exponential_loss_function', 'logistic_loss_function', 'l0_1_loss_function', 'pingfang_loss_function'])
30 plt.show()
Copy the code
The results are as follows:
reference
-
- Github.com/JohnLangfor…
- library_design/losses
- www.cs.cmu.edu/~yandongl/l…
- Math.stackexchange.com/questions/7…
- Statistical learning methods. By Li Hang.