The “loss function” is a crucial part of machine learning optimization. L1 and L2 loss functions are familiar to most people. Do you know Huber losses, log-Cosh losses, and quantile losses that are commonly used to calculate forecast ranges? These are the most commonly used regression loss functions for machine learning.

All algorithms in machine learning need to maximize or minimize a function, called the “objective function.” Among them, we generally minimize a class of functions, called “loss function”. It can measure the prediction ability of the model according to the prediction results.

In practical application, the selection of loss function is restricted by many factors, such as whether there are outliers, the choice of machine learning algorithm, the time complexity of gradient descent, the difficulty of derivation and the confidence of predicted value, etc. Therefore, there is no one loss function that works for all types of data. This article will introduce different kinds of loss functions and their functions.

Loss functions can be roughly divided into two categories: loss function of classification problems and loss function of regression problems. In this article, I will focus on regression losses.

We have saved all the code and diagrams for this article here:

https://nbviewer.jupyter.org/github/groverpr/Machine-Learning/blob/master/notebooks/05_Loss_Functions.ipynb



Loss function comparison of classification and regression problems

Mean square error (mse)



Mean square error (MSE) is the most commonly used regression loss function, which is calculated by calculating the sum of squares of distances between the predicted value and the true value. The formula is shown in the figure.

Below is the graph of the MSE function, where the target value is 100, the predicted value ranges from -10000 to 10000, and the Y-axis represents the MSE value range from 0 to infinity, and minimizes at the predicted value of 100.



Mean absolute error (also known as L1 loss)


Mean Absolute error (MAE) is another loss function used in regression models. MAE is the sum of the absolute value of the difference between the target value and the predicted value. It measures only the average length of the predicted value error, not the direction, and ranges from 0 to infinity (if direction is considered, the sum of residuals/errors — the mean deviation (MBE)).



MAE loss (Y-axis) – Predicted Value (X-axis)

Comparison of MSE (L2 loss) with MAE (L1 loss)

In short, MSE is simple to calculate, but MAE has better robustness to outliers. Here are the reasons for the difference.

When we train a machine learning model, our goal is to find the point at which the loss function reaches a minimum. Both functions are minimized when the predicted value is equal to the true value.

Here is the Python code for both loss functions. You can either write your own functions or use the built-in functions of SkLearn.

# true: Array of true target variable
 # pred: Array of predictions

def mse(true, pred):
 return np.sum((true - pred)**2)

def mae(true, pred):
 return np.sum(np.abs(true - pred))
 
 # also available in sklearn

from sklearn.metrics import mean_squared_error

from sklearn.metrics import mean_absolute_errorCopy the code

Let’s look at MAE and RMSE (the square root of MSE, on the same order of magnitude as MAE) in two examples. In the first example, the predicted value is very close to the real value, and the variance of the error is small. In the second example, the error is very large because there is an outlier.



Left: error is closer to right: one error is much larger than the others

What can we learn from the picture? How should the loss function be selected?

MSE squares the error (let e= true value – predicted value), so if e>1, MSE will further increase the error. If abnormal points exist in the data, then e value will be very big, but it far larger than | e | e squared.

Therefore, the model using MSE gives greater weight to outliers than calculating losses using MAE. In the second example, the RMSE loss model is updated to reduce outlier errors at the expense of other sample errors. However, this reduces the overall performance of the model.

MAE losses are more useful if the training data is contaminated with outliers (for example, there are a large number of false negative and positive example markers in the training data, but not in the test set).

Intuitively, if we minimize MSE to give only one predicted value for all the sample points, then this value must be the average of all the target values. But if you minimize MAE, then this is going to be the median of the target values for all the sample points. It is well known that the median is more robust than the mean for outliers, so MAE is more stable for outliers than MSE.

However, MAE has a serious problem (especially for neural networks) : the gradient of updates is always the same, that is, the gradient is large even for small loss values. This is not conducive to model learning. To solve this defect, we can use the variable learning rate to reduce the learning rate when the loss is near the minimum.

In this case, MSE performs very well and can effectively converge even with a fixed learning rate. The gradient of MSE loss increases with the increase of loss, but decreases as loss approaches 0. This allows for more accurate results using the MSE model at the end of the training.



Choose the loss function according to different situations

If outliers represent significant anomalies in the business and need to be detected, the MSE loss function should be used. In contrast, if only outliers are considered as damaged data, MAE loss functions should be used.

It is recommended that you read this article, which compares the performance of regression models using L1 and L2 losses respectively when there are indistinguishable values.

Article website:

http://rishy.github.io/ml/2015/07/28/l1-vs-l2-loss/

Here L1 and L2 losses are just nicknames for MAE and MSE.

In a word, when dealing with outliers, L1 loss function is more stable, but its derivative is discontinuous, so the solving efficiency is low. The L2 loss function is more sensitive to outliers, but a more stable closed solution can be obtained by setting its derivative to 0.

The problem with both is that, in some cases, neither loss function is sufficient. For example, if 90% of the samples in the data correspond to a target value of 150, the remaining 10% is between 0 and 30. Then a model using MAE as a loss function might ignore 10% of outliers while predicting a value of 150 for all samples.

That’s because the model makes the median prediction. Models using MSE give a lot of predictions between 0 and 30 because the model is biased towards outliers. Neither outcome is desirable in many business scenarios.

What should you do in these cases? The simplest way is to transform the target variable. The other way is to replace the loss function, which leads to the third loss function, the Huber loss function.

Huber loss, smoothed mean absolute error

Huber losses are less sensitive to outliers in the data than squared error losses. It’s also differentiable at 0. Essentially, the Huber loss is an absolute error, only when the error is small, it becomes a squared error. When the error drops to more than one hour, it becomes a quadratic error controlled by the hyperparameter delta. When Huber loss between [0 – the delta, delta 0 +], the equivalent for MSE, and in [- up, the delta] and [delta + up] for MAE.



The choice of the hyperparameter delta is important here, because it determines your definition of an outlier. When residuals are greater than delta, L1 (less sensitive to large outliers) should be used to minimize them, while residuals are less than hyperparameters and L2 should be used to minimize them.

Why use Huber losses?

One of the biggest problems of using MAE to train neural networks is the constant large gradient, which may result in a missed minimum near the end of using gradient descent. For MSE, the gradient decreases as the loss decreases, making the results more accurate.

In this case, Huber losses are very useful. It will fall near the minimum as the gradient decreases. It is more robust to outliers than MSE. Thus, Huber losses combine the advantages of MSE and MAE. However, the problem with Huber loss is that we may need to constantly adjust the hyperparameter delta.

The Log – Cosh losses

Log-cosh is another loss function used in regression problems that is smoother than L2. It is calculated as the logarithm of the hyperbolic cosine of the prediction error.



Graph of log-CoSH loss (Y-axis) versus predicted value (X-axis). The true value for 0

Advantages: For smaller x, log(cosh(x)) is approximately equal to (x^2)/2, and for larger x, approximately equal to abs(x)-log(2). This means that the ‘logcosh’ is basically like the mean square error, but not susceptible to outliers. It has all the advantages of a Huber loss, but unlike a Huber loss, the log-Cosh second order is differentiable everywhere.

Why do we need a second derivative? Many machine learning models, such as XGBoost, use Newton’s method to find the best. Newton’s method requires the second derivative (Hessian). Therefore, second-order differentiability of loss functions is necessary for machine learning frameworks such as XGBoost.



But the log-Cosh loss is not perfect, and there are still some problems. For example, if the error is large enough, the first step and Hessian will become fixed, resulting in a lack of split-points in XGBoost.

Python code for Huber and log-cosh loss functions:

# huber loss def huber(true, pred, delta): Loss = np.where(np.abs(true-pred) < delta, 0.5*((true-pred)**2), Delta *np.abs(true-pred) -0.5 *(delta**2)) return np.sum(loss) # logcosh loss def logcosh(true, pred): loss = np.log(np.cosh(pred - true)) return np.sum(loss)Copy the code

Copy the code

Quantile loss

In most real-world forecasting problems, we usually want to understand the uncertainty in the prediction. Knowing the range of predictions, rather than just estimates, helps in making decisions on many business issues.

Quantile loss functions are useful when we are more concerned with interval predictions than just point predictions. Interval prediction using least square regression is based on the assumption that the residual (Y-y_hat) is an independent variable and the variance remains constant.

Once this assumption is violated, the linear regression model fails. However, we should not assume that using nonlinear functions or tree-based models is better than using linear regression models as baseline methods. This is where quantile loss and quantile regression come in handy, because regression based on quantile loss can give a reasonable prediction interval even for residuals with variance of change or non-normal distribution.

Let’s look at a practical example to better understand how regression based on quantile loss works on heteroscedasticity data.

Quantile regression and least square regression



The code for quantile regression shown in the figure above is attached:

https://github.com/groverpr/Machine-Learning/blob/master/notebooks/09_Quantile_Regression.ipynb

Understand quantile loss functions

How to choose the appropriate quantile value depends on how much we pay attention to the positive error and the negative error. The loss function imposes different penalties for overestimation and underestimation through quantile values (γ). For example, when the quantile loss function γ=0.25, the penalty for overestimation is greater, making the predicted value slightly below the median.



Gamma is the required quantile, with a value between 0 and 1.



This loss function can also be used to compute prediction intervals in neural networks or tree-based models. The following is an example of implementing a gradient ascending tree regression model with Sklearn.



The figure above shows that 90% prediction interval can be obtained by using quantile loss in the gradient lifting regression of skLearn library. The upper limit is γ=0.95, and the lower limit is γ=0.05.

The comparative study

To demonstrate all of the above characteristics of the loss function, let’s look at a comparative study. Firstly, we establish a data set sampled from sinc (x) function, and introduce two artificial noises: the gaussian noise component ε ~ N (0, σ2) and the impulse noise component ξ ~ Bern (p).

Impulse noise is added to illustrate the robustness of the model. The following are the results of fitting GBM regressors with different loss functions.



Continuous loss function :(A) MSE loss function; (B) MAE loss function; (C) Huber loss function; (D) Quantile loss function. An example of a smooth GBM fitting of noisy sinc (x) data :(E) the original sinc (x) function; (F) Smooth GBM with MSE and MAE losses; (G) smooth GBM with Huber loss and δ={4,2,1}; (H) Smooth GBM with quantile loss, and α={0.5,0.1,0.9}.

Some observations of simulation comparison:

  • The MAE loss model is slightly affected by the impulse noise, while the MSE loss function is slightly affected by the impulse noise.

  • The prediction results of Huber loss model are not sensitive to the selected hyperparameters.

  • Quantile loss models can give good estimates at appropriate confidence levels.

Finally, let’s put all the loss functions into one graph, and we have this beautiful picture below! The difference is not clear at a glance










Big Data Abstracts
Big Data Abstracts