This article is fromSet wisdom column

All algorithms in machine learning rely on minimizing or maximizing functions, which we call “objective functions.” The function minimized is called the “loss function”. The loss function also measures the performance of the prediction model in predicting the desired outcome. The most common way to find the minimum point of a function is “gradient descent”. If the loss function is an undulating mountain, then gradient descent is like a fool trying to lower the mountain to its lowest point.

There’s not just one loss function. Depending on various factors, including the presence of outliers, the machine learning algorithm chosen, the timeliness of gradient descent, the confidence in finding predictions and the ease of finding derivatives, we can choose different loss functions. This article will take you through the different loss functions and how they can help us in data science and machine learning.

Loss function can be roughly divided into two types: classification loss and regression loss.

In this article, we will focus on regression loss functions and continue to share other loss function types in future articles.

All code addresses are at the bottom of this article.

Return loss

  • Mean square error (MSE), secondary loss, L2 loss

The mean square error is the most commonly used regression loss function, which is the sum of the squares of the difference between our target variable and the predicted value.

Mean square error function graph

  • Mean absolute error, loss of L1

Mean Absolute error (MAE) is another loss function used in regression models. MAE is the sum of the absolute difference between the target variable and the prediction variable. So it measures the average error in a set of predictions, regardless of their direction (or MBE, the sum of the errors, if we consider direction). The range is 0 to infinity.

MSE vs MAE (L2 loss vs L1 loss)

In short, it’s easier to solve the problem using squared error, but it’s more robust to outliers using absolute error. Let’s see why.

Whenever we train machine learning models, our goal is to find a point where the loss function is minimized. Of course, both functions reach a minimum when the predicted value is exactly equal to the true value.

Let’s take a quick look at the Python code for these two functions. We can either write our own function or use sklearn’s built-in metric function:

# true: Array of true target variables
# pred: Array of predicted values

def mse(true, pred): 
    return np.sum((true - pred)**2)

 def mae(true, pred):
  return np.sum(np.abs(true - pred))

 The same is true in sklearn

 from sklearn.metrics import mean_squared_error
 from sklearn.metrics import mean_absolute_error
Copy the code

Let’s look at MAE and the root mean square error (RMSE, the square root of MSE on the same scale as MAE) in both cases. In the first case, the predicted and true values are very close, and the error varies very little among many observations. In the second case, there is an anomalous observation and the error is high.

What can we observe from this? How does this help us choose which loss function to use?

Because MSE is the square of the error (y-y_predicted = E), the value of error (e) increases a lot when e is greater than 1. If there are outliers in our data, the value of e will be very high, will e squared > > | e |. This causes models with MSE errors to assign more weight to outliers than models with MAE errors. In the second case above, models with RMSE errors sacrifice other common cases in order to minimize this single outlier, which reduces the overall performance of the model.

MAE can be useful if the training data is corrupted by outliers (that is, we mistakenly receive large, unrealistic positive/negative values in the training environment, but not in the test environment).

We can think of it this way: if we had to assign a predictive value to all observations to minimize MSE, that predictive value would be the average of all the target values. But if we want to minimize MAE, the predicted value should be in the middle of all observations. We know that the median is more robust to outliers than the mean, which makes MAE more robust to outliers than MSE.

A big problem with MAE losses, especially for neural networks, is that the gradient is always the same, which means that the gradient can be very large even for small losses. That’s not a good thing for machine learning. To correct this, we can use the dynamic learning rate, which gets smaller as we get closer and closer to the minimum. In this case, MSE will perform very well and will converge even if the learning rate is fixed. The gradient of MSE loss is very high for larger loss values and decreases as loss values approach 0, making it more accurate at the end of model training (see figure below).

Decide which loss function to use

If the anomalies represented by outliers are important to the business and should be detected, then we should use MSE. On the other hand, if we think that outliers simply represent corrupt data, then MAE should be chosen as the loss function.

If you want to compare the performance of regression models using L1 and L2 loss functions with or without outliers, you are advised to read this excellent study. Remember that L1 and L2 losses are aliases for MAE and MSE, respectively.

L1 loss is more robust to outliers, but its derivatives are discontinuous, making it inefficiently solvable. L2 losses are sensitive to outliers, but yield more stable and closer solutions (by setting the derivative to 0).

The problem with both is that there may be situations in which neither loss function gives an ideal predictive value. For example, if 90% of the observations in our data have target truth values of 150, the remaining 10% have target values between 0 and 30. Then a model with MAE losses might predict a target value of 150 for all observations and ignore the 10% anomalies because it would try to move toward the middle. In the same case, a model using MSE losses would give a large number of predicted values in the range of 0 to 30 because it would be biased towards outliers. In many business situations, neither outcome is ideal.

So what to do in this case? An easy fix is to transform the target variable. Another way is to try different loss functions. This brings us to the next part: the Huber loss function.

  • Huber loss function, smooth mean absolute error is less sensitive to outliers in data than square error loss. It’s also differentiable at zero. It’s basically the absolute value, and it’s going to be squared if the error is small. How big the error squares depends on a hyperparameter, δ, that can be adjusted. When δ~ 0, Huber loss tends to MAE. When δ~ ∞ (a large number), the Huber loss tends towards MSE.

The choice of delta is critical because it determines how you view outliers. Residuals greater than δ are minimized by L1 (which is less sensitive to large outliers), while residuals less than δ are minimized “appropriately” by L2.

Why Huber loss function?

A big problem with using MAE to train neural networks is that the gradient is always large, which results in the omission of the minimum at the end of the gDA training model. For MSE, the gradient decreases gradually as the loss value approaches its minimum, making it more accurate.

In these cases, the Huber loss function can be really helpful because the minimum around it will reduce the gradient. It is also more robust to outliers than MSE. Therefore, it has the advantages of MSE and MAE. However, there is a problem with the Huber loss function. We may need to train the hyperparameter δ, and the process requires constant iteration.

  • Log-cosh loss function

Log-cosh is another loss function applied to regression tasks, which is smoother than L2 loss. Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.

Advantages:

For smaller X values, log(cosh(X)) is approximately (X ** 2) / 2; For larger X values, it is approximately abs(X) -log (2). This means that log-COSH works largely like the mean square error, but it doesn’t matter much when the occasional wildly wrong prediction occurs. It has all the advantages of a Huber loss function, but unlike a Huber loss, it is quadratic differentiable everywhere.

Why do we need a second derivative? Many machine learning models, such as XGBoost, use Newton’s method to find the best results, and therefore require second derivatives (Hessian functions). For machine learning frameworks like XGBoost, quadratic differentiable functions are more advantageous.

Huber loss function and log-cosh loss function

# huber loss
def huber(true, pred, delta):
    loss = np.where(np.abs(true-pred < delta, 0.5*((true-pred)**2), delta*np.abs(trueMr Pred) - 0.5 * * * 2) (delta)return np.sum(loss)

# log cosh loss
def logcosh(true, pred):
    loss = np.log(np.cosh(pred - true))
return np.sum(loss)
Copy the code
  • Quantile Loss Function In most realistic prediction problems, we often want to know the uncertainty of our predicted value. For many business problems, knowing the range of predicted values can significantly optimize the decision process, as opposed to knowing a certain forecast point. The prediction interval of least-squares regression is based on our assumption that the residual (Y — y_hat) changes consistently across all independent variables.

The Quantile loss function is useful if we want to predict an interval rather than a point. Regression models that violate this assumption are not credible. Of course, we can’t assume that nonlinear functions or tree-based models will work better in this case, leaving aside the idea of fitting linear models as benchmarks. Quantile loss and Quantile regression can be used at this point, because Quantile lost-based regression can provide a more intelligent prediction interval, even for errors with very large variances and abnormal distributions.

Let’s look at some cases to better understand why Quantile loss-based regression works well for heteroscedasticity problems.

  • Quantile regression VS ordinary least squares regression

The code address for Quantile regression shown in the above image: Github

Understand the Quantile loss function

The Quantile based regression model aims to predict the conditional quantiles of response variables according to the specific values of the predictive variables. Quantile loss is actually an extension of MAE (it is MAE when the Quantile is the 50th percentile).

The idea is to choose the appropriate quantile based on whether we want to increase the component of the positive error or the negative error. The loss function assigns different penalties for overestimating or underestimating, depending on the value of the selected quantile (γ). For example, the Quantile loss function of γ=0.25 does more to penalize overestimation, keeping the predicted value slightly below the mean.

A comparative study

In the article Gradient Boosting Machines, a Tutorial, these loss functions are well compared. In order to demonstrate the properties of all the above loss functions, we simulate a sinc(x) function and two artificially simulated noise datassets: the gaussian noise component ε ~ N(0, σ2) and the impulse noise component ξ ~ Bern(p). A pulsed noise term is added to illustrate the robust effect. The following figure is the result of fitting GBM regression model with different loss functions.

In the figure, (A) MSE loss function (B) MAE loss function (C) Huber loss function (D) Quantile loss function (E) Original Sinc (x) function (F) fitting the smoothing of MSE and MAE losses GBM (G) fitting the Huber loss is δ = {4, 2, The smoothed GBM (H) of 1} fits the smoothed GBM of Quantile loss α = {0.5, 0.1, 0.9}

From the above simulation, we can observe:

  • The prediction value of the model with MAE loss was slightly affected by the impulse noise, while the prediction value of the model with MSE loss was slightly affected by the deviation caused by the noise data.
  • The predicted values of the model with Huber loss are less sensitive to the values of the selected hyperparameters.
  • Quantile loss makes a good prediction at the corresponding confidence level.

Finally, we plotted all the above loss functions in a graph:

The attachedAll the code in this article


Resources: Look here!

Limited time discount: 0806 termsArtificial Intelligence – From Scratch to Mastery(The first 25 students can get ¥200 coupon)