How to generate prediction intervals for machine learning

— Using gradient ascent regression to show the uncertainty of machine learning model assessments

Original text: towardsdatascience.com/how-to-gene…

By Will Koehrsen

Lao Qi

“All models are wrong, but some are useful” — George Box

This is a maxim to keep in mind when we use machine learning to make predictions. All machine learning models have limitations: the features that influence the results are not in the data, or the assumptions made by the model do not match reality. When we give a precise forecast number — the house will be worth $450,300.01 — it gives the impression that we are convinced the model is rooted in reality.

An honest reflection of the predictive performance of the model requires a series of evaluations. There may be a best value for the prediction, but it is more likely to be a wide range. This is not a commonplace in data science courses, but it is vital that we seek out the uncertainty in our predictions and do not over-advertise the power of machine learning. While people crave certainty, I think it is better to give a broad range of predictions that includes the true value than to give a precise value that is far from reality.

In this article, we will introduce a method for generating uncertain intervals in SciKit-learn. Generating forecast intervals is another tool in the data science toolbox and is critical to earning the trust of non-data scientists.

The complete code of this article has been published in the online platform of our official account “Machine Learning Case” course, please follow the official account (scan the QR code at the end of the article) and reply “name + mobile phone number + ‘case'”, you can apply to join. For details, please refer to the menu of this official account: “Courses – Open Courses”.

The problem

In this case, we will use data from DrivenData(Drivendata.org/), a machine learning competition, which is real data on energy issues. The dataset has eight features, with samples measured at 15-minute intervals.

data.head()
Copy the code

The purpose of this project is to predict energy consumption, a practical task that we do every day at Cortex Building Intel. Unsurprisingly, no hidden features (potential variables) affecting energy consumption were found in the data set, so we wanted to show the uncertainty in the model assessment by predicting the upper and lower limits of energy use.

# Use plotly + cufflinks for interactive plotting
import cufflinks as cf

data.resample('12 H').mean().iplot()
Copy the code

In this article, the visualization tool used by the author is Plotly. For this tool, the translator has a tutorial on how to visualize Python data.

The implementation of

In order to generate the prediction interval in SciKit-Learn, we will use the gradient ascent regression algorithm. The basic idea is as follows:

Predict the lower limit, useGradientBoostingRegressor(loss= "quantile", alpha=lower_quantile). Among them,lower_quantileRepresents the lower limit. For example, the 10th percentile is 0.1.
Forecast upper limit, useGradientBoostingRegressor(loss= "quantile", alpha=upper_quantile). Among them,upper_quantileStands for upper limit. For example, the 90th percentile is 0.9.
To predict the median value, useGradientBoostingRegressor (loss = "quantile", alpha = 0.5)Or use the default of the predicted meanloss="ls"(used for least square method). The official documentation example uses the latter approach, and we do the same.

GradientBoostingRegressor the parameters in the loss of value is a function of the optimization model. If loss=”quantile” and the value of alpha is also set to percentiles (similar to the upper and lower limits of the prediction above), the prediction interval can be obtained in percentiles.

After the data is decomposed into training sets and test sets, the model is built. In practice, we had to use 3 separate gDA regression models because each model had a different optimization function and had to be trained separately.

from sklearn.ensemble import GradientBoostingRegressor

# Set lower and upper quantile
LOWER_ALPHA = 0.1
UPPER_ALPHA = 0.9

# Each model has to be separate
lower_model = GradientBoostingRegressor(loss="quantile", alpha=LOWER_ALPHA)

# The mid model will use the default loss
mid_model = GradientBoostingRegressor(loss="ls")
upper_model = GradientBoostingRegressor(loss="quantile", alpha=UPPER_ALPHA)
Copy the code

Training and predicting uses the familiar Scikit-Learn syntax:

Use the familiar Scikit-learn syntax for training and prediction:

# Fit models
lower_model.fit(X_train, y_train)
mid_model.fit(X_train, y_train)
upper_model.fit(X_train, y_train)

# Record actual values on test set
predictions = pd.DataFrame(y_test)

# Predict
predictions['lower'] = lower_model.predict(X_test)
predictions['mid'] = mid_model.predict(X_test)
predictions['upper'] = upper_model.predict(X_test)
Copy the code

And just like that, we have our prediction interval!

With a little Plotly skill, we can create a nice interactive graphic.

Note: Plotly specializes in creating interactive visualizations. For details, see the Python Data Visualization Tutorial.

Calculated prediction error

As with any machine learning model, we want to quantify the prediction error for the test set (we have the actual answer). The error of evaluating the prediction interval is more complex than that of evaluating a specific predicted value. We can calculate the actual value as a percentage of the forecast interval, but if we make the forecast interval larger, it is easy to assume that the forecast is good, so we also need a metric for the difference between the predicted value and the actual value, such as absolute error.

In the source code, I provided a function to calculate the absolute errors of the upper limit, lower limit and median values. The upper limit of the prediction interval is the average of all the upper limit absolute errors, and the lower limit is the average of all the lower limit absolute errors, and then draw the box plot as shown in the following figure.

Interestingly, for this model, the median absolute error of the lower bound prediction is actually less than the absolute error of the median prediction. The precision of this model is not high and can be improved by adjusting the optimization function. The true value is between the upper and lower limits, just over half the time. We could increase the range of measurements by lowering the lower quartile and increasing the upper quartile, but doing so would reduce accuracy.

There are probably better metrics, but I chose these because they are simple to calculate and easy to interpret. The actual evaluation criteria you use should depend on the problem you are trying to solve and your goals.

Prediction interval model

Training and prediction with three independent models is a bit tedious, so we can write a model that wraps the GDA algorithm into a class. It comes from Scikit-learn, so we use the same syntax for training or prediction, but now only need to be called once:

# Instantiate the classThe model = GradientBoostingPredictionIntervals (lower_alpha = 0.1, upper_alpha = 0.9)# Fit and make predictions
_ = model.fit(X_train, y_train)
predictions = model.predict(X_test, y_test)
Copy the code

The model also comes with some drawing tools:

fig = model.plot_intervals(mid=True, start='2017-05-26', stop='2017-06-01')
iplot(fig)
Copy the code

Please use and adjust the model until you see fit! This is just one way to make uncertain predictions, but I think it’s useful because it uses SciKit-Learn (meaning a smooth learning curve), which we can extend as needed. In general, this is a good way to solve data science problems: Start with simple solutions and only add complexity as needed!

Background: Quantile loss regressed with gradient rise

Gradient ascent regression algorithm is an ensemble model, which consists of decision tree and regression tree. For an initial explanation of the model, see Friedman’s 1999 paper, Greedy Function Approximation: Boosting Machine A Gradient Boosting Machine (statweb.stanford.edu/~jhf/ftp/tr…

The default loss function — least square — gradient rise regression model is used to predict the mean. The key point to understand is that the least square method penalizes low errors as well as high errors:

In contrast, quantile losses penalize errors based on positives (actual > predicted) and negatives (actual < predicted) of quantiles and errors, which makes gDA model optimization not for the mean, but for the percentile. Quantile loss is:

Where alpha is the quantile. Let’s quickly run through an example using the actual value 10 and the quantiles 0.1 and 0.9:

If α = 0.1 and forecast = 15, then loss = (0.1 — 1)* (10 — 15)= 4.5
If α = 0.1 and prediction = 5, then loss = 0.1 *(10 — 5)= 0.5
If α = 0.9 and prediction = 15, then loss = (0.9 — 1)* (10 — 15)= 0.5
If α = 0.9 and prediction = 5, then loss = 0.9 *(10 — 5)= 4.5

For quantiles < 0.5, if the predicted value is greater than the actual value (case 1), the loss will be greater than the predicted value at the same distance from the actual value. For > 0.5 quantiles, if the predicted value is less than the actual value (case 4), the loss will be greater than the predicted value for the same distance as the actual value. When quantile == 0.5, the predicted value higher than and lower than the actual value will lead to the same error, and the model will be optimized for the median. (For intermediate states, we can use Loss = “quantile”, alpha=0.5 as the median, or Loss = “ls” as the mean).

The following figure illustrates the relationship between quantile loss and error:

A quantile <0.5 brings the forecast below the median, and a quantile >0.5 brings the forecast above the median. This is a great reminder that loss functions in machine learning determine what you optimize!

Depending on the output we want, we can optimize the mean (least square), median (α == 0.5 quantile loss), or any percentile (α == percentile loss over 100). This is a relatively simple explanation for quantile loss, but it’s enough to get you started using model training to generate prediction intervals. For more information, please read this article and check out the source code.

conclusion

Predicting with a machine learning model with an exact number gives the illusion that we have a lot of confidence in the model. However, it is important to keep in mind that any model is an approximation, so it is important to express uncertainty in our estimation. One common approach is to use sciKit-Learn’s gradient ascent regression to generate prediction intervals. This is just one way to predict ranges (see confidence intervals for linear regression), but it is relatively simple and can be adjusted as needed. In this article, we saw a complete implementation and learned some of the theory behind quantile loss functions.

Solving data science problems requires a number of on-demand tools in the toolbox. Generating prediction intervals is a useful technique, and I recommend that you practice executing the code in this article and use it to handle your problems. (The best way to learn any skill is by doing!) We know machine learning can do some pretty incredible things, but it’s not perfect, and we shouldn’t paint it as perfect. To gain the trust of decision makers, we usually do not need to give a number as an estimate, but rather a range of predictions that represent the uncertainties inherent in all models.

Follow the wechat official account: Lao Qi Classroom. Read in-depth articles, have exquisite skills, enjoy a beautiful life.

How to generate prediction intervals for machine learning

The problem

The implementation of

Calculated prediction error

Prediction interval model

Background: Quantile loss regressed with gradient rise

conclusion

Related Posts

Three ways to enforce identity and access management

FastGCN: Fast training graph convolutional networks with importance sampling

DeepDelta: A way to automatically fix compilation errors through deep learning