XGBoost is an effective implementation of gradient classification and regression problems.

It is fast and efficient, performs well even in a variety of predictive modeling tasks, and is favored, if not the best, among winners of data science competitions, such as Kaggle’s.

XGBoost can also be used for time series prediction, although it requires that the time series data set be first converted into a supervised learning problem. It also requires the use of a specialized technique called forward validation to evaluate the model, since evaluating the model using k-fold cross-validation leads to optimistic results.

In this tutorial, you will discover how to develop an XGBoost model for time series prediction. After completing this tutorial, you will know:

1. XGBoost is the implementation of the gradient lifting integration algorithm for classification and regression.

2. Time series data sets can be converted into supervised learning using sliding window representation.

3. How to use XGBoost model to fit, evaluate and forecast for time series prediction.

Tutorial overview

This tutorial is divided into three parts: They are:

1. XGBoost integration

2. Time series data preparation

3. XGBoost is used for time series prediction

XGBoost integration

XGBoost, short for Extreme Gradient Boosting, is an effective implementation of random Gradient lifting machine learning algorithms. Stochastic gradient enhancement algorithms (also known as gradient enhancers or tree enhancers) are powerful machine learning techniques that perform well or best on a variety of challenging machine learning problems.

It is a collection of decision tree algorithms in which new trees fix errors in trees that are already part of the model. Trees are added until further improvements to the model cannot be made. XGBoost provides an efficient implementation of the random gradient lifting algorithm and provides a set of model hyperparameters designed to provide control over the model training process.

XGBoost is designed for classification and regression of tabular datasets, although it can be used for time series prediction.

First, you must install the XGBoost library. You can install using PIP, as follows:

sudo pip install xgboost
Copy the code

Once installed, you can confirm that it has been successfully installed and that you are using a modern version by running the following code:

# xgboost
import xgboost
print("xgboost", xgboost.__version__)
Copy the code

Run the code and you should see the following version number or higher.

xgboost 1.01.
Copy the code

Although the XGBoost library has its own Python API, we can use the XGBoost model in conjunction with the SciKit-Learn API through the XGBRegressor wrapper class.

An instance of a model can be instantiated just like any other Scikit-Learn class that is used for model evaluation. Such as:

# define model
model = XGBRegressor()
Copy the code

Now that we’re familiar with XGBoost, let’s look at how to prepare a time series data set for supervised learning.

Time series data preparation

Time series data can be expressed as supervised learning. Given a sequence of numbers in a time series data set, we can reorganize the data into problems that look like supervised learning. We can do this by using the previous time step as the input variable and the next time step as the output variable. Let’s illustrate this with an example. Suppose we have a time series as follows:

time, measure
1.100
2.110
3.108
4.115
5.120
Copy the code

By using the value of the previous time step to predict the value of the next time step, we can reorganize this time series data set into a supervised learning problem. By recombining the time series data set in this way, the data will look like this:

X, y ? .100
100.110
110.108
108.115
115.120
120,?Copy the code

Note that the time column has been removed and certain data rows are not available for training models, such as first and last.

This representation is called a sliding window because the Windows of input and expected output move forward over time, creating new “samples” for supervised learning models.

More information on the sliding window approach to preparing time series prediction data.

Given the required input and output sequence lengths, we can use the shift () function in Pandas to automatically create a new frame for time series problems.

This will be a useful tool because it will allow us to explore different frameworks for time series problems using machine learning algorithms to see models that might lead to better performance.

The following function takes a time series as a NumPy array time series with one or more columns and converts it into a supervised learning problem with a specified number of inputs and outputs.

# transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
 n_vars = 1 if type(data) is list else data.shape[1]
 df = DataFrame(data)
 cols = list()
 # input sequence (t-n, ... t- 1)
 for i in range(n_in, 0.- 1):
  cols.append(df.shift(i))
 # forecast sequence (t, t+1. t+n)for i in range(0, n_out):
  cols.append(df.shift(-i))
 # put it all together
 agg = concat(cols, axis=1)
 # drop rows with NaN values
 if dropnan:
  agg.dropna(inplace=True)
 return agg.values
Copy the code

We can use this function to prepare the time series data set for XGBoost.

Once the data set is ready, we must be careful how we use it to fit and evaluate the model.

For example, it is not effective to fit a model to future data and predict the past. The model must be trained in the past and predict the future. This means that methods such as k-fold cross-validation that randomize data sets during evaluation cannot be used. Instead, we must use a technique called forward validation. In forward validation, you start by selecting a pointcut (for example, all data except the last 12 months are used for training, and the last 12 months are used for testing.

If we are interested in a single step prediction, say after a month, we can evaluate the model by training on the training dataset and predicting the first step of the test dataset. We can then add real observations from the test set to the training data set, re-fit the model, and then have the model predict the second step in the test data set. Repeating this process for the entire test data set provides a one-step prediction for the entire test data set from which error measures can be calculated to assess the model’s skill.

The following functions perform forward validation. It takes as arguments the entire supervised learning version of the time series data set and the number of rows used as the test set. It then steps through the test set, calling the xgBOOST_FORECAST () function for a single-step forecast. Calculate error metrics and return the details for analysis.

# walk-forward validation for univariate data
def walk_forward_validation(data, n_test):
 predictions = list()
 # split dataset
 train, test = train_test_split(data, n_test)
 # seed history with training dataset
 history = [x for x in train]
 # step over each time-step in the test set
 for i in range(len(test)):
  # split test row into input and output columns
  testX, testy = test[i, :- 1], test[i, - 1]
  # fit model on history and make a prediction
  yhat = xgboost_forecast(history, testX)
  # store forecast in list of predictions
  predictions.append(yhat)
  # add actual observation to history for the next loop
  history.append(test[i])
  # summarize progress
  print('>expected=%.1f, predicted=%.1f' % (testy, yhat))
 # estimate prediction error
 error = mean_absolute_error(test[:, - 1], predictions)
 return error, test[:, 1], predictions
Copy the code

Call train_test_split () to split the data set into training and test sets. We can define this functionality below.

# split a univariate dataset into train/test sets
def train_test_split(data, n_test):
 return data[:-n_test, :], data[-n_test:, :]
Copy the code

We can use the XGBRegressor class for single-step prediction. The following xgBOOST_FORECAST () function does this by fitting the model and performing a single step prediction, taking the training data set and test input rows as inputs.

# fit an xgboost model and make a one step prediction
def xgboost_forecast(train, testX):
 # transform list into array
 train = asarray(train)
 # split into input and output columns
 trainX, trainy = train[:, :- 1], train[:, - 1]
 # fit model
 model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)
 model.fit(trainX, trainy)
 # make a one-step prediction
 yhat = model.predict([testX])
 return yhat[0]
Copy the code

Now that we know how to prepare time series data for prediction and evaluation of the XGBoost model, we can look at using XGBoost on a real data set.

XGBoost is used for time series prediction

In this section, we explore how to use XGBoost for time series prediction. We will use standard univariate time series data sets to make single-step predictions using this model. You can use the code in this section as a starting point for your own projects and easily adapt it to accommodate multivariable input, multivariable prediction, and multistep prediction. We will use the daily female birth data set, which is the number of monthly births over three years.

You can download the data set from here and put it in the current working directory with the file name ‘daily-total-female- kMIa.csv’.

Data set (total female births per day.CSV) :

https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-total-female-births.csv
Copy the code

Note (Total female births per day) :

https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-total-female-births.names
Copy the code

The first few lines of the dataset look like this:

"Date"."Births"
"1959-01-01".35
"1959-01-02".32
"1959-01-03".30
"1959-01-04".31
"1959-01-05".44.Copy the code

First, let’s load and draw the data set. The complete example is listed below.

# load and plot the time series dataset
from pandas import read_csv
from matplotlib import pyplot
# load dataset
series = read_csv('daily-total-female-births.csv', header=0, index_col=0)
values = series.values
# plot dataset
pyplot.plot(values)
pyplot.show()
Copy the code

Running the example creates a line graph of the data set. We can see no obvious trend or seasonality.

When forecasting the most recent 12 months, the persistence model could achieve approximately 6.7 MAE births. This provides a performance benchmark against which the model can be considered proficient.

Next, we can evaluate the XGBoost model on the data set when performing a single step prediction over the past 12 months.

We will use only the first six time steps as input for the model and default model hyperparameters, except that we change the loss to ‘reg: squarederror’ (to avoid warning messages) and use 1,000 trees in the collection (to avoid under-learning).

The complete example is listed below.

# forecast monthly births with xgboost
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from matplotlib import pyplot
 
# transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
 n_vars = 1 if type(data) is list else data.shape[1]
 df = DataFrame(data)
 cols = list()
 # input sequence (t-n, ... t- 1)
 for i in range(n_in, 0.- 1):
  cols.append(df.shift(i))
 # forecast sequence (t, t+1. t+n)for i in range(0, n_out):
  cols.append(df.shift(-i))
 # put it all together
 agg = concat(cols, axis=1)
 # drop rows with NaN values
 if dropnan:
  agg.dropna(inplace=True)
 return agg.values
 
# split a univariate dataset into train/test sets
def train_test_split(data, n_test):
 return data[:-n_test, :], data[-n_test:, :]
 
# fit an xgboost model and make a one step prediction
def xgboost_forecast(train, testX):
 # transform list into array
 train = asarray(train)
 # split into input and output columns
 trainX, trainy = train[:, :- 1], train[:, - 1]
 # fit model
 model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)
 model.fit(trainX, trainy)
 # make a one-step prediction
 yhat = model.predict(asarray([testX]))
 return yhat[0]
 
# walk-forward validation for univariate data
def walk_forward_validation(data, n_test):
 predictions = list()
 # split dataset
 train, test = train_test_split(data, n_test)
 # seed history with training dataset
 history = [x for x in train]
 # step over each time-step in the test set
 for i in range(len(test)):
  # split test row into input and output columns
  testX, testy = test[i, :- 1], test[i, - 1]
  # fit model on history and make a prediction
  yhat = xgboost_forecast(history, testX)
  # store forecast in list of predictions
  predictions.append(yhat)
  # add actual observation to history for the next loop
  history.append(test[i])
  # summarize progress
  print('>expected=%.1f, predicted=%.1f' % (testy, yhat))
 # estimate prediction error
 error = mean_absolute_error(test[:, - 1], predictions)
 return error, test[:, - 1], predictions
 
# load the dataset
series = read_csv('daily-total-female-births.csv', header=0, index_col=0)
values = series.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=6)
# evaluate
mae, y, yhat = walk_forward_validation(data, 12)
print('MAE: %.3f' % mae)
# plot expected vs preducted
pyplot.plot(y, label='Expected')
pyplot.plot(yhat, label='Predicted')
pyplot.legend()
pyplot.show()
Copy the code

Running the sample reports expected and predicted values for each step in the test set, and then reports MAE for all predicted values.

Note: Your results may differ due to randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example a few times and comparing the average results.

We can see that this model performs better than the persistence model, with an MAE of about 5.9 versus 6.7

>expected=42.0, predicted=44.5
>expected=53.0, predicted=42.5
>expected=39.0, predicted=40.3
>expected=40.0, predicted=32.5
>expected=38.0, predicted=41.1
>expected=44.0, predicted=45.3
>expected=34.0, predicted=40.2
>expected=37.0, predicted=35.0
>expected=52.0, predicted=32.5
>expected=48.0, predicted=41.4
>expected=55.0, predicted=46.6
>expected=50.0, predicted=47.2
MAE: 5.957
Copy the code

Create graphs to compare a series of expected and predicted values for the last 12 months of the data set. This gives a geometric explanation of how the model is performing on the test set.

Figure 2

Once the final XGBoost model configuration has been selected, the model can be finalized and used to make predictions for new data. This is called out-of-sample prediction, such as making predictions beyond a training data set. This is the same as making predictions during model evaluation: we always want to evaluate the model using the same process we would expect to use when the model is used to make predictions about new data. The following example demonstrates the process of fitting the final XGBoost model over all available data and performing a single step prediction at the end of the data set.

# finalize model and make a prediction for monthly births with xgboost
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from xgboost import XGBRegressor
 
# transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
 n_vars = 1 if type(data) is list else data.shape[1]
 df = DataFrame(data)
 cols = list()
 # input sequence (t-n, ... t- 1)
 for i in range(n_in, 0.- 1):
  cols.append(df.shift(i))
 # forecast sequence (t, t+1. t+n)for i in range(0, n_out):
  cols.append(df.shift(-i))
 # put it all together
 agg = concat(cols, axis=1)
 # drop rows with NaN values
 if dropnan:
  agg.dropna(inplace=True)
 return agg.values
 
# load the dataset
series = read_csv('daily-total-female-births.csv', header=0, index_col=0)
values = series.values
# transform the time series data into supervised learning
train = series_to_supervised(values, n_in=6)
# split into input and output columns
trainX, trainy = train[:, :- 1], train[:, - 1]
# fit model
model = XGBRegressor(objective='reg:squarederror', n_estimators=1000)
model.fit(trainX, trainy)
# construct an input for a new preduction
row = values[- 6:].flatten()
# make a one-step prediction
yhat = model.predict(asarray([row]))
print('Input: %s, Predicted: %.3f' % (row, yhat[0]))
Copy the code

Running the sample fits the XGBoost model into all available data. Prepare new input rows using the last 6 months of known data and predict the next month after the data set ends.

Input: [34 37 52 48 55 50], Predicted: 42.708
Copy the code

Author: Yishui Hancheng, CSDN blogger expert, personal research interests: machine learning, deep learning, NLP, CV\

Blog: yishuihancheng.blog.csdn.net

Appreciate the author

Read more

Implement a simple genetic algorithm from scratch in Python \

5 minutes to master Python random hill-climbing algorithm \

5 minutes to fully understand association rule mining algorithm \

Special recommendation \

\

Click below to read the article and join the community