- How to Convert a Time Series to a Supervised Learning Problem in Python
- Original author: Dr. Jason Brownlee
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: lsvih
How to convert a time series problem into a supervised learning problem in Python
Some machine learning methods, such as deep learning, can be used for time series prediction.
Before using these machine learning methods, the time series prediction problem must be transformed into a supervised learning problem. That is, you need to convert a time series into a sequence of paired inputs and outputs.
In this tutorial, you will learn how to convert univariate time series prediction problems and multivariable time series prediction problems into supervised learning problems using machine learning algorithms.
After finishing this tutorial, you will know:
- How to write a function that converts a time series dataset into a supervised learning dataset.
- How to transform unary time series data to use machine learning.
- How to transform multivariate time series data to use machine learning.
Let’s get started.
How to convert a time series problem into a supervised learning problem in Python
Photo by Quim Gil. All Rights reserved.
Time series vs supervised learning
Before we begin in earnest, let’s take a moment to better understand the data set structure of time series and supervised learning.
A single time series consists of a series of numbers sorted by time. You can think of it as a list of ordered values.
Such as:
0
1
2
3
4
5
6
7
8
9Copy the code
While a supervised learning problem consists of a set of inputs (X) and a set of outputs (y), the algorithm can learn how to predict the output values from the input values.
Such as:
X, y
1 2
2, 3
3, 4
4, 5
5, 6
6, 7
7, 8
8, 9Copy the code
See this article to learn more about it:
- Time Series Forecasting as Supervised Learning
Pandas’ shift() function
The key to translating time series data into supervised learning problems is to use Pandas’ shift() function.
Given a DataFrame, the shift() function makes a copy of the input column and moves the entire copy back (the foremost data space is filled with nans) or forward (the last data space is filled with nans).
This allows you to create a lagged value column and, with the observation column, change the time series dataset into a supervised learning dataset format.
Let’s see how the Shift function works in practice.
We can simulate a time series dataset of length 10 as a separate column in the DataFrame with the following code:
from pandas import DataFrame
df = DataFrame()
df['t'] = [x for x in range(10)]
print(df)Copy the code
Run the sample above to output the time series data, each row of which is the observation group data with the index.
t
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9Copy the code
We can insert a row at the top of the data to move the whole observation group down one bit. Since the new row inserted at the top has no data, we can fill it with NaN to indicate that there is “no data”.
The Shift function does this. We can insert the new column “moved” by the Shift function next to the original sequence.
from pandas import DataFrame
df = DataFrame()
df['t'] = [x for x in range(10)]
df['t-1'] = df['t'].shift(1)
print(df)Copy the code
Run the above sample and you will get a dataset with two columns. The first column is the original observation group, and the second column is the new column generated by shifting the shift function.
As you can see, after moving the sequence once, we have a primitive supervised learning problem (even though the order of X and y is obviously wrong). The uppermost table header is ignored; there is a NaN value in the first line, so it needs to be discarded. In the second row, we can take 0.0 of the second column as the input value (that is, X) and 1 of the first column as the output value (or y).
T t-1 0 0 NaN 1 1 0.0 2 2 1.0 3 3 2.0 4 4 3.0 5 5 4.0 6 6 5.0 7 7 6.0 8 8 7.0 9 9 8.0Copy the code
If we repeat the shift step and shift the original column by two, three, or more bits, we get a set of inputs (X) from which we can predict the output (y).
The shift operation can also accept negative integers as arguments. If you do this, it inserts a new row at the bottom of the column, causing the original column to move up. Here’s an example:
from pandas import DataFrame
df = DataFrame()
df['t'] = [x for x in range(10)]
df['t+1'] = df['t'].shift(-1)
print(df)Copy the code
Running the sample above, you can see that the last value in the new column is NaN.
At this point, the prediction column can be used as the input value (X) and the second column as the output value (y). That is, a given input value of 0 can be used to predict an output value of 1.
T t+1 0 0 1.0 1 1 2.0 2 2 3.0 3 3 4.0 4 4 5.0 5 5 6.0 6 6 7.0 7 7 8.0 8 8 9.0 9 9 NaNCopy the code
Technically, in the term of time series prediction, the current time (t) and future time (t+1, t+n) are the time to be predicted, while the past time (T-1, T-n) is used for prediction.
From the example above, we can learn how to use the shift function to move a sequence forward or backward to generate a new DataFrame, turning a time series problem into an input-output model for a supervised learning problem.
This can not only solve the classical X -> Y class prediction problem, but also be used for the X -> Y class prediction where the input and output values are sequences.
In addition, shift function can also be used in multivariate time series problems. These questions include multiple columns of observations (e.g., temperature, pressure, etc.). All variables in a time series can be shuffled forward or backward to generate a sequence of multiple input and output values. We’ll look at these questions later.
Series_to_supervised () function
We can use the shift() function to automatically generate a new format for a time series problem given the desired sequence of input and output values.
This is a very useful tool. We can use machine learning algorithms to study various formats of time series problems and explore which formats can produce a better model.
In this section, we will create a new Python function called series_to_supervised(). It can convert multivariate and unary time series problems into supervised learning data sets.
This function takes the following four arguments:
- Data: Specifies the sequence to be converted. The data type can be list or 2-dimensional NumPy array.
- N_in: Optional, number of lag groups (as input value X). The range can be between [1..len(data)]. The default value is 1.
- N_out: Optional, the number of observation groups (as the output value y). The value can range from [0..len(data)-1]. The default value is 1.
- Dropnan: Optional to discard rows containing NaN. The type is Boolean and the default value is True.
The function will return a value:
- Return: Returns a data set in the supervised learning format. The data type is Pandas DataFrame.
The new dataset is DataFrame format, where each column is named by the original variable name and the number of moves, allowing you to design sequences of various moves based on a given unary or multivariate time series problem.
When the DataFrame returns, you can split the rows and decide how to split the returned DataFrame into X and Y parts, depending on your needs.
The parameters of this function are set to default values, so you can call it directly to process your data, which by default will return a t-1 as X and t as y DataFrame.
This function is determined to be compatible with both Python2 and Python3.
Here is the complete code, with comments:
from pandas import DataFrame
from pandas import concat
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
"""Function purpose: To convert a time series to a supervised learning dataset. Parameter description: data: indicates a sequence of observed values. The data type can be list or NumPy array. N_in: Number of lag groups as input value (X). N_out: number of observation groups as the output value (y). Dropnan: Boolean value that determines whether to remove rows containing NaN. Return value: the translated Pandas DataFrame sequence for supervised learning. """
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n,... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# Prediction sequence (t, t+1... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# collate all columns
agg = concat(cols, axis=1)
agg.columns = names
Drop rows containing NaN
if dropnan:
agg.dropna(inplace=True)
return aggCopy the code
What do you think you can do to make this function more robust or readable? Leave it in the comments section.
Now that we have the entire function, let’s explore its use.
One step or one variable prediction
In time series prediction problems, lag time (such as T-1) is usually used as the input variable to predict the current time (t).
This kind of problem is called single-step prediction.
An example of predicting the current time (t) using a time lag of one time step (T-1) is shown below.
from pandas import DataFrame
from pandas import concat
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
"""Function purpose: To convert a time series to a supervised learning dataset. Parameter description: data: indicates a sequence of observed values. The data type can be list or NumPy array. N_in: Number of lag groups as input value (X). N_out: number of observation groups as the output value (y). Dropnan: Boolean value that determines whether to remove rows containing NaN. Return value: the translated Pandas DataFrame sequence for supervised learning. """
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n,... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# Prediction sequence (t, t+1... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# collate all columns
agg = concat(cols, axis=1)
agg.columns = names
Drop rows containing NaN
if dropnan:
agg.dropna(inplace=True)
return agg
values = [x for x in range(10)]
data = series_to_supervised(values)
print(data)Copy the code
Run the sample to output the transformed time series.
Var1 (t-1) var1(t) 1 0.0 1 2 1.0 2 3 2.0 3 4 3.0 4 5 4.0 5 6 5.0 6 7 6.0 7 8 7.0 8 9 8.0 9Copy the code
As you can see, the observation group is named “VAR1”, the observation group as the input value is named (t-1), and the output value group is named (t).
In addition, you can see that the row containing NaN has been automatically removed from the DataFrame.
We can run this example repeatedly for any given number of input sequences. For example, if we enter 3, we have already defined the number of input sequences as a parameter. Such as:
data = series_to_supervised(values, 3)Copy the code
A complete example is as follows:
from pandas import DataFrame
from pandas import concat
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
"""Function purpose: To convert a time series to a supervised learning dataset. Parameter description: data: indicates a sequence of observed values. The data type can be list or NumPy array. N_in: Number of lag groups as input value (X). N_out: number of observation groups as the output value (y). Dropnan: Boolean value that determines whether to remove rows containing NaN. Return value: the translated Pandas DataFrame sequence for supervised learning. """
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n,... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# Prediction sequence (t, t+1... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# collate all columns
agg = concat(cols, axis=1)
agg.columns = names
Drop rows containing NaN
if dropnan:
agg.dropna(inplace=True)
return agg
values = [x for x in range(10)]
data = series_to_supervised(values, 3)
print(data)Copy the code
Run the sample again and output the reconstructed sequence. You can see that the input sequence is exactly left to right, with the input values as predictions at the far right.
Var1 (t-3) VAR1 (T-2) VAR1 (T-1) VAR1 (t) 3 0.0 1.0 2.0 3 4 1.0 2.0 3.0 4 5 2.0 3.0 4.0 5 6 3.0 4.0 5.0 6 7 4.0 5.0 6.0 7 8 5.0 6.0 7.0 8 9 6.0 7.0 8.0 9Copy the code
Multistep or sequential prediction
Another kind of prediction problem involves using past observation groups to make predictions about future sequences of observation groups.
This kind of problem can be called sequence prediction problem or multi-step prediction problem.
We can reconstruct the time series of the sequence prediction problem by specifying another parameter. For example, we can reconstruct the prediction problem by converting 2 past observation groups into 2 future observation groups:
Data = series_to_supervised (values, 2, 2)Copy the code
A complete example is as follows:
from pandas import DataFrame
from pandas import concat
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
"""Function purpose: To convert a time series to a supervised learning dataset. Parameter description: data: indicates a sequence of observed values. The data type can be list or NumPy array. N_in: Number of lag groups as input value (X). N_out: number of observation groups as the output value (y). Dropnan: Boolean value that determines whether to remove rows containing NaN. Return value: the translated Pandas DataFrame sequence for supervised learning. """
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n,... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# Prediction sequence (t, t+1... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# collate all columns
agg = concat(cols, axis=1)
agg.columns = names
Drop rows containing NaN
if dropnan:
agg.dropna(inplace=True)
return agg
values = [x for x in range(10)]
data = series_to_supervised(values, 2, 2)
print(data)Copy the code
Running the sample, you can see the difference between using (t-n) as the input variable and (t+n) as the output variable, and the current observation group (t) as the output.
Var1 (t-1) var1(t) var1(t+1) 2 0.0 1.0 2 3.0 3 1.0 2.0 3 4.0 4 2.0 3.0 4 5.0 5 3.0 4.0 5 6.0 6 4.0 5.0 6 7.0 7 5.0 6.0 7 8.0 8 6.0 7.0 8 9.0Copy the code
Multiple prediction
Another important type of time series is called multivariate time series.
In this case, we take a number of different indicators as observation groups and predict the value of one or more of them.
For example, we have two time series observation groups OBS1 and OBS2 that we wish to predict, or one of them.
We can also call series_to_supervised(). Such as:
from pandas import DataFrame
from pandas import concat
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
"""Function purpose: To convert a time series to a supervised learning dataset. Parameter description: data: indicates a sequence of observed values. The data type can be list or NumPy array. N_in: Number of lag groups as input value (X). N_out: number of observation groups as the output value (y). Dropnan: Boolean value that determines whether to remove rows containing NaN. Return value: the translated Pandas DataFrame sequence for supervised learning. """
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n,... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# Prediction sequence (t, t+1... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# collate all columns
agg = concat(cols, axis=1)
agg.columns = names
Drop rows containing NaN
if dropnan:
agg.dropna(inplace=True)
return agg
raw = DataFrame()
raw['ob1'] = [x for x in range(10)]
raw['ob2'] = [x for x in range(50, 60)]
values = raw.values
data = series_to_supervised(values)
print(data)Copy the code
Running the sample results in the reconstructed data. The data shows two sets of variables at the same time as the input group and the output group.
As before, columns can be divided into two subsets X and Y according to the requirements of the problem. It should be noted that if VAR1 is included as the observation group, then VAR2 should be included as the group to be predicted.
Var1 (t-1) VAR2 (t-1) var1(t) var2(t) 1 0.0 50.0 1 51 2 1.0 51.0 2 52 3 2.0 52.0 3 53 4 3.0 53.0 4 54 54.0 54.0 5 55 6 5.0 55.0 6 56 7 6.0 56.0 7 57 8 7.0 57.0 8 58 9 8.0 58.0 9 59Copy the code
As you can see, the new sequence generated by the given number of input and output sequences above can help you easily complete multivariate time series prediction.
For example, the following would reconstruct the prediction sequence with 1 as the number of input columns and 2 as the number of output columns (prediction columns) :
from pandas import DataFrame
from pandas import concat
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
"""Function purpose: To convert a time series to a supervised learning dataset. Parameter description: data: indicates a sequence of observed values. The data type can be list or NumPy array. N_in: Number of lag groups as input value (X). N_out: number of observation groups as the output value (y). Dropnan: Boolean value that determines whether to remove rows containing NaN. Return value: the translated Pandas DataFrame sequence for supervised learning. """
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n,... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# Prediction sequence (t, t+1... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# collate all columns
agg = concat(cols, axis=1)
agg.columns = names
Drop rows containing NaN
if dropnan:
agg.dropna(inplace=True)
return agg
raw = DataFrame()
raw['ob1'] = [x for x in range(10)]
raw['ob2'] = [x for x in range(50, 60)]
values = raw.values
data = series_to_supervised(values, 1, 2)
print(data)Copy the code
Running the sample will show the reconstructed large DataFrame.
Var1 (t-1) var2(t-1) VAR1 (t) var2(t) var1(t+1) var2(t+1) 1 0.0 50.0 1 51 2.0 52.0 2 1.0 51.0 2 52 3.0 53.0 3 2.0 52.0 3 53 4.0 54.0 4 3.0 53.0 4 54 5.0 55.0 54.0 54.0 5 55 6.0 56.0 6 5.0 55.0 6 56 7.0 57.0 7 6.0 56.0 7 57 8.0 58.0 8 7.0 57.0 8 58 9.0 59.0Copy the code
You can run several experiments with your own data set to see which refactoring works better.
conclusion
In this tutorial, you’ve learned how to use Python to turn a time series dataset into a supervised learning problem.
In particular, you see:
- The Pandasshift()Function and how it automatically converts time series data into supervised learning datasets.
- How to reconstruct unary time series into one – or multi-step supervised learning problem.
- How to reconstruct multivariate time series into one – or multi-step supervised learning problem.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. Android, iOS, React, front end, back end, product, design, etc. Keep an eye on the Nuggets Translation project for more quality translations.