Original link:tecdat.cn/?p=12260

Original source:Open end data tribes

 


ARIMA model is a popular and widely used statistical method for time series prediction.

ARIMA is an acronym that stands for automatic Regression moving average. It is a class of models that can capture a set of different standard time structures in time series data.

In this tutorial, you will discover how to develop an ARIMA model for time series data using Python.

After completing this tutorial, you will know:

  • About the ARIMA model, parameters used and assumptions made by the model.
  • How to fit the ARIMA model to the data and use it for prediction.
  • How to configure the ARIMA model for your time series problem.

Learn how to prepare and visualize time series data and develop autoregressive prediction models.

Let’s get started.

Autoregressive comprehensive moving average model

ARIMA models are a class of statistical models that analyze and predict time series data.

It explicitly caters to a set of standard structures in time series data and thus provides a simple and powerful way to make skilled time series predictions.

ARIMA is an acronym that stands for automatic Regression moving average. It is a generalization of the simple automatic regression moving average and adds the concept of difference.

The acronym is descriptive. In short, they are:

  • AR:autoregressive. A model that uses a dependency between observations and some lagging observations.
  • I:comprehensive. To stabilize the time series, differences in the original observations are used (for example, subtracting the observations from the previous time step).
  • MA:Moving average. A model that uses a dependency between observations and residuals of a moving average model applied to lagging observations.

Each is explicitly specified as a parameter in the model. Use ARIMA (p, D, q), where parameters are replaced with integer values to quickly indicate the particular ARIMA model being used.

Parameters of ARIMA model are defined as follows:

  • P: The number of lag observations included in the model, also known as lag order.
  • D: The number of times the original observed value differs, also known as phase difference.
  • Q: The size of the moving average window, also called the order of the moving average.

Build a linear regression model that includes a specified number and type of terms, and prepare the data with a degree of difference to keep it smooth, that is, eliminate trends and seasonal structures that negatively affect the regression model.

The value 0 can be used as a parameter, which means that this element of the model is not used. In this way, ARIMA models can be configured to perform the functions of ARMA models or even simple AR, I, or MA models.

The ARIMA model is used for time series, so it is assumed that the basic process of generating observation values is ARIMA process. This may seem obvious, but it helps to stimulate the assumptions that need to be confirmed in the model between the original observations and the residuals predicted by the model.

Next, let’s look at how to use the ARIMA model in Python. We’ll start by loading a simple univariate time series.

 

Shampoo sales data set

This data set describes shampoo sales per month over a 3-year period.

The units are the number of units sold, and there are 36 observations.

  • Download data set

Download the dataset and place it in the current working directory with the file name “sensation-sales.csv”.

Here is an example of loading a sales data set using a custom function to parse a date-time field. The dataset is based on any year, in this case 1900.

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot


def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

print(series.head())

series.plot()

pyplot.show()
Copy the code

Running the example prints the first five lines of the dataset.

Month

1901-01-01 266.0

1901-02-01 145.9

1901-03-01 183.1

1901-04-01 119.3

1901-05-01 180.3

Name: Sales, dtype: float64
Copy the code

The data is also plotted as a time series, with months on the X-axis and sales figures on the Y-axis.

 

Shampoo sales data set graph

We can see a clear trend in the shampoo sales data set.

This indicates that the time series is not stationary and needs to be differentiated to make it stable by at least 1.

We also take a quick look at autocorrelation graphs for time series. The following example plots lagging autocorrelation in a time series.

By running the example, we can see that there is a positive correlation with the first 10 to 12, which may have significant significance for the first 5.

A good starting point for the model’s AR parameters might be 5.

 

Autocorrelation graph of shampoo sales data

ARIMA with Python

 

You can create an ARIMA model as follows:

  1. By calling theARIMA ()And the incomingp.dandqParameters to define the model.
  2. Prepare the model on training data by calling the fit () function.
  3. Predictions can be made by calling the predict () function and specifying an index for one or more times to predict.

We fit the ARIMA model into the entire Shampoo Sales data set and check for residuals.

First, we fit the ARIMA (5,1,0) model. This sets the lag value of the autoregression to 5, uses the difference order of 1 to stabilize the time series, and uses the moving average model of 0.

A lot of debugging information about linear regression model fitting is provided when the model is fitted. We can turn this off by setting the disp parameter to 0.

Running the example displays a summary of the fit model. This summarizes the coefficient values used and the techniques used to fit the observed values in the sample.

ARIMA Model Results ============================================================================== Dep. Variable: D. sales No. Observations: 35 Model: ARIMA(5, 1, 0) Log Likelihood -196.170 Method: Css-mle S.D. of Innovations 64.241 Date: Mon, 12 Dec 2016 AIC 406.340 Time: 11:09:13 BIC 417.227 Sample: 02-01-1901 HQIC 410.098-02-01-1901 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Coef STD err z P > | z | / 95.0% Conf. Int. -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- const 12.0649 3.652 3.304 0.003 4.908 19.223ar.l1.d.sales-1.1082 0.183-6.063 0.000-1.466-0.750 ar.l2.d.sales-0.6203 0.282-2.203 0.036-1.172-0.068 Ar.l3.d.sales-0.3606 0.295-1.222 0.231-0.939 0.218ar.l4.d.sales-0.1252 0.280-0.447 0.658-0.674 0.424 Ar.l5.d.sales 0.1289 0.191 0.673 0.506-0.246 0.504 Roots ============================================================================= Real Imaginary Modulus Frequency -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - AR. 1-1.0617-0.5064 j 1.1763 0.4292 AR. 2 -1.0617 +0.5064j 1.1763 0.4292 AR.3 0.0816-1.3804 j 1.3828-0.2406 AR.4 0.0816 +1.3804j 1.3828 0.2406 AR.5 2.9315 - 0.0000-2.9315-0.0000 j -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --Copy the code

First, we get a line graph of residuals, which indicates that the model may still not capture some trend information.

 

Residual curve of ARMA fitting

Next, we obtain the density diagram of the residual error value, indicating that the error is gaussian distribution.

 

ARMA fitting residual density diagram

Shows the distribution of residual errors. The results show that there is indeed a bias in the prediction (the mean residual is non-zero).

Count 35.000000 mean-5.495213 STD 68.132882 min-133.296597 25% -42.477935 50% -7.186584 75% 24.748357 Max 133.237980Copy the code

Note that although we used the entire data set for time series analysis above, ideally we would only perform this analysis on the training data set when developing the prediction model.

Next, let’s look at how to make predictions using the ARIMA model.

Rolling prediction ARIMA model

ARIMA models can be used to predict future time steps.

We can use the predict () function on the ARIMAResults object for prediction. It accepts a time step index as a parameter for prediction. These indexes relate to the beginning of the training data set used to make the prediction.

If we use 100 observations to fit the model in the training data set, the index of the next time step used to make the prediction is specified as the prediction function start = 101, end = 101. This returns an array containing an element that contains the prediction.

If we do any differentiation (d> 0 when we configure the model), we also want the predicted value to be within the original scale. This can be specified by setting the TYp parameter to the value ‘levels’ : typ =’levels’.

Alternatively, we can avoid using all of these specifications by using the Forecast () function, which uses the model to perform a one-step prediction.

We can divide the training data set into training set and test set, use the training set to fit the model, and generate predictions for each element on the test set.

Since the previous time steps for the difference and AR models were dependent on observations, rolling predictions were required. A rough way to perform this rolling prediction is to recreate the ARIMA model after each new observation is received.

We manually track all observations in a list called history, and each iteration append new observations to that list.

To sum up, here is an example of the ARIMA model doing rolling predictions in Python.

Running the example prints the predicted and expected values at each iteration.

We can also calculate the predicted final mean square error score (MSE) to provide a comparison point for other ARIMA configurations.

Predicted =349.117688, Expected =342.300000 =306.512968, Expected =339.700000 =387.376422, Expected predicted = = 440.400000 348.154111, expected predicted = = 315.900000 386.308808, Expected predicted = = 439.300000 356.081996, expected predicted = = 401.300000 446.379501, Expected predicted = = 437.400000 394.737286, expected predicted = = 575.500000 434.915566, Expected predicted = = 407.600000 507.923407, expected predicted = = 682.000000 435.483082, Expected =475.300000 predicted=652.743772, Expected =581.300000 predicted=546.343485, Expected =646.900000 Test MSE: 6958.325Copy the code

Create a line chart that shows the expected value (blue) compared to the rolling forecast forecast (red). We can see that these values show some trends and are in the right range.

ARIMA Rolling forecast chart

The model can use further adjustments to p, D, and even Q parameters.

Configure the ARIMA model

The classical method of fitting ARIMA model follows box-Jenkins methodology.

This process uses time series analysis and diagnostics to discover good parameters of the ARIMA model.

In summary, the steps of this process are as follows:

  1. Model recognition. Use graphs and summary statistics to identify trend, seasonal, and autoregressive elements to understand the amount of variation and the required lag.
  2. Parameter estimation. Use the fitting process to find the coefficients of the regression model.
  3. Model checking. Residual graphs and statistical tests are used to determine the number and types of time structures not captured by the model.

This process is repeated until an ideal level of fit is achieved on either in-sample or out-of-sample observations, such as training or test data sets.

This process is described in a classic 1970 textbook by George Box and Gwilym Jenkins entitled “Time Series Analysis: Prediction and Control.” If you are interested in delving deeper into this type of model and methodology, the updated fifth edition is now available.

The grid search parameters of this model may be a valuable approach given that the model can be effectively adapted to medium-size time series data sets.

Abstract

In this tutorial, you discovered how to develop an ARIMA model for time series prediction in Python.

Specifically, you learned:

  • About the ARIMA model, how it is configured and the assumptions the model makes.
  • How to perform fast time series analysis using ARIMA models.
  • How to use ARIMA model to predict beyond sample prediction.

Do you have any questions about ARIMA or this tutorial? Post your questions in the comments below and we’ll do our best to answer them.