[TOC]

ARMA

The combination of AR(p) and MA(q) is ARMA(P,q), autoregressive moving average.

The formula is as follows:

The formula represents:

The value of the current time step is a constant plus the sum of the autoregressive lag and its multipliers, plus the sum of the moving average lag and its multipliers, plus some white noise.

Both capture lag term and residual effect, more general. The order of p and q is determined, and the coefficients of the equation are updated according to the least square or maximum likelihood estimation.

Review the time series modeling process:

  1. Stationarity test:
  • Determine whether the sequence is stationary

If it is not stationary, it is necessary to transform the sequence (usually difference);

  • Determine whether the stationary sequence is white noise

If the stationary sequence is white noise, the modeling condition is not satisfied

  1. Model estimation:
  • Determine the value of p and q

According to the historical articles, it can be determined by the autocorrelation coefficient (ACF) and partial autocorrelation coefficient (PACF), AR(p) has p-order truncation, MA(q) has Q-order truncation.

  • Information criterion

If no clear truncation can be seen in the ACF and PACF plots, information criterion is adopted for judgment, generally BIC and AIC are used

  • The combination of the two
  1. Model residual test
  • Whether the residuals are normal distributions with mean values of 0 and variance of constant (normality)
  • Test the correlation of residuals (correlation)

ARIMA

The difference between ARIMA and ARMA is that a parameter D is added to transform non-stationary series into stationary series, indicating that the d-order difference is transformed into stationary series. ARIMA model is only ARMA model on difference time series.

ARIMA models are represented by the symbol ARIMA(P, D, Q).

For example, in the ARIMA(1,1,0) model, (1,1,0) means there is an autoregressional lag, a difference is made to the data, and there is no moving average term.

  • p

In the autoregressive part of the model, the influence of the past value is incorporated into the model, that is, the historical value has an impact on the future;

  • D is the integrated part of the model.

The fraction of difference needed to make the time series stationary. For example, if the temperature difference in the past three days is very small, tomorrow’s temperature may be similar to the previous days;

  • q

For the moving average part of the model, the model error can be a linear combination of error values observed at past time points.

SARIMA

SARIMA (Seasonal AutoRegressive Integrated Moving Average Model) was a Seasonal ARIMA Model with exogenous regressive models. That is, on the basis of ARIMA, the seasonal part is added. Seasonality is a pattern that repeats in the data with a fixed frequency: daily, fortnightly, four-month, etc.

SARIMA model can be expressed as SARIMA (P, D, q) x (p, D, q) S, which satisfies the multiplication principle. The first part represents the non-seasonal part, the second part represents the seasonal part, and S represents the seasonal frequency.

Seasonal components may capture long-term patterns, while non-seasonal components adjust predictions of short-term changes.

SARIMA of actual combat

First, go through the modeling process of time series analysis one by one.

1. Sequence stationarity test

The unit root test is used here.

Unit root test: The test of unit root of time series is the test of the stationarity of time series. If there is unit root in non-stationary time series, the difference method can generally be used to eliminate the unit root and obtain the stationary series.

Unit root T test:

  • The original hypothesis: there is a unit root
  • P < significance level, reject the null hypothesis, indicating that the unit root is stable
def test_stationarity(timeseries,
                      maxlag=None, regression=None, autolag=None,
                      window=None, plot=False, verbose=False) :
    Unit root test
    
    # set defaults (from function page)
    if regression is None:
        regression = 'c'
    
    if verbose:
        print('Running Augmented Dickey-Fuller test with paramters:')
        print('maxlag: {}'.format(maxlag))
        print('regression: {}'.format(regression))
        print('autolag: {}'.format(autolag))
    
    if plot:
        if window is None:
            window = 4
        #Determing rolling statistics
        rolmean = timeseries.rolling(window=window, center=False).mean()
        rolstd = timeseries.rolling(window=window, center=False).std()
        
        #Plot rolling statistics:
        orig = plt.plot(timeseries, color='blue', label='Original')
        mean = plt.plot(rolmean, color='red', label='Rolling Mean ({})'.format(window))
        std = plt.plot(rolstd, color='black', label='Rolling Std ({})'.format(window))
        plt.legend(loc='best')
        plt.title('Rolling Mean & Standard Deviation')
        plt.show(block=False)
    
    #Perform Augmented Dickey-Fuller test:
    dftest = smt.adfuller(timeseries, maxlag=maxlag, regression=regression, autolag=autolag)
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic'.'p-value'.'#Lags Used'.'Number of Observations Used',]),for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    if verbose:
        print('Results of Augmented Dickey-Fuller Test:')
        print(dfoutput)
    return dfoutput

Copy the code

2. Acf and PACF diagrams

Draw the original sequence diagram, ACF and PACF diagram, and roughly judge the historical data trend and p and Q order of the sequence

def tsplot(y, lags=None, title=' ', figsize=(14.8)) :

    fig = plt.figure(figsize=figsize)
    layout = (2.2)
    ts_ax   = plt.subplot2grid(layout, (0.0))
    hist_ax = plt.subplot2grid(layout, (0.1))
    acf_ax  = plt.subplot2grid(layout, (1.0))
    pacf_ax = plt.subplot2grid(layout, (1.1))
    
    y.plot(ax=ts_ax)
    ts_ax.set_title(title)
    y.plot(ax=hist_ax, kind='hist', bins=25)
    hist_ax.set_title('Histogram')
    smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
    smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
    [ax.set_xlim(0) for ax in [acf_ax, pacf_ax]]
    sns.despine()
    fig.tight_layout()
    return ts_ax, acf_ax, pacf_ax
Copy the code

3. Model residual statistics

Test the normality of standardized residuals (Jarque-Bera normality test) :

  • Null hypothesis: is normal
  • P >alpha, accept the null hypothesis, the residual is normal

Residual sequence correlation (LJung-box test) :

  • Null hypothesis: No sequence correlation
  • P >alpha, accept the null hypothesis, residual sequence has no correlation

Test residual sequence correlation (Durbin-Watson test) :

  • The closer the statistic value is to 2, the better. Generally, it is no problem if it is between 1 and 3.
  • Less than 1 indicates that the residual is autocorrelation.

def model_resid_stats(model_results,
                      het_method='breakvar',
                      norm_method='jarquebera',
                      sercor_method='ljungbox',
                      verbose=True.) :

    
    (het_stat, het_p) = model_results.test_heteroskedasticity(het_method)[0]
    # Jarque-Bera normality test
    norm_stat, norm_p, skew, kurtosis = model_results.test_normality(norm_method)[0] 
    Ljung-box test correlation test
    sercor_stat, sercor_p = model_results.test_serial_correlation(method=sercor_method)[0] 
    sercor_stat = sercor_stat[-1] # last number for the largest lag
    sercor_p = sercor_p[-1] # last number for the largest lag

    # Durbin-Watson test correlation test
    dw_stat = sm.stats.stattools.durbin_watson(model_results.filter_results.standardized_forecasts_error[0, model_results.loglikelihood_burn:])

    # check whether roots are outside the unit circle (we want them to be);
    # will be True when AR is not used (i.e., AR order = 0)
    arroots_outside_unit_circle = np.all(np.abs(model_results.arroots) > 1)
    # will be True when MA is not used (i.e., MA order = 0)
    maroots_outside_unit_circle = np.all(np.abs(model_results.maroots) > 1)
    
    if verbose:
        print('Test heteroskedasticity of residuals ({}): stat={:.3f}, p={:.3f}'.format(het_method, het_stat, het_p));
        print('\nTest normality of residuals ({}): stat={:.3f}, p={:.3f}'.format(norm_method, norm_stat, norm_p));
        print('\nTest serial correlation of residuals ({}): stat={:.3f}, p={:.3f}'.format(sercor_method, sercor_stat, sercor_p));
        print('\nDurbin-Watson test on residuals: d={:.2f}\n\t(NB: 2 means no serial correlation, 0=pos, 4=neg)'.format(dw_stat))
        print('\nTest for all AR roots outside unit circle (>1): {}'.format(arroots_outside_unit_circle))
        print('\nTest for all MA roots outside unit circle (>1): {}'.format(maroots_outside_unit_circle))
    
    stat = {'het_method': het_method,
            'het_stat': het_stat,
            'het_p': het_p,
            'norm_method': norm_method,
            'norm_stat': norm_stat,
            'norm_p': norm_p,
            'skew': skew,
            'kurtosis': kurtosis,
            'sercor_method': sercor_method,
            'sercor_stat': sercor_stat,
            'sercor_p': sercor_p,
            'dw_stat': dw_stat,
            'arroots_outside_unit_circle': arroots_outside_unit_circle,
            'maroots_outside_unit_circle': maroots_outside_unit_circle,
            }
    return stat
Copy the code

4. Grid search for model parameters

There are 6 parameters of SARIMA model. If you choose them manually, you have to adjust bald, so you have to adjust on grid search. In fact, you can also use Bayesian estimation to adjust parameters.

Here we focus on the SARIMAX function call and the meaning of its parameters.

  • Order: ARIMA corresponding to (p,d,q)
  • Seasonal_order: (P, D, Q, s)
def model_gridsearch(ts,
                     p_min,
                     d_min,
                     q_min,
                     p_max,
                     d_max,
                     q_max,
                     sP_min,
                     sD_min,
                     sQ_min,
                     sP_max,
                     sD_max,
                     sQ_max,
                     trends,
                     s=None,
                     enforce_stationarity=True,
                     enforce_invertibility=True,
                     simple_differencing=False,
                     plot_diagnostics=False,
                     verbose=False,
                     filter_warnings=True.) :
    '''Run grid search of SARIMAX models and save results. '''
    
    cols = ['p'.'d'.'q'.'sP'.'sD'.'sQ'.'s'.'trend'.'enforce_stationarity'.'enforce_invertibility'.'simple_differencing'.'aic'.'bic'.'het_p'.'norm_p'.'sercor_p'.'dw_stat'.'arroots_gt_1'.'maroots_gt_1'.'datetime_run']

    # Initialize a DataFrame to store the results
    df_results = pd.DataFrame(columns=cols)


    mod_num=0
    for trend,p,d,q,sP,sD,sQ in itertools.product(trends,
                                                  range(p_min,p_max+1),
                                                  range(d_min,d_max+1),
                                                  range(q_min,q_max+1),
                                                  range(sP_min,sP_max+1),
                                                  range(sD_min,sD_max+1),
                                                  range(sQ_min,sQ_max+1)) :# initialize to store results for this parameter set
        this_model = pd.DataFrame(index=[mod_num], columns=cols)

        if p==0 and d==0 and q==0:
            continue

        try:
            model = sm.tsa.SARIMAX(ts,
                                   trend=trend,
                                   order=(p, d, q),
                                   seasonal_order=(sP, sD, sQ, s),
                                   enforce_stationarity=enforce_stationarity,
                                   enforce_invertibility=enforce_invertibility,
                                   simple_differencing=simple_differencing,
                                  )
            
            if filter_warnings is True:
                with warnings.catch_warnings():
                    warnings.filterwarnings("ignore")
                    model_results = model.fit(disp=0)
            else:
                model_results = model.fit()

            if verbose:
                print(model_results.summary())

            if plot_diagnostics:
                model_results.plot_diagnostics();

            stat = model_resid_stats(model_results,
                                     verbose=verbose)

            this_model.loc[mod_num, 'p'] = p
            this_model.loc[mod_num, 'd'] = d
            this_model.loc[mod_num, 'q'] = q
            this_model.loc[mod_num, 'sP'] = sP
            this_model.loc[mod_num, 'sD'] = sD
            this_model.loc[mod_num, 'sQ'] = sQ
            this_model.loc[mod_num, 's'] = s
            this_model.loc[mod_num, 'trend'] = trend
            this_model.loc[mod_num, 'enforce_stationarity'] = enforce_stationarity
            this_model.loc[mod_num, 'enforce_invertibility'] = enforce_invertibility
            this_model.loc[mod_num, 'simple_differencing'] = simple_differencing

            this_model.loc[mod_num, 'aic'] = model_results.aic
            this_model.loc[mod_num, 'bic'] = model_results.bic

            # this_model.loc[mod_num, 'het_method'] = stat['het_method']
            # this_model.loc[mod_num, 'het_stat'] = stat['het_stat']
            this_model.loc[mod_num, 'het_p'] = stat['het_p']
            # this_model.loc[mod_num, 'norm_method'] = stat['norm_method']
            # this_model.loc[mod_num, 'norm_stat'] = stat['norm_stat']
            this_model.loc[mod_num, 'norm_p'] = stat['norm_p']
            # this_model.loc[mod_num, 'skew'] = stat['skew']
            # this_model.loc[mod_num, 'kurtosis'] = stat['kurtosis']
            # this_model.loc[mod_num, 'sercor_method'] = stat['sercor_method']
            # this_model.loc[mod_num, 'sercor_stat'] = stat['sercor_stat']
            this_model.loc[mod_num, 'sercor_p'] = stat['sercor_p']
            this_model.loc[mod_num, 'dw_stat'] = stat['dw_stat']
            this_model.loc[mod_num, 'arroots_gt_1'] = stat['arroots_outside_unit_circle']
            this_model.loc[mod_num, 'maroots_gt_1'] = stat['maroots_outside_unit_circle']

            this_model.loc[mod_num, 'datetime_run'] = pd.to_datetime('today').strftime('%Y-%m-%d %H:%M:%S')

            df_results = df_results.append(this_model)
            mod_num+=1
        except:
            continue
    return df_results

Copy the code

5. Build a model

5.1 Importing Data

Here, test sets and verification sets are split, which is different from machine learning modeling in the form of random sampling, because the sequential data is ordered.


liquor = pd.read_csv('liquor.csv', header=0, index_col=0, parse_dates=[0])

n_sample = liquor.shape[0]
n_train=int(0.95*n_sample)+1
n_forecast=n_sample-n_train

# Split test set sequence and validation set sequence
liquor_train = liquor.iloc[:n_train]['Value']
liquor_test  = liquor.iloc[n_train:]['Value']
print(liquor_train.shape)
print(liquor_test.shape)
print("Training Series:"."\n", liquor_train.tail(), "\n")
print("Testing Series:"."\n", liquor_test.head())

Copy the code

5.2 Visualization of original sequence, ACF and PACF

tsplot(liquor_train, title='Liquor Sales (in millions of dollars)', lags=40);
Copy the code

From the original sequence diagram, it was found that the sequence was not a stationary sequence, and there was no obvious truncation in the ACF and PACF diagrams, so p and Q could not be determined.

5.3 Non-stationary sequence to stationary sequence

# Test for stationarity

test_stationarity(liquor_train)
Copy the code

Unit root test, P >0.05, can not reject the null hypothesis (with unit root), the sequence is not stationary.

# difference
test_stationarity(liquor_train.diff().dropna())
Copy the code

The first-order difference, P <0.05, rejects the null hypothesis and the sequence is stable, so the first-order difference is enough for the sequence.

5.4 Grid search for model parameters


p_min = 0
d_min = 0
q_min = 0
p_max = 2
d_max = 1
q_max = 2

sP_min = 0
sD_min = 0
sQ_min = 0
sP_max = 1
sD_max = 1
sQ_max = 1

# Cycle by year
s=12 

# trends=['n', 'c']
trends=['n']

enforce_stationarity=True
enforce_invertibility=True
simple_differencing=False

plot_diagnostics=False

verbose=False

df_results = model_gridsearch(liquor['Value'],
                              p_min,
                              d_min,
                              q_min,
                              p_max,
                              d_max,
                              q_max,
                              sP_min,
                              sD_min,
                              sQ_min,
                              sP_max,
                              sD_max,
                              sQ_max,
                              trends,
                              s=s,
                              enforce_stationarity=enforce_stationarity,
                              enforce_invertibility=enforce_invertibility,
                              simple_differencing=simple_differencing,
                              plot_diagnostics=plot_diagnostics,
                              verbose=verbose,
                              )

Copy the code

5.5 Model selection and construction


df_results.sort_values(by='bic').head(10)
Copy the code

BIC is selected as the evaluation index of the model.Select the parameter corresponding to the minimum BIC(p,d,q)=(2,1,0), (p,d,q) s =(0,1,1)12.

Substitute the above optimal parameters into the model:


mod = sm.tsa.statespace.SARIMAX(liquor_train, order=(2.1.0), seasonal_order=(0.1.1.12))
sarima_fit2 = mod.fit()
print(sarima_fit2.summary())

Copy the code

Let’s look at the resulting statistics of the training set model:

  • Coef: Coefficient of regression
  • P > | z | : coefficient is significant
  • JB: Normal test of residuals
  • LB: Correlation test of residual sequences

Visual test of model residuals:

  • Randomness: Whether it is white noise
  • Normality: Whether it is normally distributed
  • Correlation: Whether the correlation between residuals is low

Only when the above conditions are met can the model be established successfully.

sarima_fit2.plot_diagnostics(figsize=(16.12));
Copy the code

5.6 predict

fig, ax1 = plt.subplots(nrows=1, ncols=1, figsize=(12.8))
    
ax1.plot(liquor_train, label='In-sample data', linestyle=The '-')
# subtract 1 only to connect it to previous point in the graph
ax1.plot(liquor_test, label='Held-out data', linestyle=The '-')

# yes DatetimeIndex
pred_begin = liquor_train.index[sarima_fit2.loglikelihood_burn]
pred_end = liquor_test.index[-1]
pred = sarima_fit2.get_prediction(start=pred_begin.strftime('%Y-%m-%d'),
                                    end=pred_end.strftime('%Y-%m-%d'))
pred_mean = pred.predicted_mean
pred_ci = pred.conf_int(alpha=0.05)

ax1.plot(pred_mean, 'r', alpha=6., label='Predicted values')
ax1.fill_between(pred_ci.index,
                 pred_ci.iloc[:, 0],
                 pred_ci.iloc[:, 1], color='k', alpha=2.)
ax1.set_xlabel("Year")
ax1.set_ylabel("Liquor Sales")
ax1.legend(loc='best');
fig.tight_layout();

Copy the code

Zoom in, the fitting effect is very good ~ (visible stability)


Reference links:

  1. Statsmodels.sourceforge.net/devel/gener…
  2. Wiki.mbalib.com/wiki/%E5%8D…
  3. www.statsmodels.org/dev/generat…
  4. www.statsmodels.org/dev/generat…
  5. www.statsmodels.org/dev/generat…
  6. www.statsmodels.org/dev/generat…
  7. Cloud.tencent.com/developer/a…
  8. Blog.csdn.net/qifeidemumu…

Welcome to pay attention to personal public account:Distinct whom