Original link:tecdat.cn/?p=5202

 

Unlike most advanced analysis solutions, time series modeling is a low-cost solution that provides powerful insights.

This paper introduces three basic steps to build a quality time series model: smoothing data, selecting the right model, and evaluating the accuracy of the model. The example in this article uses historical page view data from a major automotive marketing company.

Step 1:

A time series involves using data indexed by time interval (minutes, hours, days, weeks, etc.). Due to the discrete nature of time series data, many time series datasets embed seasonal and/or trend elements in the data. The first step in time series modeling is to consider existing seasons (repeated patterns over fixed time periods) and/or trends (moves up or down in the data). Given these embedded patterns, we call this data fixation. Examples of trend and seasonal data can be seen in Figures 1 and 2 below.

Figure 1: Sample upward trending data

Figure 2: Sample seasonal data

What is stationarity?

As we mentioned earlier, the first step in time series modeling is to remove the influence of trends or seasons that exist in the data to make it stand still. We keep dropping the term stationary, but what does it really mean?

A stationary series is one where the average of the series is no longer a function of time. With trend data, the average of the series increases or decreases over time (think house prices continue to rise over time). For seasonal data, the average value of the series fluctuates seasonally (allowing for temperature increases and decreases every 24 hours).

How do we achieve stability?

There are two methods that can be used to achieve stationarity, differential data or linear regression. To make a difference, you can calculate the difference between successive observations. To use linear regression, you can include binary indicator variables for seasonal components in the model. Before we decide which methods to apply, let’s explore our data. We used SAS Visual Analytics to map historical daily page views.

Figure 3: Time series diagram of the original page view

The original pattern appears to repeat itself every seven days, indicating the season of the week. The continued increase in the number of pages viewed over time indicates a slight upward trend. Along with the general idea of data, we apply stationarity statistical test, enhanced Dickey-Fuller (ADF) test. ADF test is the unit root test of stationarity. We won’t get into the details here, but the unit root indicates whether the series is non-stationary, so we use this test to determine the appropriate way to deal with trends or seasons (differences or regressions). Based on the ADF test of the above data, we removed seven days of the season by regressing dummy variables on one day of the week and eliminated trends by differentiating the data. The resulting stationary data can be seen in the figure below.

Figure 4: Stable data after removing seasons and trends

Step 2: Build your time series model

Now that the data is stationary, the second step in time series modeling is to establish a baseline level forecast. We should also note that most base-level forecasts do not require the first step of smoothing the data. This is only required for more advanced models, such as the ARIMA modeling we will discuss.

Base prediction

There are several types of time series models. To build a model that accurately predicts future page views (or whatever you’re interested in predicting), it’s necessary to determine the type of model that fits your data.

The simplest option is to assume that the future value of Y (the variable you are interested in predicting) is equal to the latest value of y.

The second type of model is the average model. In this model, all observations in the data set are given equal weight. The future prediction of Y is calculated as the average of the observed data. If the data is horizontal, the resulting forecast can be quite accurate, but if the data is trended or has a seasonal component, it can provide a very poor forecast. Below you can see the predicted values of the page view data using the average model.

Figure 5: Average model prediction

If the data has seasonal or trend elements, a better choice for a base-level model is to implement an exponential smoothing model (ESM). The ESM finds a medium between the above model and the average model, in which the most recent observations are given the maximum weight and all previous observations are weighted exponentially down into the past. The ESM also allows seasonal and/or trend components to be included in the model. The table below provides an example of an exponential decline in the initial weight of 0.7 at a rate of 0.3.

 

Yt (current observations) 0.7

YT – 10.21 –

YT – 20.063 –

YT – 30.0189 –

YT – 40.00567 –

Table 1: Past examples of observed exponential decline effects on Y.

Various types of ESM can be implemented in time series prediction. The ideal model to use will depend on the data types you have. The table below provides a quick guide on which type of ESM to use, based on trends and seasonal combinations in the data.

Exponential smoothing model for type trend seasons

Simple ESM

Linear ESMX

Seasonal ESMX

Winters ESMXX

Table 2: Model selection table

Due to the strong seven-day season and the upward trend of the data, we chose a winter warming ESM as the new baseline level model. The resulting forecast has done a decent job of continuing the slight uptrend and capturing the seven day season.

Figure 6: ESM forecast

ARIMA model

After identifying the model that best illustrates the trends and seasons in the data, you will eventually have enough information to generate a decent forecast, as shown in Figure 2 above. However, these models are still limited because they do not take into account the correlation of interest variables with themselves over a period of time. We call this correlation autocorrelation, which is usually found in time series data. If the data is autocorrelated with our data, then additional modeling may be done to further improve the baseline prediction.

In order to capture the effects of autocorrelation in time series models, it is necessary to implement an autoregressive integrated moving average (or ARIMA) model. The ARIMA model includes parameters that take into account seasons and trends (for example, using dummy variables to represent days of the week and differences), and also allows the inclusion of autoregressivity and/or moving average terms to handle embedded autocorrelation in the data. By using the appropriate ARIMA model, we can further improve the accuracy of our page view prediction, as shown in Figure 3.

Figure 7: Seasonal ARIMA model prediction

Step 3: Evaluate the accuracy of the model

While you can see an improvement in the accuracy of each model provided, it is not always reliable to visually determine which model has the best accuracy. Calculating MAPE (mean absolute error percentage) is a quick and easy method to compare the overall prediction accuracy of the proposed model – the lower the MAPE, the better the prediction accuracy. Comparing the MAPE of each of the models discussed previously, it is easy to see that the seasonal ARIMA model provides the best predictive accuracy. Note that there are several other comparison statistics available for model comparison.

Model error verification

Winters ESM6.9

Seasonal ARIMA4.4

Table 3: Comparison of model error rates

The profile

In summary, the trick to building a robust time series prediction model is to eliminate as much noise as possible (trend, season, and autocorrelation) so that the only movement left in the data is pure randomness. For our data, we found that seasonal ARIMA models with regression variables provided the most accurate predictions.