Original link:tecdat.cn/?p=18860
Original source:Tuo End number according to the tribe public number
Introduction to the
Time series analysis is a major branch of statistics, which mainly focuses on analyzing data sets to study the characteristics of the data and extract meaningful statistical information to predict the future value of the series. There are two methods of time series analysis, namely frequency domain and time domain. The former is mainly based on Fourier transform, while the latter studies the autocorrelation of sequences and uses Box-Jenkins and ARCH/GARCH methods to predict sequences.
This paper will provide a process for analyzing and modeling financial time series in an R environment using a time domain approach. The first part covers stationary time series. The second part provides guidance for MODELING ARIMA and ARCH/GARCH. Next, it will examine combinatorial models and their performance and effectiveness in modeling and predicting time series. Finally, the time series analysis methods are summarized.
Stationarity and difference of time series data set:
1. Smoothness:
The first step of time series data modeling is to transform non-stationary time series into stationary time series. This is important because many statistical and econometric methods are based on this assumption and can only be applied to stationary time series. Non-stationary time series are unstable and unpredictable, while stationary processes are mean-reverting, that is, they fluctuate around a constant mean with a constant variance. In addition, the stationarity and independence of random variables are closely related, because many theories that apply to independent random variables also apply to stationary time series that require independence. Most of these methods assume that random variables are independent (or unrelated). Noise is independent (or uncorrelated); Variables and noise are independent of each other (or unrelated). So what is a stationary time series?
Roughly speaking, stationary time series have no long-term trend, mean and variance are constant. More specifically, there are two definitions of stationarity: weak stationarity and strict stationarity.
A. Weak stationarity: the time series {Xt, t∈Z} (where Z is a set of integers) is said to be stationary if the following conditions are met
B. Strictly stationary: if (Xt1, Xt2… , Xtk) is the same as (Xt1 + h, Xt2 + h), then the time series {Xt. Xtk + h), t∈Z} is considered to be strictly stationary.
Usually in statistical literature, stationarity refers to the weak stationarity of stationary time series satisfying three conditions: constant mean, constant variance, and autocovariance functions depend only on (ts) (not on t or S). On the other hand, strict stationarity means that the probability distribution of time series does not change with time.
For example, white noise is stationary, meaning that the random variables are unrelated and not necessarily independent. However, strict white noise indicates independence between variables. In addition, since the Gaussian distribution is characterized by the first two moments, the Gaussian white noise is strictly stationary. Therefore, the irrelevance also means the independence of the random variable.
In strict white noise, the noise term {et} cannot be predicted linearly or nonlinearly. In general white noise, it may not be predicted linearly, but it can be predicted nonlinearly by the ARCH/GARCH model discussed later. There are three points to note:
• Strict stationarity does not mean weak stationarity because it does not require finite variance
• Stationarity does not mean strict stationarity, because strict stationarity requires that the probability distribution does not change over time
• Nonlinear functions of strictly stationary sequences are also strictly stationary and do not apply to weakly stationary sequences
2. The difference between:
To convert a non-stationary sequence to a stationary sequence, a difference method can be used to subtract 1 period lag from the original sequence: for example:
In financial time series, it is common to transform the sequence and then perform a difference. This is because financial time series usually undergo exponential growth, so logarithmic transformation can smooth (linearize) time series, while difference will help stabilize the variance of time series. Here’s an example of apple’s stock price:
• The chart at the top left is the original time series of Apple stock prices from Jan. 1, 2007, to July 24, 2012, showing exponential growth.
• The chart at the bottom left shows the difference in Apple’s stock price. As you can see, the series is price-related. In other words, the variance of a sequence increases with the level of the original sequence and is therefore not stationary
• Apple’s log price chart is displayed in the upper right corner. This sequence is more linear than the original sequence.
• The difference in apple log prices is shown in the bottom right. The series seems to be more mean-revertive and the variance is constant and does not change significantly with the level of the original series.
To perform the difference in R, perform the following steps:
• Read the data file in R and store it in a variable
Appl. Close = APPl $Adjclose # Reads and stores the closing price in the original fileCopy the code
• Plot the original stock price
plot(ap.close,type='l')
Copy the code
• Different from the original sequence
diff.appl=diff(ap.close)
Copy the code
• Differential sequence diagram of the original sequence
plot(diff.appl,type='l')
Copy the code
• Get the logarithm of the original sequence and plot the logarithm price
log.appl=log(appl.close)
Copy the code
• Different log prices and graphs
difflog.appl=diff(log.appl)
Copy the code
The difference in the log price represents the return, similar to the percentage change in the stock price.
ARIMA model:
Model recognition:
The time-domain method is established by observing the autocorrelation of time series. Therefore, autocorrelation and partial autocorrelation are the core of ARIMA model. The BoxJenkins method provides a method to identify ARIMA models based on autocorrelation and partial autocorrelation graphs of sequences. ARIMA parameters are composed of three parts: P (autoregression parameter), D (difference score) and Q (moving average parameter).
There are three rules for identifying ARIMA models:
• If ACF (autocorrelation graph) is cut off after lag n, PACF (partial autocorrelation graph) disappears: ARIMA (0, D, n) determines MA (q)
• If ACF drops, PACF will be cut off after n order lag: ARIMA (n, D, 0), identify AR (P)
• If ACF and PACF fail: mix ARIMA model, need to distinguish
Note that the number of differences in ARIMA is written differently, even if the same model is referenced. For example, ARIMA (1,1,0) of the original sequence can be written as ARIMA (1,0,0) of the differential sequence. Similarly, it is necessary to check for overdifferences where the lag order 1 autocorrelation is negative (usually less than -0.5). Too much difference will result in an increase in standard deviation.
Here’s an example of an Apple time series:
• The top left is represented as a logarithmic ACF of Apple’s stock price, showing a slow decline (not a decline). The model may require difference.
• In the lower left corner is Log Apple’s PACF, which indicates the valid value at 1 lag, and then PACF cutoff. So the model for Log Apple’s stock price might be ARIMA (1,0,0)
• Top right shows differential ACF of logarithmic Apple without significant lag (not considering lag 0)
• PACF of logarithmic Apple difference is shown in the lower right corner without significant lag. Therefore, the model of differential logarithmic Apple sequence is white noise, and the original model is similar to random walk model ARIMA (0,1,0).
The idea of minimalism is important in fitting the ARIMA model, in which the model should have as few parameters as possible but still be able to explain the series (p and q should be less than or equal to 2, or the total number of parameters should be less than or equal to given box-Jenkins method 3). The more parameters, the more noise can be introduced into the model, so the standard deviation is larger.
Therefore, when examining the AICc of a model, it is possible to examine models with P and Q of 2 or less. To execute ACF and PACF in R, the following code:
• Logarithmic ACF and PACF
acf.appl=acf(log.appl)
pacf.appl=pacf(log.appl,main='PACF Apple',lag.max=100
Copy the code
• ACF and PACF of difference logarithms
acf.appl=acf(difflog.appl,main='ACF Diffe
pacf.appl=pacf(difflog.appl,main='PACF D
Copy the code
In addition to the Box-Jenkins approach, AICc provides another way to examine and identify models. AICc is the red pool information criterion, which can be calculated by the following formula:
AICC = N * log (SS/N) + 2 (p + q + 1) * N/(N — p — q — 2) if there is no constant term in the model
AICC = N * log (SS/N) + 2 (P + q + 2) * N/(N — p — q — 3) if is a constant term in the model
N: Number of items after differentiation (N = N — d)
SS: Sum of difference squares
P&q: Order of autoregressive model and moving average model
According to this method, the model with the lowest AICc will be selected. When time series analysis is performed in R, the program provides AICc as part of the results. In other software, however, it may be necessary to calculate the numbers manually by calculating the sum of squares and following the formula above. The numbers may vary slightly when different software is used.
Model AICc 01 0-6493 1 1 0-6491.02 01 1-6493.02 1 1 1-6489.01 01 2-6492.84 1 1 2-6488.89 2 1 0-6491.1 2 1 1 -6489.14 2 1 2 -6501.86Copy the code
Based on AICc, we should choose ARIMA (2,1,2). The two approaches may sometimes yield different results, so once you have all the estimates, you must check and test the model. Here is the code to execute ARIMA in R:
summary(arima212)
Copy the code
Parameter estimation
To estimate the parameters, execute the same code as shown earlier. The result will provide an estimate for each element of the model. Using ARIMA (2,1,2) as the selected model, the results are as follows:
Series: the appl ARIMA (2,1,2) Coefficients: Ar1 ar2 MA1 MA2-0.0015-0.9231 0.0032 0.8803 S.e. 0.0532 0.0400 0.0661 0.0488 Sigma ^2 Estimated as 0.000559: Log likelihood= 3255.95AIC =-6501.9 AICc=-6501.86 BIC=-6475.68Copy the code
Complete model:
(Yt — YT-1) = -0.0015 (Yt — 1) -0.9231 (Yt — 2) +0.0032εt-1+0.8803εt-2+ε T
Note that when executing the ARIMA model with difference, R ignores the mean. Here’s Minitab’s output:
Final Estimates of Parameters Type Coef SE Coef T P AR 1 0.0007 0.0430 0.02 0.988 AR 2 -0.9259 0.0640-14.47 0.000 MA 1 0.0002 0.0534 0.00 0.998 MA 2-0.8829 0.0768-11.50 0.000 Constant 0.002721 0.001189 2.29 0.022 Differencing: 1 regular difference Number of observations: Original series 1401, after differencing 1400 Residuals: SS = 0.779616 (Backforecasts) MS = 0.000559 DF = 1395 Modified Box-pierce (Ljung-box) Chi-Square statistic Lag 12 24 36 48 Chi-square 6.8 21.2 31.9 42.0 DF 7 19 31 43 p-value 0.452 0.328 0.419 0.516Copy the code
Note that R will give different estimates for the same model depending on how we write the code. Arima (log.appl, order = c (2,1,2))
Arima (difflog.appl, order = c (2,0,2))Copy the code
The parameter estimates for ARIMA (2,1,2) derived from these two lines of code will differ in R, even though it references the same model. However, in Minitab, the results are similar, so there is less confusion for users.
Diagnostic work-up
The process consists of observing residual graphs and their ACF and PACF graphs, and checking lJung-box results.
If the ACF and PACF of model residuals do not lag significantly, an appropriate model is selected.
Residual graphs ACF and PACF do not have any obvious lag, indicating that ARIMA (2,1,2) is a good model to represent this sequence.
In addition, lJung-box testing provides another way to scrutinize the model. Basically, LJung-box is a kind of autocorrelation test, in which it tests whether the autocorrelation of time series is different from 0. In other words, if the results reject the hypothesis, it means that the data is independent and unrelated; Otherwise, there is still sequence correlation in the sequence and the model needs to be modified.
Modified Box-Pierce (Ljung-Box) Chi-Square statistic
Lag 12 24 36 48
Chi-Square 6.8 21.2 31.9 42.0
DF 7 19 31 43
P-Value 0.452 0.328 0.419 0.516
Copy the code
The output of Minitab shows that all p values are greater than 0.05, so we cannot reject the hypothesis that autocorrelation is different from 0. Therefore, the selected model is one of the appropriate models for Apple stock price.
ARCH/GARCH model
Although there is no significant lag in the ACF and PACF of residuals, the time series diagram of residuals shows some volatility. It is important to remember that ARIMA is a way of modeling data linearly and keeping predictions constant, as the model does not reflect recent changes or incorporate new information. In other words, it provides the best linear prediction for sequences and therefore has little role in nonlinear model prediction. To model fluctuations, ARCH/GARCH methods are used. How do we know if ARCH/GARCH is needed for the time series in question?
First, check whether the residual graph shows any volatility. Next, look at the squared residuals. If there is volatility, ARCH/GARCH should be used to model the volatility of the series to reflect more recent changes and volatility in the series. Finally, ACF and PACF of squared residuals will help confirm whether residuals (noise terms) are independent and predictable. As mentioned earlier, strict white noise cannot predict linearly or nonlinearly, while ordinary white noise may not predict linearly but still cannot predict nonlinearly. If residuals are strictly white noise, they are independent of zero mean, normal distribution, and the ACF and PACF of square residuals do not lag significantly.
Here is the graph of square residuals:
• Residual squares show volatility at certain points in time
• PACF will still truncate when lag is 10, even if some lag is still large
Therefore, residuals show some patterns that can be modeled. ARCH/GARCH is necessary for modeling volatility. As the name implies, this method is related to the conditional variance of a sequence. General form of ARCH (Q) :
res.arima212=arima212$res
squared.res.arima212=res.arima212^2
Copy the code
ARCH/GARCH order and parameters are selected according to AICc, as shown below:
AICC = -2 log+ 2 (q + 1) * N/(N — q — 2) if there are no constant terms in the model
AICC = -2 log+ 2 (q + 2) * N/(N — q — 3), if the model is a constant term
To calculate the AICc, we need to fit the ARCH/GARCH model to the residuals and then calculate the logarithmic likelihood using the logLik function in R. Note that since we only wish to model the noise of the ARIMA model, we fit ARCH to the residual of the previously selected ARIMA model, not the original sequence or logarithmic or differential logarithmic sequence.
Model N Q Log&likelihood ARCH(0) 1400 0 3256.488,6510.973139,6508.96741 ARCH(1) 1400 1 ARCH(2) 1400 2 3331.168,6656.318808,6654.307326 ARCH(3) 1400 3 3355.06,6702.091326 ARCH(4) 1400 4 3370.881,6731.718958,6729.701698 ARCH(5) 1400 5 3394.885,6777.709698,6775.68954 ARCH(6) ARCH(7) 1400 7 3403.227,6790.350477,6788.324504 ARCH(8) 1400 8 3410.242 ARCH(9) 1400 9 3405.803,6791.447613,6789.415798 ARCH(10) 1400 10 3409.187,6796.183798 GARCH(1,"1) 1400 2 3425.365,6844.712808,6842.701326Copy the code
AICc tables for constant and unsteady cases are provided above. Note that the AICc decreases from ARCH 1 to ARCH 8, and then increases in ARCH 9 and ARCH 10. Why does it happen? Indicates that we need to check the convergence of the model. In the first 7 cases, the output in R gives “relative function convergence”, while ARCH 9 and ARCH 10 have “false convergence”. When the output contains False convergence, the predictive power of the model is questionable and we should exclude these models from the selection; Although GARCH 1,1 also had the lowest AICc, the model was wrongly converged and therefore excluded. ARCH 8 is the model of choice.
In addition, we included ARCH 0 in our analysis because it can be used to check whether there are any ARCH effects or whether residuals are independent.
R code to execute ARCH/GARCH model:
loglik08=logLik(arch08)
summary(arch08)
Copy the code
Note that R does not allow the order q = 0, so we cannot obtain the logarithmic likelihood of ARCH 0 from R; 5 * N * 1 + log 2 * PI *mean(x) wE compute 2 N: the number of times we observe after the phase difference N = N — d x: In this data set case, the output of the relic ARCH 8 is:
Call: Model: GARCH(0,8) Residuals: Min 1Q Median 3Q max-4.40329-0.48569 0.08897 0.69723 4.07181 Coefficient(s): T the value Estimate Std. Error (Pr > | | t) a0 e-04 e-05 1.432 10.282 < 1.472 2-16 * * * e a1 e-01 e-02 3.532 1.284 3.636 0.000277 * * * A2 1.335E-01 2.839E-02 4.701 2.59E-06 *** A3 9.388E-02 3.688E-02 2.545 0.010917 * A4 8.678E-02 2.824E-02 3.073 0.002116 ** a5 5.667E-02 2.431e-02 2.331 0.019732 * a6 3.972e-02 2.383e-02 1.667 0.095550. A7 9.034e-02 2.907e-02 3.108 0.001885 ** a8 1.126E-01 2.072E-02 5.437 5.41E-08 *** -- signif. codes: Diagnostic Tests: Jarque Bera Test Data: 0 '***' 0.001 '**' 0.01 ' Residuals X-squared = 75.0928, df = 2, p-value < 2.2E-16 box-ljung test data: Squared.Residuals X-squared = 0.1124, df = 1, p-value = 0.7374Copy the code
Except for the sixth parameter, p “values of all parameters are less than 0.05, indicating that they are statistically significant. In addition, the Box “Ljung test p” value is greater than 0.05, so we cannot reject the assumption that the autocorrelation residual value is different from 0. The model is therefore sufficient to represent residuals. Complete ARCH 8 model: ε 2T “1 + 1.335E-01 ε 2T” 2 + 9.388E-02 ε 2T “3 + 8.678E-02 ε 2T” 4 + 5.667e-02ε 2T “5 +3.972e-02ε 2T” 6 7 + 1.126 + 9.034 e-02 epsilon 2 t “e-01 epsilon 2 t” 8
ARIMA-ARCH / GARCH
In this section, we will compare the results of the ARIMA model with the combined ARIMAARCH/GARCH model. As mentioned above, the ARIMA and ARCH models of Apple Log price series are ARIMA 2,1,2) and ARCH 8) respectively. In addition, we will look at the results for Minitab and compare them to the results for R. Remember that R excludes constants when fitting ARIMA into the desired difference sequence. Therefore, our previous result generated from R was ARIMA 2,1,2), with no constants. Using the prediction function, make a 1 step prediction for the series according to ARIMA 2,1,2)
Point Forecast Lo 95 Hi 95
1402 6.399541 6.353201 6.445882
Copy the code
Complete model of ARIMA (2,1,2) — ARCH (8) :
The following table summarizes all the models, edited and calculated in Excel for point predictions and prediction intervals:
95% Confident interval Model Forecast Lower Upper Actual ARIMA(2, 2) in R 6.399541 6.353201 6.445882 6.354317866 ARIMA(2,1) in Minitab (constant) 6.40099 6.35465 6.44734 ARIMA(2,1) in Minitab (no constant) 6.39956 6.35314 6.44597 ARIMA(2,1) +ARCH(8) in R 6.39974330 6.35340330 6.44608430 ARIMA(2,1) in Minitab (constant) +ARCH(8) 6.40119230 6.35485230 6.44754230 ARIMA(2,1) in Minitab (no constant) +ARCH(8) 6.39976230 6.35334230 6.44617230Copy the code
By converting logarithmic price to price, we obtain the prediction of the original sequence:
95% Confident interval Model Forecast Lower Upper Actual ARIMA(2,1,2) in R 601.5688544 574.3281943 630.1021821 574.9700003 ARIMA(2,1) in Minitab (constant) 602.4411595 575.1609991 631.0215411 ARIMA(2,1) in Minitab (no constant) ARIMA(2,1) + ARCH(8) in R 601.6905666 574.4443951 630.2296673 ARIMA(2,1) in Minitab (constant) +ARCH(8) 602.5630482 575.2773683 631.1492123 ARIMA(2,1) in Minitab (no constant) +ARCH(8) 601.7019989 574.409355 630.28513Copy the code
On July 25, 2012, Apple released lower-than-expected earnings. The announcement affected the company’s stock price, which fell from $600.92 on July 24, 2012 to $574.97 on July 24, 2012. This is a common risk of surprise when a company releases positive or negative news. However, since the actual price is within our 95% confidence interval and very close to the lower limit, our model seems to be able to successfully predict this risk.
It should be noted that the 95% confidence interval of ARIMA (2,1,2) is wider than that of ARIMA (2,1,2) — ARCH (8) combined model. This is because the latter reflects and incorporates recent changes and fluctuations in share prices by analyzing residuals and their conditional variances (variances affected as new information becomes available). Then how to calculate the conditional variance HT of ARCH (8)? • Generate 1-step forecast, 100-step forecast, forecast chart:
Forecast212step1 = forecast (arima212, 1, level = 95)Copy the code
• Calculate HT, conditional variance:
Ht. arch08=arch08$fit[,1]^2 #Copy the code
• Generate graphs of logarithmic price, upper and lower 95%
plot(log.appl,type='l',main='Log Apple,Low,High')
lines(low,col='red')
lines(high,col='blue')
Copy the code
To calculate HT, we first list all the parameters of the model in a column, then look for the residuals associated with these coefficients, square these residuals, multiply the HT coefficients by the residuals squared, and then sum the numbers to get HT. For example, to estimate point 1402 (our dataset has 1401 observations), we need residuals for the last 8 days because our model is ARCH (8). Here is the generated table:
Ht coeff res squared res HT components const 1.47E-04 1.47E,04 A1 1.28E-01, 5.18e,03 2.69e,05 3.45e,06 A2 1.34E-01 4.21e,04 1.77e,07 2.37e,08 A3 9.39E-02, 1.68e,02 2.84e,04 2.66e,05 A4 8.68E-02 1.57e,02 1.36e,05 a5 5.67E-02 , 7.4e,04 5.49e,07 3.11e,08 a6 3.97E-02 8.33e,04 6.93e,07 2.75e,08 a7 9.03E-02 2.92e,03 8.54e,06 7.72e,07 a8 1.13E-01 9.68e,03 9.37e,05 1.05e,05 HT 2.02e,04Copy the code
To estimate the 1-step prediction and 95% confidence interval of the mixed model as described above, we use the ARIMA prediction obtained from R or Minitab, and then add HT to the ARIMA prediction. Record logarithmic price and conditional variance:
• Conditional variance plots successfully reflect volatility across time series • High volatility is closely correlated with periods when stock prices have collapsed
95% price forecast interval:
The final check of the model is to check the QQ chart of residual error of ARIMA-ARCH model, that is, et =εt/ SQRT (HT) = residual error/SQRT (conditional variance). We can calculate it directly from R and then draw QQ plots to check the normality of residuals. Here is the code and QQ picture:
qqline(archres)
Copy the code
The figure shows that the residuals appear to be roughly normally distributed, although some points are not on straight lines. However, compared with the residual of ARIMA model, the residual of the hybrid model is closer to the normal distribution.
conclusion
Time domain method is a useful method to analyze financial time series. There are some aspects to consider in predictions based on the Arim-Arch/GARCH model: First, the ARIMA model focuses on linear time series analysis and cannot reflect recent changes due to the presence of new information. Therefore, in order to update the model, users need to merge the new data and re-estimate the parameters. The variance in ARIMA model is the unconditional variance and remains constant. ARIMA is suitable for stationary sequences and therefore should transform non-stationary sequences (e.g., logarithmic transformations). In addition, ARIMA is often used in conjunction with the ARCH/GARCH model. ARCH/GARCH is a method of measuring sequence volatility, or more specifically, modeling the noise terms of an ARIMA model. ARCH/GARCH combines new information and analyzes sequences based on conditional variance, allowing users to use the latest information to predict future value. The prediction interval of hybrid model is shorter than that of pure ARIMA model.
Most welcome insight
1. Use LSTM and PyTorch for time series prediction in Python
2. Long and short-term memory model LSTM is used in Python for time series prediction analysis
3. Time series (ARIMA, exponential smoothing) analysis using R language
4. R language multivariate Copula – Garch – model time series prediction
5. R language Copulas and financial time series cases
6. Use R language random wave model SV to process random fluctuations in time series
7. Tar threshold autoregressive model for R language time series
8. R language K-Shape time series clustering method for stock price time series clustering
Python3 uses ARIMA model for time series prediction