contact
- Author: sunhailin-Leo
- E-mail: [email protected]
- Code: warehouse address -> github.com/sunhailin-L…
Introduction to the
-
The first article is portal: Python crawls the bank of China foreign exchange rate (crawler + PyFlux simple prediction analysis)
-
If the data is not available, please refer to the first article (this article has data stored in the database by default).
Get into the business
- At the end of the last article, I looked at using the Pyflux library to analyze and predict data
- Two aspects are summarized:
- Advantages:
- The Pyflux model documentation “hits the nail on the head “(built for people who have some basic knowledge of timing analysis and can understand some of the core formulas)
- Disadvantages:
- Provides a small number of data analysis apis, unlike Statsmodels which provides methods such as residual analysis for model validation tuning
- Advantages:
- This paper will use Statsmodels to analyze the previous data.
Data preparation
df['Query time'] = df['Query time'].apply(lambda x: x[:9 -])
df['Query time'] = pd.to_datetime(df['Query time'], format="%Y-%m-%d")
df = df.groupby('Query time') ['Spot asking price'].mean()
df = df.to_frame()
print(df)
Copy the code
Data stationarity check
- Directly on the code (difference drawing to see data stationarity)
- A total of levels 1 and 2 have been tested (it is not recommended to use high-order data because it is easy to damage data)
# difference figure
fig = plt.figure(figsize=(12.8))
ax1 = fig.add_subplot(111)
# The 1 inside represents the difference order
diff1 = df.diff(1)
diff1.plot(ax=ax1)
plt.show()
# fig.savefig('./picture/1.jpg')
Copy the code
- First order difference result
- Second order difference result
- Conclusion: It can be concluded from the observation that the fluctuation of first-order difference is relatively small and relatively stable (the data modeled later is modeled using the results of first-order difference).
Autocorrelation and partial autocorrelation
-
Introduce some autocorrelation and non-autocorrelation:
- Autocorrelation (ACF): Autocorrelation refers to the correlation between the expected values of random error terms. It is called autocorrelation or sequence correlation between random error terms
- Partial autocorrelation (PACF): A Partial autocorrelation is a summary of the relationship between the observation of time series after removing the interference and the observation of time series with the previous time step.
-
Next, 40 hysteresis were selected for autocorrelation and partial autocorrelation analysis
# statModels partial correlation graph
diff1 = diff1.dropna()
fig = plt.figure(figsize=(12.8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(diff1, lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(diff1, lags=40, ax=ax2)
fig.show()
# fig.savefig('./picture/acf_pacf.jpg')
Copy the code
-
Next, p(autocorrelation coefficient) and Q (partial autocorrelation coefficient) parameters of ARMA model are defined by looking at the figure:
- Autocorrelation plots show three orders of lag beyond the confidence boundary
- The partial correlation diagram shows that the partial autocorrelation coefficient exceeds the confidence boundary when the lag is 2 orders (in fact, the fluctuation is not completely eliminated, but is more prominent when the lag is 2 orders).
-
Try the following model with p and Q parameters and output the results of AIC,BIC and HQIC
- AIC: Akaike Information Criterion –> -2ln(L) + 2k
- BIC: Bayesian Information Criterion –> -2ln(L) + ln(n) * K
- HQIC:HQ criterion –> -2ln(L) + ln(n) * k
- L is the maximum likelihood under the model, n is the number of data, k is the number of variables of the model.
-
The results are as follows:
- P =2, q=0 –> -194.77200890459767-179.81283725588074-188.79262385129292
- P =2, q=1 –> -197.01722373566554-178.31825917476937-189.54299241903462
- P =2, q=2 –> -195.11076756537022-172.6720100922948-186.1416899854131
- P =3, q=2 –> -201.37730533090803-175.19875494565338-190.91338148762475
-
Model selection:
- In view of the above results, p=2 and q=2 are selected as the optimal method, although it may not be the most suitable, but in view of the difference and the results of autocorrelation and partial autocorrelation graphs, (2, 2) parameters can be tried to be used for modeling
modeling
- The modeling code is short, so without going into too much detail, look at the following code
arma_mod22 = sm.tsa.ARMA(diff1, (3.2)).fit()
Output AIC,BIC and HQ criteria results
print(arma_mod22.aic, arma_mod22.bic, arma_mod22.hqic)
Copy the code
Model validation
- Residual verification
Residuals (DataFrame)
resid = arma_mod22.resid
ACF and PACF plots of residuals
fig = plt.figure(figsize=(12.8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2)
fig.show()
# fig.savefig('./picture/resid_acf_pacf.jpg')
Copy the code
- According to the figure above, although the order of a few parts still exceeds the confidence interval, the sequence residual is basically white noise in general.
- D – W validation
- D-w verification is a method to test the autocorrelation (only applicable to the detection of first-order autocorrelation, higher order requires a different approach)
# Residual D-W test
resid_dw_result = sm.stats.durbin_watson(arma_mod22.resid.values)
# 1.9933441709003574 is close to 2, so the residual sequence does not have autocorrelation.
print(resid_dw_result)
Copy the code
- The result equals 1.9933441709003574
- Conclusion: There is no autocorrelation in residual sequence
- Reason: There is almost no autocorrelation when DW results are close to 2 (see article: www.doc88.com/p-543541848…)
- Normal distribution
# Normal check -> Basically conforms to normal distribution
fig = plt.figure(figsize=(12.8))
ax = fig.add_subplot(111)
fig = sm.qqplot(resid, line='q', ax=ax, fit=True)
fig.show()
# fig.savefig('./picture/normal_distribution.jpg')
Copy the code
- Except for some outliers in the upper right corner (the data of these points are intuitively shown in the following Q test)
- According to the figure, most of the data still conform to the normal distribution
- Q test
Residual sequence LJung-box test, also known as Q test
r, q, p = sm.tsa.acf(resid.values.squeeze(), qstat=True)
data = np.c_[range(1.41), r[1:], q, p]
table = pd.DataFrame(data, columns=['lag'."AC"."Q"."Prob(>Q)"])
temp_df = table.set_index('lag')
print(temp_df)
# Prob(>Q)的最小值: 0.025734615668093132, 最大值: 0.9874705305611844, 均值: 0.2782013984159408
prob_q_min = temp_df['Prob(>Q)'].min()
prob_q_max = temp_df['Prob(>Q)'].max()
prob_q_mean = temp_df['Prob(>Q)'].mean()
print("Prob(>Q) min: {}, Max: {}, mean: {}".format(prob_q_min, prob_q_max, prob_q_mean))
Copy the code
- Results: The minimum value of Prob(>Q) : 0.025734615668093132; the maximum value: 0.9874705305611844; the mean value: 0.2782013984159408
- It can be seen from the figure above that the values of the last 10 data are almost around 0.05, which can correspond to the values of those discrete points that return to normal distribution
- According to the method of Q check, in the 95% confidence interval, when Prob is greater than 0.05, the current residual sequence does not have autocorrelation, so the model with 40 lag data basically does not have autocorrelation.
- Forecast (forecast exchange rate movements from November 9 to 14)
- Since the data had been made first-order difference before, the data of the predicted result was also the predicted value of first-order difference
# prediction
predict_data = arma_mod22.predict('2018-11-09'.'2018-11-14', dynamic=True)
# Draw a prediction chart
fig, ax = plt.subplots(figsize=(12.8))
ax = diff1.ix['2018-01-01':].plot(ax=ax)
fig = arma_mod22.plot_predict('2018-11-09'.'2018-11-14', dynamic=True, ax=ax, plot_insample=False)
fig.show()
# Result prediction
last_data = df['Spot asking price'].values[- 1]
# Reduction result
predict_data_list = predict_data.values.tolist()
restore_list = []
for d in predict_data_list:
last_data = last_data + d
restore_list.append(last_data)
predict_data = pd.DataFrame(restore_list, index=predict_data.index, columns=['Spot asking price'])
print(predict_data)
Copy the code
- Note: The above predicted data are the average exchange rate of the day
conclusion
- The above analysis, modeling, verification, prediction process is just a simple process. The accuracy of the prediction model needs to be verified continuously, and the above key codes only provide a set of processes for reference.
- Time series data prediction and analysis still need to be learned. There are different models for different types of data and multi-variable data. The above steps are for reference only.
- At present, statsmodels and PyFlux have used timing prediction libraries, and machine learning or deep learning methods will be considered for prediction.