Project warehouse
- Project address: github.com/sunhailin-L…
- Developer: Sunhailin-Leo
Project introduction
-
The core code of crawler will go to Github repository to see ~
-
Brief Introduction of project functions:
- Obtain the exchange rate of bank of China’s foreign exchange quotation (Hong Kong dollar is the Base of this project template)
- The acquisition time can be customized (setting the start time is not recommended to span too long)
- Crawler data can be stored in MySQL, MongoDB, and CSV (controlled by cmdline_start_spider startup command parameters)
- Incremental crawl mode updated (November 12, 2018)
-
Source: penalising srh.bankofchina.com/search/whpj…
-
To crawl:
- Just out of curiosity, I want to predict the exchange rate trend in the next few days
- And by the way, let’s visualize it
-
Project technical points:
- Scrapy gets source code parsing data
You need to install Selenium and configure the Chrome Driver
* Since the page numbers are loaded via JS, the temporary solution is to render in Selenium headless mode (later changed to scrapy-splash).Copy the code
Talk about the data
- The following data are the exchange rate data from 01-01 to November 06, 2018 (about 48,000 data after de-weighting).
- Data exploration:
- Boc has four different quoted prices on its website (spot bid price, cash bid price, spot sell price, cash sell price and BOC discount price), and BOC discount price is not considered for the time being.
- Spot purchase price – refers to the bank settlement price at which the foreign exchange in the account is converted into RMB through the settlement of foreign exchange.
- Cash buying price – refers to the bank settlement price of foreign currency settled in cash and converted into RMB.
- Spot exchange selling price — refers to the bank settlement price for foreign exchange purchase and foreign payment, and the exchange of RMB into foreign currency.
- Cash selling rate – refers to the bank settlement rate at which foreign currency notes are purchased and RMB is converted into foreign currency.
- If you want to consider going to Hong Kong to buy buy buy, you can use the cash selling price or cash selling price for data visualization and data prediction.
- I chose the spot exchange selling price, and the following data were visualized and statistically analyzed with the spot exchange selling price (for other types, only column data were changed).
- Boc has four different quoted prices on its website (spot bid price, cash bid price, spot sell price, cash sell price and BOC discount price), and BOC discount price is not considered for the time being.
- Data cleaning:
-
It is easy to see that there is some duplicate data when looking at the page or retrieving the data.
-
Cleansing method :(I chose the latter, which is relatively benign without putting pressure on the database)
- MySQL: SELECT DISTINCT statements
- Pandas: datafame. Drop_duplicates (It is ok to retain the first value based on the publication time)
-
The code looks like this:
header = ['Spot buying price'.'Cash buying rate'.'Spot asking price'.'Cash selling rate'.'Boc converted price'.'Query time'] # dataframe configuration Display all columns pd.set_option('display.max_columns'.None) # display all rows # pd.set_option('display.max_rows', None) # set the display length of value to 100, default to 50 pd.set_option('max_colwidth'.100) Get data from MySQL and reset table header sql = "SELECT buying_rate, cash_buying_rate, selling_rate, cash_selling_rate, boe_conversion_rate, rate_time " \ "FROM exchange_rate.t_exchange_rate " \ "WHERE currency_name = 'hk '" df = pd.read_sql(sql=sql, con=sql_conn()) df.columns = header df = df.sort_values(by='Query time') Convert data types df['Spot buying price'] = df['Spot buying price'].astype('float') df['Cash buying rate'] = df['Cash buying rate'].astype('float') df['Spot asking price'] = df['Spot asking price'].astype('float') df['Cash selling rate'] = df['Cash selling rate'].astype('float') df['Boc converted price'] = df['Boc converted price'].astype('float') # to heavy df = df.drop_duplicates(subset='Query time', keep='first') print(df[['Spot asking price'.'Query time']]) Copy the code
-
- Data visualization (line chart and K-chart)
- The following code is a library for some code (Pandas, Pyecharts)
- The line chart
# Line chart data
total_data = [
df['Spot buying price'].tolist(), df['Cash buying rate'].tolist(), df['Spot asking price'].tolist(), df['Cash selling rate'].tolist()
]
draw_line_pic(
title=Exchange rate conversion between RMB and HKD (HK $100),
labels=header[0:4],
data_package=total_data,
x_axis=df['Query time'].tolist()
)
def draw_line_pic(title: str, labels: list, data_package: list, x_axis: list):
""" Line chart :param title: : Param Labels: :param data_Package: :param x_Axis: :return: """
style = Style(
title_top="#fff",
title_pos="left",
width=1920,
height=900
)
line = Line(title=title, **style.init_style)
for i, d in enumerate(labels):
line.add(d, x_axis, data_package[i],
is_stack=False,
is_label_show=True,
is_smooth=True,
yaxis_min=78,
yaxis_max=90,
yaxis_formatter="Yuan yuan",
mark_point=["max"."min"],
mark_line=['average'],
is_datazoom_show=True,
datazoom_type="both",
datazoom_range=[80.100])
line.render(path='./file/line.html')
Copy the code
- The drawing results are as follows:
- K line graph
- Data note: spot exchange selling price is used, groupby is calculated according to the data for daily mean for visualization
# K line data
df['Query time'] = df['Query time'].apply(lambda x: x[:9 -])
df['Query time'] = pd.to_datetime(df['Query time'], format="%Y-%m-%d")
df = df.groupby('Query time') ['Spot asking price']
labels = []
values = []
for d in df:
temp_data = d[1].tolist()
k_data = [temp_data[0], temp_data[- 1], min(temp_data), max(temp_data)]
labels.append(str(d[0[]) :9 -])
values.append(k_data)
draw_kline_pic(title=Exchange rate conversion between RMB and HKD (HK $100), labels=labels, data_package=values)
def draw_kline_pic(title: str, labels: list, data_package: list):
""" K line chart :param title: : Param Labels: :param data_Package: :return: """
style = Style(
title_top="#fff",
title_pos="left",
width=1920,
height=900
)
kline = Kline(title=title, **style.init_style)
kline.add('day K', labels, data_package,
yaxis_min=78,
yaxis_max=90,
yaxis_formatter="Yuan yuan",
mark_line=["min"."max"],
mark_point=["min"."max"],
is_datazoom_show=True,
datazoom_type="both",
datazoom_range=[80.100])
kline.render('./file/k_line.html')
Copy the code
- The drawing results are as follows:
-
To predict
- Before the prediction, you can understand the ARIMA model (autoregressive integral moving average model) first.
- There is no code for this part (timing analysis steps are too long to consider periodic, autocorrelation, anti-autocorrelation balabala’s ~), that part is too messy to post, my last code (using pyFlux library).
def model_training_1(df: pd.DataFrame):
df['Query time'] = df['Query time'].apply(lambda x: x[:9 -])
df['Query time'] = pd.to_datetime(df['Query time'], format="%Y-%m-%d")
df = df.groupby('Query time') ['Spot asking price'].mean()
df = df.to_frame()
print(df)
# ARIMA
model = pf.ARIMA(data=df, ar=2, ma=2, integ=0, target='Spot asking price', family=pf.Normal())
x = model.fit("MLE")
x.summary()
model.plot_z(figsize=(15.5))
model.plot_fit(figsize=(15.10))
model.plot_predict_is(h=50, figsize=(15.5))
model.plot_predict(h=2, past_values=50, figsize=(15.5))
res = model.predict(h=5)
print(res)
Copy the code
- So let’s look at the fit diagram which is plot_fit
- Look at the model validation parameters
- Forecast chart (data for the next few days using data from the last 50 days)
- Predicted results
-
Data validation
- I haven’t had time to do the verification yet, but I am lazy to see the fitting degree is still considerable
- I will make it up when I have time
-
digression
- Pyflux storage address: github.com/RJT1990/pyf…
- Pyflux is a library dedicated to time series data analysis. Unfortunately, its official website has been suspended for a long time
- This open source project also has a cool official website, and there are examples + pictures on the official website. Now only git doc is left.
- Reference: pyflux. Readthedocs. IO/en/latest/g…
- Note: there are almost no diagrams left in the document, only formulas and code blocks left, so I’ll make do with it. If you are interested in time series analysis and prediction, you can consider using the statistics library Statsmodels, which is a rich and scientific document and tool.
Future development Direction
- Go wherever you want ~Peace!