The original link
My GitHub blog address
preface
Hello, everyone, xiao Zhi came to bring you some dry goods today. The primary actual combat of machine learning stock price prediction is when I just got into quantitative trading, because I was struggling to find a data source, so I looked for a third-party platform to obtain stock data.
Later, I became interested in ipython Notebook used on the platform. After all, I have not been learning Python for a long time, so I was really happy to get in touch with such a special and beautiful compilation environment. Ipython’s mix of code, text, and diagrams makes it easy to document, and it compiles instantly, so it’s fun to use anyway.
So I did some research and learned that we could actually use Ipython as a local editor ourselves, which made me really happy, plus this Friday, tomorrow, I’m doing a session in front of the entire company on ARTIFICIAL intelligence and quantitative trading, So I took the opportunity to write the demo code on the Ipython Notebook. Tomorrow, I will first run the code well, and then demonstrate the code while demonstrating the diagram, comfortable, later you will also feel the charm of ipython Notebook in the article.
I will share this PPT, and I will use an article to talk about what I said and what I thought during the speech. I hope to communicate with you.
Of course, the style of PPT may not really meet your expectations. I’m just a programmer and I’m really not good at these, so black words on a white background feel very good.
URL analysis
Like the previous article, I am using the stock index to analyze the data, however, the meeting on Friday, I need to use a usually familiar with a product of our company (spot precious metals) are involved in our company, so I just asked our CTO, let him give me some way to get the gold, then the CTO gave me an address, knowledge site on Wall Street. Then I clicked on the chart of the category I wanted and, using Chrome’s checkup tool, easily grabbed the URL data. It looks something like this
The request URL for this data looks like this
https://forexdata.wallstreetcn.com/kline?prod_code=XAUUSD&candle_period=8&data_count=1000&end_time=1413158399&fields=tim e_stamp%2Copen_px%2Cclose_px%2Chigh_px%2Clow_pxCopy the code
The format of the data is clear. We can probably guess that the request parameter data_count represents the amount of data requested, and the end_time is the timestamp. The two data values combined are the datA_count data of the previous trading days from end_time.
Time_stamp is the time stamp of each data, close is the closing price, open is the opening price, high is the highest price and low is the lowest price. These five data are the basic data we need to draw the K-chart, which is the so-called candle chart. K line does not know the partner can look up, I will not elaborate here.
Crawl data
After analyzing the URL, we are going to formally crawl the data. I want to get 10 years of data in gold (actually gold/DOLLAR, code name is XAUUSD, it is a foreign currency). Note that after my attempt, the data_count in this URL can only get 1000 data at most, if it is greater than 1000, It also returns 1000 entries by default. Naturally, then, our request parameter end_time will have to change dynamically.
For convenience, I decided to crawl only one year ata time, so data_count is fixed at 365, and end_time is retrieved from the argument via the format function, as follows
def get_data(end_time,count):
url = "https://forexdata.wallstreetcn.com/kline?prod_code=XAUUSD&candle_period=8&data_count=365&end_time="\
"{end_time}"\
"&fields=time_stamp%2Copen_px%2Cclose_px%2Chigh_px%2Clow_px".format(end_time=end_time)
response = requests.get(url) # request data
data_list = json.loads(response.text) # json parsing
data = data_list.get("data").get("candle").get("XAUUSD")
Convert to DataFrame
df = pd.DataFrame(data,columns=['date'.'open'.'close'.'high'.'low'],index=list(range(count,count+365)))
return df
Copy the code
Here, we use the Requests third-party package to request the data, parse it with JSON, and convert it to the Pandas DataFrame structure. This is a routine operation, so I think everyone should be ok.
After the method of data acquisition is written, we cycle to call get_data function 10 times, and the DataFrame object is splice, then complete to get our golden 10 years of data, pay attention to each cycle interval must have a certain delay, so as not to be blocked by the anti-crawler mechanism iP.
init_time = 1237507200 # March 20, 2009
window = 60*60*24*365 Get data for 365 days at a time
df = pd.DataFrame()
for i in range(10):
df = pd.concat([df,get_data(init_time + i * window,i*365)])
print("get data success ",i)
time.sleep(0.5)
Copy the code
Ok, after executing the code, let’s look at the df data and cut an Ipython notebook style.
Ipython is an experience that I can summarize as: carpe diem.
Play with the data
Well, now that we have 3,650 DataFrame data, it’s time to play around with the data as Python is the master of data analysis. (import matplotlib.pyplot as plt)
Let’s plot the closing price of gold in three lines of code
df['close'].plot(figsize=(15.10))
plt.grid(True)
plt.show()
Copy the code
So we have five basic data points for plotting k-line data, and it doesn’t make sense not to plot k-line data. The code to draw the K-line is a bit more complicated, dealing mainly with the time on the X-axis. It requires a data conversion, converting the timestamp to %Y-%m-%d, and converting this format to the time style supported by PyPlot.
import matplotlib.finance as mpf
from matplotlib.pylab import date2num
import datetime
r = map(lambda x : time.strftime('%Y-%m-%d',time.localtime(x)),df['date'])
df['date'] = list(r)
def date_to_num(dates):
num_time = []
for date in dates:
date_time = datetime.datetime.strptime(date,'%Y-%m-%d')
num_date = date2num(date_time)
num_time.append(num_date)
return num_time
fig,ax = plt.subplots(figsize=(15.10))
mat_data = df.as_matrix()
num_time = date_to_num(mat_data[:,0])
mat_data[:,0] = num_time
fig.subplots_adjust(bottom=0.2)
ax.xaxis_date()
mpf.candlestick_ochl(ax,mat_data,width=0.6,colorup='r',colordown='g')
plt.grid(True)
plt.xlabel('Data')
plt.ylabel('Price')
plt.show()
Copy the code
We can also plot the daily ups and downs of the decade, showing gold’s bulls, bears and swings
rate_of_return = (df['close']-df['open'])/df['open']
rate_of_return.plot(kind='line',style='k--',figsize=(15.10))
plt.show()
Copy the code
As you can see, gold has been volatile most of the time, with some anomalies in the beginning and in the middle, which I guess was a correction in the period after the financial crisis, when the dollar plummeted.
Oh, and here I want to correct a mistake I made in my last post, remember that curved bar chart? Yes, that’s it
I thought it was an Ipython bug, but it turned out it wasn’t. I added this line to the code
with plt.xkcd():
Copy the code
XKCD is the name of a cartoon, and this function represents drawing in a style similar to that of the cartoon, so what does the cartoon look like?
emm… It does look handmade.
At the end
Well, this is the data crawl, in fact, and most of the crawl work is similar, but the market has a certain particularity.
In the next post I’ll play around with some machine learning code on this data to tune arguments with you.
Recommended reading
Machine learning stock price prediction
What’s up with quantitative trading and artificial intelligence
Share some tips for learning AI