This is the third day of my participation in the November Gwen Challenge. Check out the details: the last Gwen Challenge 2021
To lose humanity, to lose much; Lose the beast, lose everything.
1 introduction
Before we have introduced how to obtain the fund list and how to obtain the basic information of the fund, today we continue the previous content, to obtain the change of the fund information. This time the information will be retrieved using a combination of page data parsing and API interface calls.
2 capture change information
By observing the basic fund information page, we can find that the page about fund changes can contain the following four parts:
Next, let’s talk about the idea of capturing data. In the first figure, we have got the basic information of the fund, the change information and the stage increase, but the stage increase has been shown in the second figure, so in this figure, we only need to obtain the real-time rise and fall and the net value of the fund on the previous day.
2.1 Information acquisition of fund changes
# Fund change information, we still start from a simple link, other funds access is similar to this, # access address change to another fund code. http://fund.eastmoney.com/005585.htmlCopy the code
Here to obtain the change is divided into two parts, one is real-time access to the change of the fund, will find that the net estimate is a period of time will change, through monitoring the browser access request record, captured such an API access, instantly happy.
// http://fundgz.1234567.com.cn/js/005585.js
{
'fundcode': '005585, 'name': 'JZRQ ',' JZRQ ': '2021- 11- 16',
'dwjz': '1.6718',
'gsz': '1.6732',
'gszzl': '0.08',
'gztime': '2021- 11- 17 15:00'}Copy the code
Fund code and name can be according to the content of the json returned to know, but JZRQ DWJZ, GSZ, GSZZL, gztime is what meaning, I carefully studied for a long time, combining with the display content on a page, plus DFCF coding habit of Chinese pinyin initials, I guess these fields mean roughly net date, net unit value, estimated value, estimated growth rate, estimated time. I was kind of smug, figuring it out.
The second part is to obtain the net unit value of the fund. Through analysis, it is found that the data is contained in an HTML element < DL class=”dataItem02″>. We obtain the data by means of BS4 to parse the returned page information to grab the element and parse the DOM tree to obtain it.
To sum up, we use API interface call to obtain real-time change information of the fund, by parsing the returned HTML, parsing dom tree to obtain the net unit information of the fund. Here is the code for the first part of fetching information.
Capture the real-time change information of the fund
resp = requests.get("http://fundgz.1234567.com.cn/js/{}.js".format(code))
# Remove JS wool to facilitate json conversion of data
data = resp.text.replace("jsonpgz("."").replace(");"."")
body = json.loads(data)
Output the obtained result data
print("{} {} estimated value {} estimated rise and fall {} Estimated time {}".format(body["fundcode"], body["name"], body["gsz"], body["gszzl"], body["gztime"]))
# Request information from the fund page
response = requests.get("http://fund.eastmoney.com/{}.html".format(code))
Print the original request return message encoding
# print(response.apparent_encoding)
# set the encoding of the return content of the request to avoid console garbled characters
response.encoding = "UTF-8"
resp_body = response.text
Convert and parse data
soup = BeautifulSoup(resp_body, 'lxml')
Since there is only one element, we can use find to retrieve the data. This is the element for the DL tag, class=dataItem02
dl_con = soup.find("dl", class_="dataItem02")
# Get the update time of fund net value
value_date = dl_con.find("p").get_text()
# Only the time to extract fund data
value_date = value_date.replace("Net unit value"."").replace("("."").replace(")"."")
# The net value data and the percentage of rise and fall data are in the two P tags under the DD tag
value_con = dl_con.find("dd", class_="dataNums")
data_list = value_con.find_all("span")
val_data = data_list[0].get_text()
per_data = data_list[1].get_text()
print("Fund net value date {} Net value data {} percentage rise and fall {}".format(value_date, val_data, per_data))
Copy the code
Finally, we pass above operation, can obtain fund change information.
2.2 Capture of fund stage information
Fund the phase information of fetching and adopt the way of bs4 parse page data, here is divided into three figure, the first figure shows the downs of the phase information, the second and the third is a quarterly and annual price information, because in the end we want to format stored, structured line mode for the first graph we can store, Daily changes can be displayed, but two and three we will use the column mode of storage, as a statistical data query. Because the two ways of parsing are different, the table header field in the first figure exists as a field in the database, so we do not need to care about it. The table header of the second and third figure needs to be obtained for storage, and the statistical events are also the data we store. Finally, we should not only get the fund of the basic information, but also access to the information that is relevant to the csi 300 after convenient when doing screening as a benchmark to judge a strength indexes, so the data of csi 300 also need to grab, this part of the operation difficulty is not great, mainly lies in the way of analysis to obtain data and follow-up for storage.
I’m just going to fetch all the table elements of the page, and then loop through the output, and then fetch the data that I need to fetch at that index. Here I’ll just post the code to illustrate:
Def print_table(head, body): Tb.add_row (body) print(TB) # query quarterly year data def query_year_quarter(data_list, num): stage_list = data_list.find_all("tr")[0].find_all("th") head_list = [] for nd in stage_list: Val = nd. Get_text (.) strip (val) = val. Replace (" quarter ", ""). The replace (" year", ""). The replace (" year", "-") if val: # print(nd.get_text()) head_list.append(val) body_list = [] stage_list = data_list.find_all("tr")[num].find_all("td") For nd in stage_list: val = nd.get_text() if "stage_list" in val or "stage_list" in val: Continue body_list.append(val.replace("%", "")) # print_table(head_list, body_list) # Def query_fund_basic(code="005585", hsFlag=False) def query_fund_basic(code="005585", hsFlag=False) Stage_month_list = ["stage_week", "stage_month", "stage_month3", "stage_month6", "stage_year", "stage_year1","stage_year2", "stage_year3", [11].stage_list = body_list[11].stage_list = body_list[11].stage_list = body_list[11].stage_list = body_list[11].stage_list = body_list[11]. Num = 3 tmp_list = [] for nd in stage_list[num]. Find_all ("td"): val = nd.get_text() if "in val or" 1 "in val: Continue tmp_list. Append (val. Replace (" % ", "")) # print phase amplitude form print (" \ t stage -- -- -- -- -- - or -- -- -- -- -- -") print_table (stage_head_list, Tmp_list) print (" \ t -- -- -- -- -- - the quarter price -- -- -- -- -- - ") query_year_quarter (body_list [12]. Num) print (" \ t -- -- -- -- -- - the annual fall -- -- -- -- -- - ") query_year_quarter (body_list [13], num)Copy the code
3 Presentation of final results
Due to the limited space, this code will not be shown in this article, and I will provide the content maintenance on Github in the future.