Data Science Club

Chinese data scientist community

Author: Suke, Zero-based, python crawler and data Analysis

Blog: www.makcyun.top

Abstract: Many web pages are now using JavaScript for dynamic rendering, including Ajax technology. Although some web pages also use Ajax technology, interface parameters may be encrypted and cannot be obtained directly, such as Taobao; Some dynamic web pages also use JavaScript, but not Ajax, such as the Echarts website. Therefore, when encountering these two types of web pages, you need to take a new approach, which includes the clean, direct and easy-to-use Selenium method. The financial statement webpage of Eastmoney.com is also dynamically loaded by JavaScript. In this paper, Selenium method is used to crawl the financial statement data of listed companies on this website.

1. Practical background 2. Web analysis 3. Selenium knowledge 4. 4.1. Idea 4.2. Climb single page table 4.3. Universal crawler construction 4.5. Complete code

1. Combat background

Many websites provide financial investment information and data of listed companies, such as Tencent Finance, NetEase Finance, Sina Finance, Oriental Fortune, etc. Among these, the data that discovers net of Oriental wealth is very complete.

Net has a data center: Oriental wealth data.eastmoney.com/center/, the data…

Taking the annual report and quarterly report category as an example, we click this category to check the 2018 interim report (see the figure below), and we can see the data of 7 statements including performance statement, performance report and income statement under this category. Take the performance report as an example, the report contains the performance report data of all more than 3,000 stocks, a total of more than 70 pages.

Let’s say we want to get performance data for all stocks in the middle of 2018, and then do some analysis on that data. More than 70 pages can be scraped together by manual copying. However, if you want to obtain data of any year, any quarter or any report, you need to manually copy the data, and the workload will be very large. As an example, if you were to capture data for all 7 reports over a 10-year period (40 quarters), the amount of manual replication would be approximately 40×7×70 (approximately 70 pages per report), or approximately 20,000 times repeatable!! It can be said that the task is impossible to complete manually. Therefore, the goal of this article is to use Selenium automation technology to crawl any year (data available on the website so far) and any financial statement data under the annual report and quarterly report category. Isn’t it cool that all we need to do is simply type in a few characters, hand the rest over to the computer, and then open Up Excel later and see the data “just sitting there”?

All right, so let’s get started. First, you need to analyze the web page object to crawl.

2. Web page analysis

We’ve already crawled through tabular data before, so the structure of tabular data should be familiar. If you forget, take a look at this article again: www.makcyun.top/web_scrapin…

Let’s take the above performance report for 2018 as an example and take a look at the form of the table.

Url url:data.eastmoney.com/bbsj/201806… BBSJ represents the annual report, 201803 represents the quarterly report of 2018, and 201806 represents the mid-year report. LRB stands for income statement, and yJBB stands for performance statement. As you can see, the url format is very simple, easy to construct the URL.

Next, we click the next page button and see that the URL has not changed after the table is updated. We can determine that Javscript is used. So, let’s first determine if Ajax is used to load. The method is also very simple, right-click or press F12, switch to Network and select XHR below, then press F5 to refresh. You can see that there is only one Ajax request, and clicking on the next page does not generate any new Ajax requests. You can determine that the page structure is not of the type of Ajax requests that are constantly generated when you click on the next page or drop down, so you cannot construct a URL for paging crawling.

The XHR option did not find the request we need. Now try to find the data request for the table in JS. Set the option to JS and refresh F5 again. You can see a lot of JS requests. Then click the next page a few times and see a new request pop up. Url links are very long and look complicated. Okay, so we’re going to stop right there.

As you can see, the method of crawling this dynamic web page by analyzing background elements is relatively complex. Is there a clean, straightforward way to grab the contents of a table? Yes, the Selenium method will be introduced next in this article.

3. Knowledge of Selenium

What is Selenium? In a word, automated test tools. It was born to be tested, but in the field of reptiles that has been hot in recent years, it has become a weapon for reptiles. To put it bluntly, Seleninm controls the browser, “surfing the Web” like a person. For example, you can automatically flip pages, log in to websites, send emails, download pictures/music/videos, and so on. For example, just a few lines of Python code can be used in Selenium to log in to an IT orange and then browse the web.

Isn’t it amazing that you can automatically access the Internet with just a few lines of code? Of course, this is just the simplest function of Selenium. There are many more rich operations available in the following tutorials:

Reference website:

Selenium official website: Selenium-python. readthedocs. IO /

SeleniumPython documents (English version) : selenium-python.readthedocs.org/index.html

SeleniumPython document (Chinese version) : selenium – python – useful. Readthedocs. IO/en/latest/f…

The Selenium basic operations: www.yukunweb.com/2017/7/pyth…

Selenium is learning about cuiqingcai.com/2852.html

Just one important thing to remember is that Selenium can be seen and climbed. That means Selenium can crawl down almost anything you see on the Web. Including the financial statement data of Eastmoney.com mentioned above, it can also do this, and it is very simple and straightforward, without having to check what JavaScript technology or Ajax parameters are used in the background. Let’s actually practice it.

4. Coding implementation

4.1. Thinking

  • Install and configure the relevant environment for Selenium to run. The browser can be Chrome, Firefox, PhantomJS, etc. I use Chrome.
  • The financial statement data of Eastmoney.com can be accessed directly without logging in, and Selenium is more convenient to climb.
  • Firstly, take the financial statements in a single webpage as an example. The table data structure is simple, so you can directly locate the entire table first, and then obtain the table unit content corresponding to all TD nodes at one time.
  • Then the data of all listed companies are clambered in a circular paging and saved as CSV files.
  • To reconstruct the flexible URL, the realization can crawl any period, any one of the financial statements of the data.

According to the above ideas, the following code step by step to achieve.

4.2. Climb single-page tables

We first in 2018 NianZhongBao income statement, for example, grab the page the first page of the form data, web url:data.eastmoney.com/bbsj/201806…

Select * from table where id = dt_1; select * from table where id = dt_1;

1from selenium import webdriver 2browser = webdriver.chrome () 2 from selenium import webdriver 2browser = webdriver.chrome () 4# Add headless headlesss 1 use Chrome Headless,2 use PhantomJS 5# Add PhantomJS Chrome headless 6# chrome_options = webdriver.chromeOptions () 7# chrome_options.add_argument('--headless') 8# browser = webdriver.Chrome(chrome_options=chrome_options) 9# browser = webdriver.PhantomJS() 10# browser.maximize_window() # Maximize window, can choose to set up 11 12 the get 13 element = (' http://data.eastmoney.com/bbsj/201806/lrb.html ') Browser.find_element_by_css_selector ('#dt_1') Td 15td_content = element.find_elementS_by_tag_name (" TD ") # 16lst = [] List 17for TD in td_content: 18 lst.append(td.text) 19print(LSTCopy the code

Here, we construct a Webdriver object using the Chrome browser and assign it to the variable Browser, which calls the Get () method to request the web page we want to grab. Then use the find_element_by_css_selector method to find the node on which the table is located: ‘#dt_1’.

SelectorGadget, a handy Chrome plugin for locating CSS /xpath quickly, eliminates the need to manually locate nodes in the source code.

Plug-in address: chrome.google.com/webstore/de…

Next, navigate down to the TD node. Since there are many TD nodes on the page, use the find_Elements method. The traversal of the data node is then stored in the list. Print to see the results:

1# list 2 [' 1 ', '002161', 'yuanwang valley'... '79.6 million', '09-29', '2', 3 ', 002316 ', 'the union'... '179 million', '09-29', '3', 4... 5 '50', '002683', 'Grand Explosion ',...' 137 million ', '09-01']Copy the code

Isn’t it convenient, a few lines of code can grab this page of table, except it’s a little slow.

For subsequent storage, we convert the list to a DataFrame. First of all, we need to split the large list into multi-row and multi-column sub-lists as follows:

3col = len(element.find_elements_by_css_selector('tr:nth-child(1) td')) 4 col = len(element.find_elements_by_css_selector('tr:nth-child(1) td') LST = [LST [I: I + col] for I in range(0, len(LST), col)] 6 7lst_link = [] 8links = element.find_elements_by_css_selector('# dt_1a.ed ') 10 URL = link.get_attribute('href') 11 lst_link.append(url) 12lst_link = pd.series (lst_link) 13# list converted to dataframe 16df_TABLE ['url'] = lst_link 17print(df_table.head()Copy the code

Here, to split the list into sublists, you simply determine how many columns the table has, and then divide each value that is separated by that number into a sublist. If we count the number of columns in this table, we can see that there are 16 columns. However, this number can not be used here, because except for the income statement, the number of columns in other statements is not 16, so it may report an error when other tables are climbed later. Again, find_elements_by_css_selector is used to locate the number of td nodes in the first row to obtain the number of columns of the table, and then split the list into sublists corresponding to the number of columns. At the same time, more detailed data can be viewed by opening the link in the “details” column in the original webpage. Here, we extract the URL and add a column to the DataFrame for convenient later viewing. Print to see the output:

As you can see, we’ve captured all the data in the table, and now we just need to do a paging loop.

In this case, the table header is not captured because the table header has merged cells, which is very cumbersome to process. It is recommended to copy the table header in Excel after the table is captured. If you really want to do it in code, check out this article: blog.csdn.net/weixin_3946…

4.3. Paging crawl

Now that we’ve done single-page table crawls, let’s implement paging crawls.

First, we implement Selenium simulation of page-turning operations, and then climb the table content of each page after success.

1from selenium import webdriver 2from selenium.common.exceptions import TimeoutException 3from selenium.webdriver.common.by import By 4from selenium.webdriver.support import expected_conditions as EC 5from selenium.webdriver.support.wait import WebDriverWait 6import time 7 8browser = webdriver.Chrome() 10wait = WebDriverWait(browser, 10) 11def index_page(page): 12 try: 13 the get (' http://data.eastmoney.com/bbsj/201806/lrb.html ') and print (' climbing on taking the first: Until (16 ec.presence_of_element_located ((by. ID, "dt_1"))) 18 if page > 1: 20 Input = wait. Until (ec. presence_of_element_located(21 (by.xpath, '//*[@id="PageContgopage"]'))) 22 input.click() 23 input.clear() 24 input.send_keys(page) 25 submit = wait.until(EC.element_to_be_clickable( 26 (By.CSS_SELECTOR, '#PageCont > a.btn_link'))) 27 submit.click() 28 time.sleep(2) 29 wait.until(EC.text_to_be_present_in_element( 31 (By.CSS_SELECTOR, '#PageCont > span.at'), STR (page)) 32 except Exception: 33 return None 34 35def main(): 36 for page in range(1,5): 37 index_page(page) 38if __name__ == '__main__': 39 main()Copy the code

Here, we load the package and use the WebDriverWait object to set an explicit wait of up to 10 seconds for the page to load the table. The ec. presence_of_element_located condition is used to determine whether the table is loaded. After the table is loaded, set a page judgment, wait for page loading to complete if it is on page 1, and jump if it is larger than page 1.

To complete the jump, we need to get the input node of the input box, clear the input box with the clear() method, fill in the corresponding page number with the send_keys() method, and click the next page with the submit.click() method to complete the page jump.

Here, we test the first 4 page jump effect, can see the page successfully jump. Now you can apply the first page crawl method to each page, grab each page of the table, convert it to a DataFrame and store it in a CSV file.

4.4. Universal crawler construction

The above, we completed 2018 NianZhongBao income statement: data.eastmoney.com/bbsj/201806… , a web table crawl. But if we want to crawl the table of any period and any statement, such as the income statement for the third quarter of 2017, the performance statement for the full year of 2016, the cash flow statement for the first quarter of 2015, and so on. The above code will not work, so let’s modify the code to become a more general purpose crawler. As can be seen from the figure, the annual report of Oriental Fortune net has 7 quarterly tables, and the financial statements have been made quarterly since 2007 at the earliest. Based on these two dimensions, you can reconstruct the form of the URL and then crawl the table data. Below, we use code to implement:

Year = int(float(input(' please enter the year to query (2007-2018) : float(float(' please enter the year to query (2007-2018) : float(float(' please enter the year to query (2007-2018)) While (year < 2007 or year > 2018): float (year < 2007 or year > 2018): float (year < 2007 or year > 2018) 6 year = int(float(input(' 1 ')) 7quarter = int(float(input(' 1 ') :  '))) 8while (quarter < 1 or quarter > 4): 9 Quarter = int(float(input) '))) 10# convert to quarter two methods,2 means two digits, 11Quarter = '{:02d}'. Format (quarter *3) 12# quarter = '%02d' %(int(month)*3) 13date = '{}{}'. Format (year, 16tables = int(17 INPUT (' Please enter the corresponding number of the query report type (1- Performance report; 2- Quick performance statement: 3- Performance forecast table; 4- Booking disclosure schedule; 5- Balance sheet; 6- Income statement; 7 - the cash flow statement) : ')) 18 dict_tables = {1: 'performance reports, 2:' fast performance reports, 3: 'earnings forecast table, 19 4:' disclosure schedule an appointment, 5: 'balance sheet', 6: 'income statement', 7: 'cash flow statement' 20 dict = {} 1: 'yjbb, 2:' yjkb / 13, 3: 'yjyg, 21 4:' yysj, 5: 'ZCFZ, 6:' LRB, 7: 22 category = dict 'XJLL} [name] 23 24 25 # 3 set url url = "http://data.eastmoney.com/ / {} {} {}. HTML'. The format (' BBSJ ', Date, category) 26print(url) #Copy the code

\

After the above setup, we can return the corresponding URL link by entering the value that we want to obtain the specified period and specify the financial statement type. Apply this link to the previous crawler and you can crawl the corresponding report content.

In addition, in addition to the result of climbing from the first page to the last page, we can also customize the number of pages we want to climb. For example, start at page 1 and climb up to 10 pages.

3nums = input(' please enter the number of pages to download: \n') : \n') 4# determine the last page of the page 5browser.get(url) 6# Determine the number of pages on the last page. 8 page = browser.find_element_by_css_selector('.next+ a') 10 page = browser.find_element_by_css_selector('.at+ a') 11else: 10 print(' not found ') 11 print(' not found ') 12 print(' not found ') 12 print(' not found ') Next + a' 14END_Page = int(page.text) 15 16if nums.isdigit(): 17 end_page = start_page + int(nums) 18elif nums == '': 19 end_page = end_page 20else: Print (' ready to download :{}-{}'. Format (date, dict_tables[tables]))Copy the code

After the above Settings, we can implement a custom period and financial report type table crawl, the code a little more tidy, can implement the following crawler effect:

Video screenshot:

The matrix code rain in the background is actually a dynamic web page effect. The footage comes from this website, which also has a lot of cool dynamic backgrounds to download.

Here, I’ve downloaded some of the reports from all the public companies.

2018 Intermediate Report Performance Statement:

Income Statement of 2017 Annual Report:

In addition, the crawler can also be improved, such as the increase of crawling the announcement information of listed companies, can be set to climb any one (several/industry) company data rather than all.

Another problem is that Selenium crawl is slow and memory intensive, so it is recommended to try using Requests Requests first, and then consider it when you can’t catch it. At the beginning of this article, when conducting webpage analysis, we preliminarily analyzed the request data of form JS. Can we find the form data we need from this request? In a follow-up article, we’ll try crawling again in a different way.

To obtain the complete code of this article, please scan the qr code below to follow “Data Science Club” and reply to “Report”

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Python Chinese community public account bottom reply “internal push”

Get a weekly list of technical positions to be promoted

Click below **** to read the original text and learn the micro major of NetEase Cloud Classroom Data Analyst