Python is an easy and efficient way to crawl web data, but it is often used to crawl static pages using BeautifSoup and Requests combinations (i.e. the data displayed on the web page can be found in HTML source code, rather than asynchronously loaded by the site via JS or Ajax). This type of site data is easier to crawl. However, in some sites where the data is updated by executing JS code, traditional methods are not so suitable. There are several ways to do this:
Clear the network information on the web page, update the page and observe the requests sent by the web page. Some websites can construct parameters in this way to simplify crawlers. But the scope of application is not wide enough.
Update web pages using Selenium to simulate browser behavior to get the updated data. The rest of this article focuses on this approach.
First, preparation
Emulating a browser requires two tools:
PIP Install Selenium is available for selenium installation.
PhantomJS, a WebKit browser engine with no interface and scriptable programming, is searched by Baidu and downloaded from its official website. There is no need to install PhantomJS after downloading, and you only need to specify the file path when using it.
Second, use Selenium to simulate the browser
This article crawl site example is: the datacenter. Mep. Gov. Cn: 8099 / TSH – report /…
Don’t crawl too many pages while studying the examples, just walk through the process to learn how to grab them.
When you open the site, you can see that the data you need to crawl is a regular table, but it has many pages.In this site, clicking on the next page does not change the URL, but updates the page by executing a piece of JS code. Therefore, the idea of this paper is to use Selenium simulation browser to click, click “Next page” to update the page data, and obtain the updated page data. Here is the complete code:
# -* -coding: UTF-8 -* -import requests from BS4 import BeautifulSoup import JSON import time from Selenium Import webdriver import sys reload(sys) sys. setDefaultencoding (" UTF-8 ") curPath =sys.path[0] print curpath Def getData (url) : # Use phantomJS which I downloaded. There are also people using Firefox and Chrome online, but I didn't succeed. Driver.set_page_load_timeout (30) Driver =webdriver.PhantomJS(executable_path="C:/phantomjs.exe") driver.set_page_load_timeout(30) Time.sleep (3) HTML =driver.get(url[0]) HTML =driver. Page_source # BeautifulSoup =BeautifulSoup(HTML,' LXML ') # Table =soup. Find ('table',class_="report-table") name=[] for th in table.find_all('tr')[0].find_all('th'): Name.append (th.get_text()) # Get table field name as dictionary key flag=0 # flag if crawl field data is 0, otherwise 1 for tr in table.find_all('tr'): If flag==1: dic={} I =0 for td in tr. Find_all ('td'): Dic [name[I]]=td.get_text() I +=1 jsonDump(dic,url[1])# Use find_element_by_link_TEXT method to get the location of the next page and click, the page will automatically update after clicking. Driver.find_element_by_link_text (u" next page ").click() def jsonDump(_json,name): ""store json data""" with open(curpath+'/'+name+'. Json ','a') as outfile: Json. dump(_json,outfile, ensure_ASCII =False) with open(curpath+'/'+name+'. Json ','a') as outfile: Outfile. write(',\n') if __name__ == '__main__': Url = [' http://datacenter.mep.gov.cn:8099/ths-report/report! List. The action? Xmlname = 1465594312346 ', 'yzc] # yzc for file name, Input Chinese here will report an error, add u in front of it, have to save and manually change the file name...... GetData (url) # call the functionCopy the code
In this article, the driver.find_element_by_link_text method is used to get the location of the next page. This is because in this page, the tag does not have a uniquely identifiable ID or class. If xpath is used to locate the tag, The xpath paths of the first page and other pages are not exactly the same, so you need to add an if. Therefore, location is performed directly through the text parameter of link. The click() function simulates clicking in a browser.
Selenium is very powerful and can be used in crawlers to solve many problems that ordinary crawlers cannot. It can simulate clicking, mouse movement, and submit forms (applications such as: Try Selenium + PhantomJS when you encounter some unconventional website data that can be tricky to crawl.
Finally: [May give you some help]
These materials, for [software testing] friends should be the most comprehensive and complete preparation warehouse, this warehouse also accompany tens of thousands of test engineers through the most difficult journey, I hope to help you! Follow my wechat public number [software test small DAO] free access ~
My learning exchange group: 644956177 group of technical cattle to exchange and share ~
If my blog is helpful to you, if you like my blog content, please “like” “comment” “favorites” one key three even oh!