J Summary: After understanding the basic knowledge of crawlers, next we will use the framework to write crawlers, using the framework will make us write crawlers more simple, next we will understand, the use of pySpider framework, understand the framework, mom no longer need to worry about our learning.
Preparation:
1, Install PySpider: pip3 install PySpider
2. Install Phantomjs: Download Phantomjs from the official website and decompress pathtomjs.exe to Scripts where Python is installed.
Download address: https://phantomjs.org/dowmload.html
The official API address: http://www.pyspider.cn/book/pyspider/self.crawl-16.html
2, usage (here is only a brief introduction, see the official document for more) :
1. Start PySpider first
Type PySpider All in the black window to see the following.
Remind us to open it at http://localhost:5000. (Be careful not to turn it off!)
Open it and click Create on the right to Create the Project, Project Name is the Name of the Project,
Start URL(s) is the address that you want to crawl and click Create to Create a project.
2. Understand the format and process of PySpider
After creation, we should see the following:
The Handler method is the main class of Pysqider. It inherits all the functionality of the BaseHandler and only needs a Handler to handle it.
Crawl_config = {} represents global configuration, such as defining a headers as you learned earlier
@every(minutes=24 *60)every attribute indicates the interval of the crawl. Minutes =24*60 means that the crawl is performed once a day.
On_start method: is the entry to the crawl, sending requests by calling the crawl method,
Callback = self.index_Page: Callback represents a callback function, that is, passing the result of the request to Index_Page for processing.
@config(age=10 * 24 * 60 * 60) is used to set the valid time of the task to 10 days, that is, the task will not be executed again within 10 days.
Index_page function: this is the processing function. The parameter response represents the return content of the request page.
The rest of this should make sense, but all links are retrieved using the PyQuery parser (which pySpider can use directly). And ask them in turn. Then call detail_page to process it and get what we want.
@config(priority=2)priority Indicates the priority to be climbed. If no value is set, the default value is 0.
Other arguments to crawl:
Exetime: indicates that the task will be executed in one hour
self.crawl(‘http://maoyan.com/board/4’, callback=self.detail_page,exetime=time.time()+60*60)
Retries: The number of retries can be set. The default is 3
Auto_recrawl: When this parameter is set to True, the task will be executed again when the task expires, that is, the age time expires.
Method: The default HTTP request mode is GET
Params: Adds the request parameters of get in the form of a dictionary
self.crawl(‘http://maoyan.com/board/4’, callback=self.detail_page,params={‘a’:’123′,’b’:’456′})
Data: Post form data, again in dictionary form
Files: Uploads files
self.crawl(‘http://maoyan.com/board/4′, callback=self.detail_page,method=’post’,files={filename:’xxx’})
User_agent: namely the user-agent
Headers: request header information
Cookies: dictionary form
Connect_timeout: indicates the maximum waiting time for initialization. The default time is 20 seconds
Timeout: indicates the maximum waiting time for fetching a page
Proxy: proxy for crawling. It is in dictionary form
Fetch_type: Set to JS to see javascript rendered pages. Phantomjs is required
Js_script: Also, you can execute your own JS scripts. Such as:
self.crawl(‘http://maoyan.com/board/4′, callback=self.detail_page,js_script=”’
function(){
alert(‘123’)
}
“‘)
Js_run_at: Used with the js script above, you can set the position at the beginning or end of the script to run
Load_images: Whether to load images when loading javascript. Default: No
Save: Pass parameters between different methods:
def on_start(self):
self.crawl(‘http://maoyan.com/board/4’, callback=self.index_page,save={‘a’:’1′})
def index_page(self,response):
return response.save[‘a’]
3. After saving, the project looks like this
Status indicates the status:
TODO: Indicates the state in which the project was just created
STOP: Stops
CHECKING: CHECKING the status of the running project
DEBUG/RUNNING: Both run versions. DEBUG is a test version.
PAUSE: Multiple errors occur.
How do I delete items?
Change group to delete and staus to STOP, and the system automatically deletes them 24 hours later
actions:
Run means run
Active Tasks View requests
View results
Tate/burst:
1/3 means one request per second, three processes
Progress: indicates the crawl progress.
4, to the end.
(You can also add your own methods to suit your needs. Examples of PySpider crawls will follow.