Use of the crawler framework PySpider

J Summary: After understanding the basic knowledge of crawlers, next we will use the framework to write crawlers, using the framework will make us write crawlers more simple, next we will understand, the use of pySpider framework, understand the framework, mom no longer need to worry about our learning.

Preparation:

1, Install PySpider: pip3 install PySpider

2. Install Phantomjs: Download Phantomjs from the official website and decompress pathtomjs.exe to Scripts where Python is installed.

Download address: https://phantomjs.org/dowmload.html

The official API address: http://www.pyspider.cn/book/pyspider/self.crawl-16.html

2, usage (here is only a brief introduction, see the official document for more) :

1. Start PySpider first

Type PySpider All in the black window to see the following.

Remind us to open it at http://localhost:5000. (Be careful not to turn it off!)

Open it and click Create on the right to Create the Project, Project Name is the Name of the Project,

Start URL(s) is the address that you want to crawl and click Create to Create a project.

2. Understand the format and process of PySpider

After creation, we should see the following:

The Handler method is the main class of Pysqider. It inherits all the functionality of the BaseHandler and only needs a Handler to handle it.

Crawl_config = {} represents global configuration, such as defining a headers as you learned earlier

@every(minutes=24 *60)every attribute indicates the interval of the crawl. Minutes =24*60 means that the crawl is performed once a day.

On_start method: is the entry to the crawl, sending requests by calling the crawl method,

Callback = self.index_Page: Callback represents a callback function, that is, passing the result of the request to Index_Page for processing.

@config(age=10 * 24 * 60 * 60) is used to set the valid time of the task to 10 days, that is, the task will not be executed again within 10 days.

Index_page function: this is the processing function. The parameter response represents the return content of the request page.

The rest of this should make sense, but all links are retrieved using the PyQuery parser (which pySpider can use directly). And ask them in turn. Then call detail_page to process it and get what we want.

@config(priority=2)priority Indicates the priority to be climbed. If no value is set, the default value is 0.

Other arguments to crawl:

Exetime: indicates that the task will be executed in one hour

self.crawl(‘http://maoyan.com/board/4’, callback=self.detail_page,exetime=time.time()+60*60)

Retries: The number of retries can be set. The default is 3

Auto_recrawl: When this parameter is set to True, the task will be executed again when the task expires, that is, the age time expires.

Method: The default HTTP request mode is GET

Params: Adds the request parameters of get in the form of a dictionary

self.crawl(‘http://maoyan.com/board/4’, callback=self.detail_page,params={‘a’:’123′,’b’:’456′})

Data: Post form data, again in dictionary form

Files: Uploads files

self.crawl(‘http://maoyan.com/board/4′, callback=self.detail_page,method=’post’,files={filename:’xxx’})

User_agent: namely the user-agent

Headers: request header information

Cookies: dictionary form

Connect_timeout: indicates the maximum waiting time for initialization. The default time is 20 seconds

Timeout: indicates the maximum waiting time for fetching a page

Proxy: proxy for crawling. It is in dictionary form

Fetch_type: Set to JS to see javascript rendered pages. Phantomjs is required

Js_script: Also, you can execute your own JS scripts. Such as:

self.crawl(‘http://maoyan.com/board/4′, callback=self.detail_page,js_script=”’

function(){

alert(‘123’)

}

“‘)

Js_run_at: Used with the js script above, you can set the position at the beginning or end of the script to run

Load_images: Whether to load images when loading javascript. Default: No

Save: Pass parameters between different methods:

def on_start(self):

self.crawl(‘http://maoyan.com/board/4’, callback=self.index_page,save={‘a’:’1′})

def index_page(self,response):

return response.save[‘a’]

3. After saving, the project looks like this

Status indicates the status:

TODO: Indicates the state in which the project was just created

STOP: Stops

CHECKING: CHECKING the status of the running project

DEBUG/RUNNING: Both run versions. DEBUG is a test version.

PAUSE: Multiple errors occur.

How do I delete items?

Change group to delete and staus to STOP, and the system automatically deletes them 24 hours later

actions:

Run means run

Active Tasks View requests

View results

Tate/burst:

1/3 means one request per second, three processes

Progress: indicates the crawl progress.

4, to the end.

(You can also add your own methods to suit your needs. Examples of PySpider crawls will follow.

Use of the crawler framework PySpider

Related Posts

[single-chip development journey] engineers must have! 10 best Arduino books to Read

Javascript constructors and prototypes

Flink Distributed Cache Application Case