Crawlab python+scrapy

0x01: Python3 is installed

Download the python installation package, specific version according to your own system requirements www.python.org/downloads/w… After the installation is complete, enter Python into CMD to view the python version

Note: During the installation, please check PIP installation and add it to the environment variable. Otherwise, the third-party class library cannot be installed normally.

0x02: Some common class libraries needed to install crawlers

Install selenimu Automation Web package, CMD to any directory, execute

pip install selenium

Install Pymysql connect mysql package, CMD into any directory, execute. PIP install pymysql
Install the PIP Install Pillow package

Note: Pillow’s official website

Pillow. Readthedocs. IO/en/latest/I…

PIP install Pypiwin32 install pypiwin32
PIP Install Requests
Install PIP install scrapy

Note: Install the crawler framework must rely on the third-party Twisted library. When using PIP to install, there will be a file download failure and the installation cannot be done. You can download the Twisted installation file first. Then use PIP Install to install Twisted.

PIP install The absolute path to download the Twisted file

The download address is as follows (download WHL file matching Python version) :

www.lfd.uci.edu/~gohlke/pyt…

PIP install bS4

0x03: Verifies that scrapy was successfully installed

Enter CMD and type scrapy to check whether scrapy was successfully installed

0x04: Create crawler project

To create a project, create a Scrapy project called tutorial with just one command: Scrapy StartProject Tutorial

The directory structure of the Tutorial project looks something like this:

/tutorial/tutorial execute: scrapy genspider QuoteSpider”www.baidu.com”QuoteSpider is the file name,www.baidu.com is the domain name to crawl,. / tutorial/tutorial/spiders directory to generate a QuoteSpider py files. The content of the document is as follows:

Modify the QuoteSpider. Py file:

import scrapy class QuotespiderSpider(scrapy.Spider): Name = 'QuoteSpider' # allowed_domains = ['landchina.mnr.gov.cn'] start_urls = ['http://landchina.mnr.gov.cn/scjy/tdzr/index_1.htm'] def parse(self, response): Fname = response.url.split('/')[-1] # define file name, With open(fname, 'wb') as f: # retrieve the file name from the response URL as the file name to save locally, then save the returned content as the file f.rite (response.body) self.log('Saved file %s.' % fname) # self.log Not necessaryCopy the code

This code is as simple as crawling a page and saving it to a file.

Execute the tutorial crawler project in the CMD directory

scrapy crawl QuoteSpider
Copy the code

The execution log is as follows

The index_1.htm file, which can be viewed in the tutorial directory; The file is what was crawled.

Related Posts

8 sorts – Heap sort (handwritten heap sort)

LSSVM prediction model Based on bat algorithm improved least square support vector machine LSSVM prediction

Selenium Alert pop-up processing