0x01: Python3 is installed
Download the python installation package, specific version according to your own system requirements www.python.org/downloads/w… After the installation is complete, enter Python into CMD to view the python version
Note: During the installation, please check PIP installation and add it to the environment variable. Otherwise, the third-party class library cannot be installed normally.
0x02: Some common class libraries needed to install crawlers
- Install selenimu Automation Web package, CMD to any directory, execute
pip install selenium
- Install Pymysql connect mysql package, CMD into any directory, execute. PIP install pymysql
- Install the PIP Install Pillow package
Note: Pillow’s official website
Pillow. Readthedocs. IO/en/latest/I…
- PIP install Pypiwin32 install pypiwin32
- PIP Install Requests
- Install PIP install scrapy
Note: Install the crawler framework must rely on the third-party Twisted library. When using PIP to install, there will be a file download failure and the installation cannot be done. You can download the Twisted installation file first. Then use PIP Install to install Twisted.
PIP install The absolute path to download the Twisted file
The download address is as follows (download WHL file matching Python version) :
www.lfd.uci.edu/~gohlke/pyt…
- PIP install bS4
0x03: Verifies that scrapy was successfully installed
Enter CMD and type scrapy to check whether scrapy was successfully installed
0x04: Create crawler project
To create a project, create a Scrapy project called tutorial with just one command: Scrapy StartProject Tutorial
The directory structure of the Tutorial project looks something like this:
/tutorial/tutorial execute: scrapy genspider QuoteSpider”www.baidu.com”QuoteSpider is the file name,www.baidu.com is the domain name to crawl,. / tutorial/tutorial/spiders directory to generate a QuoteSpider py files. The content of the document is as follows:
Modify the QuoteSpider. Py file:
import scrapy class QuotespiderSpider(scrapy.Spider): Name = 'QuoteSpider' # allowed_domains = ['landchina.mnr.gov.cn'] start_urls = ['http://landchina.mnr.gov.cn/scjy/tdzr/index_1.htm'] def parse(self, response): Fname = response.url.split('/')[-1] # define file name, With open(fname, 'wb') as f: # retrieve the file name from the response URL as the file name to save locally, then save the returned content as the file f.rite (response.body) self.log('Saved file %s.' % fname) # self.log Not necessaryCopy the code
This code is as simple as crawling a page and saving it to a file.
Execute the tutorial crawler project in the CMD directory
scrapy crawl QuoteSpider
Copy the code
The execution log is as follows
The index_1.htm file, which can be viewed in the tutorial directory; The file is what was crawled.