This article is participating in Python Theme Month. See the link for details
Scrapy is a set of Twisted based asynchronous processing framework, pure Python implementation of the crawler framework, only need to customize the development of several modules can easily implement a crawler, used to grab web content and a variety of pictures, very convenient ~
Because the work needs to use crawler technology, I learned about scrapy, found that it is quite good, and very suitable for crawler new students, below we together to understand it
The installation
Windows installation
Install pywin32
- Enter the sourceforge.net/projects/py… Or github.com/mhammond/py… Download the PyWin32 package for Python
- Foolproof installation
- Check the installation result, python command line input
import win32api
If no error message is displayed, the installation succeeds
Install the Twisted
- Go to www.lfd.uci.edu/~gohlke/pyt… Download twisted and LXML
pip install ******.whl
(PIP is not installed,Click to see how to install it)- Type the command
pip --version
Check whether the installation is successful
Install scrapy
Command line: PIP install scrapy
Scrapy frame learning
Framework diagram
The directory structure
- Kitems.py holds the data model for the crawl
- Middwares. Py middleware
- Field.py Saves the collected data
- Settings. py Configuration information of the crawler
- CFG configuration file for the scrapy. CFG project
- Spiders crawler script
The basic use
1. Create a new project
Scrapy StartProject Scrapy StartProjectCopy the code
2. Scrapy. CFG Package deployment files
3. Create a crawler
Scrapy genspider Name web domain nameCopy the code
4. Note:
- The crawler name cannot be the same as the project name
- Website domain name
5. Location of crawler file
Spiders/item names /spiders/ names. Py
6. Check scrapy for a few classes
scrapy genspider -l
Copy the code
7. Selectors selector
- regular
- The Xpath expression
- css
8.Xpath expression rules
demo
# extract reponse.xpath('//title/text()').get();Copy the code
9. Run crawlers
Scrapy crawl nameCopy the code
10. Run the crawler and save the results
Scrapy crawl -o xxx.csv (or xxx.json)Copy the code
11. Splicing url
reponse.urljoin(uri)
Copy the code
12. The paging
yield scrapy.Request(url, callback=self.parse)
Copy the code
Data is stored in pipes
Enable it in settings.py
# 300 indicates the higher priority the smaller the priority ITEM_PIPELINES = {' cxianshengSpider. Pipelines. CxianshengspiderPipeline: 300,}Copy the code
Data is stored in JSON mode and imported
Exporters import jsonItemExport scrapy. Exporters import JsonItemexport JsonLinesItemExporterCopy the code
Usage:
Self. Exporter = JsonItemExporter(self. Fp, ensure_ASCII =False, Encoding =' utF-8 ') # run self.export_item (item)Copy the code
CrawlSpider
Crawlspiders allow you to create more flexible crawlers, customize crawl rules, and more
Create crawl
Scrapy genspider -t Crawl domain nameCopy the code
Address of the latest request header
Useragentstring.com/pages/usera…
Set up the downloader middleware
1. Add the following code to middlewares.py:
import random class HttpbinUserAgentMiddleware(object): User_agent = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, Like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, Like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, Like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, Like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, Like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, Like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, Like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, Like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, Def process_request(self,request,spider): User_agent = random. Choice (self.user_agent) # set user-agent request. Headers [' user-agent '] = user_agentCopy the code
2. Settings. Py open:
# HttpbinUserAgentMiddleware downloader for themselves the name of the class DOWNLOADER_MIDDLEWARES = {' httpBin. Middlewares. HttpbinUserAgentMiddleware ': 543.}Copy the code
And enable the download request interval
# 3 means 3 seconds DOWNLOAD_DELAY = 3Copy the code
3. Crawler code
# -*- coding: utf-8 -*- import scrapy import json class UseragentdemoSpider(scrapy.Spider): name = 'userAgentDemo' allowed_domains = ['httpbin.org'] start_urls = ['http://httpbin.org/user-agent'] def parse(self, response): ['user-agent'] print('='*30) print(data) print('='*30) # scrapy.Request(self.start_urls[0], dont_filter=True) passCopy the code
Learning Xpath rules
html
<! DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Document</title> </head> <body> <a class="a-test" href="/next/2" >email me</a> </body> </html>Copy the code
The Xpath expression
The basic use
- Text Get content
@
locator
expression
1. Obtain the content of the specified label
//title/text()
Copy the code
2. Obtain the content @ locator according to the HTML attribute
//a[@class="a-text"]/@hrefCopy the code
Scrapy with Xpath
Code example: (PS: here is an example of my blog, test effective:
# -*- coding: utf-8 -*- import scrapy class CxianshengSpider(scrapy.Spider): Name = 'cxiansheng' # allowed_domains = ['cxiansheng.cn'] # start_urls = ['https://cxiansheng.cn/'] # Def return str.strip() if STR else "def parse(self, response): selectors = response.xpath('//section/article') for selector in selectors: article_title = selector.xpath('./header/h1/a/text()').get() article_url = selector.xpath('./div/p[@class="more"]/a/@href').get() article_title = self.return_default_str(article_title) Article_url = self.return_default_str(article_url) yield {' article_title ': 'article_title ': article_url} next_url = response.xpath('//nav[@class="pagination"]/a[@class="extend next"]/@href').get() if next_url: Request(next_url, callback=self.parse)Copy the code
The climb introduced me to the fact that I don’t blog much, cry
Python scrapy demo
This article summarizes the learning resources
- Get started with Python crawler framework Scrapy and learn how to ignore 80% of websites!