Python Scrapy and Python Scrapy

This article is participating in Python Theme Month. See the link for details

Scrapy is a set of Twisted based asynchronous processing framework, pure Python implementation of the crawler framework, only need to customize the development of several modules can easily implement a crawler, used to grab web content and a variety of pictures, very convenient ~

Because the work needs to use crawler technology, I learned about scrapy, found that it is quite good, and very suitable for crawler new students, below we together to understand it

The installation

Windows installation

Install pywin32

Enter the sourceforge.net/projects/py… Or github.com/mhammond/py… Download the PyWin32 package for Python
Foolproof installation
Check the installation result, python command line inputimport win32apiIf no error message is displayed, the installation succeeds

Install the Twisted

Go to www.lfd.uci.edu/~gohlke/pyt… Download twisted and LXML
pip install ******.whl(PIP is not installed,Click to see how to install it)
Type the commandpip --versionCheck whether the installation is successful

Install scrapy

Command line: PIP install scrapy

Scrapy frame learning

Framework diagram

The directory structure

Kitems.py holds the data model for the crawl
Middwares. Py middleware
Field.py Saves the collected data
Settings. py Configuration information of the crawler
CFG configuration file for the scrapy. CFG project
Spiders crawler script

The basic use

1. Create a new project

Scrapy StartProject Scrapy StartProjectCopy the code

2. Scrapy. CFG Package deployment files

3. Create a crawler

Scrapy genspider Name web domain nameCopy the code

4. Note:

The crawler name cannot be the same as the project name
Website domain name

5. Location of crawler file

Spiders/item names /spiders/ names. Py

6. Check scrapy for a few classes

scrapy genspider -l
Copy the code

7. Selectors selector

regular
The Xpath expression
css

8.Xpath expression rules

demo

# extract reponse.xpath('//title/text()').get();Copy the code

9. Run crawlers

Scrapy crawl nameCopy the code

10. Run the crawler and save the results

Scrapy crawl -o xxx.csv (or xxx.json)Copy the code

11. Splicing url

reponse.urljoin(uri)
Copy the code

12. The paging

yield scrapy.Request(url, callback=self.parse)
Copy the code

Data is stored in pipes

Enable it in settings.py

# 300 indicates the higher priority the smaller the priority ITEM_PIPELINES = {' cxianshengSpider. Pipelines. CxianshengspiderPipeline: 300,}Copy the code

Data is stored in JSON mode and imported

Exporters import jsonItemExport scrapy. Exporters import JsonItemexport JsonLinesItemExporterCopy the code

Usage:

Self. Exporter = JsonItemExporter(self. Fp, ensure_ASCII =False, Encoding =' utF-8 ') # run self.export_item (item)Copy the code

CrawlSpider

Crawlspiders allow you to create more flexible crawlers, customize crawl rules, and more

Create crawl

Scrapy genspider -t Crawl domain nameCopy the code

Address of the latest request header

Useragentstring.com/pages/usera…

Set up the downloader middleware

1. Add the following code to middlewares.py:

import random class HttpbinUserAgentMiddleware(object): User_agent = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, Like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, Like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, Like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, Like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, Like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, Like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, Like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, Like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, Def process_request(self,request,spider): User_agent = random. Choice (self.user_agent) # set user-agent request. Headers [' user-agent '] = user_agentCopy the code

2. Settings. Py open:

# HttpbinUserAgentMiddleware downloader for themselves the name of the class DOWNLOADER_MIDDLEWARES = {' httpBin. Middlewares. HttpbinUserAgentMiddleware ': 543.}Copy the code

And enable the download request interval

# 3 means 3 seconds DOWNLOAD_DELAY = 3Copy the code

3. Crawler code

# -*- coding: utf-8 -*- import scrapy import json class UseragentdemoSpider(scrapy.Spider): name = 'userAgentDemo' allowed_domains = ['httpbin.org'] start_urls = ['http://httpbin.org/user-agent'] def parse(self, response): ['user-agent'] print('='*30) print(data) print('='*30) # scrapy.Request(self.start_urls[0], dont_filter=True) passCopy the code

Learning Xpath rules

html

<! DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Document</title> </head> <body> <a class="a-test" href="/next/2" >email me</a> </body> </html>Copy the code

The Xpath expression

The basic use

Text Get content
@locator

expression

1. Obtain the content of the specified label

//title/text()
Copy the code

2. Obtain the content @ locator according to the HTML attribute

//a[@class="a-text"]/@hrefCopy the code

Scrapy with Xpath

Code example: (PS: here is an example of my blog, test effective:

# -*- coding: utf-8 -*- import scrapy class CxianshengSpider(scrapy.Spider): Name = 'cxiansheng' # allowed_domains = ['cxiansheng.cn'] # start_urls = ['https://cxiansheng.cn/'] # Def return str.strip() if STR else "def parse(self, response): selectors = response.xpath('//section/article') for selector in selectors: article_title = selector.xpath('./header/h1/a/text()').get() article_url = selector.xpath('./div/p[@class="more"]/a/@href').get() article_title = self.return_default_str(article_title) Article_url = self.return_default_str(article_url) yield {' article_title ': 'article_title ': article_url} next_url = response.xpath('//nav[@class="pagination"]/a[@class="extend next"]/@href').get() if next_url: Request(next_url, callback=self.parse)Copy the code

The climb introduced me to the fact that I don’t blog much, cry

Python scrapy demo

This article summarizes the learning resources

Get started with Python crawler framework Scrapy and learn how to ignore 80% of websites!

The installation

Windows installation

Install pywin32

Install the Twisted

Install scrapy

Scrapy frame learning

Framework diagram

The directory structure

The basic use

Data is stored in pipes

Data is stored in JSON mode and imported

CrawlSpider

Address of the latest request header

Set up the downloader middleware

Learning Xpath rules

Scrapy with Xpath

This article summarizes the learning resources

Related Posts

SpringCloud: OAuth2 and JWT

JavaSE self-study.

Jquery factory analysis