This article is participating in Python Theme Month. See the link for details

Scrapy is a set of Twisted based asynchronous processing framework, pure Python implementation of the crawler framework, only need to customize the development of several modules can easily implement a crawler, used to grab web content and a variety of pictures, very convenient ~

Because the work needs to use crawler technology, I learned about scrapy, found that it is quite good, and very suitable for crawler new students, below we together to understand it

The installation

Windows installation

Install pywin32

  1. Enter the sourceforge.net/projects/py… Or github.com/mhammond/py… Download the PyWin32 package for Python
  2. Foolproof installation
  3. Check the installation result, python command line inputimport win32apiIf no error message is displayed, the installation succeeds

Install the Twisted

  1. Go to www.lfd.uci.edu/~gohlke/pyt… Download twisted and LXML
  2. pip install ******.whl(PIP is not installed,Click to see how to install it)
  3. Type the commandpip --versionCheck whether the installation is successful

Install scrapy

Command line: PIP install scrapy

Scrapy frame learning

Framework diagram

The directory structure

  • Kitems.py holds the data model for the crawl
  • Middwares. Py middleware
  • Field.py Saves the collected data
  • Settings. py Configuration information of the crawler
  • CFG configuration file for the scrapy. CFG project
  • Spiders crawler script

The basic use

1. Create a new project

Scrapy StartProject Scrapy StartProjectCopy the code

2. Scrapy. CFG Package deployment files

3. Create a crawler

Scrapy genspider Name web domain nameCopy the code

4. Note:

  • The crawler name cannot be the same as the project name
  • Website domain name

5. Location of crawler file

Spiders/item names /spiders/ names. Py

6. Check scrapy for a few classes

scrapy genspider -l
Copy the code

7. Selectors selector

  1. regular
  2. The Xpath expression
  3. css

8.Xpath expression rules

demo

# extract reponse.xpath('//title/text()').get();Copy the code

9. Run crawlers

Scrapy crawl nameCopy the code

10. Run the crawler and save the results

Scrapy crawl -o xxx.csv (or xxx.json)Copy the code

11. Splicing url

reponse.urljoin(uri)
Copy the code

12. The paging

yield scrapy.Request(url, callback=self.parse)
Copy the code

Data is stored in pipes

Enable it in settings.py

# 300 indicates the higher priority the smaller the priority ITEM_PIPELINES = {' cxianshengSpider. Pipelines. CxianshengspiderPipeline: 300,}Copy the code
Data is stored in JSON mode and imported
Exporters import jsonItemExport scrapy. Exporters import JsonItemexport JsonLinesItemExporterCopy the code

Usage:

Self. Exporter = JsonItemExporter(self. Fp, ensure_ASCII =False, Encoding =' utF-8 ') # run self.export_item (item)Copy the code

CrawlSpider

Crawlspiders allow you to create more flexible crawlers, customize crawl rules, and more

Create crawl

Scrapy genspider -t Crawl domain nameCopy the code

Address of the latest request header

Useragentstring.com/pages/usera…

Set up the downloader middleware

1. Add the following code to middlewares.py:

import random class HttpbinUserAgentMiddleware(object): User_agent = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, Like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, Like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, Like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, Like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, Like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, Like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, Like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, Like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, Like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, Def process_request(self,request,spider): User_agent = random. Choice (self.user_agent) # set user-agent request. Headers [' user-agent '] = user_agentCopy the code

2. Settings. Py open:

# HttpbinUserAgentMiddleware downloader for themselves the name of the class DOWNLOADER_MIDDLEWARES = {' httpBin. Middlewares. HttpbinUserAgentMiddleware ': 543.}Copy the code

And enable the download request interval

# 3 means 3 seconds DOWNLOAD_DELAY = 3Copy the code

3. Crawler code

# -*- coding: utf-8 -*- import scrapy import json class UseragentdemoSpider(scrapy.Spider): name = 'userAgentDemo' allowed_domains = ['httpbin.org'] start_urls = ['http://httpbin.org/user-agent'] def parse(self, response): ['user-agent'] print('='*30) print(data) print('='*30) # scrapy.Request(self.start_urls[0], dont_filter=True) passCopy the code

Learning Xpath rules

html

<! DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Document</title> </head> <body> <a class="a-test" href="/next/2" >email me</a> </body> </html>Copy the code

The Xpath expression

The basic use

  1. Text Get content
  2. @locator

expression

1. Obtain the content of the specified label

//title/text()
Copy the code

2. Obtain the content @ locator according to the HTML attribute

//a[@class="a-text"]/@hrefCopy the code

Scrapy with Xpath

Code example: (PS: here is an example of my blog, test effective:

# -*- coding: utf-8 -*- import scrapy class CxianshengSpider(scrapy.Spider): Name = 'cxiansheng' # allowed_domains = ['cxiansheng.cn'] # start_urls = ['https://cxiansheng.cn/'] # Def return str.strip() if STR else "def parse(self, response): selectors = response.xpath('//section/article') for selector in selectors: article_title = selector.xpath('./header/h1/a/text()').get() article_url = selector.xpath('./div/p[@class="more"]/a/@href').get() article_title = self.return_default_str(article_title) Article_url = self.return_default_str(article_url) yield {' article_title ': 'article_title ': article_url} next_url = response.xpath('//nav[@class="pagination"]/a[@class="extend next"]/@href').get() if next_url: Request(next_url, callback=self.parse)Copy the code

The climb introduced me to the fact that I don’t blog much, cry

Python scrapy demo

This article summarizes the learning resources

  • Get started with Python crawler framework Scrapy and learn how to ignore 80% of websites!