Abstract

Seven nights big guy’s “Python crawler development and project combat”, bought for many years, learned a lot of things, basically crawlers are learned in this, the late scrapy framework crawler has not been able to enter the door, some time ago to fill the next object-oriented knowledge, today suddenly Epiphany! Write notes about your learning process

1. Scrapy installed

# - I followed by tsinghua image source parameters, download speed, other PIP package can also be so PIP operation install Scrapy - ihttps://pypi.tuna.tsinghua.edu.cn/simpleCopy the code

The following figure indicates that the installation is successful

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/785b87761b9f45a897e0e61625e5b9e0)

Win7 install scrapy

2. Related commands are introduced

Scrapy commands are divided into

  • Global commands: Global commands can be used anywhere;
  • Project commands: Project commands are only dependent on your project;

2.1 Global Commands

Global commands are the ones that popped up during the installation test above

Startproject, Genspider, Settings, RunSpider, shell, Fetch, View, Version

There are three commonly used ones:

Crawler # scrapy crawl spider_name # Scrapy shell scrapy shell "https://blog.csdn.net/qq_35866846"Copy the code

Global command is not dependent on the existence of the project, that is, no matter whether you have a project can run, such as: startProject it is to create a project command, must be no project can also run;

Detailed usage instructions:

  • Create a scrapy strartproject cnblogSpider

genspider

Genspider example scrapy genspider example example.comCopy the code
! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/d254a88cf461406ab17f3e53b66852e4)

settings

Scrapy Settings scrapy Settings scrapy Settings get BOT_NAME #Copy the code

Scrapy crawl XX scrapy crawl XX scrapy crawl XX Crawl is based on a project, runspide is based on a file, which means you write apy file according to scrapy spider format. If you don’t want to create a project, you can use a runspider. Eg: Crawl is a: test.py spider, which you run directly:

scrapy runspider test.py
Copy the code

shell

This command is used for debugging, and it is used for debugging. View our selector is there right on an element scrapy shell "https://www.cnblogs.com/qiyeboy/default.html?page=1" # then we can follow the response orders directly, We want to test whether the selector for the title is correct: Response.css ("title").extract_first() # and tests if the xpath path selection is correct response.xpath("//*[@id='mainContent']/div/div/div[2]/a/span").extract()Copy the code
  • Fetch is also a debug command. Its effect is to simulate our spider download page, that is to say, with this command to download the page is our spider run download page, the advantage is to be able to accurately diagnose, we to the HTML structure is not exactly what we see, and then can timely adjust our preparation of crawler strategy! For example, taobao details page, we usually see, but you can not climb according to the conventional method, why? Because it uses asynchronous transport! So when you find that you can’t get the content, you need to be alert and feel like fetch is going to fetch its HTML code and see if it has the tag node we want. If not, you need to know that we need to use javascript rendering or something like that! Scrapy fetch www.scrapyd.cn 1 That’s it. If you want to save the downloaded page to an HTML file for analysis, you can use the Windows or Linux output command. Here’s how to save the downloaded page under Window: Scrapy fetch www.scrapyd.cn >d:/3.html 1 Scrapy fetch www.scrapyd.cn >d:/3.html 1

  • view

    Similar to fetch, it is to check whether the spider sees the same thing as you see, so as to facilitate troubleshooting

    scrapy view blog.csdn.net/qq_35866846 12

  • version

    Check out the Scrapy version

    scrapy version 12

2.2 Project Commands

Scrapy command line: Scrapy command line scrapy command line: Scrapy command line

3. Scrapy frame introduction

Scrapy is a Crawler Framework written in Python, simple, lightweight and very convenient. It uses Twisted, a network library, to deal with network communication, with a clear architecture and a variety of middleware interfaces, which can flexibly fulfill various requirements. The overall architecture is as follows

! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/15ee6ac0382648e7820ddfdc7c6c2e59)
  • A Scrapy Engine that controls the flow of data among the components of a system and triggers events when actions occur.
  • Scheduler: The Scheduler receives requests from the engine and dispatches them so that the engine can provide the request when the engine requests it later.
  • Downloader: The Downloader takes the page data and feeds it to the engine, which then feeds it to the Spider.
  • Spiders: Spiders are classes that the user Scrapy writes to analyze a Response and extract an Item(that is, a fetched Item) or additional urls to follow. Each Spider is responsible for a specific (or set of) web sites
  • Item Pipeline: The Item Pipeline handles the items extracted by the Spider. Typical processes include cleanup validation and persistence (such as storing to a database);
  • Downloader middlewares: The Downloader middleware is the specific hooks between the engine and the Downloader that handle the Response that the Downloader passes to the engine. It provides a simple mechanism to extend Scrapy functionality by inserting custom code;
  • Spider Middleware (Spider Middlwares) Spider middleware is specific hooks between the engine and the Spider that handle the input (response) and output (items and request) of the Spider. It provides a simple mechanism to implement Scrapy functionality by inserting custom code.

4. Data flow in Scrapy

  1. The engine opens a domain, finds the Spider that handles the site, and requests the first URL to crawl from that Spider
  2. The engine retrieves the first URL to crawl from the Spider and schedules it as a Request through the scheduler
  3. The engine asks the scheduler for the next URL to climb
  4. The scheduler returns the next URL to be climbed to the engine, which forwards the descending URL to the Downloader through the download middleware (in the request direction).
  5. Once the page is downloaded, the downloader generates a Response to the page and sends it to the engine through the download middleware (in the Response direction)
  6. The engine receives the Response from the downloader and sends it through the Spider middleware (input direction) to the Spider for processing
  7. The Spider handles the Response and returns the retrieved Item and (following up) new Request to the engine
  8. The engine feeds the retrieved Item to the Item Pipeline and the Request to the scheduler
  9. Repeat until there are no more requests in the scheduler, and the engine shuts down the site

5. First scrapy crawler

7 nights big guy “” case project, because the book bought earlier, which is still python2 themselves in python3 environment implementation

5.1 Creating a Project

CnblogSpider Scrapy startProject cnblogSpiderCopy the code
! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/814e0f0c51d7426291df018549f120eb)

Once the project has been created, open it directly using PyCharm and continue to work to generate the structure files automatically and populate the framework

! [](https://p26-tt.byteimg.com/large/pgc-image/737cd097548a4d4685222b8a80c97b28)
  • Scrapy. CFG: project deployment file
  • CnblogSpider / : Python module for the project, where you can add code later
  • CnblogSpider /items.py: The Item file in the project
  • CnblogSpider/pipelines. Py: pipelines in the project file
  • CnblogSpider/Settings. Py: project configuration file
  • Cnblogspiders /spiders/: A directory of Spider codes

5.2 Creating the crawler Module

from .. items import CnblogspiderItem class CnblogsSpider(scrapy.Spider): Allowed_domains = [cnblogs.com] # start_urls = [ "https://www.cnblogs.com/qiyeboy/default.html?page=1" ] def parse(self, response): Papers = response.xpath("//*[@class='day']") # Extract data from each paper in papers: url = paper.xpath(".//*[@class='postTitle']/a/@href").extract()[0] title = paper.xpath(".//*[@class='postTitle']/a/span/text()").extract()[0] time = paper.xpath(".//*[@class='dayTitle']/a/text()").extract()[0] content = paper.xpath(".//*[@class='postCon']/div/text()").extract()[0] # print(url, title, time, content) item = CnblogspiderItem(url=url, title=title, time=time, Content =content) yield item next_page = Selector(response). Re (u'<a href="(\S*)"> next </a>') if next_page: yield scrapy.Request(url=next_page[0], callback=self.parse)Copy the code

5.3 define the item

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class CnblogspiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() url = scrapy.Field() time = scrapy.Field() title = scrapy.Field() content = scrapy.Field() class  newCnblogsItem(CnblogspiderItem): body = scrapy.Field() # title = scrapy.Field(CnblogspiderItem.Fields['title'], serializer = my_serializer)Copy the code

5.4 Build Item Pipeline

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import json from scrapy.exceptions import DropItem from .items import CnblogspiderItem class CnblogspiderPipeline(object): def __init__(self): self.file = open('papers.json', 'w', encoding='UTF-8') def process_item(self, item, spider): if item['title']: Dumps (dict(item)) + '\n' # print(type(line)) # self.file.write(line.encode()) # pay attention to open "wb" Self.file.write (line) return item else: raise DropItem(f"Missing title in {item}")Copy the code

5.5 Activate Item Pipeline

> < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;

Autoscrapy uncomments settings.py’s line

Many Item Pipeline components can be configured in TEM_PIPELINES variable. The integer value assigned to each class determines their running order. Item passes through the Item Pipeline in numerical order from lowest to highest, ranging from 0 to 1000

ITEM_PIPELINES = {
   'cnblogSpider.pipelines.CnblogspiderPipeline': 300,
}
Copy the code

After the activation is complete, switch the command line to the project directory for execution

scrapy crawl cnblogs
1
Copy the code
! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/ec2754300f1741edaaaad7b8fa5c267d)

This article reprinted text, copyright belongs to the author, such as infringement contact xiaobian delete!

Original address: blog.csdn.net/qq_35866846…

Complete project code get video tutorialJust click here