Scrapy frame

Scrapy is an application framework designed to crawl and extract structured data from web sites using pure Python.

The power of the framework, users only need to customize the development of several modules can easily achieve a crawler, used to grab web content and a variety of pictures, very convenient.

Scrapy uses Twisted’s (and Tornado’s) asynchronous network framework to handle network communications, which speeds up downloads without having to implement the framework itself, and includes a variety of middleware interfaces that are flexible enough to meet your needs.

Scrapy diagram (green line is data flow) :

  • Scrapy Engine: Responsible for communication, signal, data transfer among Spider, ItemPipeline, Downloader, Scheduler, etc.
  • Scheduler: It receives requests from the engine, sorts them in a way that makes them Scheduler and returns them to the engine when needed.
  • Downloader: Downloads all Requests sent by the ScrapyEngine and returns the Responses to the ScrapyEngine for handling by the Spider.
  • Spider: It is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, submitting the URL to be followed up to the engine, and entering the Scheduler again.
  • Item Pipeline: This is the place where items retrieved from the Spider are processed for post-processing (detailed analysis, filtering, storage, etc.).
  • Downloader Middlewares: You can use it as a component that can be customized to extend the download functionality.
  • Spider Middlewares: You can think of it as a functional component that extends and manipulates communication between the engine and the Spider (e.g., into Responses of the Spider; And Requests out of spiders.)

Scrapy

The code is written, the program is running…

  1. Engine: Hi! Spider, which web site are you dealing with?
  2. Spider: Boss wants me to handle xxxx.com.
  3. Engine: you give me the first URL that needs to be processed.
  4. Spider: Here you go. The first URL is xxxxxxx.com.
  5. Engine: Hi! Scheduler, I have a request for you to help me sort the queue.
  6. Scheduler: Ok, it’s being processed. Hold on.
  7. Engine: Hi! Scheduler, give me the request you processed.
  8. Scheduler: Here you are. This is my request
  9. Engine: Hi! Downloader, please help me download this request according to the setup of the middleware of the boss
  10. Downloader: Ok! Here you go. It’s a good download. If failed: Sorry, this request download failed. The engine then tells the scheduler, “This request has failed to download, record it and we will download it later.”
  11. Engine: Hi! Spider, this is the download of good things, and has been in accordance with the boss of the download in the middle of the processing, you deal with it (attention! Here responses are assigned to defparse() by default.)
  12. Spider :(for the URL to follow up after processing the data), Hi! Engine, I have two results here, this is the URL THAT I need to follow up, and this is the Item data that I got.
  13. Engine: Hi! I have an item here, please help me deal with it! The scheduler. This is to follow up the URL, please help me deal with it. Then start the loop from step 4 until you have all the information you need.
  14. Pipeline scheduler: Ok, do it now!

Attention! When the scheduler runs out of requests, the program stops. (That is, Scrapy redownloads failed urls.)

There are 4 steps to making a crawler:

  • Scrapy startproject XXX: create a new crawler project
  • Specify the target (write items.py) : Specify the target you want to crawl
  • Spiders /xxspider.py: Spiders start to climb the web page
  • > < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important

Click here for crawler video tutorials and crawler learning materials.

Official website: Scrapy framework doc.scrapy.org/en/latest Scrapy Chinese maintenance station point: http://scrapychs.readthedocs.io/zh_CN/latest/index.html