1. Scrapy concept

Scrapy is an open source web crawler framework written in Python. It is a framework designed to crawl network data and extract structural data.

Scrapy uses the Twisted asynchronous networking framework to speed up our downloads.

Scrapy-chs.readthedocs. IO /zh_CN/1.0/ I scrapy-chs.readthedocs. IO /zh_CN/1.0/ I

2. Scrapy framing

A small amount of code can be quickly captured

3. Scrapy workflow

3.1 Review the previous crawler process

3.2 The above process can be rewritten as

3.3 Scrapy process

The process can be described as follows:
  1. The starting URL in the crawler is constructed as a request object -> crawler middleware -> engine -> scheduler
  2. The scheduler takes the request–> Engine –> Download middleware –> Download
  3. Download device sends the request, to obtain the response response — — — — > download middleware — — — — — — — > > engine crawler middleware – > crawler
  4. The crawler extracts the URL address and assembles the request object —-> crawler middleware –> Engine –> scheduler, repeat Step 2
  5. Crawlers extract data –> engines –> pipes process and store data
Note:
  • The Chinese characters in the picture are added for ease of understanding
  • The green lines in the figure represent the transfer of data
  • Note the placement of the middleware in the diagram, which determines its role
  • Note the position of the engine. All modules are independent of each other and interact only with the engine

3.4 Scrapy’s three built-in objects

  • Request Request object: url method post_data headers
  • Response Response object: consists of URL, body, status headers
  • Item data object: Essentially a dictionary

3.5 Scrapy details for each module

Note:
  • Crawler middleware and download middleware only have different locations of running logic, and their functions are repeated, such as replacing UA

summary

  1. The concept of scrapy: Scrapy is an application framework for crawling site data and extracting structural data
  2. Scrapy framework and data transfer process:
    1. The starting URL in the crawler is constructed as a request object -> crawler middleware -> engine -> scheduler
    2. The scheduler takes the request–> Engine –> Download middleware –> Download
    3. Download device sends the request, to obtain the response response — — — — > download middleware — — — — — — — > > engine crawler middleware – > crawler
    4. The crawler extracts the URL address and assembles the request object —-> crawler middleware –> Engine –> scheduler, repeat Step 2
    5. Crawlers extract data –> engines –> pipes process and store data
  3. Scrapy framework for quick scraping with a small amount of code
  4. The scheduler (engine) processes data and signals from one module to another and implements a queue on which the downloader (downloader) sends requests. Process the response sent by the engine, extract the data, extract the URL, and deliver it to the engine pipeline: Download extensions that can be customized, such as proxy IP crawler middleware: Can customize request request and response filtering, and download middleware function duplication