Life is short. I use Python
Previous portal:
Learning Python crawlers (1) : The Beginning
Python crawler (2) : Preparation (1) basic class library installation
Learn Python crawler (3) : Pre-preparation (2) Linux basics
Docker is a Python crawler
Learn Python crawler (5) : pre-preparation (4) database foundation
Python crawler (6) : Pre-preparation (5) crawler framework installation
Python crawler (7) : HTTP basics
Little White learning Python crawler (8) : Web basics
Learning Python crawlers (9) : Crawler basics
Python crawler (10) : Session and Cookies
Python crawler (11) : Urllib
Python crawler (12) : Urllib
Urllib: A Python crawler (13)
Urllib: A Python crawler (14)
Python crawler (15) : Urllib
Python crawler (16) : Urllib crawler (16) : Urllib crawler
Python crawler (17) : Basic usage for Requests
Python crawler (18) : Requests advanced operations
Python crawler (19) : Xpath base operations
Learn Python crawler (20) : Advanced Xpath
Python crawler (21) : Parsing library Beautiful Soup
Python crawler (22) : Beautiful Soup
Python crawler (23) : Getting started parsing pyQuery
Python Crawler (24) : 2019 douban movie Rankings
Python crawler (25) : Crawls stock information
You can’t even afford to buy a second-hand house in Shanghai
Selenium, an Automated Testing Framework, goes from Getting Started to Giving up
Selenium, an Automated Testing Framework, goes from Starter to Quit
Selenium obtains commodity information on a large e-commerce site
Python crawler (30) : Proxy basics
Python crawler (31) : Build a simple proxy pool yourself
Python crawler (32) : Introduction to asynchronous request library AIOHTTP basics
Python crawlers and their Scrapy framework
Python crawlers and their Scrapy framework
Python crawler (35) : Crawler framework Scrapy introduction foundation (three) Selector
Downloader Middleware. Python crawler framework Scrapy
The introduction
Spider Middleware is a hook framework for Scrapy’s Spider handling mechanism, where you can insert custom functionality to handle responses sent to spiders for processing and requests and items generated by spiders.
Built-in crawler middleware
Like Downloader Middleware, Scrapy also builds in a portion of Spider Middleware, These built-in Spider Middleware are stored in the variable SPIDER_MIDDLEWARES_BASE as follows:
{ 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500, 'scrapy.spidermiddlewares.referer.RefererMiddleware': 700, 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800, 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900.}Copy the code
Also like Downloader Middleware, Spider Middleware is added to the SPIDER_MIDDLEWARES setting, This setting is merged with the SPIDER_MIDDLEWARES_BASE defined in Scrapy, and is prioritized by the size of the values, with the first Middleware next to the engine and the last Middleware next to the Spider.
Custom crawler middleware
Scrapy’s built-in Spider Middleware provides only basic functionality, and if you want to extend it to implement a custom crawler Middleware, you only need to implement one of the following methods.
The core methods are as follows:
- processspiderinput(response, spider)
- processspideroutput(response, result, spider)
- processspiderexception(response, exception, spider)
- processstartrequests(start_requests, spider)
You only need to implement one of these methods to define a crawler middleware.
processspiderinput(response, spider)
Parameters:
Response (Response object) – The response being processed
Spider (Spider object) – The spider this response is intended for
This method is called for every response that is processed through Spider Middleware and into a Spider.
Processspiderinput () should return None or raise an exception.
If None is returned, Scrapy continues processing the response and executes all other middleware until the response is finally handed to the spider for processing.
If an exception is thrown, Scrapy doesn’t bother to call any other spider middleware, ProcessSpiderInput (), and will call the request errback if there is an error, otherwise it will start the ProcessSpiderException () chain. The output of errback is chained back in the other direction for processSpiderOutput () to handle, or to ProcessSpiderException () if an exception is thrown.
processspideroutput(response, result, spider)
Parameters:
Response (Response object) – Generates the response from the spider for this output
Result (iterable Request, dict, or Item object) – The result returned by the spider
Spider (Spider object) – the spider that is processing its results
After processing the response, this method is called from the returned result using a Spider.
Processspideroutput () must return an iterable of a Request, dict, or Item object.
processspiderexception(response, exception, spider)
Parameters:
Response (Response object) – The response being processed when the exception is thrown
Exception (Exception object) – The exception thrown
Spider (Spider object) – The spider that throws the exception
This method is called when the Spider or processSpiderOutput () method (from the previous Spider middleware) throws an exception.
Processspiderexception () should return None or an iterable Request, dict, or Item object.
If None is returned, Scrapy continues handling the exception and executes any other ProcessSpiderExceptions () in the following middleware components until there are no remaining middleware components and the exception reaches the engine (logging and dismissing the exception).
If iterable is returned, starting with the next spider middleware, the ProcessSpiderOutput () pipeline will start and no other ProcessSpiderException () will be called.
processstartrequests(start_requests, spider)
Parameters:
Start_requests (iterable Requests) — Start requests
Spider (Spider object) – Launches the spider to which the request belongs
This method is called by the Spider’s start request and works similarly to the processSpiderOutput () method, except that it has no associated response and only returns the request (no project).
It receives an iterable (in the start_requests argument), and must return another iterable Request object.
Spider Middleware is not used as much as Downloader Middleware, but can be used for data processing if necessary.
reference
https://docs.scrapy.org/en/latest/topics/spider-middleware.html