Spider Middleware and their Scrapy framework

Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

Learning Python crawlers (9) : Crawler basics

Python crawler (10) : Session and Cookies

Python crawler (11) : Urllib

Python crawler (12) : Urllib

Urllib: A Python crawler (13)

Urllib: A Python crawler (14)

Python crawler (15) : Urllib

Python crawler (16) : Urllib crawler (16) : Urllib crawler

Python crawler (17) : Basic usage for Requests

Python crawler (18) : Requests advanced operations

Python crawler (19) : Xpath base operations

Learn Python crawler (20) : Advanced Xpath

Python crawler (21) : Parsing library Beautiful Soup

Python crawler (22) : Beautiful Soup

Python crawler (23) : Getting started parsing pyQuery

Python Crawler (24) : 2019 douban movie Rankings

Python crawler (25) : Crawls stock information

You can’t even afford to buy a second-hand house in Shanghai

Selenium, an Automated Testing Framework, goes from Getting Started to Giving up

Selenium, an Automated Testing Framework, goes from Starter to Quit

Selenium obtains commodity information on a large e-commerce site

Python crawler (30) : Proxy basics

Python crawler (31) : Build a simple proxy pool yourself

Python crawler (32) : Introduction to asynchronous request library AIOHTTP basics

Python crawlers and their Scrapy framework

Python crawler (35) : Crawler framework Scrapy introduction foundation (three) Selector

Downloader Middleware. Python crawler framework Scrapy

The introduction

Spider Middleware is a hook framework for Scrapy’s Spider handling mechanism, where you can insert custom functionality to handle responses sent to spiders for processing and requests and items generated by spiders.

Built-in crawler middleware

Like Downloader Middleware, Scrapy also builds in a portion of Spider Middleware, These built-in Spider Middleware are stored in the variable SPIDER_MIDDLEWARES_BASE as follows:

{ 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500, 'scrapy.spidermiddlewares.referer.RefererMiddleware': 700, 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800, 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900.}Copy the code

Also like Downloader Middleware, Spider Middleware is added to the SPIDER_MIDDLEWARES setting, This setting is merged with the SPIDER_MIDDLEWARES_BASE defined in Scrapy, and is prioritized by the size of the values, with the first Middleware next to the engine and the last Middleware next to the Spider.

Custom crawler middleware

Scrapy’s built-in Spider Middleware provides only basic functionality, and if you want to extend it to implement a custom crawler Middleware, you only need to implement one of the following methods.

The core methods are as follows:

processspiderinput(response, spider)
processspideroutput(response, result, spider)
processspiderexception(response, exception, spider)
processstartrequests(start_requests, spider)

You only need to implement one of these methods to define a crawler middleware.

processspiderinput(response, spider)

Parameters:

Response (Response object) – The response being processed

Spider (Spider object) – The spider this response is intended for

This method is called for every response that is processed through Spider Middleware and into a Spider.

Processspiderinput () should return None or raise an exception.

If None is returned, Scrapy continues processing the response and executes all other middleware until the response is finally handed to the spider for processing.

If an exception is thrown, Scrapy doesn’t bother to call any other spider middleware, ProcessSpiderInput (), and will call the request errback if there is an error, otherwise it will start the ProcessSpiderException () chain. The output of errback is chained back in the other direction for processSpiderOutput () to handle, or to ProcessSpiderException () if an exception is thrown.

processspideroutput(response, result, spider)

Parameters:

Response (Response object) – Generates the response from the spider for this output

Result (iterable Request, dict, or Item object) – The result returned by the spider

Spider (Spider object) – the spider that is processing its results

After processing the response, this method is called from the returned result using a Spider.

Processspideroutput () must return an iterable of a Request, dict, or Item object.

processspiderexception(response, exception, spider)

Parameters:

Response (Response object) – The response being processed when the exception is thrown

Exception (Exception object) – The exception thrown

Spider (Spider object) – The spider that throws the exception

This method is called when the Spider or processSpiderOutput () method (from the previous Spider middleware) throws an exception.

Processspiderexception () should return None or an iterable Request, dict, or Item object.

If None is returned, Scrapy continues handling the exception and executes any other ProcessSpiderExceptions () in the following middleware components until there are no remaining middleware components and the exception reaches the engine (logging and dismissing the exception).

If iterable is returned, starting with the next spider middleware, the ProcessSpiderOutput () pipeline will start and no other ProcessSpiderException () will be called.

processstartrequests(start_requests, spider)

Parameters:

Start_requests (iterable Requests) — Start requests

Spider (Spider object) – Launches the spider to which the request belongs

This method is called by the Spider’s start request and works similarly to the processSpiderOutput () method, except that it has no associated response and only returns the request (no project).

It receives an iterable (in the start_requests argument), and must return another iterable Request object.

Spider Middleware is not used as much as Downloader Middleware, but can be used for data processing if necessary.

reference

https://docs.scrapy.org/en/latest/topics/spider-middleware.html

Spider Middleware and their Scrapy framework

The introduction

Built-in crawler middleware

Custom crawler middleware

processspiderinput(response, spider)

processspideroutput(response, result, spider)

processspiderexception(response, exception, spider)

processstartrequests(start_requests, spider)

reference

Related Posts

What kind of monitoring really shows that there is something wrong with the system?

A Preliminary study on SQL Index

Chapter 1: Use a HelloWord to illustrate the simplicity and speed of SpringBoot