Pay attention to the “water drop and silver bullet” public account, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.
It takes about 15 minutes to read this article.
How does Scrapy work? We’re going to break down the core logic of how Scrapy works, what it does before it actually does the scraping.
What are the core components of Scrapy? And what are they primarily responsible for? How are these components internally implemented to accomplish these functions?
reptiles
Let’s pick up where we left off last time. Last time Scrapy to run after the execution at the end of the Crawler to crawl method, let’s take a look at this method:
@defer.inlineCallbacks
def crawl(self, *args, **kwargs) :
assert not self.crawling, "Crawling already taking place"
self.crawling = True
try:
Find the crawler from the SpiderLoader and instantiate the crawler instance
self.spider = self._create_spider(*args, **kwargs)
# create engine
self.engine = self._create_engine()
Call the crawler's start_requests method to get the list of seed urls
start_requests = iter(self.spider.start_requests())
Execute the engine's Open_spider and pass in the crawler instance and the initial request
yield self.engine.open_spider(self.spider, start_requests)
yield defer.maybeDeferred(self.engine.start)
except Exception:
if six.PY2:
exc_info = sys.exc_info()
self.crawling = False
if self.engine is not None:
yield self.engine.close()
if six.PY2:
six.reraise(*exc_info)
raise
Copy the code
At this point, we see that the crawler instance is created, then the engine is created, and finally the crawler is handed over to the engine for processing.
In the last article, we also mentioned that when Crawler is instantiated, a SpiderLoader is created, which finds the location of the Crawler based on the configuration file settings.py we defined.
The SpiderLoader then scans these code files and finds that the parent class of the crawler is scrapy.Spider. The SpiderLoader then generates a {spider_name: Crawl
scrapy crawl
def _create_spider(self, *args, **kwargs) :
Call the from_crawler class method to instantiate
return self.spidercls.from_crawler(self, *args, **kwargs)
Copy the code
The crawler is not initialized using the normal constructor. Instead, the crawler is initialized using the from_crawler class and finds the scrapy.Spider class:
@classmethod
def from_crawler(cls, crawler, *args, **kwargs) :
spider = cls(*args, **kwargs)
spider._set_crawler(crawler)
return spider
def _set_crawler(self, crawler) :
self.crawler = crawler
Assign the Settings object to the spider instance
self.settings = crawler.settings
crawler.signals.connect(self.close, signals.spider_closed)
Copy the code
So here we can see that this class method is actually calling the constructor, instantiating it, and also getting the Settings configuration, so what does the constructor do?
class Spider(object_ref) :
name = None
custom_settings = None
def __init__(self, name=None, **kwargs) :
# name required
if name is not None:
self.name = name
elif not getattr(self, 'name'.None) :raise ValueError("%s must have a name" % type(self).__name__)
self.__dict__.update(kwargs)
# if start_urls is not set, the default is []
if not hasattr(self, 'start_urls'):
self.start_urls = []
Copy the code
Does this look familiar? Here are some of the most common attributes we use when writing crawlers: name, start_urls, custom_settings:
name
: use it to find the crawler class we wrote when running the crawler;start_urls
: crawl entry, also called seed URL;custom_settings
: Crawler custom configuration overwrites the configuration items in the configuration file.
engine
After analyzing the initialization of the Crawler, go back to the crawling method of the Crawler and then create the engine object, _create_engine. What happens when you initialize the Crawler?
class ExecutionEngine(object) :
"" "engine "" "
def __init__(self, crawler, spider_closed_callback) :
self.crawler = crawler
Save the Settings configuration to the engine
self.settings = crawler.settings
# signal
self.signals = crawler.signals
# log format
self.logformatter = crawler.logformatter
self.slot = None
self.spider = None
self.running = False
self.paused = False
Find the Scheduler class from Settings
self.scheduler_cls = load_object(self.settings['SCHEDULER'])
Again, find the Downloader class
downloader_cls = load_object(self.settings['DOWNLOADER'])
Instantiate Downloader
self.downloader = downloader_cls(crawler)
It's a bridge between the engine and the crawler
self.scraper = Scraper(crawler)
self._spider_closed_callback = spider_closed_callback
Copy the code
Scheduler, Downloader, and Scrapyer are the core components that Scheduler defines and initializes without instantiation.
That is, the engine is at the heart of Scrapy, managing and scheduling the components to make them work together.
How are these core components initialized?
The scheduler
The scheduler initialization occurs in the engine’s Open_spider method, so let’s look at the scheduler initialization in advance.
class Scheduler(object) :
""" "Scheduler """
def __init__(self, dupefilter, jobdir=None, dqclass=None, mqclass=None,
logunser=False, stats=None, pqclass=None) :
# Fingerprint filter
self.df = dupefilter
# task queue folder
self.dqdir = self._dqdir(jobdir)
Priority task queue class
self.pqclass = pqclass
Disk task queue class
self.dqclass = dqclass
# memory task queue class
self.mqclass = mqclass
Whether the log is serialized
self.logunser = logunser
self.stats = stats
@classmethod
def from_crawler(cls, crawler) :
settings = crawler.settings
Get the fingerprint filter class from the configuration file
dupefilter_cls = load_object(settings['DUPEFILTER_CLASS'])
Instantiate the fingerprint filter
dupefilter = dupefilter_cls.from_settings(settings)
Get priority task queue class, disk queue class, memory queue class from the configuration file
pqclass = load_object(settings['SCHEDULER_PRIORITY_QUEUE'])
dqclass = load_object(settings['SCHEDULER_DISK_QUEUE'])
mqclass = load_object(settings['SCHEDULER_MEMORY_QUEUE'])
Request log serialization switch
logunser = settings.getbool('LOG_UNSERIALIZABLE_REQUESTS', settings.getbool('SCHEDULER_DEBUG'))
return cls(dupefilter, jobdir=job_dir(settings), logunser=logunser,
stats=crawler.stats, pqclass=pqclass, dqclass=dqclass, mqclass=mqclass)
Copy the code
As you can see, the scheduler initialization does two things:
- Instantiated request fingerprint filter: mainly used to filter repeated requests;
- Define different types of task queues: priority task queues, disk-based task queues, memory-based task queues;
What about requesting a fingerprint filter?
In the configuration file, we can see that the default fingerprint filter defined is RFPDupeFilter:
class RFPDupeFilter(BaseDupeFilter) :
""" Request fingerprint filter """
def __init__(self, path=None, debug=False) :
self.file = None
# Fingerprint sets use Set based memory
self.fingerprints = set()
self.logdupes = True
self.debug = debug
self.logger = logging.getLogger(__name__)
Request fingerprint can be saved to disk
if path:
self.file = open(os.path.join(path, 'requests.seen'), 'a+')
self.file.seek(0)
self.fingerprints.update(x.rstrip() for x in self.file)
@classmethod
def from_settings(cls, settings) :
debug = settings.getbool('DUPEFILTER_DEBUG')
return cls(job_dir(settings), debug)
Copy the code
When a fingerprint filter is requested for initialization, a collection of fingerprints is defined that uses an in-memory implementation of the Set and can control whether the fingerprints are saved to disk for next reuse.
In other words, the fingerprint filter is responsible for filtering repeated requests and you can customize the filtering rules.
In the next article, we’ll look at the rules by which each request generates a fingerprint, and then how to implement the repeat request filtering logic, but we’ll just know what it does.
What are the tasks defined by the scheduler?
The scheduler defines two queue types by default:
- Disk-based task queue: You can configure a storage path in the configuration file and save the queue tasks to disks after each execution.
- Memory-based task queue: Each time in memory execution, the next startup will disappear;
The default configuration file definition is as follows:
# Disk-based task queue (last in first out)
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
# memory-based task queue (lifO)
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
# priority queue
SCHEDULER_PRIORITY_QUEUE = 'queuelib.PriorityQueue'
Copy the code
If we define the JOBDIR configuration item in the configuration file, the task queue will be saved on disk each time the crawler is executed, so that the next time the crawler is started, the task can be reloaded to continue executing our task.
If this configuration item is not defined, memory queues are used by default.
If you’re careful, you might notice that the default queue structure is lifO. What does that mean?
In other words, when running our crawler code, if a crawler task is generated and put into the task queue, the next crawler will obtain this task from the task queue first and execute it first.
What does this implementation mean? Scrapy’s default collection rule is depth first.
How do you change this mechanism to breadth-first collection? This is where we look at the scrapy.squeues module, which defines various types of queues:
# FifO disk queue (pickle serialization)
PickleFifoDiskQueue = _serializable_queue(queue.FifoDiskQueue, \
_pickle_serialize, pickle.loads)
Last in first out disk queue (pickle serialization)
PickleLifoDiskQueue = _serializable_queue(queue.LifoDiskQueue, \
_pickle_serialize, pickle.loads)
First in, first out (FIFO)
MarshalFifoDiskQueue = _serializable_queue(queue.FifoDiskQueue, \
marshal.dumps, marshal.loads)
Last in, first out (LAST in, first out)
MarshalLifoDiskQueue = _serializable_queue(queue.LifoDiskQueue, \
marshal.dumps, marshal.loads)
# fifO memory queue
FifoMemoryQueue = queue.FifoMemoryQueue
Last in, first out memory queue
LifoMemoryQueue = queue.LifoMemoryQueue
Copy the code
If we wanted to change the fetching task to breadth-first, all we had to do was change the queue class to fifO in the configuration file! As you can see, the coupling between Scrapy components is very low, and each module is customizable.
If you want to explore how these queues are implemented, you can check out the author’s Scrapy/Queuelib project, which is available on Github.
downloader
Going back to the initialization of the engine, let’s look at how the downloader is initialized.
In the default configuration file default_settings.py, the downloader is configured as follows:
DOWNLOADER = 'scrapy.core.downloader.Downloader'
Copy the code
Let’s look at the initialization of the Downloader class:
class Downloader(object) :
""" "Downloader """
def __init__(self, crawler) :
Get the Settings object as well
self.settings = crawler.settings
self.signals = crawler.signals
self.slots = {}
self.active = set(a)Initialize the DownloadHandlers
self.handlers = DownloadHandlers(crawler)
Get the set concurrency from the configuration
self.total_concurrency = self.settings.getint('CONCURRENT_REQUESTS')
# Number of concurrent requests for the same domain name
self.domain_concurrency = self.settings.getint('CONCURRENT_REQUESTS_PER_DOMAIN')
# Number of concurrent requests for the same IP address
self.ip_concurrency = self.settings.getint('CONCURRENT_REQUESTS_PER_IP')
# Random delay of download time
self.randomize_delay = self.settings.getbool('RANDOMIZE_DOWNLOAD_DELAY')
Initialize the downloader middlewareself.middleware = DownloaderMiddlewareManager.from_crawler(crawler) self._slot_gc_loop = task.LoopingCall(self._slot_gc) self._slot_gc_loop.start(60)
Copy the code
In this process, the download processor, the download middleware manager, and the parameters related to grab request control are initialized from the configuration file.
So what does a download processor do? What does the downloader middleware do?
First, the DownloadHandlers:
class DownloadHandlers(object) :
""" "Downloader processor """
def __init__(self, crawler) :
self._crawler = crawler
self._schemes = {} The classpath corresponding to the storage scheme is later used for instantiation
self._handlers = {} # Store the downloader corresponding to scheme
self._notconfigured = {}
Find the DOWNLOAD_HANDLERS_BASE construct from the configuration to download the handler
# note: the getwithBase method is called to take the XXXX_BASE configuration
handlers = without_none_values(
crawler.settings.getwithbase('DOWNLOAD_HANDLERS'))
The classpath corresponding to the storage scheme is later used for instantiation
for scheme, clspath in six.iteritems(handlers):
self._schemes[scheme] = clspath
crawler.signals.connect(self._close, signals.engine_stopped)
Copy the code
Download handlers are configured like this in the default configuration file:
# User-definable download processor
DOWNLOAD_HANDLERS = {}
# Default download handler
DOWNLOAD_HANDLERS_BASE = {
'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler'.'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler'.'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler'.'s3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler'.'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',}Copy the code
As you can see from this, the download processor will select the appropriate downloader to download the resource according to the type of resource to download. The most common of these are the HTTP and HTTPS processors.
Note, however, that these downloaders are not instantiated here, and are initialized only when the network request is actually made, and only once, as described in a later article.
Below we see downloader middleware DownloaderMiddlewareManager initialization process, in the same way, here again called class methods from_crawler initialized, And DownloaderMiddlewareManager inherited MiddlewareManager class, to see what kind of work does it in the initial changes:
class MiddlewareManager(object) :
A parent class of all middleware that provides common middleware methods.
component_name = 'foo middleware'
@classmethod
def from_crawler(cls, crawler) :
# call from_settings
return cls.from_settings(crawler.settings, crawler)
@classmethod
def from_settings(cls, settings, crawler=None) :
Call subclass _get_mwLIST_from_settings to get the modules of all middleware classes
mwlist = cls._get_mwlist_from_settings(settings)
middlewares = []
enabled = []
# instantiate in sequence
for clspath in mwlist:
try:
Load these middleware modules
mwcls = load_object(clspath)
Call this method to instantiate from_crawler if the middleware class defines it
if crawler and hasattr(mwcls, 'from_crawler'):
mw = mwcls.from_crawler(crawler)
Call this method to instantiate from_Settings if the middleware class defines it
elif hasattr(mwcls, 'from_settings'):
mw = mwcls.from_settings(settings)
If neither of the above methods exists, call the constructor instantiation directly
else:
mw = mwcls()
middlewares.append(mw)
enabled.append(clspath)
except NotConfigured as e:
if e.args:
clsname = clspath.split('. ')[-1]
logger.warning("Disabled %(clsname)s: %(eargs)s",
{'clsname': clsname, 'eargs': e.args[0]},
extra={'crawler': crawler})
logger.info("Enabled %(componentname)ss:\n%(enabledlist)s",
{'componentname': cls.component_name,
'enabledlist': pprint.pformat(enabled)},
extra={'crawler': crawler})
Call the constructor
return cls(*middlewares)
@classmethod
def _get_mwlist_from_settings(cls, settings) :
What middleware classes are there, subclass definition
raise NotImplementedError
def __init__(self, *middlewares) :
self.middlewares = middlewares
Define middleware methods
self.methods = defaultdict(list)
for mw in middlewares:
self._add_middleware(mw)
def _add_middleware(self, mw) :
Subclasses defined by default can be overridden
Add methods if the middleware class defines an open_spider
if hasattr(mw, 'open_spider'):
self.methods['open_spider'].append(mw.open_spider)
Add methods if close_spider is defined for the middleware class
Methods are a chain of middleware methods that are called in turn
if hasattr(mw, 'close_spider'):
self.methods['close_spider'].insert(0, mw.close_spider)
Copy the code
DownloaderMiddlewareManager instantiation:
class DownloaderMiddlewareManager(MiddlewareManager) :
""" Download middleware manager """
component_name = 'downloader middleware'
@classmethod
def _get_mwlist_from_settings(cls, settings) :
Get all the downloader middleware from the configuration files DOWNLOADER_MIDDLEWARES_BASE and DOWNLOADER_MIDDLEWARES
return build_component_list(
settings.getwithbase('DOWNLOADER_MIDDLEWARES'))
def _add_middleware(self, mw) :
Define a list of methods for the downloader middleware request, response, and exception
if hasattr(mw, 'process_request'):
self.methods['process_request'].append(mw.process_request)
if hasattr(mw, 'process_response'):
self.methods['process_response'].insert(0, mw.process_response)
if hasattr(mw, 'process_exception'):
self.methods['process_exception'].insert(0, mw.process_exception)
Copy the code
The downloader MiddlewareManager inherits the MiddlewareManager class and then overrides the _add_middleware method to define default pre-download, post-download, and exception-time actions for download behavior.
Here we can think about, what are the benefits of middleware doing this?
Can probably see from here, from one component to another component, passes through a series of middleware, each middleware defines its own processing, equivalent to a pipeline, can input for data processing, and then sent to another component, after another component processing logic, and through this a series of middleware, These middleware can then process the response result for the final output.
Scraper
After the launcher is instantiated, I return to the Engine initialization method, and then to the Scraper. As I mentioned in Scrapy, this class does not appear on the Spiders of Engine, Pipeline, and Engine. It’s a bridge between these three components.
Let’s take a look at its initialization:
class Scraper(object) :
def __init__(self, crawler) :
self.slot = None
Instantiate crawler middleware manager
self.spidermw = SpiderMiddlewareManager.from_crawler(crawler)
Load the Pipeline handler class from the configuration file
itemproc_cls = load_object(crawler.settings['ITEM_PROCESSOR'])
Instantiate the Pipeline handler
self.itemproc = itemproc_cls.from_crawler(crawler)
Get the number of tasks to process output simultaneously from the configuration file
self.concurrent_items = crawler.settings.getint('CONCURRENT_ITEMS')
self.crawler = crawler
self.signals = crawler.signals
self.logformatter = crawler.logformatter
Copy the code
The Scraper creates the SpiderMiddlewareManager, which is initialized:
class SpiderMiddlewareManager(MiddlewareManager) :
Crawler Middleware Manager ""
component_name = 'spider middleware'
@classmethod
def _get_mwlist_from_settings(cls, settings) :
SPIDER_MIDDLEWARES_BASE and SPIDER_MIDDLEWARES get default crawler middleware classes from configuration files
return build_component_list(settings.getwithbase('SPIDER_MIDDLEWARES'))
def _add_middleware(self, mw) :
super(SpiderMiddlewareManager, self)._add_middleware(mw)
Define crawler middleware processing methods
if hasattr(mw, 'process_spider_input'):
self.methods['process_spider_input'].append(mw.process_spider_input)
if hasattr(mw, 'process_spider_output'):
self.methods['process_spider_output'].insert(0, mw.process_spider_output)
if hasattr(mw, 'process_spider_exception'):
self.methods['process_spider_exception'].insert(0, mw.process_spider_exception)
if hasattr(mw, 'process_start_requests'):
self.methods['process_start_requests'].insert(0, mw.process_start_requests)
Copy the code
The crawler middleware manager initialization is similar to the previous downloader middleware manager in that it first loads the default crawler middleware class from the configuration file and then registers a series of process methods for the crawler middleware in turn. The default crawler middleware classes defined in the configuration file are as follows:
SPIDER_MIDDLEWARES_BASE = {
The default crawler middleware class
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50.'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500.'scrapy.spidermiddlewares.referer.RefererMiddleware': 700.'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800.'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,}Copy the code
Here’s what the default crawler middleware does:
- HttpErrorMiddleware: logical handling of non-200 response errors;
- OffsiteMiddleware: if defined in Spider
allowed_domains
, will automatically filter other domain name requests; - RefererMiddleware: additional
Referer
Header information; - UrlLengthMiddleware: Filters requests for urls whose length exceeds the limit;
- DepthMiddleware: Filters fetch requests that exceed a specified depth;
Of course, you can also define your own crawler middleware to handle your own logic.
After the crawler middleware manager is initialized, the Pipeline component is initialized. The default Pipeline component is ItemPipelineManager:
class ItemPipelineManager(MiddlewareManager) :
component_name = 'item pipeline'
@classmethod
def _get_mwlist_from_settings(cls, settings) :
Load the ITEM_PIPELINES_BASE and ITEM_PIPELINES classes from the config file
return build_component_list(settings.getwithbase('ITEM_PIPELINES'))
def _add_middleware(self, pipe) :
super(ItemPipelineManager, self)._add_middleware(pipe)
Define the default pipeline processing logic
if hasattr(pipe, 'process_item'):
self.methods['process_item'].append(pipe.process_item)
def process_item(self, item, spider) :
Call the process_item method of all subclasses in turn
return self._process_chain('process_item', item, spider)
Copy the code
ItemPipelineManager is also a subclass of middleware manager, which behaves very much like middleware but is one of the core components due to its independent functionality.
From the initialization of the Scraper, we can see that it manages the data interaction associated with Spiders and pipelines.
conclusion
The engine, the downloader, the scheduler, the crawler, and the output processor are all initialized and their submodules are designed to perform their functions.
These components play their respective roles and coordinate with each other to jointly complete the crawler grasping task. Moreover, we can also find from the code that each component class is defined in the configuration file, that is to say, we can implement our own logic and then replace these components. Such a design pattern is also worth learning.
In the next article, I’ll take a look at the core of Scrapy and how the components work together to accomplish our scraping tasks.
Crawler series:
- Scrapy source code analysis (a) architecture overview
- Scrapy source code analysis (two) how to run Scrapy?
- Scrapy source code analysis (three) what are the core components of Scrapy?
- Scrapy source code analysis (four) how to complete the scraping task?
- How to build a crawler proxy service?
- How to build a universal vertical crawler platform?
My advanced Python series:
- Python Advanced – How to implement a decorator?
- Python Advanced – How to use magic methods correctly? (on)
- Python Advanced – How to use magic methods correctly? (below)
- Python Advanced — What is a metaclass?
- Python Advanced – What is a Context manager?
- Python Advancements — What is an iterator?
- Python Advancements — How to use yield correctly?
- Python Advanced – What is a descriptor?
- Python Advancements – Why does GIL make multithreading so useless?
Want to read more hardcore technology articles? Focus on”Water drops and silver bullets”Public number, the first time to obtain high-quality technical dry goods. 7 years of senior back-end development, with a simple way to explain the technology clearly.