In the field of crawler development, Java and Python are the two most commonly used mainstream languages. If you use Python to develop crawlers, you have probably heard of the open source framework Scrapy, which is written in Python.

Scrapy is one of the most popular open source crawler frameworks and has been used by almost everyone who writes crawlers in Python. Moreover, many open source crawler frameworks in the industry are implemented by imitating and referring to the idea and structure of Scrapy. If you want to learn crawlers in depth, it is necessary to read the source code of Scrapy.

From the beginning of this article, I will share with you my thoughts and experience of reading the source code of the crawler.

In this article, we will introduce the overall structure of Scrapy and learn from the macro level how it works. In the next few articles, I’ll take you through each module and break down the implementation details of the framework.

introduce

First, let’s take a look at Scrapy’s official introduction. From the official site, we can see that Scrapy is defined below.

Scrapy is a Python crawler framework that allows you to quickly and easily build crawlers and extract the data you need from your site.

That said, using Scrapy allows you to quickly and easily create a crawler that pulls data from your site.

This article will not introduce the installation and use of Scrapy. This series will introduce the implementation of Scrapy by reading the source code. For details about how to install and use Scrapy, please refer to the official website and official documentation. (Note: At the time of writing, Scrapy is 1.2, which is a bit lower, but not much different from the latest version.)

Creating a crawler using Scrapy is very simple. Here is an example of how to create a crawler using Scrapy:

Simply put, writing and running a crawler requires the following steps:

  1. usescrapy startprojectCommand to create a crawler template, or write your own crawler code based on the template
  2. Define a crawler, and inheritscrapy.SpiderAnd then rewriteparsemethods
  3. parseMethod to prepare the web page parsing logic, and grab the path
  4. usescrapy runspider <spider_file.py>Run the crawler

Using Scrapy to write a few lines of code, it is very convenient to collect the data of a website page.

But what’s really going on behind the scenes? How exactly does Scrapy help us do our job?

architecture

To see how Scrapy works, let’s take a look at Scrapy’s architecture and see how it works from a macro perspective:

The core module

As you can see from the structure diagram, Scrapy mainly includes the following five modules:

  • Scrapy Engine: Core engine, responsible for controlling and scheduling each component to ensure data flow;
  • Scheduler: scheduler responsible for managing tasks, filtering tasks, output tasks, storage, deduplication tasks are controlled here;
  • Downloader: downloader, responsible for downloading data on the network, input the URL to be downloaded, output the download results;
  • Spiders: we write our own crawler logic, define the crawling intention;
  • Item Pipeline: Responsible for the output of structured data, customizable format and output location;

If you look closely, there are two more modules:

  • Downloader middlewares: between the engine and the download, can be in the web page before and after the download of logic processing;
  • Spider middlewares: between the engine and crawler, before the crawler input download results, and after the crawler output request/data for logical processing;

With these core modules in mind, let’s take a look at how the collection process flows when using Scrapy, that is, how the modules work together to complete the scraping task.

Run the process

The data flow for Scrapy runs looks something like this:

  1. The engine gets the initialization request (also called seed URL) from a custom crawler;
  2. The engine puts the request into the scheduler, and the scheduler gets the request to download from the engine.
  3. The scheduler sends requests to be downloaded to the engine;
  4. The engine sends the request to the downloader, passing through a series of downloader middleware;
  5. After the request is downloaded by the downloader, a response object is generated and returned to the engine, which is again passed through a series of downloader middleware;
  6. After the engine receives the response returned by the downloader, it sends it to the crawler, which passes through a series of crawler middleware, and finally executes the crawler’s customized parsing logic.
  7. After the crawler executes the self-defined parsing logic, it generates the result object or the new request object to the engine and passes through a series of crawler middleware again.
  8. The engine sends the result object returned by the crawler to the result processor, and sends the new request to the scheduler through the engine.
  9. Repeat 1-8 until no new requests are processed in the scheduler and the task ends.

Collaboration of core modules

See, Scrapy architecture diagram or relatively clear, each of the modules cooperate with each other to complete the scraping task.

After reading its source code, I have compiled a more detailed interaction diagram of the core module, which shows more modules related details, you can refer to:

It’s important to note that this is a core part of Scrapy, but it’s not shown in the official architecture. This module is actually between the Spiders, Engine and Pipeline. It is a bridge between these three modules. I will talk about it in detail in the later source analysis article.

The core class diagram

In addition, in the process of reading the source code, I also organized the class diagram of these core modules, which will be very helpful for you to learn the source code.

A quick explanation of the core class diagram:

  • Unstyled black text is the core property of the class;
  • Highlighted text with yellow patterns is the core method of the class;

You can focus on these core attributes and methods as you read the source code.

Combining the official architecture diagram with my summary of the core module interaction diagram and core class diagram, we can see that the following components are involved in Scrapy.

  • Five core categories:Scrapy Engine,Scheduler,Downloader,Spiders, Item Pipeline
  • Four middleware manager classes:DownloaderMiddlewareManager,SpiderMiddlewareManager,ItemPipelineMiddlewareManager,ExtensionManager
  • Other auxiliary classes:Request,Response,Selector

We’ll get a first look at Scrapy, and in the next article I’ll look at the source code for each of these classes and methods.

Read more

Top 10 Best Popular Python Libraries of 2020 \

2020 Python Chinese Community Top 10 Articles \

5 minutes to quickly master the Python timed task framework \

Special recommendation \

\

Click below to read the article and join the community