It takes about 5 minutes to read the text.

Why use a crawler framework

In our ordinary use of crawler, simply using requsets, xpath and other crawler libraries is far from meeting the requirements of a crawler framework. A prototype crawler framework that should contain schedulers, queues, request objects, etc. The crawlers we write don’t even have the most basic framework.

But the architecture and modules are still too simple to be a framework. If we isolate each component and define it as a separate module, we gradually form a framework.

With the framework, we do not need to care about the whole process of crawler, exception handling, task scheduling and so on will be integrated in the framework. We only need to pay attention to the core logical part of crawler, such as the extraction of page information, generation of next request, etc. In this way, not only development efficiency will be much higher, but the crawler will also be more robust.

In the actual practice of the project, we often use the crawler framework to grasp, which can improve the development efficiency and save the development time. Pyspider is a very good crawling framework, its operation is convenient, powerful, using it we can quickly and easily complete the development of crawler.

Introduction to the PySpider framework

Pyspider is a powerful network crawling system written by Binux, it has a powerful WebUI, script editor, task monitor, project management and result processor, it supports a variety of database backend, a variety of message queues, JavaScript rendering page crawling. It’s very convenient to use.

Its GiHub address is:

https://github.com/binux/pyspider

Official Document Address:

http://docs.pyspider.org/

Basic features of PySpider

Pyspider does several things:

  • 1 provides a convenient and easy to use WebUI system, visual writing and tuning crawler

  • 2 Provides functions such as monitoring crawl progress, viewing crawl results, and managing crawler projects.

  • 3 Supports multiple back-end databases, such as MySQL, MongoDB, Reids, SQLite, Elasticsearch, and PostgreSQL.

  • 4 Supports multiple message queues, such as RabbitMQ, Beanstalk, Redis, and Kombu.

  • 5 Provides priority control, retry after a failure, and scheduled capture.

  • 6 docking PhantomJS, can crawl JavaScript rendered pages.

  • 7 Supports single-node, distributed deployment, and Docker deployment.

If you want to crawl a page quickly and easily, using PySpider is a good choice. For example, quickly grab the news content of a common news website. However, it is recommended to use Scrapy for large-scale data collection of websites with strong anti-crawling degree and large scale scraping, such as grasping closed IP, closed account and high frequency verification.

Pyspider architecture

The architecture of PySpider is mainly divided into three parts: Scheduler, Fetcher and Processer. The whole crawl process is monitored by Monitor, and the crawl results are processed by the Result Worker.

Scheduler initiates task scheduling, Fetcher is responsible for fetching web page content, and Processer is responsible for parsing web page content. Then, the newly generated Request is sent to Scheduler for scheduling, and the generated extraction result output is saved.

The logic of pySpider’s task execution process is clear as follows:

  • 1 Each Pysipder project corresponds to a Python script that defines a Handler class with an on_start() method. The crawl first calls the on_start() method to generate the initial crawl task, which is then sent to the Scheduler.

  • The Scheduler dispatches the fetch task to the Fetcher for fetching, which executes, gets the response, and then sends the response to the Processer.

  • 3 Processer processes the response and extracts the new URL to generate a new fetching task. Then it notifies the Scheduler of the current fetching task execution by message queue and sends the new fetching task to the Scheduler. If a new extract Result is generated, it is sent to the Result queue for processing by the Result Worker.

  • 4 The Scheduler receives a new fetching task, queries the database, determines if it is a new fetching task or a task that needs to be retried, and sends it back to Fetcher for fetching.

  • 5 continue to repeat the above work, until all the tasks are completed, the end of the capture.

  • 6 After the capture is complete, the program calls the on_finished() method. You can define the post-processing procedure here.


Today, WE will take you to understand the basic features and architecture of PySpider, and have an overall understanding of Pysider. Next, in the pySipder combat project, you will learn more about the use of PySipder.


This article was first published on the public account “Crazy Sea”. Every day we share Python dry stuff, reply “1024”, you know what I mean.