Develop a distributed crawler using Scrapy? You know what the fastest way to do that is? Can you really develop or modify a distributed crawler in a minute?

Without further ado, let’s see how it works and then talk about the details

Quick learning

Step 0:

Install scrapy-distributed first:

pip install scrapy-distributed
Copy the code

(Optional) If you don’t have the required conditions, you can start two Docker images to test (RabbitMQ and RedisBloom):

# pull and run a RabbitMQ container.Docker run -d --name rabbitmq -p 0.0.0.0:15672:156772 -p 0.0.0.0:5672:5672 rabbitmq:3# pull and run a RedisBloom container.Docker run -d --name redis-redisbloom -p 0.0.0.0:6379:6379 redislabs/rebloom:latestCopy the code

Step 1 (Optional):

If you have an existing crawler, you can skip this Step and go straight to Step 2.

To create a crawler project, I’ll use a Sitemap crawler as an example:

scrapy startproject simple_example
Copy the code

Then modify the spider file under the Spiders folder:

from scrapy_distributed.spiders.sitemap import SitemapSpider
from scrapy_distributed.queues.amqp import QueueConfig
from scrapy_distributed.dupefilters.redis_bloom import RedisBloomConfig

class MySpider(SitemapSpider) :
    name = "example"
    sitemap_urls = ["http://www.people.com.cn/robots.txt"]
    queue_conf: QueueConfig = QueueConfig(
        name="example", durable=True, arguments={"x-queue-mode": "lazy"."x-max-priority": 255}
    )
    redis_bloom_conf: RedisBloomConfig = RedisBloomConfig(key="example:dupefilter")

    def parse(self, response) :
        self.logger.info(f"parse response, url: {response.url}")
Copy the code

Step 2:

By changing SCHEDULER, DUPEFILTER_CLASS in settings.py and adding RabbitMQ and Redis configurations, you can get a distributed crawler. Scrapy-distributed will initialize a default RabbitMQ queue and a default RedisBloom filter.

Integrate Scheduler for RabbitMQ and RedisBloom
# if only use the RabbitMQ Scheduler, here can fill scrapy_distributed. Schedulers. Closer. RabbitScheduler
SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.amqp.RabbitQueue"
RABBITMQ_CONNECTION_PARAMETERS = "amqp://guest:guest@localhost:5672/example/? heartbeat=0"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379
Redis Bloom client configuration, just copy
REDIS_BLOOM_PARAMS = {
    "redis_cls": "redisbloom.client.Client"
}
Bloem filter error rate configuration, default 0.001 without write configuration
BLOOM_DUPEFILTER_ERROR_RATE = 0.001
# Bloem filter capacity configuration, default 100_0000 without write configuration
BLOOM_DUPEFILTER_CAPACITY = 100 _0000
Copy the code

You can also add two class attributes to your Spider class to initialize your RabbitMQ queue or RedisBloom filter:

class MySpider(SitemapSpider) :.# More arguments can be configured using the arguments argument. In this example, lazy mode and maximum priority are configured
    queue_conf: QueueConfig = QueueConfig(
        name="example", durable=True, arguments={"x-queue-mode": "lazy"."x-max-priority": 255})Error_rate and capacity are used to configure the redis key, error rate, and capacity of bloom filters respectively
    redis_bloom_conf: RedisBloomConfig = RedisBloomConfig(key="example:dupefilter", error_rate=0.001, capacity=100 _0000)...Copy the code

Step 3:

scrapy crawl <your_spider>
Copy the code

Check your RabbitMQ queues and RedisBloom filters. Are they working?

As you can see, scrapy-distributed crawlers can be modified to support RabbitMQ queues and RedisBloom filters by modifying the configuration file. With RabbitMQ and RedisBloom, it takes only a minute to change the configuration. 😂

About Scrapy – Distributed

At present, scrapy-distributed mainly refers to scrapy-Redis and scrapy-RabbitMQ libraries.

If you’ve had any experience with Scrapy, you probably know that scrapy-Redis is a very fast distributed crawler, and if you’ve tried using RabbitMQ as a crawler queue, you’ve probably seen scrapy-RabbitMQ. While scrapy-redis are handy, scrapy-RabbitMQ can also implement RabbitMQ as a task queue, but they have a few drawbacks, which I’ll briefly address here.

  1. Scrapy-redis uses the set of Redis to remove weight. The larger the number of links, the larger the memory occupied, which is not suitable for the distributed crawler with a large number of tasks.
  2. Scrapy-redis uses the list of Redis as a queue. In many scenarios, there will be a backlog of tasks and memory resources will be consumed too fast. For example, when we climb the site sitemap, the speed of the link to the queue is much faster than that of the queue.
  3. Rabbitmq scrapy components, such as rabbitMQ scrapy components, do not provide any of the parameters that RabbitMQ supports for creating queues, and cannot control queue persistence.
  4. The Scheduler of RabbitMQ frameworks such as scrapy-RabbitMQ does not support distributed Dupefilter.
  5. Scrapy-redis, scrapy-RabbitMQ and other frameworks are intrusive. If we need to use these frameworks to develop distributed crawlers, we need to modify our crawler code and inherit the Spider class of the framework to achieve distributed functions.

This is where the scrapy-distributed framework comes in. With a non-invasive design, you can simply modify Settings under settings.py and the framework will distribute your crawlers according to the default configuration.

To address some of the pain points of scrapy-redis and scrapy-RabbitMQ, scrapy-distributed does the following:

  1. RedisBloom’s Bloom filter is used for less memory footprint.
  2. Supports all parameters specified for RabbitMQ queues and allows RabbitMQ queues to support lazy-mode, which reduces memory usage.
  3. RabbitMQ queue declarations are more flexible and different crawlers can use the same or different queue configurations.
  4. Scheduler is designed to support multiple combinations of components. RedisBloom’s DupeFilter and RabbitMQ’s Scheduler module can be used independently.
  5. Implement the non-invasive design of Scrapy distributed, only need to modify the configuration, the common crawler can be distributed.

At present, there are many functions being added to the framework. Interested partners can continue to pay attention to the trend of the project warehouse and discuss their ideas together.

Scrapy-distributed Github scrapy-distributed Github scrapy-distributed Github

Blog: insutanto.net/posts/scrap…