\
Develop a distributed crawler using Scrapy? You know what the fastest way to do that is? Can you really develop or modify a distributed crawler in a minute?
Without further ado, let’s see how it works and then talk about the details.
Quick learning
Step 0:
Install scrapy-distributed first:
pip install scrapy-distributed
Copy the code
If you don’t have the required conditions, you can start two Docker images to test (RabbitMQ and RedisBloom):
# pull and run a RabbitMQ container.
docker run -d --name rabbitmq -p 0.0. 0. 0:15672:15672 -p 0.0. 0. 0:5672:5672 rabbitmq:3
# pull and run a RedisBloom container.
docker run -d --name redis-redisbloom -p 0.0. 0. 0:6379:6379 redislabs/rebloom:latest
Copy the code
Step 1 (Optional):
If you have an existing crawler, you can skip this Step and go straight to Step 2.
To create a crawler project, I’ll use a Sitemap crawler as an example:
scrapy startproject simple_example
Copy the code
Then modify the spider file under the Spiders folder:
from scrapy_distributed.spiders.sitemap import SitemapSpider
from scrapy_distributed.queues.amqp import QueueConfig
from scrapy_distributed.dupefilters.redis_bloom import RedisBloomConfig
class MySpider(SitemapSpider):
name = "example"
sitemap_urls = ["http://www.people.com.cn/robots.txt"]
queue_conf: QueueConfig = QueueConfig(
name="example", durable=True, arguments={"x-queue-mode": "lazy"."x-max-priority": 255}
)
redis_bloom_conf: RedisBloomConfig = RedisBloomConfig(key="example:dupefilter")
def parse(self, response):
self.logger.info(f"parse response, url: {response.url}")
Copy the code
Step 2:
By changing SCHEDULER, DUPEFILTER_CLASS in settings.py and adding RabbitMQ and Redis configurations, you can get a distributed crawler. Scrapy-distributed will initialize a default RabbitMQ queue and a default RedisBloom filter.
If only the Scheduler of RabbitMQ is used, Here you can fill in scrapy_distributed. Schedulers. Closer. RabbitScheduler SCHEDULER ="scrapy_distributed.schedulers.DistributedScheduler"
SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.amqp.RabbitQueue"
RABBITMQ_CONNECTION_PARAMETERS = "amqp://guest:guest@localhost:5672/example/? heartbeat=0"
DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter"
BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0"
BLOOM_DUPEFILTER_REDIS_HOST = "localhost"
BLOOM_DUPEFILTER_REDIS_PORT = 6379REDIS_BLOOM_PARAMS = {"redis_cls": "redisbloom.client.Client"} # Bloem filter error rate configuration, default if no write configuration0.001
BLOOM_DUPEFILTER_ERROR_RATE = 0.001# Bloem filter capacity configuration, default if no write configuration100_0000
BLOOM_DUPEFILTER_CAPACITY = 100_0000
Copy the code
You can also add two class attributes to your Spider class to initialize your RabbitMQ queue or RedisBloom filter:
class MySpider(SitemapSpider): ...... Queue_conf: QueueConfig = QueueConfig(name="example", durable=True, arguments={"x-queue-mode": "lazy"."x-max-priority": 255} # redis_bloom_conf: RedisBloomConfig = RedisBloomConfig(key= RedisBloomConfig"example:dupefilter", error_rate=0.001, capacity=100_0000)...Copy the code
Step 3:
scrapy crawl example
Copy the code
Check your RabbitMQ queues and RedisBloom filters. Are they working?
As you can see, scrapy-distributed crawlers can be modified to support RabbitMQ queues and RedisBloom filters by modifying the configuration file. With RabbitMQ and RedisBloom, it takes only a minute to change the configuration.
About Scrapy – Distributed
At present, scrapy-distributed mainly refers to scrapy-Redis and scrapy-RabbitMQ libraries.
If you’ve had any experience with Scrapy, you probably know that scrapy-Redis is a very fast distributed crawler, and if you’ve tried using RabbitMQ as a crawler queue, you’ve probably seen scrapy-RabbitMQ. While scrapy-redis are handy, scrapy-RabbitMQ can also implement RabbitMQ as a task queue, but they have a few drawbacks, which I’ll briefly address here.
- Scrapy-redis uses the set of Redis to remove weight. The larger the number of links, the larger the memory occupied, which is not suitable for the distributed crawler with a large number of tasks.
- Scrapy-redis uses the list of Redis as a queue. In many scenarios, there will be a backlog of tasks and memory resources will be consumed too fast. For example, when we climb the site sitemap, the speed of the link to the queue is much faster than that of the queue.
- Rabbitmq scrapy components, such as rabbitMQ scrapy components, do not provide any of the parameters that RabbitMQ supports for creating queues, and cannot control queue persistence.
- The Scheduler of RabbitMQ frameworks such as scrapy-RabbitMQ does not support distributed Dupefilter.
- Scrapy-redis, scrapy-RabbitMQ and other frameworks are intrusive. If we need to use these frameworks to develop distributed crawlers, we need to modify our crawler code and inherit the Spider class of the framework to achieve distributed functions.
This is where the scrapy-distributed framework comes in. With a non-invasive design, you can simply modify Settings under settings.py and the framework will distribute your crawlers according to the default configuration.
To address some of the pain points of scrapy-redis and scrapy-RabbitMQ, scrapy-distributed does the following:
- RedisBloom’s Bloom filter is used for less memory footprint.
- Supports all parameters specified for RabbitMQ queues and allows RabbitMQ queues to support lazy-mode, which reduces memory usage.
- RabbitMQ queue declarations are more flexible and different crawlers can use the same or different queue configurations.
- Scheduler is designed to support multiple combinations of components. RedisBloom’s DupeFilter and RabbitMQ’s Scheduler module can be used independently.
- Implement the non-invasive design of Scrapy distributed, only need to modify the configuration, the common crawler can be distributed.
At present, there are many functions being added to the framework. Interested partners can continue to pay attention to the trend of the project warehouse and discuss their ideas together.
Scrapy – Distributed making warehouse address: https://github.com/Insutanto/scrapy-distributed
Blog: https://insutanto.net/
This article is from the author: Insutanto Zhihu contributed, the program circle, a photographer
zhihu.com/people/momentxu
Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the ministry, tsinghua university, Peking University, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, such as Google, Microsoft, government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.
Long press scan to add “Python Assistant”
Click here to become a community member