Scrapy_Redis distributed crawler project combat

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money

Scrapy is a general purpose crawler framework, but it does not support distribution. In order to facilitate the implementation of Scrapy distributed crawling, it provides some redis based components (only components).

Scrapy-redis scrapy-redis extends the following four components based on the redis feature:

Scheduler
Duplication Filter
Item Pipeline
Base Spider

Scrapy – redis architecture

Scheduler

Scrapy queues do not allow multiple spiders to share a single queue. Scrapy-redis implements queue sharing by changing the queue to redis.

Duplication Filter

In Scrapy, request fingerprint de-duplication is implemented through the collection in Python. In scrapy-Redis, de-duplication is implemented by the Duplication Filter component through the redis set non-duplication feature. DuplicationFilter is cleverly implemented.

Item Pipeline

The engine feeds the crawled Item to the Item Pipeline, and the scrapy-Redis Item Pipeline stores the crawled Item to the Redis items queue. The modified Item Pipeline makes it easy to extract items from the Items queue by key, thus implementing the Items Processes cluster.

Base Spider

Instead of scrapy, RedisSpider inherits both Spider and RedisMixin, the class that reads urls from Redis. When we generate a Spider that inherits the RedisSpider, we call setup_redis, which connects to the Redis database and sets signals: One is the signal when the spider is idle, which calls the spider_IDLE function. This function calls schedule_next_request, keeps the spider alive, and throws a DontCloseSpider exception. One is the signal when an item is captured, calling the Item_monopoly function, which calls the schedule_next_request function for the next request

Install Scrapy – Redis

Python3.6 -m PIP install scrapy-redisCopy the code

Project practice

First modify the configuration file

BOT_NAME = 'cnblogs' SPIDER_MODULES = ['cnblogs.spiders'] NEWSPIDER_MODULE = 'cnblogs.spiders' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # See also autothrottle Settings and docs #DOWNLOAD_DELAY = 3 DOWNLOAD_DELAY = 2 # wait 2s MY_USER_AGENT = [" Mozilla / 5.0 + (+ Windows NT + 6.2; 537.36 + + WOW64) + AppleWebKit/(KHTML, + + Gecko like) + Chrome/Safari 45.0.2454.101 + / 537.36 ", "Mozilla / 5.0 + (+ Windows NT) + 5.1 + 537.36 + AppleWebKit/(KHTML, + + Gecko like) + Chrome/Safari 28.0.1500.95 + / 537.36 + SE + 2 X + MetaSr + 1.0 ", "Mozilla / 5.0 + (+ Windows NT + 6.1; +WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/50.0.2657.3+Safari/537.36"] # middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'cnblogs.middlewares.UserAgentMiddleware': 543, } LOG_LEVEL = "ERROR" ITEM_PIPELINES = { 'cnblogs.pipelines.MongoPipeline': In 300, MONGO_PORT = 27017 # MONGO_PORT = "spider_data" # MONGO_PORT = "spider_data" # MONGO_COLL = "Cnblogs_title # # collection name needs to be" SCHEDULER class and go to heavy class replacement for Scrapy - Redis SCHEDULER = "scrapy_redis. SCHEDULER. The SCHEDULER" DUPEFILTER_CLASS = "scrapy_redis. Dupefilter. RFPDupeFilter" REDIS_HOST = '127.0.0.1' REDIS_PORT = 7001 # Redis cluster is one of the nodes in the port Scrapy-redis by default empties the crawl queue and deprints the fingerprint collection after the crawl is complete. #SCHEDULER_PERSIST = True # flush_on_start = TrueCopy the code

There are two changes to the code: the first is the inherited RedisSpider and the second is start_urls changed to redis_key.

# -*- coding: utf-8 -*- import scrapy import datetime from scrapy_redis.spiders import RedisSpider class CnblogSpider(RedisSpider): name = 'cnblog' redis_key = "myspider:start_urls" #start_urls = [f'https://www.cnblogs.com/c-x-a/default.html?page={i}' For I in range(1,2)] def parse(self, response): main_info_list_node = response.xpath('//div[@class="forFlow"]') content_list_node = main_info_list_node.xpath(".//a[@class='postTitle2']/text()").extract() for item in content_list_node: url = response.url title=item crawl_date = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') item = {} item['url'] =  url item['title'] = title.strip() if title else title item['crawl_date'] = crawl_date yield itemCopy the code

Because scrapy-Redis is a Redis queue for message sharing, our task needs to be pre-inserted into the database, and its key is called “mySpider :start_urls” as we specify. To insert tasks into the previously created Redis cluster, first connect to the database using cluster mode

Redis -cli -c -p 7000 # a Master node port for my Redis clusterCopy the code

Perform the following statement insert task

lpush myspider:start_urls https://www.cnblogs.com/c-x-a/default.html?page=1
lpush myspider:start_urls https://www.cnblogs.com/c-x-a/default.html?page=2
Copy the code

Then look at

lrange myspider:start_urls 0 10
Copy the code

See our task, ok task insertion is successful. Now it’s time to run the code, and after you run the code, look at three things. First, look at the Tasks in Redis and see that there are no tasks left

(empty list or set)
Copy the code

Second, we look at the Mongo database and see that we have successfully saved the results.

The third place, you will find your crawlers is not over, this is normal, because we are using the scrapy – redis, crawlers will always redis task, if there is no task is waiting, if in redis insert a new task he will continue the crawlers, after entering a state of waiting for the task.

The resources

Segmentfault.com/a/119000001…

From my original zhihu column: zhuanlan.zhihu.com/p/106605401

Scrapy_Redis distributed crawler project combat

Scrapy – redis architecture

Scheduler

Duplication Filter

Item Pipeline

Base Spider

Install Scrapy – Redis

Project practice

Related Posts

Disruptor Memory leak check

EasySwoole set up efficient wechat management background

Call chain tracking system in companion fish: Practice