Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money
Scrapy is a general purpose crawler framework, but it does not support distribution. In order to facilitate the implementation of Scrapy distributed crawling, it provides some redis based components (only components).
Scrapy-redis scrapy-redis extends the following four components based on the redis feature:
- Scheduler
- Duplication Filter
- Item Pipeline
- Base Spider
Scrapy – redis architecture
Scheduler
Scrapy queues do not allow multiple spiders to share a single queue. Scrapy-redis implements queue sharing by changing the queue to redis.
Duplication Filter
In Scrapy, request fingerprint de-duplication is implemented through the collection in Python. In scrapy-Redis, de-duplication is implemented by the Duplication Filter component through the redis set non-duplication feature. DuplicationFilter is cleverly implemented.
Item Pipeline
The engine feeds the crawled Item to the Item Pipeline, and the scrapy-Redis Item Pipeline stores the crawled Item to the Redis items queue. The modified Item Pipeline makes it easy to extract items from the Items queue by key, thus implementing the Items Processes cluster.
Base Spider
Instead of scrapy, RedisSpider inherits both Spider and RedisMixin, the class that reads urls from Redis. When we generate a Spider that inherits the RedisSpider, we call setup_redis, which connects to the Redis database and sets signals: One is the signal when the spider is idle, which calls the spider_IDLE function. This function calls schedule_next_request, keeps the spider alive, and throws a DontCloseSpider exception. One is the signal when an item is captured, calling the Item_monopoly function, which calls the schedule_next_request function for the next request
Install Scrapy – Redis
Python3.6 -m PIP install scrapy-redisCopy the code
Project practice
First modify the configuration file
BOT_NAME = 'cnblogs' SPIDER_MODULES = ['cnblogs.spiders'] NEWSPIDER_MODULE = 'cnblogs.spiders' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # See also autothrottle Settings and docs #DOWNLOAD_DELAY = 3 DOWNLOAD_DELAY = 2 # wait 2s MY_USER_AGENT = [" Mozilla / 5.0 + (+ Windows NT + 6.2; 537.36 + + WOW64) + AppleWebKit/(KHTML, + + Gecko like) + Chrome/Safari 45.0.2454.101 + / 537.36 ", "Mozilla / 5.0 + (+ Windows NT) + 5.1 + 537.36 + AppleWebKit/(KHTML, + + Gecko like) + Chrome/Safari 28.0.1500.95 + / 537.36 + SE + 2 X + MetaSr + 1.0 ", "Mozilla / 5.0 + (+ Windows NT + 6.1; +WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/50.0.2657.3+Safari/537.36"] # middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'cnblogs.middlewares.UserAgentMiddleware': 543, } LOG_LEVEL = "ERROR" ITEM_PIPELINES = { 'cnblogs.pipelines.MongoPipeline': In 300, MONGO_PORT = 27017 # MONGO_PORT = "spider_data" # MONGO_PORT = "spider_data" # MONGO_COLL = "Cnblogs_title # # collection name needs to be" SCHEDULER class and go to heavy class replacement for Scrapy - Redis SCHEDULER = "scrapy_redis. SCHEDULER. The SCHEDULER" DUPEFILTER_CLASS = "scrapy_redis. Dupefilter. RFPDupeFilter" REDIS_HOST = '127.0.0.1' REDIS_PORT = 7001 # Redis cluster is one of the nodes in the port Scrapy-redis by default empties the crawl queue and deprints the fingerprint collection after the crawl is complete. #SCHEDULER_PERSIST = True # flush_on_start = TrueCopy the code
There are two changes to the code: the first is the inherited RedisSpider and the second is start_urls changed to redis_key.
# -*- coding: utf-8 -*- import scrapy import datetime from scrapy_redis.spiders import RedisSpider class CnblogSpider(RedisSpider): name = 'cnblog' redis_key = "myspider:start_urls" #start_urls = [f'https://www.cnblogs.com/c-x-a/default.html?page={i}' For I in range(1,2)] def parse(self, response): main_info_list_node = response.xpath('//div[@class="forFlow"]') content_list_node = main_info_list_node.xpath(".//a[@class='postTitle2']/text()").extract() for item in content_list_node: url = response.url title=item crawl_date = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') item = {} item['url'] = url item['title'] = title.strip() if title else title item['crawl_date'] = crawl_date yield itemCopy the code
Because scrapy-Redis is a Redis queue for message sharing, our task needs to be pre-inserted into the database, and its key is called “mySpider :start_urls” as we specify. To insert tasks into the previously created Redis cluster, first connect to the database using cluster mode
Redis -cli -c -p 7000 # a Master node port for my Redis clusterCopy the code
Perform the following statement insert task
lpush myspider:start_urls https://www.cnblogs.com/c-x-a/default.html?page=1
lpush myspider:start_urls https://www.cnblogs.com/c-x-a/default.html?page=2
Copy the code
Then look at
lrange myspider:start_urls 0 10
Copy the code
See our task, ok task insertion is successful. Now it’s time to run the code, and after you run the code, look at three things. First, look at the Tasks in Redis and see that there are no tasks left
(empty list or set)
Copy the code
Second, we look at the Mongo database and see that we have successfully saved the results.
The third place, you will find your crawlers is not over, this is normal, because we are using the scrapy – redis, crawlers will always redis task, if there is no task is waiting, if in redis insert a new task he will continue the crawlers, after entering a state of waiting for the task.
The resources
Segmentfault.com/a/119000001…
From my original zhihu column: zhuanlan.zhihu.com/p/106605401