This is the 25th day of my participation in the August Genwen Challenge.More challenges in August
Simple use of distributed crawling douban movie information (restricted here, only crawling four pages of movie data a total of 100, can remove the restriction to crawl all 10 pages of 250 data!) :
The project source code: * * link: pan.baidu.com/s/13akXDxNb… Extraction code: BcuY **
(Goal: to use two douban projects exactly the same on this machine, to use distributed download Douban movies!) In fact, the only things we need to change are settings.py and our crawler file.
(1) Configuration in settings.py (both projects do this) :
Scrapy-redis scrapy-redis
#1. Enable scheduling to store requests into Redis
# from scrapy_redis.scheduler import Scheduler
SCHEDULER="scrapy_redis.scheduler.Scheduler"
#2. Make sure all spiders share the same repeat filtering via Redis
# from scrapy_redis.dupefilter import RFPDupeFilter
DUPEFILTER_CLASS="scrapy_redis.duperfilter.RFDuperFilter"
#3. Specify the host and port to use when connecting to Redis in order to connect to the Redis database
REDIS_HOST="localhost"
REDIS_PORT=6379
# Do not clean up the Redis queue, allow pause/resume fetching (optional) Allow pause, redis data is not lost can achieve a breakpoint crawl!!
SCHEDULER_PERSIST = True
# step 2: Open the pipeline for sending data to redis public area and storing local TXT:
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'film.pipelines.FilmPipeline': 300.# pipe to store local TXT files
'scrapy_redis.pipelines.RedisPipeline': 100.Open the pipeline for data delivery to redis public area
# 'film.pipelines.DoubanSqlPipeline': 200,
}
Copy the code
(2) Spider file changes (the two projects are slightly different!)
1. Export the RedisSpider class: (If you want to use it, you must import it first!) from scrapy_redis.spiders import RedisSpider
Class DbSpider(RedisSpider) class DbSpider(RedisSpider)
#3. Now that the request is in Redis, the initial request for start_urls is no longer needed in the crawler: # start_urls = [‘movie.douban.com/top250’]
#4. Set a key to find the original url :(this key will find the original url in redis, so we just need to put the request in redis!) redis_key=”db:start_urls” **
Full version crawler file: crawler file under the first project:
# -*- coding: utf-8 -*-
import scrapy
import re
from ..items import FilmItem Import is required because we are going to use a class that contains the name of the field defined
from scrapy_redis.spiders import RedisSpider #1. Export RedisSpider class
class DbSpider(RedisSpider) : #2. Use the RedisSpider class
name = 'db'
allowed_domains = ['movie.douban.com']
# start_urls = ['https://movie.douban.com/top250'] #3. Put the request in redis
redis_key="db:start_urls" #4. Set a key to find the starting URL
page_num=0 # class variables
def parse(self, response) : Parse and extract data
print('First item:',response.url)
print('First item:',response.url)
print('First item:',response.url)
# Get movie information data
# films_name=response.xpath('//div[@class="info"]/div/a/span[1]/text()').extract()
node_list=response.xpath('//div[@class="info"]') # 25
if node_list: # This judgment function: after climbing to 10 pages, it is not available! Check whether data is retrieved each time, if not, return null (that is, stopped)
self.logger.info("How handsome you are!")
for node in node_list:
# Movie title
film_name=node.xpath('./div/a/span[1]/text()').extract()[0]
print(film_name)
# Take the tag content and then match it with the regular expression
con_star_name=node.xpath('./div/p[1]/text()').extract()[0]
if "The Lord" in con_star_name:
star_name=re.findall("Star? :? ? (. *)",con_star_name)[0]
else:
star_name="Empty"
# score
score=node_list.xpath('./div/div/span[@property="v:average"]/text()').extract()[0]
Collect data using field names
item=FilmItem()
item["film_name"]=film_name
item["star_name"]=star_name
item["score"]=score
# form: {" film_name ":" the shawshank redemption ", "star_name" : "Tim" and "score" : "9.7"}
#yield item # Return is not used because it returns the parse method (causing the bug that only the first movie is passed to the pipe). All we need is to return the data after each parse
The url to get details about the movie:
detail_url=node.xpath('./div/a/@href').extract()[0]
yield scrapy.Request(detail_url,callback=self.get_detail,meta={"info":item})
# yield meta={"num":self.page_num} # yield meta={"num":self.page_num}
# the shared variable page_num for both items can be changed correctly without causing conflicts!!
if response.meta.get("num"):
self.page_num = response.meta["num"]
self.page_num+=1
if self.page_num == 4:
return
print("page_num:",self.page_num)
page_url="https://movie.douban.com/top250?start={}&filter=".format(self.page_num*25)
yield scrapy.Request(page_url, callback=self.parse, meta={"num":self.page_num})
Note: each module request is handed to the engine and then passed through a series of operations by the engine; However, remember: the engine eventually returns the data
# spider crawls the file to parse and retrieve the data it really wants (callback=self.parse) so it can give it back to itself.
else:
return
def get_detail(self,response) :
item=FilmItem()
Get movie synopsis information
#1. Meta will return 2 along with response. Updata allows you to add to new Item objects
info = response.meta["info"] Receive basic information about the movie
item.update(info) # Add the field for basic movie information
Add the movie introduction information to the corresponding field
description=response.xpath('//div[@id="link-report"]//span[@property="v:summary"]/text()').extract()[0].strip()
item['description']=description
yield item
Copy the code
Crawler file under the second project:
# -*- coding: utf-8 -*-
import scrapy
import re
from ..items import FilmItem Import is required because we are going to use a class that contains the name of the field defined
from scrapy_redis.spiders import RedisSpider #1. Export RedisSpider class
class DbSpider(RedisSpider) : #2. Use the RedisSpider class
name = 'db'
allowed_domains = ['movie.douban.com']
# start_urls = ['https://movie.douban.com/top250'] #3. Put the request in redis
redis_key="db:start_urls" #4. Set a key to find the starting URL
page_num = 0
def parse(self, response) : Parse and extract data
print('Second item:',response.url)
print('Second item:',response.url)
print('Second item:',response.url)
# Get movie information data
# films_name=response.xpath('//div[@class="info"]/div/a/span[1]/text()').extract()
node_list=response.xpath('//div[@class="info"]') # 25
if node_list: # This judgment function: after climbing to 10 pages, it is not available! Check whether data is retrieved each time, if not, return null (that is, stopped)
self.logger.info("How handsome you are!")
for node in node_list:
# Movie title
film_name=node.xpath('./div/a/span[1]/text()').extract()[0]
print(film_name)
# Take the tag content and then match it with the regular expression
con_star_name=node.xpath('./div/p[1]/text()').extract()[0]
if "The Lord" in con_star_name:
star_name=re.findall("Star? :? ? (. *)",con_star_name)[0]
else:
star_name="Empty"
# score
score=node_list.xpath('./div/div/span[@property="v:average"]/text()').extract()[0]
Collect data using field names
item=FilmItem()
item["film_name"]=film_name
item["star_name"]=star_name
item["score"]=score
# form: {" film_name ":" the shawshank redemption ", "star_name" : "Tim" and "score" : "9.7"}
#yield item # Return is not used because it returns the parse method (causing the bug that only the first movie is passed to the pipe). All we need is to return the data after each parse
The url to get details about the movie:
detail_url=node.xpath('./div/a/@href').extract()[0]
yield scrapy.Request(detail_url,callback=self.get_detail,meta={"info":item})
if response.meta.get("num"):
self.page_num = response.meta["num"]
self.page_num+=1
if self.page_num == 4:
return
print("page_num:",self.page_num)
page_url="https://movie.douban.com/top250?start={}&filter=".format(self.page_num*25)
yield scrapy.Request(page_url, callback=self.parse, meta={"num":self.page_num})
Note: each module request is handed to the engine and then passed through a series of operations by the engine; However, remember: the engine eventually returns the data
# spider crawls the file to parse and retrieve the data it really wants (callback=self.parse) so it can give it back to itself.
else:
return
def get_detail(self,response) :
item=FilmItem()
Get movie synopsis information
#1. Meta will return 2 along with response. Updata allows you to add to new Item objects
info = response.meta["info"] Receive basic information about the movie
item.update(info) # Add the field for basic movie information
Add the movie introduction information to the corresponding field
description=response.xpath('//div[@id="link-report"]//span[@property="v:summary"]/text()').extract()[0].strip()
item['description']=description
yield item
Copy the code
(3) Kitems. py file (both items are the same!) :
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class DoubanItem(scrapy.Item) :
# define the fields for your item here like:
# name = scrapy.Field()
Insert data into a database; insert data into a database;
# films_name= scrapy.field (
film_name=scrapy.Field()
star_name=scrapy.Field()
score=scrapy.Field()
description=scrapy.Field()
Copy the code
> < span style = “box-sizing: border-box! Important; word-break: inherit! Important; word-break: inherit! Important; :
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
class DoubanPipeline(object) :
def open_spider(self,spider) : This method is enabled when the crawler file is opened
self.f=open("films.txt"."w",encoding="utf-8") # Open file
def process_item(self, item, spider) : If you write this normally, the file will be opened and closed 25 times
Dumps dumps dumps dumps dumps dumps dumps dumps dic data to STR
json_str=json.dumps(dict(item),ensure_ascii=False) +"\n"
self.f.write(json_str) # crawler file open, file has been opened, in this direct write data can be!
return item
def close_spider(self,spider) : This method is enabled when the crawler file is closed
self.f.close() When the crawler file is closed, the engine has passed all data to the pipeline, closing the file
Copy the code
Distributed implementation effect:
(1) Run the project directly and find that it is waiting for some simple Settings for observation.
Open two scrapy projects in each terminal.
It will be found that both projects are waiting and will not be carried out. This is because WE did not give the redis public area an initial request, these two projects repeatedly asked redis for the initial URL, the result has been unable to get!
Set the logs for both projects in settings.py not to be displayed on the console, but to be stored in the.log file. To facilitate observation:
LOG_FILE="db.log"
LOG_ENABLED=False
Copy the code
(2) Run the two projects respectively and find that they are both waiting. Open another terminal and do the following operations:
We will find that both of our projects will run successfully :(and get exactly four pages of movie information, 100 pieces in total)
Fix the odds and ends:
1. Solve the creep problem :(do the following in both projects!)
(1) Use the extension (this file is designed to solve the crawling void) :
The crawler will stop automatically in the specified time if data crawler is completed after anti-crawler setting for both items!! : (file name extensions. Py) * * to join the development after the complete project code: link: pan.baidu.com/s/1Naie1HsW… Extraction code: E30P **
# -*- coding: utf-8 -*-
# Define here the models for your scraped Extensions
import logging
from scrapy import signals
from scrapy.exceptions import NotConfigured
logging = logging.getLogger(__name__)
class RedisSpiderSmartIdleClosedExensions(object) :
def __init__(self, idle_number, crawler) :
self.crawler = crawler
self.idle_number = idle_number
self.idle_list = []
self.idle_count = 0
@classmethod
def from_crawler(cls, crawler) :
# first check if the extension should be enabled and raise
# NotConfigured otherwise
if not crawler.settings.getbool('MYEXT_ENABLED') :raise NotConfigured
if not 'redis_key' in crawler.spidercls.__dict__.keys():
raise NotConfigured('Only supports RedisSpider')
# get the number of items from settings
idle_number = crawler.settings.getint('IDLE_NUMBER'.360)
# instantiate the extension object
ext = cls(idle_number, crawler)
# connect the extension object to signals
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle)
return ext
def spider_opened(self, spider) :
spider.logger.info("opened spider {}, Allow waiting time:{} second".format(spider.name, self.idle_number * 5))
def spider_closed(self, spider) :
spider.logger.info(
"closed spider {}, Waiting time exceeded {} second".format(spider.name, self.idle_number * 5))
def spider_idle(self, spider) :
This method is called once when the program starts and again every 5 seconds
# close the crawler if there is no spider. Redis_key for half an hour
# check whether redis_key exists
if not spider.server.exists(spider.redis_key):
self.idle_count += 1
else:
self.idle_count = 0
if self.idle_count > self.idle_number:
Execute close crawler
self.crawler.engine.close_spider(spider, 'Waiting time exceeded')
Copy the code
(2) Set this extension in settings.py:
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
'film.extensions.RedisSpiderSmartIdleClosedExensions':500.# open extensions. Py
}
MYEXT_ENABLED = True # Enable extension
IDLE_NUMBER = 10 The unit of idle duration is 10, and the unit of idle duration is 5s
Copy the code
(Note: Data stored in redis: spiderName :items
The list type stores the data obtained by the crawler. The item content is a JSON string. spidername:dupefilter
Spidername :start_urls is a hash string used to retrieve 40 characters from the URL that the crawler accesses
List for receiving the first URL spiderName: Requests when a RedisSpider is started
The zset type, used to hold requests waiting for scheduling. The content is the serialized string of the Requests object.
Conclusion:
The parse function is displayed for the first project run, which means that the start_urls request in the redis public area was grabbed by the first project, and the project will be run. However, In the process of executing the crawler file code of this project, a total of 25 URL requests will be sent to the engine in 25 cycles. After the engine gets these 25 requests, it will deliver them to the Scheduler, which will then deliver them to the redis database, the public area. The scheduler schedulers of both projects then scramble for requests in this common area together and submit other requests to Redis as their crawlers run, and the two projects continue to scramble until they are empty. This realizes the distributed crawl data of our crawler!!
Effects: (note: because there is no solution to climb, so project completion will not herself off, and, even if the project running over, will also have unlimited climb empty, two projects leads to crawl the preservation of the 100 local data is not enough, so, after two projects running at the time of climbing empty, closed two projects, you will see the data is complete!!!
(1) The movie information stored in the local TXT text under the two projects total exactly all the target data we want to climb: four pages of 100 movies in total.
🔆 In The End!
Start now, stick to it, a little progress a day, in the near future, you will thank you for your efforts! |
---|
This blogger will continue to update the basic column of crawler and crawler combat column, carefully read this article friends, you can like the collection and comment on your feelings after reading. And can follow this blogger, read more crawler in the days ahead!
If there are mistakes or inappropriate words can be pointed out in the comment area, thank you! If reprint this article please contact me for my consent, and mark the source and the name of the blogger, thank you!