This is the 14th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021 “.
Crawl novel
spider
import scrapy from xiaoshuo.items import XiaoshuoItem class XiaoshuoSpiderSpider(scrapy.Spider): name = 'xiaoshuo_spider' allowed_domains = ['zy200.com'] url = 'http://www.zy200.com/5/5943/' start_urls = [url + '11667352.html'] def parse(self, response): info = response.xpath("/html/body/div[@id='content']/text()").extract() href = response.xpath("//div[@class='zfootbar']/a[3]/@href").extract_first() xs_item = XiaoshuoItem() xs_item['content'] = info yield xs_item if href ! = 'index.html': new_url = self.url + href yield scrapy.Request(new_url, callback=self.parse)Copy the code
items
import scrapy
class XiaoshuoItem(scrapy.Item):
# define the fields for your item here like:
content = scrapy.Field()
href = scrapy.Field()
Copy the code
pipeline
class XiaoshuoPipeline(object):
def __init__(self):
self.filename = open("dp1.txt", "w", encoding="utf-8")
def process_item(self, item, spider):
content = item["title"] + item["content"] + '\n'
self.filename.write(content)
self.filename.flush()
return item
def close_spider(self, spider):
self.filename.close()
Copy the code
1. CrawlSpiders
Schematic diagram
SequenceDiagram start_urls ->> Scheduler: Initializes the URL scheduler ->> Downloader: Request ->> Rules: Response Rules ->> Data extraction: Response Rules ->> Scheduler: The new urlCopy the code
You can quickly create code for the CrawlSpider template by using the following command
Scrapy genspider -t crawl file name (allowed_url)Copy the code
The spiders are the base of all reptiles, and the crawlies are a derivative of spiders. A working CrawlSpider class that is designed to crawl only pages in the start_URL list and takes a link from the crawled page and continues to crawl is more appropriate
2. Rule object
The Rule and CrawlSpider classes are both located in the scrapy.contrib.spiders module
class scrapy.contrib.spiders.Rule (
link_extractor, callback=None,cb_kwargs=None,follow=None,process_links=None,process_request=None )
Copy the code
Parameter Meanings:
-
Link_extractor is LinkExtractor, which defines the links to be extracted
-
Callback argument: the value specified when link_extractor gets the link as the callback function
- Note how to use the callback parameter:
When writing crawler rules, avoid using parse as a callback function. Use the parse method for CrawlSpider to implement its logic. If you override the parse method, the CrawlSpider will fail
-
Follow: Specifies whether links extracted from Response according to this rule need to be followed. When callback is None, the default value is True
-
Process_links: mainly used to filter links fetched by link_extractor
-
Process_request: filters requests extracted from the rule
3.LinkExtractors
3.1 concept
As the name suggests, link extractor
3.2 role
The only public method for every LinkExtractor is extract_links(), which receives a Response object and returns a scrapy.link.link object
3.3 the use of
class scrapy.linkextractors.LinkExtractor(
allow = (),
deny = (),
allow_domains = (),
deny_domains = (),
deny_extensions = None,
restrict_xpaths = (),
tags = ('a','area'),
attrs = ('href'),
canonicalize = True,
unique = True,
process_value = None
)
Copy the code
Main parameters:
-
Allow: The values that meet the requirements of the regular expression in the brackets are extracted. If the values are empty, all values are matched.
-
Deny: urls that do not match the regular expression (or the list of regular expressions) must not be extracted.
-
Allow_domains: The domains of the link that will be extracted.
-
Deny_domains: Links that must not be picked up.
-
Restrict_xpaths: Using xpath expressions, and allowing to filter links (only nodes, not attributes)
-
Restrict_css: Use CSS expressions and allow to filter links (only to nodes, not properties)
3.3.1 Viewing the Effect (Verification in Shell)
The first run
scrapy shell http://www.fhxiaoshuo.com/read/33/33539/17829387.shtml
Copy the code
Continue importing related modules:
from scrapy.linkextractors import LinkExtractor
Copy the code
Extracts the link obtained in the current web page
link = LinkExtractor(restrict_xpaths=(r'//div[@class="bottem"]/a[4]')
Copy the code
Call the Extract_links () method of the LinkExtractor instance to query the matching result
link.extract_links(response)
Copy the code
3.3.2 Viewing the Effect CrawlSpider version
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from xiaoshuo.items import XiaoshuoItem
class XiaoshuoSpiderSpider(CrawlSpider):
name = 'xiaoshuo_spider'
allowed_domains = ['fhxiaoshuo.com']
start_urls = ['http://www.fhxiaoshuo.com/read/33/33539/17829387.shtml']
rules = [
Rule(LinkExtractor(restrict_xpaths=(r'//div[@class="bottem"]/a[4]')), callback='parse_item'),]
def parse_item(self, response):
info = response.xpath("//div[@id='TXT']/text()").extract()
it = XiaoshuoItem()
it['info'] = info
yield it
Copy the code
Note:
rules = [
Rule(LinkExtractor(restrict_xpaths=(r'//div[@class="bottem"]/a[4]')), callback='parse_item'),]
Copy the code
- Callback is followed by the function name in quotes
- The function name cannot be parse
- Format problem