Crawler -Spider extension introduction

This is the 14th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021 “.

Crawl novel

spider

import scrapy from xiaoshuo.items import XiaoshuoItem class XiaoshuoSpiderSpider(scrapy.Spider): name = 'xiaoshuo_spider' allowed_domains = ['zy200.com'] url = 'http://www.zy200.com/5/5943/' start_urls = [url + '11667352.html'] def parse(self, response): info = response.xpath("/html/body/div[@id='content']/text()").extract() href = response.xpath("//div[@class='zfootbar']/a[3]/@href").extract_first() xs_item = XiaoshuoItem() xs_item['content'] = info  yield xs_item if href ! = 'index.html': new_url = self.url + href yield scrapy.Request(new_url, callback=self.parse)Copy the code

items

import scrapy


class XiaoshuoItem(scrapy.Item):
    # define the fields for your item here like:
    content = scrapy.Field()
    href = scrapy.Field()

Copy the code

pipeline

class XiaoshuoPipeline(object):
    def __init__(self):
        self.filename = open("dp1.txt", "w", encoding="utf-8")

    def process_item(self, item, spider):
        content = item["title"] + item["content"] + '\n'
        self.filename.write(content)
        self.filename.flush()
        return item

    def close_spider(self, spider):
        self.filename.close()
Copy the code

1. CrawlSpiders

Schematic diagram

SequenceDiagram start_urls ->> Scheduler: Initializes the URL scheduler ->> Downloader: Request ->> Rules: Response Rules ->> Data extraction: Response Rules ->> Scheduler: The new urlCopy the code

You can quickly create code for the CrawlSpider template by using the following command

Scrapy genspider -t crawl file name (allowed_url)Copy the code

The spiders are the base of all reptiles, and the crawlies are a derivative of spiders. A working CrawlSpider class that is designed to crawl only pages in the start_URL list and takes a link from the crawled page and continues to crawl is more appropriate

2. Rule object

The Rule and CrawlSpider classes are both located in the scrapy.contrib.spiders module

class scrapy.contrib.spiders.Rule (  
link_extractor, callback=None,cb_kwargs=None,follow=None,process_links=None,process_request=None ) 
Copy the code

Parameter Meanings:

Link_extractor is LinkExtractor, which defines the links to be extracted
Callback argument: the value specified when link_extractor gets the link as the callback function
- Note how to use the callback parameter:
When writing crawler rules, avoid using parse as a callback function. Use the parse method for CrawlSpider to implement its logic. If you override the parse method, the CrawlSpider will fail
Follow: Specifies whether links extracted from Response according to this rule need to be followed. When callback is None, the default value is True
Process_links: mainly used to filter links fetched by link_extractor
Process_request: filters requests extracted from the rule

3.LinkExtractors

3.1 concept

As the name suggests, link extractor

3.2 role

The only public method for every LinkExtractor is extract_links(), which receives a Response object and returns a scrapy.link.link object

3.3 the use of

class scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)
Copy the code

Main parameters:

Allow: The values that meet the requirements of the regular expression in the brackets are extracted. If the values are empty, all values are matched.
Deny: urls that do not match the regular expression (or the list of regular expressions) must not be extracted.
Allow_domains: The domains of the link that will be extracted.
Deny_domains: Links that must not be picked up.
Restrict_xpaths: Using xpath expressions, and allowing to filter links (only nodes, not attributes)
Restrict_css: Use CSS expressions and allow to filter links (only to nodes, not properties)

3.3.1 Viewing the Effect (Verification in Shell)

The first run

scrapy shell http://www.fhxiaoshuo.com/read/33/33539/17829387.shtml 
Copy the code

Continue importing related modules:

from scrapy.linkextractors import LinkExtractor
Copy the code

Extracts the link obtained in the current web page

link = LinkExtractor(restrict_xpaths=(r'//div[@class="bottem"]/a[4]')
Copy the code

Call the Extract_links () method of the LinkExtractor instance to query the matching result

 link.extract_links(response)
Copy the code

3.3.2 Viewing the Effect CrawlSpider version

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from xiaoshuo.items import XiaoshuoItem


class XiaoshuoSpiderSpider(CrawlSpider):
    name = 'xiaoshuo_spider'
    allowed_domains = ['fhxiaoshuo.com']
    start_urls = ['http://www.fhxiaoshuo.com/read/33/33539/17829387.shtml']

    rules = [
        Rule(LinkExtractor(restrict_xpaths=(r'//div[@class="bottem"]/a[4]')), callback='parse_item'),]

    def parse_item(self, response):
        info = response.xpath("//div[@id='TXT']/text()").extract()
        it = XiaoshuoItem()
        it['info'] = info
        yield it

Copy the code

Note:

 rules = [
        Rule(LinkExtractor(restrict_xpaths=(r'//div[@class="bottem"]/a[4]')), callback='parse_item'),]
Copy the code

Callback is followed by the function name in quotes
The function name cannot be parse
Format problem