This is the 20th day of my participation in the August Text Challenge.More challenges in August

Now complete the fourth step: get the URL of each chapter by parsing the chapter list page of the corresponding novel, send the request to get the response, and get the chapter content of the corresponding chapter: (Chapter content)

First of all, compile crawler files: (Note: 1. This requires us to define a rules rule to obtain the URL of specific chapter contents of books and parse corresponding response data! 2. Write the corresponding callback function to get the specific chapter content of the book!

# -*- coding: utf-8 -*-
import datetime

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Bh3Spider(CrawlSpider) :
    name = 'bh3'
    allowed_domains = ['book.zongheng.com']
    start_urls = ['http://book.zongheng.com/store/c0/c0/b0/u1/p1/v0/s1/t0/u0/i1/ALL.html']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)
        # 1.LinkExractor is a class defined by the Scrapy framework that defines how to extract links from each crawled page surface.
        Findall (r'Items/',response.text) # allow=r'Items/'

        # 2. Callback ='parse_item' specifies the callback function.

        # 3. Follow =True的 用 户 : The response generated by LinkExtractor url is given to the callback, and then given to rules to match all rules.
        Process_links the callback function used to handle links matched by LinkExtractor
        Match the URL of each book
        Rule(LinkExtractor(allow=r'http://book.zongheng.com/book/\d+.html',restrict_xpaths=("//div[@class='bookname']")), callback='parse_book', follow=True,process_links="process_booklink"),

        Match the url of the section directory
        Rule(LinkExtractor(allow=r'http://book.zongheng.com/showchapter/\d+.html',restrict_xpaths=('//div[@class="fr link-group"]')), callback='parse_catalog', follow=True),

        The response generated by the URL of the section directory, and then the matching of the URL of the specific section content will form the response, which will be handed to the callback function
        Rule(LinkExtractor(allow=r'http://book.zongheng.com/chapter/\d+/\d+.html',restrict_xpaths=('//ul[@class="chapter-list clearfix"]')), callback='get_content',follow=False,process_links="process_chapterlink"),
        # restrict_xPaths is a parameter in LinkExtractor. Allows only urls that match allow to pass this rule.
    )

    def process_booklink(self, links) :
        for index,link in enumerate(links):
            # Limit one book
            if index==0:
                print("Limit a book:",link.url)
                yield link
            else:
                return
                
    def process_chapterlink(self, links) :
        for index,link in enumerate(links):
            # Limit chapter 21
            if index<=20:
                print("Restricted to 20 chapters:",link.url)
                yield link
            else:
                return

    def parse_book(self, response) :
        print("Analytical book_url")
        # words:
        book_nums=response.xpath('//div[@class="nums"]/span/i/text()').extract()[0]
        # titles:
        book_name=response.xpath('//div[@class="book-name"]/text()').extract()[0].strip()
        category=response.xpath('//div[@class="book-label"]/a/text()').extract()[1]
        author=response.xpath('//div[@class="au-name"]/a/text()').extract()[0]
        status=response.xpath('//div[@class="book-label"]/a/text()').extract()[0]
        description="".join(response.xpath('//div[@class="book-dec Jbook-dec hide"]/p/text()').extract())
        c_time=datetime.datetime.now()
        book_url=response.url
        catalog_url=response.css("a").re("http://book.zongheng.com/showchapter/\d+.html") [0]
        print(book_nums,book_name,category,author,status,description,c_time,book_url,catalog_url)


    def parse_catalog(self, response) :
        print("Parse section table of Contents",response.url)            # Response. url is the URL from which the data comes
        # Note: Section and section urls should correspond one to one
        a_tags=response.xpath('//ul[@class="chapter-list clearfix"]/li/a')
        chapter_list=[]
        for index,a in enumerate(a_tags):
            title=a.xpath("./text()").extract()[0]
            chapter_url=a.xpath("./@href").extract()[0]
            ordernum=index+1
            c_time=datetime.datetime.now()
            catalog_url=response.url
            chapter_list.append([title,ordernum,c_time,chapter_url,catalog_url])
        print('Section Contents:',chapter_list)

    def get_content(self, response) :
        content = "".join(response.xpath('//div[@class="content"]/p/text()').extract())
        chapter_url = response.url
        print(Chapter Details:,content)
Copy the code

Then: the operation will find that the data is normal!

Above, we achieved all the target data acquisition, but restricted to the first novel information only in order to prevent sealing!! The restrict_xPaths parameter in LinkExtractor allows you to filter and restrict only the urls that allow matches in the allowed area. Can prevent multiple rules match the same URL, resulting in data duplication!!Copy the code

3. Persist the data and write it to the Mysql database

① Define structured fields (kitems.py file writing) :
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class NovelItem(scrapy.Item) :
    Matches each book URL and parses the field created to get some information.
    # define the fields for your item here like:
    # name = scrapy.Field()
    category = scrapy.Field()
    book_name = scrapy.Field()
    author = scrapy.Field()
    status = scrapy.Field()
    book_nums = scrapy.Field()
    description = scrapy.Field()
    c_time = scrapy.Field()
    book_url = scrapy.Field()
    catalog_url = scrapy.Field()

class ChapterItem(scrapy.Item) :
    Field created to parse some information about the current novel chapter list from each novel chapter list page.
    # define the fields for your item here like:
    # name = scrapy.Field()
    chapter_list = scrapy.Field()

class ContentItem(scrapy.Item) :
    The field created to parse the current chapter of the current novel from the specific chapter of the novel.
    # define the fields for your item here like:
    # name = scrapy.Field()
    content = scrapy.Field()
    chapter_url = scrapy.Field()

Copy the code
< span style = “box-sizing: border-box; line-height: 22px; word-break: inherit! Important; word-break: inherit! Important; word-break: inherit! Important;
# -*- coding: utf-8 -*-
import datetime

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import NovelItem,ChapterItem,ContentItem


class Bh3Spider(CrawlSpider) :
    name = 'bh3'
    allowed_domains = ['book.zongheng.com']
    start_urls = ['http://book.zongheng.com/store/c0/c0/b0/u1/p1/v0/s1/t0/u0/i1/ALL.html']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)
        # 1.LinkExractor is a class defined by the Scrapy framework that defines how to extract links from each crawled page surface.
        Findall (r'Items/',response.text) # allow=r'Items/'

        # 2. Callback ='parse_item' specifies the callback function.

        # 3. Follow =True的 用 户 : The response generated by LinkExtractor url is given to the callback, and then given to rules to match all rules.
        Process_links the callback function used to handle links matched by LinkExtractor
        Match the URL of each book
        Rule(LinkExtractor(allow=r'http://book.zongheng.com/book/\d+.html',restrict_xpaths=("//div[@class='bookname']")), callback='parse_book', follow=True,process_links="process_booklink"),

        Match the url of the section directory
        Rule(LinkExtractor(allow=r'http://book.zongheng.com/showchapter/\d+.html',restrict_xpaths=('//div[@class="fr link-group"]')), callback='parse_catalog', follow=True),

        The response generated by the URL of the section directory, and then the matching of the URL of the specific section content will form the response, which will be handed to the callback function
        Rule(LinkExtractor(allow=r'http://book.zongheng.com/chapter/\d+/\d+.html',restrict_xpaths=('//ul[@class="chapter-list clearfix"]')), callback='get_content',follow=False),
        # restrict_xPaths is a parameter in LinkExtractor. Allows only urls that match allow to pass this rule.
    )

    def process_booklink(self, links) :
        for index,link in enumerate(links):
            # Limit one book
            if index==0:
                print("Limit a book:",link.url)
                yield link
            else:
                return
    def process_chapterlink(self, links) :
        for index,link in enumerate(links):
            # Limit chapter 21
            if index<=20:
                print("Restricted to 20 chapters:",link.url)
                yield link
            else:
                return

    def parse_book(self, response) :
        print("Analytical book_url")
        # words:
        book_nums=response.xpath('//div[@class="nums"]/span/i/text()').extract()[0]
        # titles:
        book_name=response.xpath('//div[@class="book-name"]/text()').extract()[0].strip()
        category=response.xpath('//div[@class="book-label"]/a/text()').extract()[1]
        author=response.xpath('//div[@class="au-name"]/a/text()').extract()[0]
        status=response.xpath('//div[@class="book-label"]/a/text()').extract()[0]
        description="".join(response.xpath('//div[@class="book-dec Jbook-dec hide"]/p/text()').extract())
        c_time=datetime.datetime.now()
        book_url=response.url
        catalog_url=response.css("a").re("http://book.zongheng.com/showchapter/\d+.html") [0]

        item=NovelItem()
        item["category"]=category
        item["book_name"]=book_name
        item["author"]=author
        item["status"]=status
        item["book_nums"]=book_nums
        item["description"]=description
        item["c_time"]=c_time
        item["book_url"]=book_url
        item["catalog_url"]=catalog_url
        yield item


    def parse_catalog(self, response) :
        print("Parse section table of Contents",response.url)            # Response. url is the URL from which the data comes
        # Note: Section and section urls should correspond one to one
        a_tags=response.xpath('//ul[@class="chapter-list clearfix"]/li/a')
        chapter_list=[]
        for index,a in enumerate(a_tags):
            title=a.xpath("./text()").extract()[0]
            chapter_url=a.xpath("./@href").extract()[0]
            ordernum=index+1
            c_time=datetime.datetime.now()
            catalog_url=response.url
            chapter_list.append([title,ordernum,c_time,chapter_url,catalog_url])
            
        item=ChapterItem()
        item["chapter_list"]=chapter_list
        yield item


    def get_content(self, response) :
        content = "".join(response.xpath('//div[@class="content"]/p/text()').extract())
        chapter_url = response.url

        item=ContentItem()
        item["content"]=content
        item["chapter_url"]=chapter_url
        yield item
Copy the code
> < span style = “box-sizing: border-box! Important; word-break: inherit! Important;
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
import logging
from .items import NovelItem,ChapterItem,ContentItem
logger=logging.getLogger(__name__)      Generate a Logger object named with the current filename. Error reporting with logging.

class ZonghengPipeline(object) :
    def open_spider(self,spider) :
        Connect to database
        data_config = spider.settings["DATABASE_CONFIG"]
        if data_config["type"] = ="mysql":
            self.conn = pymysql.connect(**data_config["config"])
            self.cursor = self.conn.cursor()
    def process_item(self, item, spider) :
        Write to database
        if isinstance(item,NovelItem):
            # Write book information
            sql="select id from novel where book_name=%s and author=%s"
            self.cursor.execute(sql,(item["book_name"], ["author"]))
            if not self.cursor.fetchone():          Fetchone () retrieves the last query result set. None in Python if None
                try:
                    # if you don't get an ID, the novel doesn't exist
                    sql="insert into novel(category,book_name,author,status,book_nums,description,c_time,book_url,catalog_url)"\
                        "values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
                    self.cursor.execute(sql,(
                        item["category"],
                        item["book_name"],
                        item["author"],
                        item["status"],
                        item["book_nums"],
                        item["description"],
                        item["c_time"],
                        item["book_url"],
                        item["catalog_url"],
                    ))
                    self.conn.commit()
                except Exception as e:          Catch exceptions and log them
                    self.conn.rollback()
                    logger.warning("Wrong information! url=%s %s")%(item["book_url"],e)
            return item
        elif isinstance(item,ChapterItem):
            # Write chapter information
            try:
                sql="insert into chapter (title,ordernum,c_time,chapter_url,catalog_url)"\
                    "values(%s,%s,%s,%s,%s)"
                # note: the form of this item is! item["chapter_list"]====[(title,ordernum,c_time,chapter_url,catalog_url)]
                chapter_list=item["chapter_list"]
                self.cursor.executemany(sql,chapter_list)     #.executemany() writes multiple tuples of data in one operation. Like: executemany (SQL, [() ()])
                self.conn.commit()
            except Exception as e:
                self.conn.rollback()
                logger.warning("Section information error! %s"%e)
            return item
        elif isinstance(item,ContentItem):
            try:
                sql="update chapter set content=%s where chapter_url=%s"
                content=item["content"]
                chapter_url=item["chapter_url"]
                self.cursor.execute(sql,(content,chapter_url))
                self.conn.commit()
            except Exception as e:
                self.conn.rollback()
                logger.warning("Chapter content error! url=%s %s") % (item["chapter_url"], e)
            return item

    def close_spider(self,spider) :
        # close database
        self.cursor.close()
        self.conn.close()


Copy the code
④ Auxiliary configuration (modify settings.py) :

The first is to close the robots protocol; Second: start delay; Add header file; Fourth: Open the pipeline:

Mysql > connect to local Mysql

DATABASE_CONFIG={
    "type":"mysql"."config": {"host":"localhost"."port":3306."user":"root"."password":"123456"."db":"spider39"."charset":"utf8"}}Copy the code

🔆 In The End!

Start now, stick to it, a little progress a day, in the near future, you will thank you for your efforts!

This blogger will continue to update the basic column of crawler and crawler combat column, carefully read this article friends, you can like the collection and comment on your feelings after reading. And can follow this blogger, read more crawler in the days ahead!

If there are mistakes or inappropriate words can be pointed out in the comment area, thank you! If reprint this article please contact me for my consent, and mark the source and the name of the blogger, thank you!