This is the 20th day of my participation in the August Text Challenge.More challenges in August
Now complete the fourth step: get the URL of each chapter by parsing the chapter list page of the corresponding novel, send the request to get the response, and get the chapter content of the corresponding chapter: (Chapter content)
First of all, compile crawler files: (Note: 1. This requires us to define a rules rule to obtain the URL of specific chapter contents of books and parse corresponding response data! 2. Write the corresponding callback function to get the specific chapter content of the book!
# -*- coding: utf-8 -*-
import datetime
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Bh3Spider(CrawlSpider) :
name = 'bh3'
allowed_domains = ['book.zongheng.com']
start_urls = ['http://book.zongheng.com/store/c0/c0/b0/u1/p1/v0/s1/t0/u0/i1/ALL.html']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)
# 1.LinkExractor is a class defined by the Scrapy framework that defines how to extract links from each crawled page surface.
Findall (r'Items/',response.text) # allow=r'Items/'
# 2. Callback ='parse_item' specifies the callback function.
# 3. Follow =True的 用 户 : The response generated by LinkExtractor url is given to the callback, and then given to rules to match all rules.
Process_links the callback function used to handle links matched by LinkExtractor
Match the URL of each book
Rule(LinkExtractor(allow=r'http://book.zongheng.com/book/\d+.html',restrict_xpaths=("//div[@class='bookname']")), callback='parse_book', follow=True,process_links="process_booklink"),
Match the url of the section directory
Rule(LinkExtractor(allow=r'http://book.zongheng.com/showchapter/\d+.html',restrict_xpaths=('//div[@class="fr link-group"]')), callback='parse_catalog', follow=True),
The response generated by the URL of the section directory, and then the matching of the URL of the specific section content will form the response, which will be handed to the callback function
Rule(LinkExtractor(allow=r'http://book.zongheng.com/chapter/\d+/\d+.html',restrict_xpaths=('//ul[@class="chapter-list clearfix"]')), callback='get_content',follow=False,process_links="process_chapterlink"),
# restrict_xPaths is a parameter in LinkExtractor. Allows only urls that match allow to pass this rule.
)
def process_booklink(self, links) :
for index,link in enumerate(links):
# Limit one book
if index==0:
print("Limit a book:",link.url)
yield link
else:
return
def process_chapterlink(self, links) :
for index,link in enumerate(links):
# Limit chapter 21
if index<=20:
print("Restricted to 20 chapters:",link.url)
yield link
else:
return
def parse_book(self, response) :
print("Analytical book_url")
# words:
book_nums=response.xpath('//div[@class="nums"]/span/i/text()').extract()[0]
# titles:
book_name=response.xpath('//div[@class="book-name"]/text()').extract()[0].strip()
category=response.xpath('//div[@class="book-label"]/a/text()').extract()[1]
author=response.xpath('//div[@class="au-name"]/a/text()').extract()[0]
status=response.xpath('//div[@class="book-label"]/a/text()').extract()[0]
description="".join(response.xpath('//div[@class="book-dec Jbook-dec hide"]/p/text()').extract())
c_time=datetime.datetime.now()
book_url=response.url
catalog_url=response.css("a").re("http://book.zongheng.com/showchapter/\d+.html") [0]
print(book_nums,book_name,category,author,status,description,c_time,book_url,catalog_url)
def parse_catalog(self, response) :
print("Parse section table of Contents",response.url) # Response. url is the URL from which the data comes
# Note: Section and section urls should correspond one to one
a_tags=response.xpath('//ul[@class="chapter-list clearfix"]/li/a')
chapter_list=[]
for index,a in enumerate(a_tags):
title=a.xpath("./text()").extract()[0]
chapter_url=a.xpath("./@href").extract()[0]
ordernum=index+1
c_time=datetime.datetime.now()
catalog_url=response.url
chapter_list.append([title,ordernum,c_time,chapter_url,catalog_url])
print('Section Contents:',chapter_list)
def get_content(self, response) :
content = "".join(response.xpath('//div[@class="content"]/p/text()').extract())
chapter_url = response.url
print(Chapter Details:,content)
Copy the code
Then: the operation will find that the data is normal!
Above, we achieved all the target data acquisition, but restricted to the first novel information only in order to prevent sealing!! The restrict_xPaths parameter in LinkExtractor allows you to filter and restrict only the urls that allow matches in the allowed area. Can prevent multiple rules match the same URL, resulting in data duplication!!Copy the code
3. Persist the data and write it to the Mysql database
① Define structured fields (kitems.py file writing) :
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class NovelItem(scrapy.Item) :
Matches each book URL and parses the field created to get some information.
# define the fields for your item here like:
# name = scrapy.Field()
category = scrapy.Field()
book_name = scrapy.Field()
author = scrapy.Field()
status = scrapy.Field()
book_nums = scrapy.Field()
description = scrapy.Field()
c_time = scrapy.Field()
book_url = scrapy.Field()
catalog_url = scrapy.Field()
class ChapterItem(scrapy.Item) :
Field created to parse some information about the current novel chapter list from each novel chapter list page.
# define the fields for your item here like:
# name = scrapy.Field()
chapter_list = scrapy.Field()
class ContentItem(scrapy.Item) :
The field created to parse the current chapter of the current novel from the specific chapter of the novel.
# define the fields for your item here like:
# name = scrapy.Field()
content = scrapy.Field()
chapter_url = scrapy.Field()
Copy the code
< span style = “box-sizing: border-box; line-height: 22px; word-break: inherit! Important; word-break: inherit! Important; word-break: inherit! Important;
# -*- coding: utf-8 -*-
import datetime
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import NovelItem,ChapterItem,ContentItem
class Bh3Spider(CrawlSpider) :
name = 'bh3'
allowed_domains = ['book.zongheng.com']
start_urls = ['http://book.zongheng.com/store/c0/c0/b0/u1/p1/v0/s1/t0/u0/i1/ALL.html']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)
# 1.LinkExractor is a class defined by the Scrapy framework that defines how to extract links from each crawled page surface.
Findall (r'Items/',response.text) # allow=r'Items/'
# 2. Callback ='parse_item' specifies the callback function.
# 3. Follow =True的 用 户 : The response generated by LinkExtractor url is given to the callback, and then given to rules to match all rules.
Process_links the callback function used to handle links matched by LinkExtractor
Match the URL of each book
Rule(LinkExtractor(allow=r'http://book.zongheng.com/book/\d+.html',restrict_xpaths=("//div[@class='bookname']")), callback='parse_book', follow=True,process_links="process_booklink"),
Match the url of the section directory
Rule(LinkExtractor(allow=r'http://book.zongheng.com/showchapter/\d+.html',restrict_xpaths=('//div[@class="fr link-group"]')), callback='parse_catalog', follow=True),
The response generated by the URL of the section directory, and then the matching of the URL of the specific section content will form the response, which will be handed to the callback function
Rule(LinkExtractor(allow=r'http://book.zongheng.com/chapter/\d+/\d+.html',restrict_xpaths=('//ul[@class="chapter-list clearfix"]')), callback='get_content',follow=False),
# restrict_xPaths is a parameter in LinkExtractor. Allows only urls that match allow to pass this rule.
)
def process_booklink(self, links) :
for index,link in enumerate(links):
# Limit one book
if index==0:
print("Limit a book:",link.url)
yield link
else:
return
def process_chapterlink(self, links) :
for index,link in enumerate(links):
# Limit chapter 21
if index<=20:
print("Restricted to 20 chapters:",link.url)
yield link
else:
return
def parse_book(self, response) :
print("Analytical book_url")
# words:
book_nums=response.xpath('//div[@class="nums"]/span/i/text()').extract()[0]
# titles:
book_name=response.xpath('//div[@class="book-name"]/text()').extract()[0].strip()
category=response.xpath('//div[@class="book-label"]/a/text()').extract()[1]
author=response.xpath('//div[@class="au-name"]/a/text()').extract()[0]
status=response.xpath('//div[@class="book-label"]/a/text()').extract()[0]
description="".join(response.xpath('//div[@class="book-dec Jbook-dec hide"]/p/text()').extract())
c_time=datetime.datetime.now()
book_url=response.url
catalog_url=response.css("a").re("http://book.zongheng.com/showchapter/\d+.html") [0]
item=NovelItem()
item["category"]=category
item["book_name"]=book_name
item["author"]=author
item["status"]=status
item["book_nums"]=book_nums
item["description"]=description
item["c_time"]=c_time
item["book_url"]=book_url
item["catalog_url"]=catalog_url
yield item
def parse_catalog(self, response) :
print("Parse section table of Contents",response.url) # Response. url is the URL from which the data comes
# Note: Section and section urls should correspond one to one
a_tags=response.xpath('//ul[@class="chapter-list clearfix"]/li/a')
chapter_list=[]
for index,a in enumerate(a_tags):
title=a.xpath("./text()").extract()[0]
chapter_url=a.xpath("./@href").extract()[0]
ordernum=index+1
c_time=datetime.datetime.now()
catalog_url=response.url
chapter_list.append([title,ordernum,c_time,chapter_url,catalog_url])
item=ChapterItem()
item["chapter_list"]=chapter_list
yield item
def get_content(self, response) :
content = "".join(response.xpath('//div[@class="content"]/p/text()').extract())
chapter_url = response.url
item=ContentItem()
item["content"]=content
item["chapter_url"]=chapter_url
yield item
Copy the code
> < span style = “box-sizing: border-box! Important; word-break: inherit! Important;
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
import logging
from .items import NovelItem,ChapterItem,ContentItem
logger=logging.getLogger(__name__) Generate a Logger object named with the current filename. Error reporting with logging.
class ZonghengPipeline(object) :
def open_spider(self,spider) :
Connect to database
data_config = spider.settings["DATABASE_CONFIG"]
if data_config["type"] = ="mysql":
self.conn = pymysql.connect(**data_config["config"])
self.cursor = self.conn.cursor()
def process_item(self, item, spider) :
Write to database
if isinstance(item,NovelItem):
# Write book information
sql="select id from novel where book_name=%s and author=%s"
self.cursor.execute(sql,(item["book_name"], ["author"]))
if not self.cursor.fetchone(): Fetchone () retrieves the last query result set. None in Python if None
try:
# if you don't get an ID, the novel doesn't exist
sql="insert into novel(category,book_name,author,status,book_nums,description,c_time,book_url,catalog_url)"\
"values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
self.cursor.execute(sql,(
item["category"],
item["book_name"],
item["author"],
item["status"],
item["book_nums"],
item["description"],
item["c_time"],
item["book_url"],
item["catalog_url"],
))
self.conn.commit()
except Exception as e: Catch exceptions and log them
self.conn.rollback()
logger.warning("Wrong information! url=%s %s")%(item["book_url"],e)
return item
elif isinstance(item,ChapterItem):
# Write chapter information
try:
sql="insert into chapter (title,ordernum,c_time,chapter_url,catalog_url)"\
"values(%s,%s,%s,%s,%s)"
# note: the form of this item is! item["chapter_list"]====[(title,ordernum,c_time,chapter_url,catalog_url)]
chapter_list=item["chapter_list"]
self.cursor.executemany(sql,chapter_list) #.executemany() writes multiple tuples of data in one operation. Like: executemany (SQL, [() ()])
self.conn.commit()
except Exception as e:
self.conn.rollback()
logger.warning("Section information error! %s"%e)
return item
elif isinstance(item,ContentItem):
try:
sql="update chapter set content=%s where chapter_url=%s"
content=item["content"]
chapter_url=item["chapter_url"]
self.cursor.execute(sql,(content,chapter_url))
self.conn.commit()
except Exception as e:
self.conn.rollback()
logger.warning("Chapter content error! url=%s %s") % (item["chapter_url"], e)
return item
def close_spider(self,spider) :
# close database
self.cursor.close()
self.conn.close()
Copy the code
④ Auxiliary configuration (modify settings.py) :
The first is to close the robots protocol; Second: start delay; Add header file; Fourth: Open the pipeline:
Mysql > connect to local Mysql
DATABASE_CONFIG={
"type":"mysql"."config": {"host":"localhost"."port":3306."user":"root"."password":"123456"."db":"spider39"."charset":"utf8"}}Copy the code
🔆 In The End!
Start now, stick to it, a little progress a day, in the near future, you will thank you for your efforts! |
---|
This blogger will continue to update the basic column of crawler and crawler combat column, carefully read this article friends, you can like the collection and comment on your feelings after reading. And can follow this blogger, read more crawler in the days ahead!
If there are mistakes or inappropriate words can be pointed out in the comment area, thank you! If reprint this article please contact me for my consent, and mark the source and the name of the blogger, thank you!