preface
Reading novels is a habit I have kept for so many years. “Coiling Dragon”, “Fight against the Sky”, “Fairy inverse”, “Mortals Xiuxian Biography” and so on, accompanied me throughout the school days. Recently, I found that the experience of fiction apps on iOS is not good, with frequent pop-up ads, delayed updates and forced sharing. So one rainy night, I decided to stop putting up with the apps and masturbate a book crawler myself.
Scrapy is introduced
Scrapy is python’s main crawler framework, which makes it easy to grab web information by URL, and provides more tools and higher concurrency than the traditional Requests library. Learning from the official learning website is recommended.
However, it doesn’t matter if you don’t know any scrapy material, you can still pull it off after reading this article
Scrapy of actual combat
Before we start, we need to have the following things ready:
- The url of the novel you want to crawl
- Environment set up
- Beginner Python syntax
Select site
Here I chose m.book9.net/. I chose this site because it has three advantages
- Fast update speed (stable service)
- Simple page structure (easy to parse)
- No crawler protection (easy to operate)
Next, find the home page for chasing more novels.
For example, Chen Dong’s The Holy Ruins.
If we wanted to catch up now, we would have to go to the site each time and click on the first TAB of the latest chapter to link to the specific novel chapter.
Imitating the above steps, I drew a flow like this:
So next, we just need to follow these procedures to translate into our code
Building engineering
We’re going to build a Scrapy shell project, but before doing so, make sure you have python installed. The framework itself is compatible with version 2 and 3, so you don’t need to worry about version differences.
My local environment is Python3, so it may operate slightly differently from 2.
1. Install Scrapy
> pip3 install scrapy
Copy the code
2. Create a crawler project and name it NovelCrawler
> scrapy startproject NovelCrawler
Copy the code
3. Create a CRAWler service based on URL
> scrapy genspider novel m.book9.net
Copy the code
This is the basic project creation process and is ready to use once executed
> scrapy crawl novel
Copy the code
Command to start crawler service. However, currently our crawler does not implement any rules, so it does not do anything even if the command is executed, so we need to add some crawler rules to the project.
The crawler written
Next we use Pycharm to open the project we just created.
All of scrapy’s crawler services are grouped under the spiders directory, where we also add the crawler file Novel.py
Request Novel Home page
# encoding:utf-8
import scrapy
class NovelSpider(scrapy.Spider):
Create the service with the same name as above
name = 'novel'
The name of the domain to be accessed is the same as the name of the service created above
allowed_domains = ['m.book9.net']
# url to initiate the request, the holy Ruins novel home page
start_urls = ['https://m.book9.net/wapbook/10.html']
The default callback function to request success
def parse(self, response):
pass
Copy the code
In the above code, the input argument to the parse function, the response object, is unknown to us, and this is one of the most troubling aspects of learning Python. One way to do this is to use Pycharm’s Debug function to view the parameters
As you can see from the figure above, Response contains the REQUESTED HTML information. So we just need to take it a little bit and cut out what we need.
Gets the latest chapter URL
So how to parse the nodes we need? Response provides us with xpath methods, and we only need to input xpath rules to locate the corresponding HTML tag nodes.
It doesn’t matter if you don’t know xpath syntax, Chrome gives you a one-click way to get xpath addresses (right-click -> check ->copy->copy xpath), as shown below:
Using xpath, we can get the address of the latest chapter from this page
# encoding:utf-8
import scrapy
class NovelSpider(scrapy.Spider):
name = 'novel'
allowed_domains = ['m.book9.net']
start_urls = ['https://m.book9.net/wapbook/10.html']
def parse(self, response):
# jump link to the specified tag
context = response.xpath('/html/body/div[3]/div[2]/p[1]/a/@href')
The first result of the array is the URL of the latest section
url = context.extract_first()
print(url)
pass
Copy the code
Request section information
Once we have the link, we can jump to the next page. And Response also provides the follow method, which is convenient for us to jump to the short chain in the station.
# encoding:utf-8
import scrapy
class NovelSpider(scrapy.Spider):
name = 'novel'
allowed_domains = ['m.book9.net']
start_urls = ['https://m.book9.net/wapbook/10.html']
def parse(self, response):
context = response.xpath('/html/body/div[3]/div[2]/p[1]/a/@href')
url = context.extract_first()
Get the short chain and return the result to the specified callback
yield response.follow(url=url, callback=self.parse_article)
# Custom callback methods
def parse_article(self,response):
# Response here is our specific post page
print(response)
pass
Copy the code
(Click on the portal if you are confused by the yield keyword in the code.)
With the page of the article, we just need to parse its HTML. This section is too detail-oriented. It only applies to this site, so I won’t go over it. Attach a comment code:
# encoding:utf-8
import re
import os
import scrapy
class NovelSpider(scrapy.Spider):
name = 'novel'
allowed_domains = ['m.book9.net']
start_urls = ['https://m.book9.net/wapbook/10.html']
def parse(self, response):
# jump link to the specified tag
context = response.xpath('/html/body/div[3]/div[2]/p[1]/a/@href')
Get the short chain and return the result to the specified callback
url = context.extract_first()
yield response.follow(url=url, callback=self.parse_article)
def parse_article(self, response):
Get the title of the article
title = self.generate_title(response)
Build the HTML for the article
html = self.build_article_html(title, response)
Save section HTML locally
self.save_file(title + ".html", html)
# Open native HTML with your own browser
os.system("open " + title.replace(""."\") + ".html")
pass
@staticmethod
def build_article_html(title, response):
# Get the content
context = response.xpath('//*[@id="chaptercontent"]').extract_first()
# Skip the tag in the article to jump content
re_c = re.compile('<\s*a[^>]*>[^<]*<\s*/\s*a\s*>')
article = re_c.sub("", context)
# Splice the article HTML
html = '
'
\
+ title + '</font></b></div>' + article + "</html>"
return html
@staticmethod
def generate_title(response):
title = response.xpath('//*[@id="read"]/div[1]/text()').extract()
return "".join(title).strip()
@staticmethod
def save_file(file_name, context):
fh = open(file_name, 'wb')
fh.write(context.encode(encoding="utf-8"))
fh.close()
pass
Copy the code
Now we can run the following command in the current directory:
> scrapy crawl novel
Copy the code
Show video
thinking
After writing the whole thing, I found it difficult to make one piece of code fit multiple sites. Therefore, it is more suitable to build a crawler file for each site when the need of multi-site crawler is met.
Source code address (I heard that the end of the year will double star oh)