Before many consecutive articles introduced the client crawling platform (Dspider), today we start from scratch, to achieve climbing any novel on the vertex novel network function.
If you don’t know about client-side crawls, take a look at my previous posts:
Crawler technology (I) An article to understand the status of crawler technology
Crawler technology (2) client crawler
Crawler technology (iii)- Client crawler from Android SDK release
The client crawler ios SDK is out!
Client crawl – answer users to ask
DSpider profile
To integrate the SDK
Dspider’s official website has a detailed integration document and provides a demo. The examples in this section are based on this demo. Please download the demo for the corresponding platform (Github) first.
Ios: dspider.dtworkroom.com/document/io…
Android: dspider.dtworkroom.com/document/an…
How to integrate, crawl the article is very detailed, let’s look at how to write a crawl script
Write a crawl script
Analysis of web page
Since we are climbing the mobile version of the site, open the apex Novels home page, m.23us.com/. For simplicity, we will start the crawl as soon as we enter the contents page of the specific novel. Below to choose the day to remember, for example: introduction page url:m.23us.com/book/52234 we extract the character string “/ book/”. Directory page URL: m.23us.com/html/52/522… We extract the novel catalog page URL feature “HTML/number/number/”.
The basic crawl flow is as follows:
// We use the novel title as the sessionKey
var sessionKey="Novel title"
dSpider(sessionKey, function(session,env,$){
// On the introduction page
if(location.href.indexOf("/book/")! =- 1) {// Add a crawl button
}else if(/.+html\/\d+\/\d+\/$/.test(location.href)) {
// The directory page starts crawling automatically}})Copy the code
DSpider is ascript crawl entry function, similar to the main function in C, please refer to the dSpider javascript API documentation for details.
Introduction page to add a crawl button
To start the crawl, we added a crawl button to the intro page, which looked like this:
Since the page will be automatically climbed when entering the table of contents, we just need to change the text of the button to enter the chapter table of contents, and change the background to green in order to stand out.
Specific code:
$(".more a").text("Crawl a book").css("background"."#1ca72b");Copy the code
Directory page crawl
-
On the table of Contents page we get the urls for all sections.
var list = $(".chapter li a");Copy the code
-
Each URL is then requested via Ajax, and the data is retrieved and parsed
$.get(e.attr("href")).done(function (data) { // Get the section name var text = e.text().trim() + "\r\n"; // Get the text for formatting text+= $(data).find("#txt").html() .replace(/ /g."").replace(/<br>/g."\n") +"\r\n"; // Pass the data to the end session.push(text) })Copy the code
In the crawl, we output the progress message to the end, the complete code is as follows:
var sessionKey=dQuery(".index_block h1").text()
dSpider(sessionKey, function(session,env,$){
if(location.href.indexOf("/book/")! =- 1){
$(".more a").text("Crawl a book").css("background"."#1ca72b");
}else if(/.+html\/\d+\/\d+\/$/.test(location.href)) {
log(sessionKey)
var list = $(".chapter li a");
session.showProgress();
session.setProgressMax(list.length);
var curIndex = 0
function getText() {
var e = list.eq(list.length-curIndex- 1);
$.get(e.attr("href")).done(function (data) {
var text = e.text().trim() + "\r\n";
text+= $(data).find("#txt").html()
.replace(/ /g."").replace(/<br>/g."\n") +"\r\n";
session.push(text)
}).always(function () {
if (++curIndex < list.length) {
session.setProgress(curIndex);
session.setProgressMsg("Crawling up"+sessionKey+"》 "+e.text())
getText();
} else {
session.setProgress(curIndex);
session.finish();
}
})
}
getText()
}
})Copy the code
How about simple and powerful!
Let’s see how it works:
After the successful climb, I will save the data in TXT, and then use QQ to read open
The script has been released to the crawler store: dspider.dtworkroom.com/spider/12
Integration considerations
The default package name of the downloaded demo is wendu.dspiderdemo, and the appID is 5. This is the app under the official account. If you want to change the package name, you need to create the app in the background first, get the AppID once you have created it, replace the APPID from the SDK initialization with your own AppID, then find the “Vertex Novel” in the crawler store, and add it to your app. This step is important, otherwise your app will not have the permission to execute the crawler. We get its ID (12) on the “Apex Novel” details page and change its SID to 12 in the demo. The results of the crawl can be spliced and saved in TXT.
Ps: Don’t forget to star.