Before many consecutive articles introduced the client crawling platform (Dspider), today we start from scratch, to achieve climbing any novel on the vertex novel network function.

If you don’t know about client-side crawls, take a look at my previous posts:

Crawler technology (I) An article to understand the status of crawler technology

Crawler technology (2) client crawler

Crawler technology (iii)- Client crawler from Android SDK release

The client crawler ios SDK is out!

Client crawl – answer users to ask

DSpider profile

To integrate the SDK

Dspider’s official website has a detailed integration document and provides a demo. The examples in this section are based on this demo. Please download the demo for the corresponding platform (Github) first.

Ios: dspider.dtworkroom.com/document/io…

Android: dspider.dtworkroom.com/document/an…

How to integrate, crawl the article is very detailed, let’s look at how to write a crawl script

Write a crawl script

Analysis of web page

Since we are climbing the mobile version of the site, open the apex Novels home page, m.23us.com/. For simplicity, we will start the crawl as soon as we enter the contents page of the specific novel. Below to choose the day to remember, for example: introduction page url:m.23us.com/book/52234 we extract the character string “/ book/”. Directory page URL: m.23us.com/html/52/522… We extract the novel catalog page URL feature “HTML/number/number/”.

The basic crawl flow is as follows:

// We use the novel title as the sessionKey
var sessionKey="Novel title"
dSpider(sessionKey, function(session,env,$){
    // On the introduction page
    if(location.href.indexOf("/book/")! =- 1) {// Add a crawl button
    }else if(/.+html\/\d+\/\d+\/$/.test(location.href)) {  
      // The directory page starts crawling automatically}})Copy the code

DSpider is ascript crawl entry function, similar to the main function in C, please refer to the dSpider javascript API documentation for details.

Introduction page to add a crawl button

To start the crawl, we added a crawl button to the intro page, which looked like this:

Introduction page. PNG

Since the page will be automatically climbed when entering the table of contents, we just need to change the text of the button to enter the chapter table of contents, and change the background to green in order to stand out.

PNG modified introduction page

Specific code:

$(".more a").text("Crawl a book").css("background"."#1ca72b");Copy the code

Directory page crawl

  1. On the table of Contents page we get the urls for all sections.

    var list = $(".chapter li a");Copy the code
  2. Each URL is then requested via Ajax, and the data is retrieved and parsed

     $.get(e.attr("href")).done(function (data) {
                    // Get the section name
                    var text = e.text().trim() + "\r\n";
                    // Get the text for formatting
                    text+= $(data).find("#txt").html()
                           .replace(/ /g."").replace(/<br>/g."\n") +"\r\n";
                    // Pass the data to the end
                    session.push(text)
     })Copy the code

In the crawl, we output the progress message to the end, the complete code is as follows:

var sessionKey=dQuery(".index_block h1").text()
dSpider(sessionKey, function(session,env,$){
    if(location.href.indexOf("/book/")! =- 1){
        $(".more a").text("Crawl a book").css("background"."#1ca72b");
    }else if(/.+html\/\d+\/\d+\/$/.test(location.href)) {
        log(sessionKey)
        var list = $(".chapter li a");
        session.showProgress();
        session.setProgressMax(list.length);
        var curIndex = 0
        function getText() {
            var e = list.eq(list.length-curIndex- 1);
            $.get(e.attr("href")).done(function (data) {
                var text = e.text().trim() + "\r\n";
                text+= $(data).find("#txt").html()
                      .replace(/ /g."").replace(/<br>/g."\n") +"\r\n";
                session.push(text)
            }).always(function () {
                if (++curIndex < list.length) {
                    session.setProgress(curIndex);
                    session.setProgressMsg("Crawling up"+sessionKey+"》 "+e.text())
                    getText();
                } else {
                    session.setProgress(curIndex);
                    session.finish();
                }
            })
        }
        getText()
    }
})Copy the code

How about simple and powerful!

Let’s see how it works:

Screenshot_20170330-165659.png

After the successful climb, I will save the data in TXT, and then use QQ to read open

directory

The body of the

The script has been released to the crawler store: dspider.dtworkroom.com/spider/12

Integration considerations

The default package name of the downloaded demo is wendu.dspiderdemo, and the appID is 5. This is the app under the official account. If you want to change the package name, you need to create the app in the background first, get the AppID once you have created it, replace the APPID from the SDK initialization with your own AppID, then find the “Vertex Novel” in the crawler store, and add it to your app. This step is important, otherwise your app will not have the permission to execute the crawler. We get its ID (12) on the “Apex Novel” details page and change its SID to 12 in the demo. The results of the crawl can be spliced and saved in TXT.

Ps: Don’t forget to star.