Brit Robert Pitt posted his crawler script on Github, making it easy for anyone to access a large number of publicly available user ids on Google Plus. About 225 million user ids have been exposed so far.

The nice thing about this is that it’s a NodeJS script and it’s very short, with only 71 lines including comments.

There’s no question that nodeJS has changed the whole front-end development ecosystem. This article completes a promise – based nodeJS crawler that collects information about any article written by Jane. And finally the climb down results in the form of email, automatically sent to the target object. Don’t be intimidated by nodeJS ‘appearance, even if you’re new to the front-end, or new to nodeJS, you can read and understand this article.

All the code of the crawler can be found in my Github repository. In the future, this crawler will be constantly upgraded and updated, welcome to pay attention.

NodeJS VS Python implementation crawlers

Let’s start with reptiles. By contrast, discuss why nodeJS is/is not suitable as a crawler language. First, to sum up:

NodeJS’s single-threaded, event-driven nature can achieve great throughput on a single machine, making it ideal for writing resource-intensive programs such as web crawlers.

However, for some complex scenarios, a more comprehensive consideration is required. The following content is a summary of the relevant questions from Zhihu. Thank you @Zhihu netizens for your contribution to the answers.

  • If you’re crawling a few pages directionally, doing some simple page parsing, and crawling efficiency is not a core requirement, then the language doesn’t make much difference.

  • If it is a directional crawl, and the main goal is to parse the content dynamically generated by JS: at this point, the page content is dynamically generated by JS/Ajax, using the common request page + parsing method does not work, you need to use a similar Firefox, Chrome browser JS engine to do dynamic parsing of the page JS code.

  • If crawler involves large-scale website crawling, efficiency, scalability, maintainability and other factors must be considered: 1) PHP: it is not recommended to use multithreading and asynchronous support. 2) NodeJS: Crawling down some vertical sites is ok. However, due to distributed crawling, message communication and other weak support, according to their own situation. 3) Python: Suggestions, good support for the above issues.

Of course, what we’ve achieved today is a simple crawler that doesn’t put any pressure on the target site or negatively impact personal privacy. After all, his goal is to become familiar with the nodeJS environment. Suitable for beginners and practice.

Similarly, any malicious crawler nature is bad, we should try our best to avoid the impact, jointly maintain the health of the network environment.

The crawler instances

The purpose of today’s crawler is to crawl all the information that LucasHC (myself) has ever published on the Simple book platform, including the information for each article:

  • Release date;
  • Article number;
  • Comments;
  • Number of views, appreciation; And so on.

The output of the final crawl result is as follows:

Crawl output

At the same time, the above results, we need to automatically send mail to the specified mailbox through the script. The contents received are as follows:

Email content

All operations can be completed with one click.

The crawler design

Our program relies on three modules/libraries altogether:

const http = require("http");
const Promise = require("promise");
const cheerio = require("cheerio");Copy the code

Send the request

HTTP is a native nodeJS module, which can be used to build servers by itself. HTTP module is implemented by C++ and has reliable performance. We use Get to request the corresponding page of the author’s article:

http.get(url, function(res) {
    var html = "";
    res.on("data".function(data) {
        html += data;
    });

    res.on("end".function() {... }); }).on("error".function(e) {
    reject(e);
    console.log("Error getting information!");
});Copy the code

Because I found that Jane each article links in the book form is as follows: full form: “www.jianshu.com/p/ab2741f78… www.jianshu.com/p/ + Article ID.

So, each article URL of the relevant author in the above code: consists of a concatenation of baseUrl and the relevant article ID:

articleIds.forEach(function(item) {
    url = baseUrl + item;
});Copy the code

ArticleIds are, of course, an array of ids for each of the authors’ articles.

Finally, we store the HTML content of each article in the HTML variable.

Asynchronous Promise encapsulation

Since the author may have multiple articles, we should fetch and parse each article asynchronously. Here I use promise to encapsulate the above code:

function getPageAsync (url) {
    return new Promise(function(resolve, reject){
        http.get(url, function(res) {... }).on("error".function(e) {
            reject(e);
            console.log("Error getting information!");
        });
    });
};Copy the code

So let’s say I’ve written 14 original articles. Then the request and processing for each piece of article is all a Promise object. We store it in a predefined array:

const articlePromiseArray = [];Copy the code

Next, I use the promise.all method for processing.

The promise.all method is used to wrap multiple Promise instances into a new Promise instance.

This method takes an array of Promise instances as arguments and returns an Resolved instance when the states of all the instances in the array become Resolved. Promise.all returns an array of promise instance values and passes them to the callback function.

In other words, my request for 14 articles corresponds to 14 promise instances. After each instance is requested, the following logic is executed:

Promise.all(articlePromiseArray).then(function onFulfilled (pages) {
    pages.forEach(function(html) {
        let info = filterArticles(html);
        printInfo(info);        
    });
}, function onRejected (e) {
    console.log(e);
});Copy the code

His goal is to filterArticles each return value, which is the HTML content of a single article. The processing results are output by printInfo method. Next, let’s look at what the filterArticles method does.

HTML parsing

Obviously, if you understand the above. The filterArticles method extracts valuable information from the HTML content of a single article. Valuable information includes: 1) the title of the article; 2) Publication time; 3) Word count; 4) Article views; 5) Number of comments; 6) The number of articles praised.

function filterArticles (html) {
    let $ = cheerio.load(html);
    let title = $(".article .title").text();
    let publishTime = $('.publish-time').text();
    let textNum = $('.wordage').text().split(' ') [1];
    let views = $('.views-count').text().split('read') [1];
    let commentsNum = $('.comments-count').text();
    let likeNum = $('.likes-count').text();

    let articleData = {
        title: title,
        publishTime: publishTime,
        textNum: textNum
        views: views,
        commentsNum: commentsNum,
        likeNum: likeNum
    }; 

    return articleData;
};Copy the code

You may be wondering why I can manipulate HTML information using something like $in jQuery. It’s actually thanks to the Cheerio library.

The filterArticles method returns what we are interested in for each article. This content is stored in the articleData object and is eventually printed by printInfo.

Automatic mail sending

At this point, the design and implementation of crawler has reached a paragraph. The next step is to send the contents of our crawl by email. Here I use the NodeMailer module to send mail. The relevant logic is in promise.all:

Promise.all(articlePromiseArray).then(function onFulfilled (pages) {
    let mailContent = ' ';
    var transporter = nodemailer.createTransport({
        host : 'smtp.sina.com'.secureConnection: true.// Use SSL (secure to prevent information from being stolen)
        auth : {
            user : '**@sina.com'.pass: * * *}});var mailOptions = {
        // ...
    };
    transporter.sendMail(mailOptions, function(error, info){
        if (error) {
            console.log(error);
        }
        else {
            console.log('Message sent: '+ info.response); }}); },function onRejected (e) {
    console.log(e);
});Copy the code

The configuration of the mail service has been properly hidden. Readers can configure their own.

conclusion

In this paper, we implemented a crawler step by step. The knowledge points involved mainly include: nodeJS basic module usage, promise concept, etc. If we expand, we can also do nodeJS to connect to the database and store the contents of the crawl in the database. Of course, you can also use Node-schedule for timing script control. Of course, the purpose of the crawler is to get started, and its implementation is relatively easy. The target source is not large data.

This is just the tip of the iceberg for nodeJS, and I hope you can explore it together. If you are interested in the full code, click here.

Happy Coding!