This is the 21st day of my participation in the August Text Challenge.More challenges in August

Environment set up

Node must be installed. I’m running Version 8.11.2 for Mac.

Some third-party libraries used:

  • Back-end services:express
  • Make an HTTP request:superagent
  • Control concurrent requests:async + eventproxy
  • Analyzing web content:cheerio

Configure package.json directly:

{"name": "spider", "version": "0.0.0", "description": "learn nodejs on github", "scripts": {"start": "Node app. Js"}, "dependencies" : {" async ":" ^ 2.0.0 - rc. 6 ", "cheerio" : "^ 0.20.0", "eventproxy" : "^ 0.3.4", "express" : ^2.3.0", "superagent": "^2.3.0"}Copy the code

Nom install after the nom install is configured.

So let’s start writing the crawler.

Background service

The function is to receive the front-end request to start the crawler, and to return the information to the front-end after the completion of information crawling. In the background service part, I use the Express framework here, which is relatively simple and can also use the native HTTP module. The simple framework is as follows:

var express = require('express');
var app = express();
app.get('/', function (req, res, next) {
    // your code here
});
app.listen(3000, function (req, res) {
    console.log('app is running at port 3000');
});
Copy the code

Insert our response code in get processing, including start crawler, result information output, etc.

Crawling of article links

superagent.get(Url).end(function (err, res) {
    if (err) { return next(err); }
    // your code here
});
Copy the code

The Url is the address we requested. Using get to request the Url has the same effect as opening the Url with a browser. The returned data is stored in RES, and we can get the data we want by analyzing RES.

Data processing

Here we’re using cheerio library, which allows us to manipulate the returned data in a jQuery way, which is so sweet.

Note: The syntax is basically the same as jquery, the API can be viewed: cnodejs.org/topic/5203a…

Var $= cheerio.load(sres.text); $('.blog_list').each(function (i, e) { var u = $('.nickname', e).attr('href'); if (authorUrls.indexOf(u) === -1) { authorUrls.push(u); }});Copy the code

Article author information crawl

superagent.get(myurl) .end(function (err, ssres) { if (err) { callback(err, myurl + ' error happened! '); } var $ = cheerio.load(ssres.text); var result = { userId: url.parse(myurl).pathname.substring(1), userName: $(".name #uid").text(), blogTitle: $(".title-blog a").text(), visitCount: $('.grade-box dl').eq(1).children('dd').attr("title"), score: $('.grade-box dl').eq(2).children('dd').attr("title"), /* oriCount: ParseInt ($(' # blog_statistics > li '). The eq (0). The text (). The split (/ /) [: :] [1]), copyCount: ParseInt ($(' # blog_statistics > li '), eq. (1) the text (). The split (/ /) [: :] [1]), trsCount: ParseInt ($(' # blog_statistics > li '), eq. (2) the text (). The split (/ /) [: :] [1]), * / cmtCount: ParseInt ($(' # blog_statistics > li '), eq. (3) the text (). The split (/ /) [: :] [1])}; callback(null, result); });Copy the code

You can filter your data according to your needs.

Concurrent control

Because our requests are asynchronous, we need to execute the next operation in the successful callback function, and in the case of multiple concurrent requests, we need a counter to determine if all concurrent requests have successfully completed. We use the EventProxy library to manage concurrent results for us.

We will crawl the first 3 pages of the web front end on CSDN, so we will perform 3 times to crawl the article link, which is written in eventProxy:

var baseUrl = 'http://blog.csdn.net/web/index.html'; var pageUrls = []; for (var _i = 1; _i < 4; _i++) { pageUrls.push(baseUrl + '? &page=' + _i); } ep.after('get_topic_html', pageurls. length, function (eps) {}); ForEach (function (page) {superagent.get(page).end(function (err, sres) {// post link crawl ep.emit('get_topic_html', 'get authorUrls successful'); }); });Copy the code

In simple terms, the emit event of ‘get_topic_html’ is detected and ep.after is called after it has occurred a specified number of times.

Control the number of concurrent requests

That would have been it, but we do this asynchronously, so we can have dozens or even hundreds of requests sent to the target site. For security reasons, the target site may reject our request, so we need to control the number of concurrent requests, which we use the async library to achieve.

Here the authorUrls are the array of author links we crawled in the previous step, and Async will execute them in order of length. Earlier in the author crawl section we used a callback function to return data, which is also the interface provided by Async. After all the elements in the final array have been executed, the data returned by the callback is put into the Result array, which is returned to the front end.

The effect

Run the daemon from Node app.js and enter http://localhost:3000 in postman to view the result:

So, once you’ve done that, you can use it to crawl whatever data you want,