The introduction

In a crawler exercise after learning AlsoTang’s Node.js, I encountered a lot of problems and learned a lot of things. “Single charge ×3”

Here is the tutorial address github.com/alsotang/no… .

Below is my own crawling effect diagram

I also recommend a Google plugin called JSONView, which converts JSON data into the format shown above

ImoocSpider practice source code

Setting up the server

First, set up an HTTP service

var http = require('http');
var express = require('express');

var app = express();

http.createServer(app).listen(8080);

app.get('/'.function(req, res) {
  //code here...
})
Copy the code

Built with Express, of course, can also use the native, here I am more used to express

Online crawler

Here, superagent and Cheerio are used to climb the page. Here are relevant documents for reference: Superagent Chinese document and Cheerio are both from the CNode community. Of course, those who are good at English can also refer to the original document. These two are the only ones posted here

Crawl the page link to https://www.imooc.com/course/list?c=fe

We want to get some information from the eight pages of courses developed at the front end of MOOCs, but when we open this link, we find that each page only has the name of the course, without the name of the teacher and some main information of the course. So we also need to get and crawl based on the URL of each course.

Obtain course details page link

So let’s start by crawling the urls of all the course details pages on all eight pages

By clicking the button of the corresponding page, we find that a new GET request will be sent every time, and the requested link is the corresponding page. Here, only the page property is different, so we can simulate clicking the corresponding page to obtain the information of the corresponding page by dynamically changing the page

var pages = 1;
var baseUrl = 'https://www.imooc.com/course/list/';

var params = {
  c: 'fe'.page: page
};

superagent
  .get(baseUrl)
  .query(params)
  .end(function(err, content) {
    var topicUrls = [];
    var $ = cheerio.load(content.text);
    var courseCard = $('.course-card-container');
    courseCard.each(function(index, element) {
      var $element = $(element);
      var href = url.resolve(
        homeUrl,
        $element.find('.course-card').attr('href')); topicUrls.push(href); });console.log(topicUrls);
  });
Copy the code

So you get the 25 course details page urls from the first page. How do you get eight pages?

async

Because some websites usually have security restrictions, they will not allow the same domain name to have too many concurrent requests, so they need to limit the number of concurrent requests. Here we use the async library. Here’s github

Let’s start by wrapping the previous code into a function

var baseUrl = 'https://www.imooc.com/course/list/';
var fetchUrl = function(page, callback) {
    count++;
    console.log('Current concurrency', count);

    var params = {
      c: 'fe'.page: page
    };

    superagent
      .get(baseUrl)
      .query(params)
      .end(function(err, content) {
        var topicUrls = [];
        var $ = cheerio.load(content.text);
        var courseCard = $('.course-card-container');
        courseCard.each(function(index, element) {
          var $element = $(element);
          var href = url.resolve(
            homeUrl,
            $element.find('.course-card').attr('href')); topicUrls.push(href); }); callback(err, topicUrls); count--;console.log('Current concurrency after release', count);
        
      });
  };
Copy the code

Async is then used to control the number of concurrent and eight page crawls

var pages = [1.2.3.4.5.6.7.8];
async.mapLimit(
    pages, 
    5.function(page, callback) {
      fetchUrl(page, callback);
    },
    function(err, result) {
      if (err) console.log(err);

      console.log(result)
    }
  );
});
Copy the code

All urls are printed out. Note that Async automatically merges the return value of the third function argument into an array for the result argument of the fourth function argument. When I started writing, I declared topicUrls globally and returned the following set of data

Climb to the course details page for information

After we had the urls for all the course details pages, we started to crawl the contents. Let’s first define a function

var fetchMsg = function(topicUrl, callback) {
    console.log('Open a new fetch')
    superagent
      .get(topicUrl)
      .end(function(err, content){
        var Item = [];
        var $ = cheerio.load(content.text);
        var title = $('.hd .l').text().trim();// Course name
        var teacher = $('.tit a').text().trim();// Teacher's name
        var level = $('.meta-value').eq(0).text().trim();/ / difficulty
        var time = $('.meta-value').eq(1).text().trim();/ / the length
        var grade = $('.meta-value').eq(3).text().trim();/ / score

        Item.push({
          title: title,
          teacher: teacher,
          level: level,
          time: time,
          grade: grade,
          href: topicUrl
        })


        callback(null, Item); })};Copy the code

Async is then used to control concurrent crawls

//result is the result of the above, the following code is also the fourth parameter in the above

var topicUrls = result; // Get all urls, but there are 8 small arrays in the large array

var Urls = [];
// Merge large arrays
for(let i=0,l=topicUrls.length; i<l; i++){ Urls = Urls.concat(topicUrls[i]); }async.mapLimit(
  Urls,
  5.function(url,callback){
    fetchMsg(url, callback);
  },
  function(err, result) {
    // Avoid garbled characters
    res.writeHead(200, {'Content-Type': 'text/plain; charset=utf8'})
    res.end(JSON.stringify(result));
  }
Copy the code

One small problem to note here is that the URL structure obtained by result is a large array containing eight small arrays, so the small arrays need to be merged into a large array first.

The last

Program source code

The original address

Explore a little every day, make a little progress every day.