The introduction
In a crawler exercise after learning AlsoTang’s Node.js, I encountered a lot of problems and learned a lot of things. “Single charge ×3”
Here is the tutorial address github.com/alsotang/no… .
Below is my own crawling effect diagram
I also recommend a Google plugin called JSONView, which converts JSON data into the format shown above
ImoocSpider practice source code
Setting up the server
First, set up an HTTP service
var http = require('http');
var express = require('express');
var app = express();
http.createServer(app).listen(8080);
app.get('/'.function(req, res) {
//code here...
})
Copy the code
Built with Express, of course, can also use the native, here I am more used to express
Online crawler
Here, superagent and Cheerio are used to climb the page. Here are relevant documents for reference: Superagent Chinese document and Cheerio are both from the CNode community. Of course, those who are good at English can also refer to the original document. These two are the only ones posted here
Crawl the page link to https://www.imooc.com/course/list?c=fe
We want to get some information from the eight pages of courses developed at the front end of MOOCs, but when we open this link, we find that each page only has the name of the course, without the name of the teacher and some main information of the course. So we also need to get and crawl based on the URL of each course.
Obtain course details page link
So let’s start by crawling the urls of all the course details pages on all eight pages
By clicking the button of the corresponding page, we find that a new GET request will be sent every time, and the requested link is the corresponding page. Here, only the page property is different, so we can simulate clicking the corresponding page to obtain the information of the corresponding page by dynamically changing the page
var pages = 1;
var baseUrl = 'https://www.imooc.com/course/list/';
var params = {
c: 'fe'.page: page
};
superagent
.get(baseUrl)
.query(params)
.end(function(err, content) {
var topicUrls = [];
var $ = cheerio.load(content.text);
var courseCard = $('.course-card-container');
courseCard.each(function(index, element) {
var $element = $(element);
var href = url.resolve(
homeUrl,
$element.find('.course-card').attr('href')); topicUrls.push(href); });console.log(topicUrls);
});
Copy the code
So you get the 25 course details page urls from the first page. How do you get eight pages?
async
Because some websites usually have security restrictions, they will not allow the same domain name to have too many concurrent requests, so they need to limit the number of concurrent requests. Here we use the async library. Here’s github
Let’s start by wrapping the previous code into a function
var baseUrl = 'https://www.imooc.com/course/list/';
var fetchUrl = function(page, callback) {
count++;
console.log('Current concurrency', count);
var params = {
c: 'fe'.page: page
};
superagent
.get(baseUrl)
.query(params)
.end(function(err, content) {
var topicUrls = [];
var $ = cheerio.load(content.text);
var courseCard = $('.course-card-container');
courseCard.each(function(index, element) {
var $element = $(element);
var href = url.resolve(
homeUrl,
$element.find('.course-card').attr('href')); topicUrls.push(href); }); callback(err, topicUrls); count--;console.log('Current concurrency after release', count);
});
};
Copy the code
Async is then used to control the number of concurrent and eight page crawls
var pages = [1.2.3.4.5.6.7.8];
async.mapLimit(
pages,
5.function(page, callback) {
fetchUrl(page, callback);
},
function(err, result) {
if (err) console.log(err);
console.log(result)
}
);
});
Copy the code
All urls are printed out. Note that Async automatically merges the return value of the third function argument into an array for the result argument of the fourth function argument. When I started writing, I declared topicUrls globally and returned the following set of data
Climb to the course details page for information
After we had the urls for all the course details pages, we started to crawl the contents. Let’s first define a function
var fetchMsg = function(topicUrl, callback) {
console.log('Open a new fetch')
superagent
.get(topicUrl)
.end(function(err, content){
var Item = [];
var $ = cheerio.load(content.text);
var title = $('.hd .l').text().trim();// Course name
var teacher = $('.tit a').text().trim();// Teacher's name
var level = $('.meta-value').eq(0).text().trim();/ / difficulty
var time = $('.meta-value').eq(1).text().trim();/ / the length
var grade = $('.meta-value').eq(3).text().trim();/ / score
Item.push({
title: title,
teacher: teacher,
level: level,
time: time,
grade: grade,
href: topicUrl
})
callback(null, Item); })};Copy the code
Async is then used to control concurrent crawls
//result is the result of the above, the following code is also the fourth parameter in the above
var topicUrls = result; // Get all urls, but there are 8 small arrays in the large array
var Urls = [];
// Merge large arrays
for(let i=0,l=topicUrls.length; i<l; i++){ Urls = Urls.concat(topicUrls[i]); }async.mapLimit(
Urls,
5.function(url,callback){
fetchMsg(url, callback);
},
function(err, result) {
// Avoid garbled characters
res.writeHead(200, {'Content-Type': 'text/plain; charset=utf8'})
res.end(JSON.stringify(result));
}
Copy the code
One small problem to note here is that the URL structure obtained by result is a large array containing eight small arrays, so the small arrays need to be merged into a large array first.
The last
Program source code
The original address
Explore a little every day, make a little progress every day.