Background: Niao took part in a simple book writing training camp, and served as the monitor of a small class. Every weekend, she had to count how many articles each student had written in the previous week and what articles they had written. In order to get female ticket happy, then promised to write a simple book crawler for her, for statistical data, after all, how is engaged in this line of work, can solve the program is determined not to do that kind of repetitive physical work.
Here’s how to get into the writing of this reptile
I. Preliminary analysis:
First let’s analyze how we want to climb, as shown in the picture:
The first step:
We need to enter into the center of each participant’s users, similar to the link: www.jianshu.com/u/eeb221fb7…
This can only be done by female tickets to help us collect the user center links of each student in their class. Once we have this user center link, the first step we need to climb is the user center
The second step:
To climb the user center, we need to get some data. For example: our statistics are the data of the last week, so we need the link set and user name of the article details in a certain period of time.
How do you get this data? To get the username, we need the text $(‘.nickname’). $(‘.note-list li’) and $(‘.titile’) in $(‘.note-list li’)
Step 3:
Climb in the second step is to obtain the article details link to crawl the article details the contents of the inside, the article details similar link is: www.jianshu.com/p/aeaa1f2a0…
In the article details page, get the data we need, such as title, word count, number of views, etc. As shown in the figure below
Step 4:
Generate the data into excel, so that women will admire you more, wow, si guo yi.
Two, early preparation:
Often above the analysis, below we prepare the tools needed by the crawler:
-
Cheerio: Lets you crawl back pages like you would with jquery
-
Superagent-charset: Solve the page encoding problem that crawls back
-
Superagent: Used to initiate requests
-
Async: Used for asynchronous processes and concurrency control
-
Ejsexcel: Used to generate Excel tables
-
Express, node > = 6.0.0
The specific usage of these modules can be found here, www.npmjs.com, some of which are described below
Three, start crawler
1. We put the simple book user center link of all students in the configuration file config.js, and defined the storage path of the generated Excel table:
const path = require('path'); module.exports = { excelFile: { path: path.join(__dirname, 'public/excel/') }, data: [{name: "small cake cake", url: "http://www.jianshu.com/u/eeb221fb7dac"},]}Copy the code
Because you want to protect other people’s information, you can go to Jane’s user center and create more data
2. We first define some global variables, such as: basic link, the current number of concurrent, grab the wrong link set
let _baseUrl = "http://www.jianshu.com",
_currentCount = 0,
_errorUrls = [];Copy the code
3. Encapsulate some functions:
Const fetchUrl = (url,callback) => {let fetchStart = new Date().getTime(); superagent .get(url) .charset('utf-8') .end((err, ssres) => { if(err) { _errorUrls.push(url); Console. log(' fetch ['+ url +'] error '); return false; } let spendTime = new Date().getTime() - fetchStart; Console. log(' fetch :'+ URL +' successful, time :'+ spendTime +' ms, now number of concurrent :'+ _currentCount); _currentCount--; callback(ssres.text); }); } const removeSame = (arr) => {const newArr = []; const obj = {}; arr.forEach((item) => { if(! obj[item.title]) { newArr.push(item); obj[item.title] = item.title; }}); return newArr; }Copy the code
4. Start to crawl the user center to get links to article details from a certain period of time
// Crawl the user center, Const crawlUserCenter = (res,startTime,endTime) => {//startTime,endTime from the user's Ajax request time const centerUrlArr = config.data; async.mapLimit(centerUrlArr, 5, (elem, callback) => { _currentCount++; fetchUrl(elem.url, (html) => { const $ = cheerio.load(html); const detailUrlArr = getDetailUrlCollections($,startTime,endTime); callback(null,detailUrlArr); //callback is required}); }, (err, detailUrlArr) => {// The result of the end of the concurrency needs to be [[ABC,def],[hij, XXX]] => [ABC,def,hij, XXX] _currentCount = 0; crawArticleDetail(detailUrlArr,res); return false; }); }Copy the code
In this case, the user-centric link comes from the CONFG file, and we use async.maplimit to do a concurrency control on our fetching, with a maximum concurrency of 5.
Async. MapLimit: mapLimit(arr, limit, iterator, callback); Arr: array limit: the number of concurrent requests iterator: the user center that the callback must execute, storing the result of the execution. Callback: Callback after completion of execution. After fetching all user centers, the total result of fetching is returned to this callback.
Get a collection of article details from the user center for a time period:
Const getDetailUrlCollections = ($,startTime,endTime) => {let articleList = $('#list-container ') .note-list li'), detailUrlCollections = []; for(let i = 0,len = articleList.length; i < len; i++) { let crateAt = articleList.eq(i).find('.author .time').attr('data-shared-at'); let createTime = new Date(crateAt).getTime(); if(createTime >= startTime && createTime <= endTime) { let articleUrl = articleList.eq(i).find('.title').attr('href'); let url = _baseUrl + articleUrl; detailUrlCollections.push(url); } } return detailUrlCollections; }Copy the code
5. From step 4, we get all the links to the details of the article, so let’s crawl the details of the article as in step 4
Const crawArticleDetail = (detailUrls,res) => {const detailUrlArr = spreadDetailUrl(detailUrls); async.mapLimit(detailUrlArr, 5, (url, callback) => { _currentCount ++; fetchUrl(url, (html) => { const $ = cheerio.load(html,{decodeEntities: false}); const data = { title: $('.article .title').html(), wordage: $('.article .wordage').html(), publishTime: $('.article .publish-time').html(), author: $('.author .name a').html() }; callback(null,data); }); }, (err, resData) => { let result = removeSame(resData); const sumUpData = sumUpResult(result); res.json({ data: result, sumUpData: sumUpData }); createExcel(result,sumUpData); Console. info(' + result.length +', where '+ _errorurls.length +'); If (_errorurls.length > 0) {console.info(' error url :' + _errorurls.join (',')); } return false; }); } // [[abc,def],[hij,xxx]] => [abc,def,hij,xxx] const spreadDetailUrl= (urls) => { const urlCollections = []; urls.forEach((item) => { item.forEach((url) => { urlCollections.push(url); })}); return urlCollections; }Copy the code
From the article details, we get the title, the number of words, the date of publication, and of course, you can get whatever information you want on that page, but I don’t need too much here, that’s all. The data obtained here is duplicate, so the array has to be reprocessed
6. So far, we have obtained all the data we need to crawl. The last step is to generate excel table
Using Node to generate excel tables, looking around, found ejsexcel this framework is better evaluation. Click to view;
The way to use it is: we need to have an Excel table template, and then in the template, we can use EJS syntax, to generate the table according to our meaning
// Generate excel table const createExcel = (dataArr,sumUpData) => {const exlBuf = fs.readfilesync (config.excelfile.path + "/report.xlsx"); // Data source const data = [[{"table_name":"7 ","date": formatTime()}], dataArr, sumUpData]; // renderExcel template ejsExcel. RenderExcel (exlBuf, data) .then(function(exlBuf2) { fs.writeFileSync(config.excelFile.path + "/report2.xlsx", exlBuf2); Console. log(" Excel generated successfully "); }). Catch (function(err) {console.error(' failed to generate excel table '); }); }Copy the code
Read our template, then write our data to the template to generate a new Excel table
7. Since the grasping time is not fixed, we make the whole process of grasping into the form of front-end Ajax request, and transfer startTime and endTime data. The front-end interface is relatively simple
Generated Excel chart:
The process of grasping is as follows:
At this point, our simple book crawler is complete, relatively simple.
Note:
Since only the first page of data can be captured in the user center, it is better to select the last week. And when the Excel table is generated, you must turn off Excel, otherwise it will lead to the failure of the excel table generation.
The code is already on Github, welcome to use it (I hope Jane doesn’t block me): Go to Github.
By the way, the name of this reptile is bumblebee