Since contacting Node for more than two months, DUE to the urgent project, I have been unable to learn node related modules properly. I have contacted Python crawler before, and today I try to use Node to write a small crawler. Node modules used in this paper are:

rendering

The node module

const cheerio = require("cheerio"); // require("request") const request = require("request"); Const fs = require("fs"); // file operator const path = require("path"); // Path correlationCopy the code

steps

Pull the project

npm installCopy the code

node app.jsCopy the code

implementation

Select the fiction catalog page
Gets all chapter page addresses of the novel
Use Request to get section information
Get the chapter title and its chapter content using Cheerios
Use FS to save the novel content value in TXT file

Randomly selected novel sites

Gets all chapter page addresses of the novel

@param {*} body */ const booksQuery = function (body) {$= cheerio.load(body); booksName = $('.btitle').find('h1').text(); / / novel name $(' chapterlist '), find (' a '). Each (function (I, e) {/ / get chapter UrlList list. Push ($(e). Attr (' href ')}); createFolder(path.join(__dirname, `/book/${booksName}.txt`)); CreateWriteStream (path.join(__dirname, ` / book / ${booksName}. TXT `)) / / create the TXT file console log (` began to write "${booksName}" · · · · · · · `) getBody (); // Get section information}Copy the code

Use Request to get section information

/** / const getBody = function () {let primUrl = url + list[count]; // console.log(primUrl) request(primUrl, function (err, res, body) { if (! err && res.statusCode == 200) { toQuery(body); } else { console.log('err:' + err) } }) };Copy the code

Get the chapter title and its chapter content using Cheerios

@param {any} body */ const toQuery = function (body) {$= cheerio.load(body); const title = $('h1').text(); / / get chapter title const content = Trim ($(' # content). The text (), "g"); // Get the text content of the current chapter and remove all Spaces writeFs(title, content) of the text; }Copy the code

Use FS to save the novel content value in TXT file

@param {*} title * @param {*} content */ const writeFs = function (title, AppendFile (path.join(__dirname, '/book/${booksName}.txt'), title, function (err) { if (err) throw err; }); fs.appendFile(path.join(__dirname, `/book/${booksName}.txt`), content, Function (err) {if (err) {console.log(err)} else {console.log(title + '... ') if (count + 1 < list.length) { Count = count + 1; getBody(); }}}); }Copy the code

expand

To climb other novel websites, only by modifying THE URL address, booksQuery method and its toQuery method, the page node information filtering can be reused.

conclusion

Node is so easy and convenient in an era where everything you can do with JS will be implemented with JS.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Node Crawler – Try it out (Novel Crawler)

rendering

The node module

steps

implementation

Randomly selected novel sites

Gets all chapter page addresses of the novel

Use Request to get section information

Get the chapter title and its chapter content using Cheerios

Use FS to save the novel content value in TXT file

expand

conclusion

Node Crawler – Try it out (Novel Crawler)

rendering

The node module

implementation

Gets all chapter page addresses of the novel

Get the chapter title and its chapter content using Cheerios

expand

Related Posts