“This article is participating in the technical topic essay node.js advanced road, click to see details”
As a platform that allows JavaScript to run on the server side, NodeJS has been popular in recent years, so the front-end is more or less the same. Today, I’m learning to share a novel crawler by NodeJS.
preface
Because I like reading novels, Biquge has become an important way for me to read novels. Although Biquge is a free platform, it does not provide a way to download novels. So I wondered if I could use my NodeJS knowledge to crawl the novels above, so here it is.
Related technologies
In addition to mastering NodeJS, you also need to use third-party libraries to help you code, because you can see further from the giant. In this case, I chose axios and Cheerio libraries.
aixos
: This library I do not have to introduce, write front-end understand;cheerio
: This library is a short description of the server sidejQuery
.
Why not use the newer Puppeteer library? This library is recommended online, can be more convenient to crawl asynchronous rendering data. However, using the Puppeteer developer crawler, I have discovered a problem: when too many pages are requested, the memory footprint becomes too large. Even a simple novel site has thousands of requests, so some optimization is needed, otherwise the crawl will be slow.
Train of thought
As a server rendering site, using Axios + Cheerio is also very easy to crawl.
- Find the target novel
url
, parse the title of the novel and the table of contents (the table of contents has links to corresponding chapters); - Traverse the fiction catalog by requesting the corresponding
path
, to parse out the content of the article; - Pass the parsed content
fs
The module is written to a file.
With a simple idea, you can start coding your implementation.
implementation
- To create a
index.js
File and passnpm init -y
Initialize and then download the dependency we want to use:
npm install --save axios cheerio
Copy the code
- Request target novel, get content:
const Axios = require('axios')
const cheerio = require('cheerio')
const baseURL = 'https://www.xbiquge.la/' // The website of Biquge
const outDir = 'dist' // Write to the file directory
const targetURL = '7/7877 /' // Find the name pathname according to the rules of the website URL
const axios = Axios.create({
timeout: 0
})
const res = await axios.get(baseURL + targetURL)
console.log(res.data
Copy the code
From the figure above, you can see that after requesting the address directly through Axios, you get the HTML content returned, and with that content, you can get any content through Cheerio.
- To analyze node information, run the
cheerio
Get the information you need:
After finding useful node information, start coding:
const res = await axios.get(baseURL + targetURL)
const $ = cheerio.load(res.data)
const title = $('#info h1').text()
const list = []
$('#list a').each((idx, item) = > {
list.push({
index: idx,
url: item.attribs.href,
title: item.children[0].data
})
})
console.log('Book Name:', title)
console.log('Bibliography:', list)
Copy the code
It can be seen from the above that Cheerio API is basically as convenient as jQuery, so it is easy to obtain the directory information. Once you have the section information of the directory, you can follow suit and iterate through the request to fetch the corresponding content. Now, after you can grab the content, you can consider the next step: save the content of the captured article to this.
- Save article content to local file:
In NodeJS, you can read and write files through the FS module, which is very easy to use.
const ws = fs.createWriteStream(resolve(outDir, `${title}.txt`), {
flags: 'a' // Indicates that the write mode is appended
})
for (const b of list) {
const content = '... ' // Get content from Cheerio
ws.write(title + '\n') // \n Newline character
ws.write(content + '\n')
}
ws.end()
Copy the code
In normal cases, you should see the.txt file for the title generated in the dist directory, so that the novel can be easily climbed down.
To optimize the
Although the purpose of downloading the novel was achieved, several serious problems were found:
- Too many chapters and slow downloads;
- If requests are too frequent, some requests will fail, resulting in content completion.
- Long time program execution, no feedback interaction, the program seems to be stuck.
Once you have identified the problem, you need to find a way to solve it. After thinking about it, you need to come up with the following ways:
- through
setTimeout
To enable asynchrony and speed up the download rate; - use
try/catch
With recursive mode, wait for the request to succeed (can set a retry number); - use
cli-progress
To give feedback on download progress.
Shard to download
When using setTimeout to download novels asynchronously, you need to consider the order. Here we define a step to break it into sections to download, and then attach the sections:
async function downloadBook (title, book, p = 5) {
let c = 0 // The number of completed downloads
const d = {
book: [].ws: []}if(! isExist(outDir)) { fs.mkdirSync(outDir) }else {
fs.readdirSync(outDir).forEach(file= > {
if(! fs.statSync(pathFor(file)).isDirectory()) { fs.unlinkSync(pathFor(file)) } }) }for (let i = 0 ; i < p ; i ++) {
const step = Math.ceil(book.length / p)
d.book.push(i === p - 1 ? book.slice(i * step) : book.slice(i * step, (i + 1) * step))
d.ws.push(fs.createWriteStream(pathFor(`${title + i}.txt`), {
flags: 'a'
}))
d.ws[i].on('finish'.async () => {
c ++
if (c === p) {
console.log('\n Download complete, file in the archive... ')
for (let i = 0 ; i < p ; i ++) {
const target = fs.createWriteStream(pathFor(` /${title}.txt`), {
flags: 'a'
})
const rs = fs.createReadStream(pathFor(`${title + i}.txt`))
await pipeSync(rs, target)
}
console.log('Download complete! ')
process.exit()
}
})
}
console.log('Start downloading${title}A: `)
const bar = new cliProgress.SingleBar({
format: 'download progress: [{bar}] {percentage with} % | estimate time: eta} {s | {value} / {total}'.barIncompleteChar: The '-'
}, cliProgress.Presets.rect)
bar.start(book.length, 0)
for (let i = 0 ; i < p ; i ++) {
setTimeout(async() = > {for (const b of d.book[i]) {
const { $c } = await getHTMLContent(baseURL + b.url, {
$c: '#content'
}, b.title)
const content = $c.text()
d.ws[i].write(b.title + '\n')
d.ws[i].write(content + '\n')
bar.increment()
}
d.ws[i].end()
})
}
}
Copy the code
That’s a simple implementation of asynchronous downloading in sections, and how to handle exceptions:
Exception handling
If the number of requests is too many, packet loss will definitely occur, resulting in incomplete content. In this case, corresponding processing is needed:
async function getHTMLContent (url, selectors, title = ' ') {
try {
const result = {}
const res = await axios.get(url)
const $ = cheerio.load(res.data)
for (const k in selectors) {
result[k] = $(selectors[k])
}
return result
} catch(err) {
console.log(Download '\ n' + (title || url) + 'Network error occurred and will try again... ')
return getHTMLContent(url, selectors)
}
}
Copy the code
When the network fails, the catch method will be captured by try. In the catch method, we recursively call to request the failed chapter again to ensure that the content will not be lost.
Progress bar feedback
To use the progress bar, download the CLI-Progress library:
npm install --save cli-progress
Copy the code
It’s very simple to use:
const bar = new cliProgress.SingleBar({
format: 'download progress: [{bar}] {percentage with} % | estimate time: eta} {s | {value} / {total}'.barIncompleteChar: The '-'
}, cliProgress.Presets.rect)
bar.start(200.0) // Initialize the total number, the current number
bar.increment() / / + 1
Copy the code
Several of the parameters are expressed as follows:
format
: indicates the formatted content.barIncompleteChar
: Undownloaded display character;start
: Initializes the progress bar. The first parameter is always, and the second parameter is completion.increment
: The completion number increases by 1.
complete
Now that the optimization is complete, you can see what it looks like:
When the download is complete, the files will be archived into a vault. TXT file.
Blue Blue: How to use NodeJS to crawl a regular website