Recently the project needed some information, and since the project was written in Node.js, it was natural to use Node.js to write crawlers
Project address: github.com/mrtanweijie… In the project, the information content of Readhub, Open Source China, Developer Toutiao and 36Kr websites were crawled. For the time being, there was no multi-page processing, because crawlers run once every day. Now, the latest crawler can meet the needs every time, and we will improve it later
The crawler process is summarized as downloading the HTML of the target website to the local site for data extraction.
Download page
Node.js has many HTTP request libraries. Here we use request, the main code is as follows:
requestDownloadHTML () {
const options = {
url: this.url,
headers: {
'User-Agent': this.randomUserAgent()
}
}
return new Promise((resolve, reject) => {
request(options, (err, response, body) => {
if(! err && response.statusCode === 200) {return resolve(body)
} else {
return reject(err)
}
})
})
}
Copy the code
Use Promise wrapping to make it easier to use async/await later. Since many websites are rendered on the client side, the downloaded page may not contain the desired HTML content. We can use Google’s Puppeteer to download the rendered page on the client side. For well-known reasons, puppeteer may fail to be installed in NPM I due to the need to download the Chrome kernel, just try it a few times 🙂
puppeteerDownloadHTML () {
return new Promise(async (resolve, reject) => {
try {
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
await page.goto(this.url)
const bodyHandle = await page.$('body')
const bodyHTML = await page.evaluate(body => body.innerHTML, bodyHandle)
return resolve(bodyHTML)
} catch (err) {
console.log(err)
return reject(err)
}
})
}
Copy the code
Of course, the best way to render a client page is to use the interface request directly, so that the subsequent HTML parsing is not required, a simple encapsulation, and then can be used like this: # funny:)
await new Downloader('http://36kr.com/newsflashes', DOWNLOADER.puppeteer).downloadHTML()
Copy the code
Second, HTML content extraction
HTML content extraction is, of course, using Cheerio, which exposes the same interface as jQuery and is very simple to use. The browser opens page F12 to view the extracted page element node and then extracts the content as required
readHubExtract () {
let nodeList = this.$('#itemList').find('.enableVisited')
nodeList.each((i, e) => {
let a = this.$(e).find('a')
this.extractData.push(
this.extractDataFactory(
a.attr('href'),
a.text(),
' ',
SOURCECODE.Readhub
)
)
})
return this.extractData
}
Copy the code
3. Scheduled tasks
Cron runs every day
function job () {
let cronJob = new cron.CronJob({
cronTime: cronConfig.cronTime,
onTick: () => {
spider()
},
start: false
})
cronJob.start()
}
Copy the code
Data persistence
Data persistence should theoretically be out of the crawler’s domain, so with Mongoose, create the Model
import mongoose from 'mongoose'
const Schema = mongoose.Schema
const NewsSchema = new Schema(
{
title: { type: 'String', required: true },
url: { type: 'String', required: true },
summary: String,
recommend: { type: Boolean, default: false },
source: { type: Number, required: true, default: 0 },
status: { type: Number, required: true, default: 0 },
createdTime: { type: Date, default: Date.now }
},
{
collection: 'news'})export default mongoose.model('news', NewsSchema)
Copy the code
Basic operation
import { OBJ_STATUS } from '.. /.. /Constants'
class BaseService {
constructor (ObjModel) {
this.ObjModel = ObjModel
}
saveObject (objData) {
return new Promise((resolve, reject) => {
this.ObjModel(objData).save((err, result) => {
if (err) {
return reject(err)
}
return resolve(result)
})
})
}
}
export default BaseService
Copy the code
information
import BaseService from './BaseService'
import News from '.. /models/News'
class NewsService extends BaseService {}
export default new NewsService(News)
Copy the code
Save data happily
await newsService.batchSave(newsListTem)
Copy the code
Clone the project on Github for more content