Use Node.js to develop an information crawler

Recently the project needed some information, and since the project was written in Node.js, it was natural to use Node.js to write crawlers

Project address: github.com/mrtanweijie… In the project, the information content of Readhub, Open Source China, Developer Toutiao and 36Kr websites were crawled. For the time being, there was no multi-page processing, because crawlers run once every day. Now, the latest crawler can meet the needs every time, and we will improve it later

The crawler process is summarized as downloading the HTML of the target website to the local site for data extraction.

Download page

Node.js has many HTTP request libraries. Here we use request, the main code is as follows:

 requestDownloadHTML () {
    const options = {
      url: this.url,
      headers: {
        'User-Agent': this.randomUserAgent()
      }
    }
    return new Promise((resolve, reject) => {
      request(options, (err, response, body) => {
        if(! err && response.statusCode === 200) {return resolve(body)
        } else {
          return reject(err)
        }
      })
    })
  }
Copy the code

Use Promise wrapping to make it easier to use async/await later. Since many websites are rendered on the client side, the downloaded page may not contain the desired HTML content. We can use Google’s Puppeteer to download the rendered page on the client side. For well-known reasons, puppeteer may fail to be installed in NPM I due to the need to download the Chrome kernel, just try it a few times 🙂

  puppeteerDownloadHTML () {
    return new Promise(async (resolve, reject) => {
      try {
        const browser = await puppeteer.launch({ headless: true })
        const page = await browser.newPage()
        await page.goto(this.url)
        const bodyHandle = await page.$('body')
        const bodyHTML = await page.evaluate(body => body.innerHTML, bodyHandle)
        return resolve(bodyHTML)
      } catch (err) {
        console.log(err)
        return reject(err)
      }
    })
  }
Copy the code

Of course, the best way to render a client page is to use the interface request directly, so that the subsequent HTML parsing is not required, a simple encapsulation, and then can be used like this: # funny:)

 await new Downloader('http://36kr.com/newsflashes', DOWNLOADER.puppeteer).downloadHTML()
Copy the code

Second, HTML content extraction

HTML content extraction is, of course, using Cheerio, which exposes the same interface as jQuery and is very simple to use. The browser opens page F12 to view the extracted page element node and then extracts the content as required

 readHubExtract () {
    let nodeList = this.$('#itemList').find('.enableVisited')
    nodeList.each((i, e) => {
      let a = this.$(e).find('a')
      this.extractData.push(
        this.extractDataFactory(
          a.attr('href'),
          a.text(),
          ' ',
          SOURCECODE.Readhub
        )
      )
    })
    return this.extractData
  }
Copy the code

3. Scheduled tasks

Cron runs every day

function job () {
  let cronJob = new cron.CronJob({
    cronTime: cronConfig.cronTime,
    onTick: () => {
      spider()
    },
    start: false
  })
  cronJob.start()
}
Copy the code

Data persistence

Data persistence should theoretically be out of the crawler’s domain, so with Mongoose, create the Model

import mongoose from 'mongoose'
const Schema = mongoose.Schema
const NewsSchema = new Schema(
  {
    title: { type: 'String', required: true },
    url: { type: 'String', required: true },
    summary: String,
    recommend: { type: Boolean, default: false },
    source: { type: Number, required: true, default: 0 },
    status: { type: Number, required: true, default: 0 },
    createdTime: { type: Date, default: Date.now }
  },
  {
    collection: 'news'})export default mongoose.model('news', NewsSchema)

Copy the code

Basic operation

import { OBJ_STATUS } from '.. /.. /Constants'
class BaseService {
  constructor (ObjModel) {
    this.ObjModel = ObjModel
  }

  saveObject (objData) {
    return new Promise((resolve, reject) => {
      this.ObjModel(objData).save((err, result) => {
        if (err) {
          return reject(err)
        }
        return resolve(result)
      })
    })
  }
}
export default BaseService
Copy the code

information

import BaseService from './BaseService'
import News from '.. /models/News'
class NewsService extends BaseService {}
export default new NewsService(News)
Copy the code

Save data happily

await newsService.batchSave(newsListTem)
Copy the code

Clone the project on Github for more content

Use Node.js to develop an information crawler

Download page

Second, HTML content extraction

3. Scheduled tasks

Data persistence

Related Posts

[LeetCode] 83. Delete duplicate elements from sorted lists

Design and Implementation of Mission System (1)

Java Concurrent Programming: Three ways Java creates threads