Getting started with node.js crawlers (1) Crawling static pages

When it comes to crawler, it is not as complicated as imagined. The principle is to send a request to the target URL and then parse the response into the data format we want. If Token authentication is involved, it is more complicated.

Node. js implements crawler recommendation two libraries, Request and Cheerio

npm install request
npm install cheerio
Copy the code

Request is used to send requests. Cheerio is a quick, concise and flexible implementation of the core functions of jquery. It can directly perform Jquery-style DOM operations on the HTML of the page returned by request.

(1) DOM crawling

Here is a simple example of a crawler. Let’s take the username of Jane’s home page

Open the page you want to climb and find the DOM node for the user name

As per jquery selectors, this should be$('.main-top>.title>a').text()

Attach a code

const request = require('request')
const cheerio = require('cheerio')

request('https://www.jianshu.com/u/5b23cf5114a1', (err, res) => {
    if (err) {
        console.log(err.code)
    }
    else {
        let $ = cheerio.load(res.body)
        console.log($('.main-top>.title>a').text())
    }
})
Copy the code

(2) List crawling

If you want to crawl a list, such as my simple book Blog list, again, use jquery’s each method, which iterates through the matching DOM in the selector. Follow the same steps to find the DOM node before parsing it.

Attach reference code

const request = require('request')
const cheerio = require('cheerio')

request('https://www.jianshu.com/u/5b23cf5114a1', (err, res) => {
    if (err) {
        console.log(err.code)
    }
    else {
        let $ = cheerio.load(res.body)
        let data = []
        $('.note-list>li').each(function (i) {
            data.push($(this).find('.title').text());
        });
        console.log(data)
    }
})
Copy the code

This is the end of the static crawler, and will explain the crawling of dynamic web pages.

Getting started with node.js crawlers (1) Crawling static pages

(1) DOM crawling

(2) List crawling

Related Posts

Dubbo reunites

Developers, how can there be no personal blog! Domain name ICP record

· Practical experience in HBase design (II) | Actual Java development