
Recently, I was doing a small program project, which needed to crawl third-party data, so I began to pick up the crawler. In fact, the front-end crawler is quite good, but now there is a SPA on the web page, so I started to step on the pit crazily, and chat about this article to comfort you.

Normal website crawls data — Cheerio

It’s pretty simple, a toolkit, a few lines of code

const $ = require('cheerio')
const requestPromise = require('request-promise')
const url = '';
    .then((html) => {
        // 利用 cheerio 来分析网页内容,拿到所有小册子的描述
        const books = $('.info', html)
        let totalSold = 0
        let totalSale = 0
        letTotalBooks = books.length // Walk through the list and count the number of people who bought it, and the total number of sales.function () {
            const book = $(this)
            const price = $(book.find('.price-text')).text().replace(A '$'.' ')
            const count = book.find('.message').last().find('span').text().split('people')[0] totalSale += Number(price) * Number(count) totalSold += Number(count)}) console.log(' total${totalBooks}This booklet ', 'altogether${totalSold}Person-time purchase ', 'about${Math.round(totalSale / 10000)}`)})Copy the code

But… However, the above example does not work because nuggets are a classic SPA where the server returns an empty mount node with no data. This leads to the headless browser, Puppeteer

SPA Page crawl – Puppeteer

const $ = require('cheerio');
const puppeteer = require('puppeteer');
const url = '';

async functionRun (params) {// Open a browser const browser = await puppeteer.launch(); // Open a page const page = await browser.newPage(); await page.goto(url, {waitUntil: 'networkidle0'
    const html = await page.content();
    const books = $('.info', html);
    let totalSold = 0
    let totalSale = 0
    letTotalBooks = books.length // Walk through the list and count the number of people who bought it, and the total number of sales.function () {
        const book = $(this)
        const price = $(book.find('.price-text')).text().replace(A '$'.' ')
        const count = book.find('.message').last().find('span').text().split('people')[0] totalSale += Number(price) * Number(count) totalSold += Number(count)}) console.log(' total${totalBooks}This booklet ', 'altogether${totalSold}Person-time purchase ', 'about${Math.round(totalSale / 10000)}} run()Copy the code

The data from the page is as follows