preface
Recently, I was doing a small program project, which needed to crawl third-party data, so I began to pick up the crawler. In fact, the front-end crawler is quite good, but now there is a SPA on the web page, so I started to step on the pit crazily, and chat about this article to comfort you.
Normal website crawls data — Cheerio
It’s pretty simple, a toolkit, a few lines of code
const $ = require('cheerio')
const requestPromise = require('request-promise')
const url = 'https://juejin.cn/books';
requestPromise(url)
.then((html) => {
// 利用 cheerio 来分析网页内容,拿到所有小册子的描述
const books = $('.info', html)
let totalSold = 0
let totalSale = 0
letTotalBooks = books.length // Walk through the list and count the number of people who bought it, and the total number of sales.function () {
const book = $(this)
const price = $(book.find('.price-text')).text().replace(A '$'.' ')
const count = book.find('.message').last().find('span').text().split('people')[0] totalSale += Number(price) * Number(count) totalSold += Number(count)}) console.log(' total${totalBooks}This booklet ', 'altogether${totalSold}Person-time purchase ', 'about${Math.round(totalSale / 10000)}`)})Copy the code
But… However, the above example does not work because nuggets are a classic SPA where the server returns an empty mount node with no data. This leads to the headless browser, Puppeteer
SPA Page crawl – Puppeteer
const $ = require('cheerio');
const puppeteer = require('puppeteer');
const url = 'https://juejin.cn/books';
async functionRun (params) {// Open a browser const browser = await puppeteer.launch(); // Open a page const page = await browser.newPage(); await page.goto(url, {waitUntil: 'networkidle0'
});
const html = await page.content();
const books = $('.info', html);
let totalSold = 0
let totalSale = 0
letTotalBooks = books.length // Walk through the list and count the number of people who bought it, and the total number of sales.function () {
const book = $(this)
const price = $(book.find('.price-text')).text().replace(A '$'.' ')
const count = book.find('.message').last().find('span').text().split('people')[0] totalSale += Number(price) * Number(count) totalSold += Number(count)}) console.log(' total${totalBooks}This booklet ', 'altogether${totalSold}Person-time purchase ', 'about${Math.round(totalSale / 10000)}} run()Copy the code