Recently in the research of crawler, so I used my familiar node to write a simple.
I started with PhantomJS to get HTML, but I looked at the document and realized it hadn’t been maintained for a long time, so I gave it up.
Later, I looked around and found Puppeteer, which was developed by Google, so I tried it decisively and felt it was much higher than phantom.
B: Don’t talk too much. Stick to the project address.
Github.com/Huoshendame…
Project introduction
Technology stack
Node, Puppeteer, Cheerio (although puppeteer is integrated with Jq, please use it since it is installed)
Installation Precautions
NPM install will cause an error when puppeteer is being installed, because Node will cause an error when Chrome is being downloaded. So ignore Chrome for now
npm install puppeteer --ignore-scripts
Copy the code
After the installation is successful, run it again
npm installCopy the code
Once installed, manually place the Chrome-Win folder in your project in the root directory of drive D.
PS: Or you can place the puppeteer.launch in your own directory, but specify the absolute location of the puppeteer.launch in the project’s reptile.js
Function is introduced
1. By opening the headline news page https://www.toutiao.com/ch/news_game/ Puppeteer.
2. Obtain the page instance, through the injection of JS to achieve page scrolling
3. Analyze dom structure through Cheerio to obtain title, picture and link address.
4. Save the file to a local file. (It can also be put in DB, I am here is the direct interface to return the data, and conveniently saved to the local file)
The source code
/* const fs = require('fs')
const cheerio = require('cheerio')
const puppeteer = require('puppeteer'/* Define the function */let getListData = async function(Category) {/* Initialize puppeteer*/ const browser = await puppeteer.launch({executablePath:'D:\\chrome-win\\chrome.exe'// Put the chrome-win folder from the project into the root directory headless:false// This is whether to open the Chrome visual windowtrueIt is not openfalseConst page = await browser.newPage() const page = await browser.newPage(let url = ' '
switch (Category) {
case '0':
url = 'https://www.toutiao.com/ch/news_game/'
break;
case '1':
url = 'https://www.toutiao.com/ch/news_entertainment/'
break;
case '2':
url = 'https://www.toutiao.com/ch/news_history/'
break;
case '3':
url = 'https://www.toutiao.com/ch/news_finance/'
break; } // open page await page.goto(url) // define page content and Jquery var content, $/* page scroll method */ asyncfunctionscrollPage(i) { content = await page.content(); $ = cheerio.load(content); /* Execute javascript code (scroll page) */ await page.evaluate(function() {/* This is incremental scrolling. If you scroll once, no listening for new data will be triggered */for(var y = 0; y <= 1000*i; Const li = $($()) {window.scrollto (0,y)}})'.feedBox').find('ul')[0]).find('li')
return li
}
let i = 0
letLi = await scrollPage(++ I) // If the data list is less than 30 keep gettingwhile (li.length < 30) {
li = await scrollPage(++i)
}
letData = {list: []} /* Encapsulate the returned data */ li.map(function (index,item) {
$(item).find('img').attr('src') != undefined ?
data.list.push({
src: $(item).find('img').attr('src'),
title: $($(item).find('.title')).text(),
source:$($(item).find('.source')).text(),
href:$($(item).find('.title')).attr('href')
}):' '}) // save to fs.writefilesync ('tt.JSON',JSON.stringify(data))
fs.writeFileSync('tt.html'/* Close puppeteer*/ await browser.close()return data
}
module.exports = getListDataCopy the code