To summarize some tips for using Puppetter from the following perspectives:

  • Browser startup and request
  • Page loading and rendering
  • Perform optimization and state management

Browser startup and request

Custom Chromium/Chrome path

By default, puppeteer downloads chromium from within the Module that matches the current version, and sometimes fails to download due to network problems.

As of V1.7.0, puppeteer-Core is a lightweight way to use Puppeteer to specify the Chromium/Chrome path. This allows you to use the chrome installed on your system (puppeteer internally starts child processes that use the specified executable using child_process.spawn()).

Note the following points:

  • If you specify chrome in the system, check whether its version meets puppetter requirements
  • Puppeteer-core does not automatically download Chromium
  • Will ignore allPUPPETEER_*The environment variable
import puppeteer from 'puppeteer-core'
const getDefaultOsPath = () => {
    if (process.platform === 'win32') {
        return 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe'
    } else {
        return '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'
    }
}
let browser = await puppeteer.launch({
    executablePath: getDefaultOsPath()
}))
Copy the code

UA related

Get UA

async function getPuppeteerChromeUA() {
  const browser = await puppeteer.launch();
  const ua = await browser.userAgent();
  await browser.close();
  return ua;
}
Copy the code

Using anonymous UA

Encapsulate a function to set anonymous UA:

async function setAnonymizeUA (page, opts) { let ua = await page.browser().userAgent() // 1. If (opts.stripheadless) {ua = ua.replace('HeadlessChrome/', 'Chrome/')} // 2. If (opts.makeWindows) {ua = ua.replace(/\(([^)]+)\)/, '(Windows NT 10.0; Win64; If (opts.customFn) {ua = opts.customFn(UA)} await page.setUserAgent(UA)}Copy the code
  • puppeteer-extra-plugin-anonymize-ua

Page loading and rendering

Mask requests for resources of a specified type

Use setRequestInterception() to intercept the request and block the specified type of request for faster loading

. const blockTypes = new Set(['image', 'media', 'font']) await page.setRequestInterception(true) page.on('request', request => { const type = request.resourceType() const shouldBlock = blockedTypes.has(type) this.debug('onRequest', { type, shouldBlock }) return shouldBlock ? request.abort() : request.continue() }) ...Copy the code

Note: Enabling request interception makes the page cache unavailable

  • puppeteer-extra-plugin-block-resources

Controls the stage at which the page object loads the page

By setting the waitUntil parameter in the goto function so that the page returns the result when the DOMContentLoaded event is triggered, rather than waiting for the Load event, you save time waiting to build the rendering tree and draw the page.

LifecycleEvent corresponds to Page. LifecycleEvent in the CDP

. let page = await browser.newPage() await page.goto('http://some.site', {waitUntil: 'domcontentloaded'}) ...Copy the code

Perform optimization and state management

Use a singleton browser instance

When using multiple crawlers in the same program, in some cases you can choose to reuse the same browser instance instead of having to create a new browser instance for every crawler you start.

// instance.js const pptr = require('puppeteer'); let instance = null; module.exports.getBrowserInstance = async function() { if (! instance) instance = await pptr.launch(); return instance; }Copy the code

Use:

const {getBrowserInstance} = require('./instance');

async function doWork() {
  // ....
  const browser = await getBrowserInstance(); // this will reuse single browser
  // ....
}
Copy the code
  • Github.com/GoogleChrom…

You can also use the following simple methods:

let browserInstance = null const getSingleBrowser = async option => { if (! browserInstance) { browserInstance && browserInstance.close() browserInstance = await puppeteer.launch() } return browser }Copy the code

In one case, if multiple crawlers are started at the same time, the following tasks need to be executed together after the first execution is completed; otherwise, simultaneous execution will start multiple browser instances. As follows:

Async searchHandle() {await bing('hello world') // Creates browser instance DuckduckGo (' Hello World ') // uses the browser above Instance Google ('hello world') // Use browser instance}Copy the code

Use Transform Stream to master crawler execution progress

If you want to know the step inside the crawler after using promise to encapsulate the crawler object, you can use a customized Transform Stream to uniformly receive the state information. The state information can also be synchronized with the renderer process when using electron.

Initialize the Stream

// main.js export const statusStream = new Transform({// Enable object mode writableObjectMode: true, readableObjectMode: true, transform(chunk, encoding, callback) { callback(null, Stream. SetEncoding (' utF-8 ') stream.on('data', Chunk => {handle_func(chunk) // Process data // If used in electron, Send (ipc_renderer_signal. MESSAGE, {MESSAGE: Chunk})}) // Const initStatusPipe = (stream, win) => {stream.setencoding (' utF-8 ') stream.on('data', Chunk => {handle_func(chunk) // Process data // If used in electron, Send (ipc_renderer_signal. MESSAGE, {MESSAGE: chunk }) }) } app.on('ready', () => { let mainWindow = new BrowserWindow(...) . initStatusPipe(statusStream, mainWindow) })Copy the code

If used in electron, corresponding events can be listened for in the rendering process

this.$electron.ipcRenderer.on(IPC_RENDERER_SIGNAL.MESSAGE, (e, arg) => {
    console.log(arg.message)
})
Copy the code

Use stream in crawlers

Pass in the previously defined stream object and write the status information to the stream using the write method.

// crawler.js
const google = (pipe, option) => {
    return new Promise(async(resolve, reject) => {
        try {
            ...
            await page.goto(url, {waitUntil: 'domcontentloaded'})
            pipe.write(`page: open ${url}`)
            ...
            pipe.write(`page: crwaled ${number} results from google`)
            ...
            await page.close()
            pipe.write('page: closed')
            // return results
            resolve(...)
        } catch (err) {
            reject(...)
        }
    })
}

export default google
Copy the code

In this way, you can get a handle on what’s going on inside the reptile. In order to better manage the crawler state, some message formats can be designed according to the situation.