Puppeteer is a Node library that provides a high-level API for controlling Chromium or Chrome via the DevTools protocol.

Puppeteer allows us to write scripts that simulate browser behavior and perform the following functions:

  • Take a screenshot of the web page and save it as an image or PDF.
  • Simulate DOM operations such as form submission, keyboard input, button clicking and slider movement.
  • Implement automated testing of UI.
  • As a packet capture area for the performance of the web debugging and analysis.
  • Write a customized crawler to solve the problem that traditional HTTP crawler SPA page is difficult to process asynchronous requests.

To achieve this with Puppeteer, we use a few tricks to make the pupeteer program more efficient.

Filtering request

When puppeteer is used to parse the DOM structure of an asynchronous rendering of a page, it is often necessary to wait until the page has been rendered before scripts can be used. However, the page rendering process also contains many static resources such as images/audio/video/style files. At this point we can page. SetRequestInterception method for filtering, web page request intercept static resource requests, to speed up page rendering. A code example is as follows:

    // Enable request blocking
    page.setRequestInterception(true);
    
    page.on('request'.async req => {
        // Filter by request type
        const resourceType = req.resourceType();
        if (resourceType === 'image') {
            req.abort();
        else{ req.continue(); }});Copy the code

Recommended types of requests to intercept:

const blockedResourceTypes = [
    'image'.'media'.'font'.'texttrack'.'object'.'beacon'.'csp_report'.'imageset',];const skippedResources = [
    'quantserve'.'adzerk'.'doubleclick'.'adition'.'exelator'.'sharethrough'.'cdn.api.twitter'.'google-analytics'.'googletagmanager'.'google'.'fontawesome'.'facebook'.'analytics'.'optimizely'.'clicktale'.'mixpanel'.'zedo'.'clicksor'.'tiqcdn',];Copy the code

The agent requests

In addition to filtering requests, we can also proxy requests made during web page rendering. In some crawler projects to achieve the purpose of not being sent crawling, code examples are as follows:

page.on('request'.async req => {
    // Proxy request
    const response = await fetch({
        url: req.url(),
        method: req.method(),
        headers: req.headers(),
        body: req.postData(),
        proxy: getProxyIp(),
        resolveWithFullResponse: true});// Respond to the request
    req.respond({
        status: response.statusCode,
        contentType: response.headers['content-type'].headers: response.headers || req.headers(),
        body: response.body,
    });
});
Copy the code

Reuse browser

Puppeteer.connect is much faster than puppeteer.launch to launch a browser instance (see below), so if you need to enable multiple Broswer instances, you can reuse the wsEndpoint by caching it:

let wsEndpoint = await cache.get(Parser.WS_KEY);
let broswer;
try{ browser = ! wsEndpoint ?await puppeteer.launch(config)
        : await puppeteer.connect({
              browserWSEndpoint: this.wsEndpoint,
          });
} catch (err) {
    browser = await puppeteer.launch(config);
} finally {
    wsEndpoint = this.browser.wsEndpoint();
    await cache.set(Parser.WS_KEY, 60 * 60 * 1000.this.wsEndpoint);
}
Copy the code

Disable unnecessary browser functions

Puppeteer provides a sophisticated browser environment, but in practice, there are many features enabled by default that are not required by the project itself. In this case, we can disable additional features by setting browser startup parameters:

    puppeteer.launch({
        args: [
            '--no-sandbox'.// Sandbox mode
            '--disable-setuid-sandbox'./ / the uid sandbox
            '--disable-dev-shm-usage'.// Create temporary file shared memory
            '--disable-accelerated-2d-canvas'./ / canvas rendering
            '--disable-gpu'.// GPU hardware acceleration]});Copy the code

The resources

  • Official documentation for Puppeteer
  • Connecting Puppeteer to Existing Chrome Window w/ reCAPTCHA
  • Puppeteer set proxy for page