Front-end sharp device, puppeteer

Puppeteer is a Node.js package released by the Chrome development team in 2017, along with Headless Chrome. Used to simulate the running of Chrome. It provides a high-level API to control headless Chrome or Chromium via the DevTools protocol, and it can also be configured to use full (non-headless) Chrome or Chromium.

Before learning about Puppeteer, let’s take a look at Chrome DevTool Protocol and Headless Chrome.

What is Chrome DevTool Protocol

CDP is based on WebSocket and uses WebSocket to realize fast data channel with browser kernel.
CDP is divided into multiple domains (DOM, Debugger, Network, Profiler, Console…). Each domain defines related Commands and Events.
Some tools can be used to debug and analyze Chrome based on CDP. For example, the Chrome Developer Tool is implemented based on CDP.
Many useful tools are implemented based on CDP, such as Chrome Developer Tools, Chrome-remote-Interface, Puppeteer, etc.

What is Headless Chrome

You can run Chrome in an unbounded environment.
Operate Chrome from the command line or programming language.
Without human intervention, operation is more stable.
Start Chrome in Headless mode by adding the parameter “headless” when you start Chrome.
Click here to see what parameters chrome can add when it starts.

Headless Chrome is a feature-free version of the Chrome browser that allows you to run applications using all of Chrome’s supported features without having to open the browser.

What is the Puppeteer

Puppeteer is the Node.js tool engine.
Puppeteer provides a series of apis that control the behavior of Chromium/Chrome through the Chrome DevTools Protocol.
Puppeteer, by default, starts Chrome with headless. You can also start Chrome with an interface using parameters.
Puppeteer is bound to the latest Chromium version by default, and can be bound to a different version by itself.
Puppeteer allows us to communicate with the browser without knowing too much about the underlying CDP protocol.

What can Puppeteer do

Official: Most of the things you can do manually in a browser can be done with Puppeteer! Example:

Generate screen captures and PDF of the page.
Crawl SPA or SSR sites.
Automated form submission, UI testing, keyboard input, etc.
Create the latest automated test environment. Run tests directly in the latest version of Chrome, using the latest JavaScript and browser features.
Capture a timeline trace of the site to help diagnose performance problems.
Test the Chrome extension.
.

The Puppeteer API is layered

The API hierarchy in Puppeteer is basically the same as that in the browser. Here are some of the classes that are commonly used:

Browser: Corresponding to a Browser instance, a Browser can contain multiple BrowserContext
BrowserContext: BrowserContext has a separate Session(cookies and cache are not shared), just like opening a normal Chrome browser and then opening a browser in incognito mode. A BrowserContext can contain multiple pages
Page: NewPage ()/browser.newPage(). Browser.newpage () creates the page using the default browserContext. A Page can contain multiple frames
Frame: a Frame that has one MainFrame(page.mainframe ()) for each page, or multiple subframes, created primarily by the iframe tag
ExecutionContext: is the javascript execution environment. Each Frame has a default javascript execution environment
ElementHandle: an element node corresponding to the DOM. This instance can be used to click on an element and fill in a form. The element can be obtained by using selectors, xPath, etc
JsHandle: Corresponding to the javascript object in DOM, ElementHandle inherits from JsHandle. Since we cannot operate the object in DOM directly, it is encapsulated as JsHandle to realize related functions
CDPSession: communicates with the native CDP directly, sending messages using the session.send function and receiving messages using the session.on function, enabling Puppeteer apis to perform functions not involved in Puppeteer
Coverage: Gets JavaScript and CSS code Coverage
Tracing: Captures performance data for analysis
Response: Indicates the Response received by the page
Request: indicates a Request made by the page

Puppeteer Installation and Environment

Note: Puppeteer needs at least Node V6.4.0 before V1.18.1. Versions from V1.18.1 to V2.1.0 depend on Node 8.9.0+. Starting with V3.0.0, Puppeteer is dependent on Node 10.18.1+. To use async/await, only Node V7.6.0 or later supports it.

Puppeteer is a Node.js package, so installing Puppeteer is simple:

NPM install puppeteer // or yarn add puppeteerCopy the code

NPM may have an error installing puppeteer! This is due to the external network caused by the use of scientific Internet access or the use of Taobao mirror CNPM installation can be solved.

When Puppeteer is installed, it will download the latest version of Chromium. Starting with version 1.7.0, the puppeteer-Core software package is officially available. By default, no browser is downloaded and is used to launch existing browsers or connect to remote browsers. Note that the puppeteer-Core version installed is compatible with the browser you intend to connect to.

Puppeteer USES

Case1: screenshots

Puppeteer is used to take screenshots of both a page and an element in the page:

const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch();
  const page = await browser.newPage();
  // Set the viewable area. The default page size is 800x600 resolution
  await page.setViewport({width: 1920.height: 800});
  await page.goto('https://www.baidu.com/');
  // Take a screenshot of the entire page
  await page.screenshot({
      path: './files/baidu_home.png'.// Image save path
      type: 'png'.fullPage: true // Take screenshots while scrolling
      // clip: {x: 0, y: 0, width: 1920, height: 800}
  });
  // Take a screenshot of an element on the page
  let element = await page.$('#s_lg_img');
  await element.screenshot({
      path: './files/baidu_logo.png'
  });
  await page.close();
  awaitbrowser.close(); }) ();Copy the code

How do we get an element in a page?

page.$('#uniqueId'): Gets the first element corresponding to a selector
page.$$('div'): Gets all elements corresponding to a selector
page.$x('//img'): Gets all elements corresponding to an xPath
page.waitForXPath('//img'): Waits for an xPath element to appear
page.waitForSelector('#uniqueId'): Waits for the element corresponding to a selector to appear

Case2: simulates user operations

const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch({
        slowMo: 100.// Slow down
        headless: false.// Enable visualization
        defaultViewport: {width: 1440.height: 780},
        ignoreHTTPSErrors: false.// Ignore HTTPS error
        args: ['--start-fullscreen'] // Open the page in full screen
    });
    const page = await browser.newPage();
    await page.goto('https://www.baidu.com/');
    // Enter text
    const inputElement = await page.$('#kw');
    await inputElement.type('hello word', {delay: 20});
    // Click the search button
    let okButtonElement = await page.$('#su');
    // Wait for page navigation to complete. Generally, when clicking a button to jump, wait for page.waitfornavigation () to complete before the jump is successful
    await Promise.all([
        okButtonElement.click(),
        page.waitForNavigation()  
    ]);
    await page.close();
    awaitbrowser.close(); }) ();Copy the code

What functions does ElementHandle provide to manipulate elements?

elementHandle.click(): Click on an element
elementHandle.tap(): Simulates finger touch and click
elementHandle.focus(): Focuses on an element
elementHandle.hover()Hover: hover over an element
elementHandle.type('hello'): Enters text in the input box

Case3: Embed javascript code

The most powerful feature of Puppeteer is that you can execute any javascript code you want in your browser. The following code is an example of baidu home news recommendation to crawl data.

const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://www.baidu.com/');
    Evaluate executes the code in the browser using Page. Evaluate
    const resultData = await page.evaluate(async() = > {let data = {};
      const ListEle = [...document.querySelectorAll('#hotsearch-content-wrapper .hotsearch-item')];
      data = ListEle.map((ele) = > {
        const urlEle = ele.querySelector('a.c-link');
        const titleEle = ele.querySelector('.title-content-title');
        return {
          href: urlEle.href,
          title: titleEle.innerText,
        };
      });
      return data;
    });
    console.log(resultData)
    await page.close();
    awaitbrowser.close(); }) ();Copy the code

What functions are available to execute code in the browser environment?

page.evaluate(pageFunction[, ...args]): Executes functions in the browser environment
page.evaluateHandle(pageFunction[, ...args]): Executes a function in the browser environment that returns a JsHandle object
page.$$eval(selector, pageFunction[, ...args]): Passes all elements corresponding to the selector to the function and executes it in the browser environment
page.$eval(selector, pageFunction[, ...args]): passes the first element corresponding to the selector to the function and executes it in the browser environment
page.evaluateOnNewDocument(pageFunction[, ...args]): Executes in the browser environment when a new Document is created, before all scripts on the page are executed
page.exposeFunction(name, puppeteerFunction): Registers a function on the Window object, which executes in the Node environment and has the opportunity to call node.js libraries in the browser environment

Case4: Request interception

Request in some situations it is necessary to intercept it is not necessary to request to improve performance, we can monitor the Page request events, and request to intercept, the premise is to open a request to intercept Page. SetRequestInterception (true).

const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    const blockTypes = new Set(['image'.'media'.'font']);
    await page.setRequestInterception(true); // Enable request blocking
    page.on('request'.request= > {
        const type = request.resourceType();
        const shouldBlock = blockTypes.has(type);
        if(shouldBlock){
            // Block requests directly
            return request.abort();
        }else{
            // Override the request
            return request.continue({
                // Override url, method, postData, headers
                headers: Object.assign({}, request.headers(), {
                    'puppeteer-test': 'true'})}); }});await page.goto('https://www.baidu.com/');
    await page.close();
    awaitbrowser.close(); }) ();Copy the code

What events are available on the Page?

page.on('close')Page is closed
page.on('console')The console API is called
page.on('error')Page fault
page.on('load')Page loaded
page.on('request')Receipt of a request
page.on('requestfailed')The request failed
page.on('requestfinished')The request is successful
page.on('response')The response is received
page.on('workercreated')Create webWorker
page.on('workerdestroyed')Destruction of webWorker

Case5: Gets the WebSocket response

Puppeteer does not currently provide a native API for handling Websockets, but it is available through the lower-level Chrome DevTool Protocol (CDP)

const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Create a CDP session
    let cdpSession = await page.target().createCDPSession();
    // Enable Network debugging and listen for Network events in Chrome DevTools Protocol
    await cdpSession.send('Network.enable');
    // Listen for the webSocketFrameReceived event to get the corresponding data
    cdpSession.on('Network.webSocketFrameReceived'.frame= > {
        let payloadData = frame.response.payloadData;
        if(payloadData.includes('push:query')) {// Parse payloadData to get the data pushed by the server
            let res = JSON.parse(payloadData.match(/ / \ {. * \}) [0]);
            if(res.code ! = =200) {console.log('Error calling websocket interface :code=${res.code},message=${res.message}`);
            }else{
                console.log('Get webSocket data:', res.result); }}});await page.goto('https://netease.youdata.163.com/dash/142161/reportExport?pid=700209493');
    await page.waitForFunction('window.renderdone', {polling: 20});
    await page.close();
    awaitbrowser.close(); }) ();Copy the code

Case6: how do I fetch elements in an iframe

A Frame contains an Execution Context. Functions cannot be executed across frames. A page can have multiple frames, which are generated by embedding iframe tags. Most of the functions on the page are actually short for page.mainframe ().xx. Frame is a tree structure, and we can iterate through all frames with frame.childframes (). If you want to execute a function in another Frame, you have to get the corresponding Frame to process it

When logging in to mailbox 188, the login window is actually an embedded IFrame. In the following code, we are getting the IFrame and logging in

const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch({headless: false.slowMo: 50});
    const page = await browser.newPage();
    await page.goto('https://www.188.com');
    
    for (const frame of page.mainFrame().childFrames()){
        // Find the iframe corresponding to the login page based on the URL
        if (frame.url().includes('passport.188.com')) {await frame.type('.dlemail'.'[email protected]');
            await frame.type('.dlpwd'.'123456');
            await Promise.all([
                frame.click('#dologin'),
                page.waitForNavigation()
            ]);
            break; }}await page.close();
    awaitbrowser.close(); }) ();Copy the code

Case7: page performance analysis

Puppeteer provides a tool to perform performance analysis on Puppeteer. Currently, it is a weak tool, and only one page performance data can be obtained. – A browser can trace only once at a time – devTools Performance can upload the corresponding JSON file and view the analysis results – we can write scripts to parse the data in trace.json for automatic analysis – Yes Tracing shows page loading speed and script execution performance

const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.tracing.start({path: './files/trace.json'});
    await page.goto('https://www.google.com');
    await page.tracing.stop();
    /* continue analysis from 'trace.json' */browser.close(); }) ();Copy the code

Case8: File upload and download

The need to upload and download files is often encountered in automated testing. How is this implemented in Puppeteer?

const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Set the download path through the CDP session
    const cdp = await page.target().createCDPSession();
    await cdp.send('Page.setDownloadBehavior', {
        behavior: 'allow'.// Allow all download requests
        downloadPath: 'path/to/download'  // Set the download path
    });
    // Click the button to trigger the download
    await (await page.waitForSelector('#someButton')).click();
    // Wait for the file to appear, and take turns to check whether the file appears
    await waitForFile('path/to/download/filename');

    // Upload inputElement must be the  element
    let inputElement = await page.waitForXPath('//input[@type="file"]');
    await inputElement.uploadFile('/path/to/file'); browser.close(); }) ();Copy the code

Case9: Switches to a new TAB page

When clicking a button to jump to a new Tab Page, a new Page is opened. How do we get the corresponding Page instance of the changed Page? This can be done by listening for a TargetCreated event on Browser to indicate that a new page has been created:

let page = await browser.newPage();
await page.goto(url);
let btn = await page.waitForSelector('#btn');
// Before clicking the button, define a Promise that returns the Page object of the new TAB
const newPagePromise = new Promise(res= > 
  browser.once('targetcreated'.target= > res(target.page())
  )
);
await btn.click();
// After clicking the button, wait for the new TAB object
let newPage = await newPagePromise;
Copy the code

Case10: simulate different devices

Puppeteer provides the function of simulating different devices. The Puppeteer. Devices defines the configuration information of many devices, including viewport and userAgent

const puppeteer = require('puppeteer');
const iPhone = puppeteer.devices['iPhone 6'];
puppeteer.launch().then(async browser => {
  const page = await browser.newPage();
  await page.emulate(iPhone);
  await page.goto('https://www.baidu.com');
  await browser.close();
});
Copy the code

Performance and optimization

About shared memory:

Chrome uses /dev/shm shared memory by default. Docker uses /dev/shm shared memory by default. Docker uses /dev/shm shared memory by default. - docker docker add parameter --shm-size= 1GB to increase /dev/shm shared memory, swarm does not support shm-size parameter -- disable-dev-shm-usage, Do not use the /dev/shm shared memoryCopy the code

Try to use the same browser instance so that the cache can be shared
Interception of resources that do not need to be loaded by request
Just like when you open Chrome, many TAB pages will inevitably get stuck, so you have to control the number of tabs
A Chrome instance that takes a long time to start will inevitably have memory leaks, page crashes and other problems, so it is necessary to periodically restart the Chrome instance
To speed up performance, turn off unnecessary configurations such as: -no-sandbox (sandbox), –disable-extensions, etc
Avoid using Page.waiffor (1000) as much as possible and it is better to let the application decide for itself
A Sticky Websocket session problem occurs because of the Websocket used to connect to the Chrome instance.

reference

A preliminary Headless Chrome
Official documentation for Puppeteer
Puppeteer guides
Puppeteer API
Talk about Puppeteer in conjunction with the project
Puppeteer performance optimization and execution speed improvement