Puppeteer is a Node.js package released by the Chrome development team in 2017, along with Headless Chrome. Used to simulate the running of Chrome. It provides a high-level API to control headless Chrome or Chromium via the DevTools protocol, and it can also be configured to use full (non-headless) Chrome or Chromium.
Before learning about Puppeteer, let’s take a look at Chrome DevTool Protocol and Headless Chrome.
What is Chrome DevTool Protocol
- CDP is based on WebSocket and uses WebSocket to realize fast data channel with browser kernel.
- CDP is divided into multiple domains (DOM, Debugger, Network, Profiler, Console…). Each domain defines related Commands and Events.
- Some tools can be used to debug and analyze Chrome based on CDP. For example, the Chrome Developer Tool is implemented based on CDP.
- Many useful tools are implemented based on CDP, such as Chrome Developer Tools, Chrome-remote-Interface, Puppeteer, etc.
What is Headless Chrome
- You can run Chrome in an unbounded environment.
- Operate Chrome from the command line or programming language.
- Without human intervention, operation is more stable.
- Start Chrome in Headless mode by adding the parameter “headless” when you start Chrome.
- Click here to see what parameters chrome can add when it starts.
Headless Chrome is a feature-free version of the Chrome browser that allows you to run applications using all of Chrome’s supported features without having to open the browser.
What is the Puppeteer
- Puppeteer is the Node.js tool engine.
- Puppeteer provides a series of apis that control the behavior of Chromium/Chrome through the Chrome DevTools Protocol.
- Puppeteer, by default, starts Chrome with headless. You can also start Chrome with an interface using parameters.
- Puppeteer is bound to the latest Chromium version by default, and can be bound to a different version by itself.
- Puppeteer allows us to communicate with the browser without knowing too much about the underlying CDP protocol.
What can Puppeteer do
Official: Most of the things you can do manually in a browser can be done with Puppeteer! Example:
- Generate screen captures and PDF of the page.
- Crawl SPA or SSR sites.
- Automated form submission, UI testing, keyboard input, etc.
- Create the latest automated test environment. Run tests directly in the latest version of Chrome, using the latest JavaScript and browser features.
- Capture a timeline trace of the site to help diagnose performance problems.
- Test the Chrome extension.
- .
The Puppeteer API is layered
The API hierarchy in Puppeteer is basically the same as that in the browser. Here are some of the classes that are commonly used:
- Browser: Corresponding to a Browser instance, a Browser can contain multiple BrowserContext
- BrowserContext: BrowserContext has a separate Session(cookies and cache are not shared), just like opening a normal Chrome browser and then opening a browser in incognito mode. A BrowserContext can contain multiple pages
- Page: NewPage ()/browser.newPage(). Browser.newpage () creates the page using the default browserContext. A Page can contain multiple frames
- Frame: a Frame that has one MainFrame(page.mainframe ()) for each page, or multiple subframes, created primarily by the iframe tag
- ExecutionContext: is the javascript execution environment. Each Frame has a default javascript execution environment
- ElementHandle: an element node corresponding to the DOM. This instance can be used to click on an element and fill in a form. The element can be obtained by using selectors, xPath, etc
- JsHandle: Corresponding to the javascript object in DOM, ElementHandle inherits from JsHandle. Since we cannot operate the object in DOM directly, it is encapsulated as JsHandle to realize related functions
- CDPSession: communicates with the native CDP directly, sending messages using the session.send function and receiving messages using the session.on function, enabling Puppeteer apis to perform functions not involved in Puppeteer
- Coverage: Gets JavaScript and CSS code Coverage
- Tracing: Captures performance data for analysis
- Response: Indicates the Response received by the page
- Request: indicates a Request made by the page
Puppeteer Installation and Environment
Note: Puppeteer needs at least Node V6.4.0 before V1.18.1. Versions from V1.18.1 to V2.1.0 depend on Node 8.9.0+. Starting with V3.0.0, Puppeteer is dependent on Node 10.18.1+. To use async/await, only Node V7.6.0 or later supports it.
Puppeteer is a Node.js package, so installing Puppeteer is simple:
NPM install puppeteer // or yarn add puppeteerCopy the code
NPM may have an error installing puppeteer! This is due to the external network caused by the use of scientific Internet access or the use of Taobao mirror CNPM installation can be solved.
When Puppeteer is installed, it will download the latest version of Chromium. Starting with version 1.7.0, the puppeteer-Core software package is officially available. By default, no browser is downloaded and is used to launch existing browsers or connect to remote browsers. Note that the puppeteer-Core version installed is compatible with the browser you intend to connect to.
Puppeteer USES
Case1: screenshots
Puppeteer is used to take screenshots of both a page and an element in the page:
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
// Set the viewable area. The default page size is 800x600 resolution
await page.setViewport({width: 1920.height: 800});
await page.goto('https://www.baidu.com/');
// Take a screenshot of the entire page
await page.screenshot({
path: './files/baidu_home.png'.// Image save path
type: 'png'.fullPage: true // Take screenshots while scrolling
// clip: {x: 0, y: 0, width: 1920, height: 800}
});
// Take a screenshot of an element on the page
let element = await page.$('#s_lg_img');
await element.screenshot({
path: './files/baidu_logo.png'
});
await page.close();
awaitbrowser.close(); }) ();Copy the code
How do we get an element in a page?
page.$('#uniqueId')
: Gets the first element corresponding to a selectorpage.$$('div')
: Gets all elements corresponding to a selectorpage.$x('//img')
: Gets all elements corresponding to an xPathpage.waitForXPath('//img')
: Waits for an xPath element to appearpage.waitForSelector('#uniqueId')
: Waits for the element corresponding to a selector to appear
Case2: simulates user operations
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch({
slowMo: 100.// Slow down
headless: false.// Enable visualization
defaultViewport: {width: 1440.height: 780},
ignoreHTTPSErrors: false.// Ignore HTTPS error
args: ['--start-fullscreen'] // Open the page in full screen
});
const page = await browser.newPage();
await page.goto('https://www.baidu.com/');
// Enter text
const inputElement = await page.$('#kw');
await inputElement.type('hello word', {delay: 20});
// Click the search button
let okButtonElement = await page.$('#su');
// Wait for page navigation to complete. Generally, when clicking a button to jump, wait for page.waitfornavigation () to complete before the jump is successful
await Promise.all([
okButtonElement.click(),
page.waitForNavigation()
]);
await page.close();
awaitbrowser.close(); }) ();Copy the code
What functions does ElementHandle provide to manipulate elements?
elementHandle.click()
: Click on an elementelementHandle.tap()
: Simulates finger touch and clickelementHandle.focus()
: Focuses on an elementelementHandle.hover()
Hover: hover over an elementelementHandle.type('hello')
: Enters text in the input box
Case3: Embed javascript code
The most powerful feature of Puppeteer is that you can execute any javascript code you want in your browser. The following code is an example of baidu home news recommendation to crawl data.
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.baidu.com/');
Evaluate executes the code in the browser using Page. Evaluate
const resultData = await page.evaluate(async() = > {let data = {};
const ListEle = [...document.querySelectorAll('#hotsearch-content-wrapper .hotsearch-item')];
data = ListEle.map((ele) = > {
const urlEle = ele.querySelector('a.c-link');
const titleEle = ele.querySelector('.title-content-title');
return {
href: urlEle.href,
title: titleEle.innerText,
};
});
return data;
});
console.log(resultData)
await page.close();
awaitbrowser.close(); }) ();Copy the code
What functions are available to execute code in the browser environment?
page.evaluate(pageFunction[, ...args])
: Executes functions in the browser environmentpage.evaluateHandle(pageFunction[, ...args])
: Executes a function in the browser environment that returns a JsHandle objectpage.$$eval(selector, pageFunction[, ...args])
: Passes all elements corresponding to the selector to the function and executes it in the browser environmentpage.$eval(selector, pageFunction[, ...args])
: passes the first element corresponding to the selector to the function and executes it in the browser environmentpage.evaluateOnNewDocument(pageFunction[, ...args])
: Executes in the browser environment when a new Document is created, before all scripts on the page are executedpage.exposeFunction(name, puppeteerFunction)
: Registers a function on the Window object, which executes in the Node environment and has the opportunity to call node.js libraries in the browser environment
Case4: Request interception
Request in some situations it is necessary to intercept it is not necessary to request to improve performance, we can monitor the Page request events, and request to intercept, the premise is to open a request to intercept Page. SetRequestInterception (true).
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
const blockTypes = new Set(['image'.'media'.'font']);
await page.setRequestInterception(true); // Enable request blocking
page.on('request'.request= > {
const type = request.resourceType();
const shouldBlock = blockTypes.has(type);
if(shouldBlock){
// Block requests directly
return request.abort();
}else{
// Override the request
return request.continue({
// Override url, method, postData, headers
headers: Object.assign({}, request.headers(), {
'puppeteer-test': 'true'})}); }});await page.goto('https://www.baidu.com/');
await page.close();
awaitbrowser.close(); }) ();Copy the code
What events are available on the Page?
page.on('close')
Page is closedpage.on('console')
The console API is calledpage.on('error')
Page faultpage.on('load')
Page loadedpage.on('request')
Receipt of a requestpage.on('requestfailed')
The request failedpage.on('requestfinished')
The request is successfulpage.on('response')
The response is receivedpage.on('workercreated')
Create webWorkerpage.on('workerdestroyed')
Destruction of webWorker
Case5: Gets the WebSocket response
Puppeteer does not currently provide a native API for handling Websockets, but it is available through the lower-level Chrome DevTool Protocol (CDP)
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
// Create a CDP session
let cdpSession = await page.target().createCDPSession();
// Enable Network debugging and listen for Network events in Chrome DevTools Protocol
await cdpSession.send('Network.enable');
// Listen for the webSocketFrameReceived event to get the corresponding data
cdpSession.on('Network.webSocketFrameReceived'.frame= > {
let payloadData = frame.response.payloadData;
if(payloadData.includes('push:query')) {// Parse payloadData to get the data pushed by the server
let res = JSON.parse(payloadData.match(/ / \ {. * \}) [0]);
if(res.code ! = =200) {console.log('Error calling websocket interface :code=${res.code},message=${res.message}`);
}else{
console.log('Get webSocket data:', res.result); }}});await page.goto('https://netease.youdata.163.com/dash/142161/reportExport?pid=700209493');
await page.waitForFunction('window.renderdone', {polling: 20});
await page.close();
awaitbrowser.close(); }) ();Copy the code
Case6: how do I fetch elements in an iframe
A Frame contains an Execution Context. Functions cannot be executed across frames. A page can have multiple frames, which are generated by embedding iframe tags. Most of the functions on the page are actually short for page.mainframe ().xx. Frame is a tree structure, and we can iterate through all frames with frame.childframes (). If you want to execute a function in another Frame, you have to get the corresponding Frame to process it
When logging in to mailbox 188, the login window is actually an embedded IFrame. In the following code, we are getting the IFrame and logging in
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch({headless: false.slowMo: 50});
const page = await browser.newPage();
await page.goto('https://www.188.com');
for (const frame of page.mainFrame().childFrames()){
// Find the iframe corresponding to the login page based on the URL
if (frame.url().includes('passport.188.com')) {await frame.type('.dlemail'.'[email protected]');
await frame.type('.dlpwd'.'123456');
await Promise.all([
frame.click('#dologin'),
page.waitForNavigation()
]);
break; }}await page.close();
awaitbrowser.close(); }) ();Copy the code
Case7: page performance analysis
Puppeteer provides a tool to perform performance analysis on Puppeteer. Currently, it is a weak tool, and only one page performance data can be obtained. – A browser can trace only once at a time – devTools Performance can upload the corresponding JSON file and view the analysis results – we can write scripts to parse the data in trace.json for automatic analysis – Yes Tracing shows page loading speed and script execution performance
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.tracing.start({path: './files/trace.json'});
await page.goto('https://www.google.com');
await page.tracing.stop();
/* continue analysis from 'trace.json' */browser.close(); }) ();Copy the code
Case8: File upload and download
The need to upload and download files is often encountered in automated testing. How is this implemented in Puppeteer?
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
// Set the download path through the CDP session
const cdp = await page.target().createCDPSession();
await cdp.send('Page.setDownloadBehavior', {
behavior: 'allow'.// Allow all download requests
downloadPath: 'path/to/download' // Set the download path
});
// Click the button to trigger the download
await (await page.waitForSelector('#someButton')).click();
// Wait for the file to appear, and take turns to check whether the file appears
await waitForFile('path/to/download/filename');
// Upload inputElement must be the element
let inputElement = await page.waitForXPath('//input[@type="file"]');
await inputElement.uploadFile('/path/to/file'); browser.close(); }) ();Copy the code
Case9: Switches to a new TAB page
When clicking a button to jump to a new Tab Page, a new Page is opened. How do we get the corresponding Page instance of the changed Page? This can be done by listening for a TargetCreated event on Browser to indicate that a new page has been created:
let page = await browser.newPage();
await page.goto(url);
let btn = await page.waitForSelector('#btn');
// Before clicking the button, define a Promise that returns the Page object of the new TAB
const newPagePromise = new Promise(res= >
browser.once('targetcreated'.target= > res(target.page())
)
);
await btn.click();
// After clicking the button, wait for the new TAB object
let newPage = await newPagePromise;
Copy the code
Case10: simulate different devices
Puppeteer provides the function of simulating different devices. The Puppeteer. Devices defines the configuration information of many devices, including viewport and userAgent
const puppeteer = require('puppeteer');
const iPhone = puppeteer.devices['iPhone 6'];
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.emulate(iPhone);
await page.goto('https://www.baidu.com');
await browser.close();
});
Copy the code
Performance and optimization
- About shared memory:
Chrome uses /dev/shm shared memory by default. Docker uses /dev/shm shared memory by default. Docker uses /dev/shm shared memory by default. - docker docker add parameter --shm-size= 1GB to increase /dev/shm shared memory, swarm does not support shm-size parameter -- disable-dev-shm-usage, Do not use the /dev/shm shared memoryCopy the code
- Try to use the same browser instance so that the cache can be shared
- Interception of resources that do not need to be loaded by request
- Just like when you open Chrome, many TAB pages will inevitably get stuck, so you have to control the number of tabs
- A Chrome instance that takes a long time to start will inevitably have memory leaks, page crashes and other problems, so it is necessary to periodically restart the Chrome instance
- To speed up performance, turn off unnecessary configurations such as: -no-sandbox (sandbox), –disable-extensions, etc
- Avoid using Page.waiffor (1000) as much as possible and it is better to let the application decide for itself
- A Sticky Websocket session problem occurs because of the Websocket used to connect to the Chrome instance.
reference
- A preliminary Headless Chrome
- Official documentation for Puppeteer
- Puppeteer guides
- Puppeteer API
- Talk about Puppeteer in conjunction with the project
- Puppeteer performance optimization and execution speed improvement