Pay attention to the public number “kite”, reply to “information” to obtain 500G information (all “arms”), there are professional communication groups waiting for you to come together. (ha ha)

In this article, you will learn about puppeteer, which is published and maintained by Google. Learn the basics and common functions of puppeteer.

A brief introduction to Puppeteer

Puppeteer is a Node library that provides a high-level API for controlling Chromium or Chrome via the DevTools protocol, Puppeteer allows users to access page DOM nodes, web requests and responses, manipulate page behavior programmatically, monitor and optimize page performance, capture screenshots and PDFS, and perform a variety of tricks with Chrome.

The Puppeteer core structure

Puppeteer’s structure mirrors that of the browser, and its core structure is as follows:

  1. Browser: This is an instance of a Browser that can have a Browser context and create a Browser object through puppeteer.launch or puppeteer.connect.
  2. BrowserContext: This instance defines a browser context that can have multiple pages. A browser context is created by default when a browser instance is created (it cannot be closed). In addition to using the createIncognitoBrowserContext () to create an anonymous browsers context (not Shared with other context browser cookies/cache).
  3. Page: Contains at least one main frame, in addition to the main frame may exist other frames, such as iframe.
  4. Frame: The Frame in the page. At each point in time, the page exposes the details of the current Frame through the page.mainframe () and frame.childframes () methods. There is at least one execution context for the framework
  5. ExecutionCOntext: represents a JavaScript ExecutionCOntext.
  6. Worker: Has a single execution context for easy interaction with WebWorkers.

Basic use and common functions

The overall use of the artifact is relatively simple, the following begins our use of the road.

3.1 start the Browser

The core function is to call puppeteer.launch() asynchronously and create an instance of Browser according to the corresponding configuration parameters.

const path = require('path'); const puppeteer = require('puppeteer'); const chromiumPath = path.join(__dirname, '.. /', 'chromium/chromium/chrome.exe'); Async function main() {// Start chrome browser const browser = await puppeteer.launch({// specify the browser path executablePath: ChromiumPath, // whether headless browser mode, default headless browser mode headless: false}); } main();Copy the code

3.2 Accessing the Page

To access a page, you first create a browser context, then create a new page based on that context, and finally specify the URL to visit.

Async function main() {// Start chrome //... // a newPage is created in a default browser context const page1 = await browser.newpage (); // Blank page just asks the specified url to await page1. Goto ('https://51yangsheng.com'); / / create an anonymous browsers context const browserContext = await the createIncognitoBrowserContext (); Const page2 = await browsercontext.newpage (); // create a newPage in this context const page2 = await browsercontext.newpage (); page2.goto('https://www.baidu.com'); } main();Copy the code

3.3 Device Simulation

In this case, device simulation can be used. Here is a simulation of the browser results of an iPhone X device

Async function main() {// Start browser // device simulation: simulate an iPhone X // user agent await page1.setUserAgent('Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, Like Gecko) Version/11.0 Mobile/15A372 Safari/604.1') // viewport ({width: 375, height: 812}); } main();Copy the code

3.4 Obtaining a DOM Node

There are two ways to get a DOM node. One way is to call the native function that comes with the page directly, and the other way is to get it by executing JS code.

Async function main() {// Start chrome browser const browser = await puppeteer.launch({// specify the browser path executablePath: ChromiumPath, // whether headless browser mode, default headless browser mode headless: false}); // a newPage is created in a default browser context const page1 = await browser.newpage (); // Blank page just asks the specified url to await page1. Goto ('https://www.baidu.com'); // wait for the title node to appear await page1. WaitForSelector ('title'); Const titleDomText1 = await page1.$eval('title', el => el.innertext); console.log(titleDomText1); Const titleDomText2 = await page1. Evaluate (() => {const titleDom = document.querySelector('title'); return titleDom.innerText; }); console.log(titleDomText2); } main();Copy the code

3.5 Listening for Requests and responses

The following is to monitor the request and response of a JS script in Baidu. The request event is to monitor the request, and the response event is to monitor the response.

Async function main() {// Start chrome browser const browser = await puppeteer.launch({// specify the browser path executablePath: ChromiumPath, // whether headless browser mode, default headless browser mode headless: false}); // a newPage is created in a default browser context const page1 = await browser.newpage (); page1.on('request', request => { if (request.url() === 'https://s.bdstatic.com/common/openjs/amd/eslx.js') { console.log(request.resourceType()); console.log(request.method()); console.log(request.headers()); }}); page1.on('response', response => { if (response.url() === 'https://s.bdstatic.com/common/openjs/amd/eslx.js') { console.log(response.status()); console.log(response.headers()); }}) // Blank page just asks the specified url to await page1.goto('https://www.baidu.com'); } main();Copy the code

3.6 Intercepting a Request

By default only read-only property request events, are not able to intercept the request, if you want to intercept the request requires through page. SetRequestInterception request interceptor (value) to start, It then uses the Request. abort, request.continue, and request.respond methods to decide what to do next with the request.

Async function main() {// Start chrome browser const browser = await puppeteer.launch({// specify the browser path executablePath: ChromiumPath, // whether headless browser mode, default headless browser mode headless: false}); // a newPage is created in a default browser context const page1 = await browser.newpage (); / / intercept request open await page1. SetRequestInterception (true); // true to enable, False close page1. On (' request, Request = > {the if (the request url () = = = 'https://s.bdstatic.com/common/openjs/amd/eslx.js') {/ / terminate the request request. The abort (); Console. log(' This request was aborted!! '); } else {// Continue the request request.continue(); }}); // Blank page just asks the specified url to await page1. Goto ('https://www.baidu.com'); } main();Copy the code

3.7 screenshot

A screenshot is a very useful function. You can save a snapshot by capturing it, which is convenient for troubleshooting later. (Note: Take screenshots in headless mode, otherwise there may be problems with the screenshot.)

Async function main() {// Start the browser, access the Page operation // screenshot operation, use Page. Screenshot function // screenshots of the entire Page :Page. Screenshot with the fullPage parameter is a full screen screenshot of "await page1. Screenshot ({path: '.. /imgs/fullScreen.png', fullPage: true }); // Screenshot of an area in the screen and await page1.screenshot({path: '.. /imgs/partScreen.jpg', type: 'jpeg', quality: 80, clip: { x: 0, y: 0, width: 375, height: 300 } }); browser.close(); } main();Copy the code

3.8 generate PDF

In addition to preserving snapshots using screenshots, you can also preserve snapshots using PDF.

Async function main() {// Start browser, access Page operations // Generate PDF file based on Page content, use page.pdf -- note: you must call await page1. PDF ({path: '.. /pdf/baidu.pdf' }); browser.close(); } main();Copy the code

1. If you think this article is good, share and like it so that more people can see it

2 pay attention to the public number kite, receive learning materials (front “multiple arms” information), regularly push original depth of good articles for you