Puppeteer is a Node library that provides a high-level API for controlling Chromium or Chrome via the DevTools protocol.
Puppeteer allows us to write scripts that simulate browser behavior and perform the following functions:
- Take a screenshot of the web page and save it as an image or PDF.
- Simulate DOM operations such as form submission, keyboard input, button clicking and slider movement.
- Implement automated testing of UI.
- As a packet capture area for the performance of the web debugging and analysis.
- Write a customized crawler to solve the problem that traditional HTTP crawler SPA page is difficult to process asynchronous requests.
To achieve this with Puppeteer, we use a few tricks to make the pupeteer program more efficient.
Filtering request
When puppeteer is used to parse the DOM structure of an asynchronous rendering of a page, it is often necessary to wait until the page has been rendered before scripts can be used. However, the page rendering process also contains many static resources such as images/audio/video/style files. At this point we can page. SetRequestInterception method for filtering, web page request intercept static resource requests, to speed up page rendering. A code example is as follows:
// Enable request blocking
page.setRequestInterception(true);
page.on('request'.async req => {
// Filter by request type
const resourceType = req.resourceType();
if (resourceType === 'image') {
req.abort();
else{ req.continue(); }});Copy the code
Recommended types of requests to intercept:
const blockedResourceTypes = [
'image'.'media'.'font'.'texttrack'.'object'.'beacon'.'csp_report'.'imageset',];const skippedResources = [
'quantserve'.'adzerk'.'doubleclick'.'adition'.'exelator'.'sharethrough'.'cdn.api.twitter'.'google-analytics'.'googletagmanager'.'google'.'fontawesome'.'facebook'.'analytics'.'optimizely'.'clicktale'.'mixpanel'.'zedo'.'clicksor'.'tiqcdn',];Copy the code
The agent requests
In addition to filtering requests, we can also proxy requests made during web page rendering. In some crawler projects to achieve the purpose of not being sent crawling, code examples are as follows:
page.on('request'.async req => {
// Proxy request
const response = await fetch({
url: req.url(),
method: req.method(),
headers: req.headers(),
body: req.postData(),
proxy: getProxyIp(),
resolveWithFullResponse: true});// Respond to the request
req.respond({
status: response.statusCode,
contentType: response.headers['content-type'].headers: response.headers || req.headers(),
body: response.body,
});
});
Copy the code
Reuse browser
Puppeteer.connect is much faster than puppeteer.launch to launch a browser instance (see below), so if you need to enable multiple Broswer instances, you can reuse the wsEndpoint by caching it:
let wsEndpoint = await cache.get(Parser.WS_KEY);
let broswer;
try{ browser = ! wsEndpoint ?await puppeteer.launch(config)
: await puppeteer.connect({
browserWSEndpoint: this.wsEndpoint,
});
} catch (err) {
browser = await puppeteer.launch(config);
} finally {
wsEndpoint = this.browser.wsEndpoint();
await cache.set(Parser.WS_KEY, 60 * 60 * 1000.this.wsEndpoint);
}
Copy the code
Disable unnecessary browser functions
Puppeteer provides a sophisticated browser environment, but in practice, there are many features enabled by default that are not required by the project itself. In this case, we can disable additional features by setting browser startup parameters:
puppeteer.launch({
args: [
'--no-sandbox'.// Sandbox mode
'--disable-setuid-sandbox'./ / the uid sandbox
'--disable-dev-shm-usage'.// Create temporary file shared memory
'--disable-accelerated-2d-canvas'./ / canvas rendering
'--disable-gpu'.// GPU hardware acceleration]});Copy the code
The resources
- Official documentation for Puppeteer
- Connecting Puppeteer to Existing Chrome Window w/ reCAPTCHA
- Puppeteer set proxy for page