The following is the official documents (zhaoqize. Making. IO/puppeteer – a…
Puppeteer is a Node library that provides a high-level API for controlling Chromium or Chrome via the DevTools protocol. Puppeteer runs in headless mode by default, but can be run in headless mode by modifying the configuration file.
What can be done? Most of the things you can do manually in a browser can be done using Puppeteer! Here are some examples:
- Generate page PDF.
- Grab SPA (single-page application) and generate pre-rendered content (that is, “SSR” (server-side rendering)).
- Automatic form submission, UI testing, keyboard input, etc.
- Create an automated test environment that is constantly updated. Perform tests directly in the latest version of Chrome using the latest JavaScript and browser features.
- Capture the Timeline trace for the site to help analyze performance issues.
- Test the browser extension.
Begin to use
The installation
Using Puppeteer in projects:
npm i puppeteer
# or "yarn add puppeteer"
Copy the code
Note: When you install Puppeteer, it will download the latest version of Chromium (~170MB for Mac, ~282MB for Linux, ~280MB for Win) to ensure access to the API. If you want to skip the download, read environment variables.
puppeteer-core
Since version 1.7.0, we’ve released a puppeteer-Core package that doesn’t download Chromium by default.
npm i puppeteer-core
# or "yarn add puppeteer-core"
Copy the code
Puppeteer-core is a lightweight version of Puppeteer used to launch an existing browser installation or connect to a remote installation.
For details, see Puppeteer vs Puppeteer-core.
use
Note: Puppeteer requires at least Node V6.4.0, the following examples use async/await, they are only supported in Node V7.6.0 or later.
Puppeteer works similarly to other testing frameworks. You need to create an instance of Browser, open the page, and use the Puppeteer API.
Example – Jump to example.com and save the screenshot to example.png:
File for example. Js
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
awaitbrowser.close(); }) ();Copy the code
Execute on the command line
node example.js
Copy the code
The default screen size for Puppeteer initialization is 800px x 600px. But this size can be set via page.setviewPort ().
Example – Create a PDF.
File for hn. Js
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
await page.pdf({path: 'hn.pdf'.format: 'A4'});
awaitbrowser.close(); }) ();Copy the code
Execute on the command line
node hn.js
Copy the code
See page.pdf () for more information.
Example – Executes the script in the page
File for the get – dimensions. Js
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// Get the "viewport" of the page, as reported by the page.
const dimensions = await page.evaluate(() = > {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio
};
});
console.log('Dimensions:', dimensions);
awaitbrowser.close(); }) ();Copy the code
Execute on the command line
node get-dimensions.js
Copy the code
See Page.evaluate() for more, which is somewhat similar to evaluateOnNewDocument and exposeFunction.
The default Settings
1. Use headless mode
Puppeteer runs Chromium’s Headless mode. If you want to use the full version of Chromium, set ‘headless’ option.
const browser = await puppeteer.launch({headless: false}); // default is true
2. Run the Bound Chromium version
By default, Puppeteer downloads and uses a specific version of Chromium and its APIS right out of the box. If you want Puppeteer to work with a different version of Chrome or Chromium, pass in the path to the Chromium executable when creating the Browser instance:
const browser = await puppeteer.launch({executablePath: ‘/path/to/Chrome’}); Puppeteer.launch()
Read this article to learn about Chromium and Chrome. This article introduces some differences in usage for Linux users.
3. Create a user profile
Puppeteer creates its own Chromium user profile, which it cleans up each time it runs.
Resource API Documentation Examples Community List of Puppeteer resources
The following is the specific implementation plan I have used (please feel free to discuss any shortcomings)
1. Main dependence of the project (NPM can be installed directly) :
- Express – a common HTTP framework for Node
- Ioredis – Node redis connection tool
- Isbot — a tool to determine if it is a crawler (not necessary if it is not for SEO)
- Node-schedule – Node’s scheduled task plug-in (cron expression)
- Puppeteer – a headless browser for rendering HTML
- Redlock – Redis mutex prevents concurrent rendering from running out of server resources
- Pm2 – Node Multi-process cluster running framework
2. Specific ideas
In advance:
- Place the packaged project files in the project dist (any name) folder.
- If possible, all pages can be pre-rendered cache using scheduled tasks (I beg myself).
Matter:
- The user requests a page to determine if there is a rendered cache and returns a direct hit.
- If there is no cache, the global lock is first applied, and then the secondary read cache (to eliminate the problem of duplicate rendering when concurrent). If there is no cache, puppeteer is used for rendering, and the lock is released after rendering.
- For static resource requests, direct hit return without SSR (express method)
- For resources that have nothing to do with page presentation (such as the wechat SDK package, Google Firebase, etc.), you can configure filtering without requests in Puppeteer. See the documentation for the Puppeteer API
After the event:
- Some pages can be updated using scheduled tasks or automatically triggered
3. Specific reference code
Express + Redis judgment layer
var express = require('express');
var compression = require('compression')
var app = express();
var server = require('http').createServer(app);
var history = require('connect-history-api-fallback');
var isBot = require('isbot');
import SSR from '.. /utils/ssr'
import Config from '.. /Config'
import redisUtils from '.. /utils/RedisUtils'
import utils from '.. /utils/Utils'
import { updateSsrHtml, updateAllSsrHtml, refreshSiteMap } from '.. /utils/EventEmitter'
import { scheduleRefreshAllSSrHtml, scheduleRefreshSiteMap } from '.. /utils/schedule'
const Redlock = require('redlock');
const redlock = new Redlock([redis], {
retryDelay: 200.// time in ms
retryCount: 10});var listenPort = Config.port;
const staticFileMiddleware = express.static('dist');
// Initialize the global variable
global.updateAllSsrHtmlFlag = false
global.refreshSiteMapFlag = false
// Local tests
function isLocalHost(ua) {
return ua == "Mozilla / 5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1"
}
// Try to use Compression before other middleware
/ / gzip compression
app.use(compression());
app.use(function (req, res, next) {
var UA = req.headers['user-agent'];
// Check if there is no header browser request
var is_puppeteer = false
if (UA.indexOf("HeadlessChrome/") != -1) {
is_puppeteer = true
}
let staticRescoure = ['static/'."manifest.json"."favicon.ico".'img/'.'sitemap.xml'.'robots.txt'.'googlef58764e7fe61073e.html'.'umi.css'.'umi.js']
var isStaticDir = false
if (staticRescoure.find(regex= > req.url.match(regex))) {
isStaticDir = true
}
let params = utils.getUrlParams(req.url)
if(! utils.isEmptyValue(params['updateAllSsrHtml']) && params['updateAllSsrHtml'] = ="true" && !isStaticDir) {
updateAllSsrHtml()
}
if(! utils.isEmptyValue(params['refreshSiteMap']) && params['refreshSiteMap'] = ="true" && !isStaticDir) {
refreshSiteMap()
}
// Check if it is a crawler and exclude resource directory requests
//if(UA && isBot(UA) && ! isStaticDir){
if(UA && isBot(UA) && ! isStaticDir && ! is_puppeteer) {// Generate local access links
var requestUrl = Config.baseUrl + req.url;
(async() = > {let html = await redisUtils.get(req.url)
if(! utils.isEmptyValue(html)) {console.log("Obtaining cached HTML succeeded:" + req.url)
res.send(html);
} else {
var resource = `locks:gethtml`;
// the maximum amount of time you want the resource locked,
// keeping in mind that you can extend the lock up until
// the point when it expires
var ttl = 1000;
redlock.lock(resource, ttl, async function (err, lock) {
// we failed to lock the resource
if (err) {
// ...
}
// we have the lock
else {
/ /... do something here...
try {
let html = await redisUtils.get(req.url)
if(! utils.isEmptyValue(html)) {console.log("Obtaining cached HTML succeeded:" + req.url)
res.send(html);
} else {
var results = awaitSSR(requestUrl); redisUtils.set(req.url, results.html) res.send(results.html); }}catch (e) {
console.log('ssr failed', e);
res.status(500).send('Server error');
}
// unlock your resource when you are done
lock.unlock(function (err) {
// we weren't able to reach redis; your lock will eventually
// expire, but you probably want to log this error
//console.error(err);}); }}); }}) ();return;
}
next();
});
// Static resources return directly
app.use(staticFileMiddleware);
// If the resource is dead, it will continue after history rewirte
app.use(history({
disableDotRule: true.verbose: true
}));
// process again
app.use(staticFileMiddleware);
server.listen(listenPort);
console.log("Startup successful:" + listenPort)
Copy the code
SSR. Js:
const puppeteer = require('puppeteer');
let browserWSEndpoint = null;
async function SSR(url){
let browser = null;
if(browserWSEndpoint){
console.log("browserWSEndpoint!")
try{
browser = await puppeteer.connect({browserWSEndpoint});
}catch(e){
// May fail
browserWSEndpoint = null;
browser = null; }}if(! browserWSEndpoint){ browser =await puppeteer.launch({
headless: true.ignoreHTTPSErrors: true.args: [
'--no-sandbox'.'--disable-setuid-sandbox'.'--disable-dev-shm-usage'.'--disable-accelerated-2d-canvas'.'--no-first-run'.'--no-zygote'.'--single-process'.// <- this one doesn't works in Windows
'--disable-gpu'].devtools: false.// Does not automatically open the console (valid when the browser displays)
});
browserWSEndpoint = await browser.wsEndpoint();
}
const start = Date.now();
const page = await browser.newPage();
// 1. Intercept network requests.
await page.setRequestInterception(true);
page.on('request'.async req => {
// 2. Ignore requests for resources that don't produce DOM
// (images, stylesheets, media).
const blacklist = ['google'.'firebase'.'/gtag/js'.'ga.js'.'analytics.js'.'jweixin - 1.6.0. Js.'adsbygoogle'.'3gimg.qq.com'];
if (blacklist.find(regex= > req.url().match(regex))) {
return req.abort();
}
const whitelist = ['document'.'script'.'xhr'.'fetch'];
if(! whitelist.includes(req.resourceType())) {return req.abort();
}
req.continue();
});
try {
await page.goto(url,{
//await page.goto(' your spa link ',{
waitUntil: ['load'.'networkidle0'].// When the page is loaded and no request is issued within 500ms, the page is rendered
});
const html = await page.content();
await browser.close()
const ttRenderMs = Date.now() - start;
console.info(`Headless rendered page ${url}: ${ttRenderMs}ms`);
return {html,err:false};
}catch (err) {
console.error("Request link:"+url+"Error");
return {html:' '.err:true}; }}export default SSR;
Copy the code
Scheduled Task Reference:
const schedule = require('node-schedule');
import {updateAllSsrHtml,refreshSiteMap} from './EventEmitter'
import HttpUtils from './HttpUtils'
const scheduleRefreshAllSSrHtml = () = >{
console.log("Start timing task: scheduleRefreshAllSSrHtml")
// 2 hours to execute tasks
schedule.scheduleJob('0, 0, 0, 1/1 *? '.() = >{
updateAllSsrHtml()
});
//updateAllSsrHtml()
}
const scheduleRefreshSiteMap = () = >{
console.log("Start scheduled Task: scheduleRefreshSiteMap")
// Perform a task every day
schedule.scheduleJob('0, 0, 0, 1/1 *? '.() = >{
refreshSiteMap()
});
//refreshSiteMap()
}
module.exports = {
scheduleRefreshAllSSrHtml,
scheduleRefreshSiteMap
}
Copy the code