preface
Recently, when I was doing a project, I met the demand of generating PDF at the back end, and generated more pages, and the style management is relatively complex, and the restoration degree is higher. After a series of community research, Puppeteer is a Node library that provides a relatively advanced API to control Chrome or Chromium using the DevTools protocol. The headless mode can convert HTML to PDF. This is currently the most popular solution in Node Server applications, but there are still a lot of pitfalls and considerations in use. Therefore, this paper summarizes and introduces headless browser, Puppeteer, HTML to PDF and other aspects.
Headless browser
1.1 Basic Understanding of headless browsers
In Wikipedia, a headless browser is a browser without a GRAPHICAL user interface GUI.
Headless browsers run in an environment similar to regular web browsers and provide automatic control over web pages, executed either through a command line interface or using network communication because there is no graphical user interface.
Headless browser for testing web pages, reptiles and other scenarios can play a huge role, because they are able to like browser rendering and understand hypertext markup language, including page layout, color, font and execution of JavaScript and Ajax style elements, these elements used in other test method is often not available.
To sum up, the basic contents of a headless browser can be summarized as follows:
- It has no real rendering of the content, that is, it draws everything in memory.
- It consumes less memory and works faster because it doesn’t have to draw a visual graphical interface, it doesn’t have to display anything on the actual screen and tries to run it on the back end.
- withUsed to managetheProgramming interface (API). Such as:
Puppeteer
Can provide a relatively high-level API toDevToolsProtocol controlChrome
orChromium
. - An important feature is thatCan be installed on a bare Linux server. This way, in the brand new installation
Ubuntu
orCentOS
On the server, only the binary file is compiled and installed, and the headless browser can be used.
One caveat here is that Chrome and Chromium are two different things, two browsers, and the general difference is that Chromium is an open source browser project, which is the foundation of the ChromeWeb browser. Refer to the following article for specific differences.
Article: What’s the difference between Chromium and Chrome
1.2 Application Scenarios of Headless Browser
Headless browsers are usually used for:
Web
Test automation in applications,JavaScript
The library runs automated tests- Take photos, take screenshots, and convert PDF to web pages
- Use some debugging tools and performance analysis tools that come with the browser to help us analyze problems
- Fetching a single page application (
SPA
) execute and render (solve the traditionHTTP
Crawler single page application is difficult to handle asynchronous request) - Capture a timeline trace of the site to help diagnose performance problems
- Collect Website data (crawler application)
- Automating web interaction, simulating user behavior (e.g. keyboard input, form submission, etc.)
- Used to launch some malicious attacks against QAQ
Here is an article on anti-creeping attacks based on the Headless browser controlled by Puppeteer:
Detection of headless browser Puppeteer attack and defense
Of course, there is more than one headless browser, other headless browser detection similar, you can Google~
1.3 Common headless browsers
Puppeteer
Operation of theHeadless Chrome
Based on theWebkit
PhantomJS
Based on theWebkit
SlimerJS
Based on theGecko
HtmlUnit
Based on theRhnio
TrifleJS
Based on theTrident
Splash
Based on theWebkit
One thing to note here is that Puppeteer, as I understand it, is not a headless browser per se, in conjunction with the official definition:
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeter is a Node library that provides a high-level API for controlling Chrome or Chromium via DevTools.
It is therefore a Node library for manipulating and controlling Headless browsers such as Headless Chrome, which in theory should be able to operate in a variety of ways.
But for PhantomJS, the official definition is:
A headless WebKit scriptable with JavaScript.
PhantomJS is therefore a headless browser
This is the main introduction to Puppeteer
The basic application of Puppeteer
2.1 Official hands-on materials
The overall structure of Puppeteer is as follows:
Basically, it’s a layered architecture of Chrome. BrowserContext is a session of the browser environment (if that’s a bit hard to understand, a BrowserContext is a private window environment that doesn’t share cookies, CacheData, etc.). Page is a browser Page created from a new TAB Page, and Frame corresponds to a Page Document.
This chapter provides a brief list of common apis and operations for Puppeteer, which are relatively easy to use. You can refer to the documentation:
Basic use of Puppeteer
The Puppeteer detailed API
A website is available to demonstrate the Puppeteer Demo
Demo: Try Puppeteer
See Puppeteer in action on this website
Puppeteer is a local test that can be used directly by NPM I
2.2 Application 1: Store the page as a picture to the specified Url
Local execution:
// example.js
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'example.png' });
awaitbrowser.close(); }) ();// Enter node example.js on the command line
Copy the code
Try Puppeteer
2.3 Application 2: Store web pages (HTML strings) as a PDF
Local execution:
// example.js
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.all1024.com', {
waitUntil: 'networkidle2'});await page.pdf({ path: '1024.pdf'.format: 'a4' });
awaitbrowser.close(); }) ();// Enter node example.js on the command line
Copy the code
Try Puppeteer
2.4 Application 3: Execute scripts in the context of a page
Local execution:
// example.js
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// Get the "viewport" of the page, as reported by the page.
const dimensions = await page.evaluate(() = > {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio,
};
});
console.log('Dimensions:', dimensions);
awaitbrowser.close(); }) ();// Enter node example.js on the command line
Copy the code
Try Puppeteer
2.5 Application 4: Agent
// example.js
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch({
// Launch chromium using a proxy server on port 9876.
// More on proxying:
// https://www.chromium.org/developers/design-documents/network-settings
args: [ '-- proxy server - = 127.0.0.1:9876']});const page = await browser.newPage();
await page.goto('https://google.com');
awaitbrowser.close(); }) ();// Enter node example.js on the command line
Copy the code
2.6 Application 5: Automatic form submission
// example.js
const puppeteer = require('puppeteer');
(async() = > {const browser = await puppeteer.launch();
const page = await browser.newPage();
// Enter the web address in the address bar
await page.goto('https://baidu.com/', {
waitUntil: 'networkidle2'});// Enter the search keyword
await page.type('#kw'.'Tencent', {
delay: 1000.// Controls keypress, which is the spacing of each letter input
});
/ / return
await page.keyboard.press('Enter'); }) ();// Enter node example.js on the command line
Copy the code
In addition, there are many applications, you can find and explore ~
The application of Puppeteer from HTML to PDF
In the next chapter, I will summarize the common problems in HTML to PDF conversion
The project scenario is as follows: In the application with separated front and back ends, the back end is Koa, and the HTML to PDF application is based on Puppeteer. The HTML is not a Url, but an HTML string read from the template rendered by EJS. We need to export dozens of PDF reports simultaneously, each OF which is dynamically generated from back-end aggregated data, and each of which has relatively complex UI requirements, as well as chart customization and cross-page processing. In theory, you can cover most PDF generation scenarios.
3.1 Use of EJS template engine
Why use EJS in this project? The reason is obvious, we need dynamic rendering data, but the overall structure and style is fixed, so we need to use a template engine, EJS is relatively old, is a standard technology selection, EJS official documents are as follows.
Ejs official document
In the cooperation between EJS and Puppeteer, there are two theoretical solutions:
One is to aggregate EJS string and data directly through EJS renderFileAPI, and then pass the generated HTML string to Puppeteer’s Page API for PDF generation.
Alternatively, an EJS rendered HTML string can be saved as an HTML file and mounted as a Koa static resource, so that the HTML can be accessed through a Url, which is then passed to the Puppeteer page API for PDF generation.
Puppeteer is supported in two ways, one to receive HTML strings and the other to receive urls. However, Puppeteer is much more efficient than Puppeteer. The core code for Puppeteer is as follows:
/ / environment TypeScript
// The type definition of the incoming data
interface PDFDataObj {
[propName: string] :any
}
async function getHTML(pdfReportData:PDFDataObj) {
// Parse the HTML string
let EJS2HTML = await new Promise((resolve, reject) = > {
ejs.renderFile(
path.resolve(__dirname, ".. /.. /.. /"."public/htmlModel/"."report.ejs"), // EjS template file storage path
pdfReportData, // Render data passed to ejS
function (err, string) {
// The callback function
if (err) {
reject(string);
} else {
resolve(string); }}); });return EJS2HTML as string;
}
Copy the code
3.2 Introduction of external resource files in EJS (CSS, JS files and image files)
If the resource is called using the path format in the EJS template, the static resource cannot be successfully loaded when the Puppeteer generates the PDF, for example:
<script type="text/javascript" src="/public/js/echarts.min.js"></script>
<script type="text/javascript" src="./js/echarts.min.js"></script>
Copy the code
Because the path environment has changed, how to solve the problem, there are two ways:
- Upload the resource file to the CDN or some
Object Storage Service
(Such as Tencent’sCOS
Ali,OSS
), and then get the resource link for replacement, but only if the project supports extranet resource invocation; - When the project cannot call external resources (as this project does), we can only mount static resource files to the current Server;
In Koa, koA-static can mount static resources for us.
In this project, you need to enable Koa’s multiple static resource paths, with one allocated to the front-end packaged file and another allocated to static files used by the back-end (such as external files introduced by EJS). This requires the use of another NPM package, koa-mount. If you have enabled path permissions in your project, remember to release the permissions for these static resources. Part of the core setting code is as follows:
/ / TypeScript environment
import Koa from "koa"
import koaJwt from "koa-jwt"
import koaMount from "koa-mount"
import koaStatic from "koa-static"
import { Config } from "./config"
export class App {
public app: Koa
privateserver! : Serverprivate config: Config
public constructor() {
this.app = new Koa()
this.config = new Config()
}
private addRouter() {
let staticPath = path.resolve(__dirname, ".. /client/dist")
let publicPath = path.resolve(__dirname, ".. /public")
this.app.use(koaJwt({secret:this.config.config.jwt.secretKey, key:"jwt".cookie: "jwt_token"}).unless({ path: [/^\/(v1|login|js|img|css|font|images|public)/]}))this.app.use(koaStatic(staticPath, { index: "index.html".maxage: 24 * 3600 * 1000.defer: true }))
// Mount multiple static directories
this.app.use(koaMount("/public", koaStatic(publicPath)))
}
}
Copy the code
This project is built in typescript. Javascript is built in a similar way. See the setup in addRouter
The last effect is that through xxx.com/login this path can be mapped to the front entrance, can be mapped to the backend by xxx.com/public/images/xxx.png static resources, CSS, js file, such as the font files are in the same way.
The reason for doing this is because of the problem of the project directory structure. For the convenience of development, in this project, the front-end source files are placed in the directory list of the back-end, so that after the front-end NPM run build is packaged, it can be seamlessly updated to the front-end Dist directory pointed to by the back-end, without manually updating dist.
Ejs-related static resource files are used for template rendering on the back end and therefore cannot be placed in the default Dist static resource directory, otherwise they will be automatically deleted on the back end as soon as the front end is packaged. Therefore, they are independent of the front end and should be placed in a new public static resource directory, so Koa needs to enable two static resource directories.
The overall directory structure is roughly as follows. Client is the source code of the front end and contains the dist resource package. Public indicates the back-end static resource dependency ~
Once configured, resource calls are made in the FORM of urls in ejS files. For dynamic configuration, here https://www.xxx.com is passed as resourcesUrl:
<script type="text/javascript" src="<%= resourcesUrl %>/public/js/echarts.min.js"></script>
Copy the code
3.3 Invalid fonts in Puppeteer
Usually in HTML, CSS style writing, when a web page needs to specify a font,
If we set font family directly to Microsoft YaHei and do nothing else, the page rendering for some browsers with Microsoft YaHei fonts will look like Microsoft YaHei fonts. Some browsers don’t have Microsoft Yahei fonts built in, so this is the browser’s default font.
To avoid this, we usually store the relevant font files in the resources folder and call them in a manner similar to the following:
@font-face {
font-family: 'MyWebFont';
src: url('.. /font/webfont.woff') format('woff'),
url('.. /font/webfont.ttf') format('truetype');
}
.targetDom {
font-family: MyWebFont;
}
Copy the code
In Puppeteer, however, this is problematic because Puppeteer ultimately relies on an operating system-level font library to generate PDFS, which means that whatever font is installed in the system can be invoked in CSS with the same name. This sounds ridiculous, but I did it in project practice. I tried many ways but failed to solve it, and finally found that it was related to the system font.
Looking at the documentation, it is presumably due to Puppeteer’s dependence on Chromium, which is directly dependent on the underlying OS.
In other words, this problem can be solved simply by installing font libraries at the system level.
3.4 Problems in Puppeteer deployment in Linux and Docker application
The new problem is that, in most cases, the Server we deploy is A Linux operating system, which has a very different font installation than Windows or macOS, and the development environment is often Windows or macOS.
For font installation problems in Linux, refer to the following article, or refer to the following execution steps in Dockerfile:
Linux Chinese font installation
When it is deployed on the company’s self-developed cloud platform, it is easy to produce different effects and even errors between local development and online deployment, which leads to the application of Docker. Before Puppeteer was introduced, the whole project did not seem to need Docker, but it has been proved that Docker is still used for long-term project development. It will be much more convenient
In addition to the font problem, there is also chromium error in the Linux application of Puppeteer, which needs to be installed separately.
After combing all the above operations, the following Dockerfile is formed, in which commands are annotated for reference in actual project development. For other configurations, refer to 3.9.
Enter the address of the base image here
FROM mirrors.tencent.com/xxxxx/xxxxxx
ARG NODEJS_VERSION=v14.1.0
LABEL MAINTAINER="Alexzhli"
# Install
# installing chromium
RUN yum -y install chromium \
Get and install nodejs
&& wget https://github.com/nvm-sh/nvm/archive/v0.35.1.zip \
&& unzip v0.35.1.zip -d /root/ \
&& rm v0.35.1.zip \
&& mv /root/nvm-0.35.1 /root/.nvm \
&& echo ". /root/.nvm/nvm.sh" >> /root/.bashrc \
&& echo ". /root/.nvm/bash_completion" >> /root/.bashrc \
&& source /root/.bashrc \
&& nvm install $NODEJS_VERSION \
Install TS and TS-Node
&& npm install -g typescript ts-node \
Install and set Linux Chinese fonts
Install Chinese font support
&& yum -y groupinstall chinese-support \
Set the Linux locale
&& LANG=zh_CN.UTF-8 \
# Download font from COS
&& wget https://xxx.com./xxx/TencentSans-W7.ttf \
# Download font from COS
&& wget https://xxx.com./xxx/msyh.ttf \
# install font
&& cp TencentSans-W7.ttf /usr/share/fonts \
&& cp msyh.ttf /usr/share/fonts \
&& cd /usr/share/fonts \
&& mkfontscale \
&& mkfontdir \
# update cache
&& fc-cache
WORKDIR /usr/local/app
Copy the code
This way, the development and production environments will be exactly the same
3.5 Key points of header, footer and page number
Puppeteer
The header footer scheme is provided by settingheaderTemplate
andfooterTemplate
Set both toHTML
String, and pass it topage.pdf()
In the.- through
margin
Parameter set the page margin, the margin left here, isheaderTemplate
andfooterTemplate
Exhibition space. headerTemplate
andfooterTemplate
There is no support for calling image resources in the form of paths and urls, so what if I need to display img? After the IMG is compressed as much as possible, convert tobase64
And put them insrc
, can be displayed normally.headerTemplate
andfooterTemplate
Does not supportcss
thebackground
To create a rich style header footer, convert the background to IMG and put it in.headerTemplate
andfooterTemplate
The header and footer Settings are not inThe HTML dom flow
In, they don’t belong<html/>
Also don’t belong to<body/>
, HTMLdom
Will automatically skip this area, andword
Similar. So it’s not possiblehtml
In the filecss
Control its style, can only write its styleTemplate
String as an interline style.headerTemplate
andfooterTemplate
The header and footer Settings have some offsets by default, so they need to be specified in the interline style extramargin-top
,margin-bottom
To adjust the position.- At present through practice, in
Puppeteer
There is no way that we can removeheaderTemplate
andfooterTemplate
Elsewhere in the worldpage
The page number,headerTemplate
andfooterTemplate
Provides default page number display support, willspan
Of the labelclass
Set tototalPages
To the total number of pages, set topageNumber
Is the current page number.
3.6 Solution when Echarts or images break across pages
For the whole page generation, it is hard to avoid the cross-page fracture of long pictures. In advanced operating software like Word, the official threw the problem to users by directly not allowing the cross-page fracture of long pictures, so users had to manually slice or shrink the pictures.
A broken page in Puppeteer looks like this (with header footer and margin configuration) :
However, for complex dynamic PDF generation scenarios, there is no way to intervene in shard or miniaturize, which can affect the desired results or lead to excessive development costs. Word pages are static in nature, meaning that the user must anchor each page, and there are no uncertain pages. When generating vertical Echarts charts, too many data items can take up an indefinite number of pages. There are three general solutions to this situation:
- Communicating with the product that the total length of the page is as static as possible — that is, where each page is placed, the size is determined — is surely the safest bet.
- Can accept pictures or
Echarts
If the table breaks across pages, no action is required (provided header footer andmargin
), but the image may break between the two pages. - Must support
Echarts
In a perfectly segmented scene, remove the dynamic height of the place to process, before creating only oneEcharts
Instance, then create as many as needed in dynamic scenariosEcharts
Example, strictly calculatecharts
In each oneitem
And then combine the wholePage
Height for cross-page processing. For example, if achart
There are 40item
Stacked vertically, each onePage
You can put 15, and nowPage
There are still eight of them leftitem
That’s fourchart
Example, a total of 4 pages, whereitem
The number of chart dom elements is 8, 15, 15 and 2 respectively, and then the right height is dynamically set for each DOM element of chart to achieve perfect presentation. All of these can be made up ofjs
The dynamics of thestyle
Set up andcss
attributepage-break-after: always;
andpage-break-before: always;
Control, different scenarios code writing has a relatively large difference, not listed here, to give a rough effect of the picture, you can see that Echarts is relatively perfect segmentation ~
3.7 Native Table cross-page solution
Same as above, if the table is not processed across the page, then it looks like this (if the header footer margin is set) :
Table cross-page setup is relatively simple, using thead:
table thead {
display: table-header-group;
break-inside: avoid;
}
Copy the code
The tableHead is automatically completed across the page and looks like this:
3.8 How to debug WHAT YOU See and what you Get
Since it is based on a headless browser, we cannot see the page drawn by the browser. Save buffer as a PDF file and click the PDF file to view it, so that the restoration degree is the highest, but the debugging efficiency will be particularly low.
We can add a side script to the relevant EJS template:
setTimeout(() = > {
window.print()
}, 2000)
Copy the code
You then open a new Router in the background to load the resulting HTML string (including data) so it can be viewed directly in the browser.
The reason why window.print is used is because PDF generation is different from regular HTML DOM stream. There are some differences in effect, such as header footer, page width, and even HTML does not have the concept of “page” in normal display.
Currently in application, the size of A4 PDF is 794px x 1124px (including header footer).
Window. print can roughly simulate the rendering effect of PDF, and combine with the final generated PDF to greatly improve debugging efficiency.
3.9 Best practices for HTML to PDF (creation and generation of PDF only)
Because this project involves generating dozens of PDFS at the same time, we use promise.all () for asynchronous processing.
At the same time, I have done a simple test on the browser shutdown time and the number of instances in Puppeteer, and the results are as follows:
Processing speed Multiple Browser instances > single Browser instance multi-process > single Browser single process;
Of course, this is also related to the business scenario and server environment, the above test results are not systematic, the whole function does not need to carry a lot of concurrency, so the speed is not high
In this project, each PDF is large and the HTML drawing process takes a lot of time, so multiple Browsers will be advantageous. In addition, the server configuration can affect the processing speed in different configurations, so if possible, pulling up the server configuration should not be a problem, since Puppeteer essentially runs multiple browsers and eats up server resources.
Therefore, in the project, a Browser instance is created for each PDF generation, in exchange for time.
Read some articles, combine some pit points, and come up with a relative best practice. This is of course based on the application scenario of this project. It only refers to the creation and generation of PDF using Puppeteer, and does not include UI development related content. There are some additional notes in the code
Args configuration of browser can be seen in:
Chrome Launch Parameters
async function getPDF() {
const browser = await puppeteer.launch({
headless: true.args: [
"--no-sandbox".// This function must be enabled in Linux
"--no-zygote".// "--single-process", // Turn off single-process here
"--disable-setuid-sandbox"."--disable-gpu"."--disable-dev-shm-usage"."--no-first-run"."--disable-extensions"."--disable-file-system"."--disable-background-networking"."--disable-default-apps"."--disable-sync".// Disable synchronization
"--disable-translate"."--hide-scrollbars"."--metrics-recording-only"."--mute-audio"."--safebrowsing-disable-auto-update"."--ignore-certificate-errors"."--ignore-ssl-errors"."--ignore-certificate-errors-spki-list"."--font-render-hinting=medium"]});// try... catch...
try {
const page = await browser.newPage();
// Header template (image uses base64, where SRC base64 is a placeholder)
const headerTemplate = `
< p style=" max-width: 100%; clear: both; min-height: 1em; height: auto;" src="data:image/png; base64,iVBORw0KGgoAAAxxxxxx" />
`
// the footer template (pageNumber will automatically inject the current pageNumber)
const footerTemplate = `
`;
// For large PDF generation, which can take a long time, no timeout is specified here
await page.setDefaultNavigationTimeout(0);
// Define the HTML content
await page.setContent(this.HTMLStr, { waitUntil: "networkidle2" });
// Wait for the font to load the response
await page.evaluateHandle("document.fonts.ready");
let pdfbuf = await page.pdf({
// Page scaling
scale: 1.// Whether to display header footer
displayHeaderFooter: true.// PDF storage single page size
format: "a4".// Page margins
// The template for the header
headerTemplate,
// Template for footer
footerTemplate,
margin: {
top: 50.bottom: 50.left: 0.right: 0
},
// The output page number range
pageRanges: "".// CSS
preferCSSPageSize: true.// Enable rendering background colors, because Puppeteer is based on Chrome, which by default does not export background images and colors to save printing ink
// Pits must be added
printBackground: true});/ / close the browser
await browser.close();
// The returned buffer does not need to be stored as a PDF. The buffer is directly sent back to the front end for download to improve the processing speed
return pdfbuf
} catch(e) {
await browser.close();
throw e
}
}
Copy the code
This article will roughly summarize these topics, and more details will be provided later when time is available
Since it is a company project, I will not post the specific code and the final effect display, if you have any questions, please contact me ~
reference
- Chrome DevTools Protocol
- Wiki Headless_browser
- Advantages and disadvantages of headless browser testing
- How headless Chrome works
- A preliminary study on the headless browser Puppeteer
- Puppeteer Github
- Linux Chinese font installation
- Ejs official document
- Chrome Launch Parameters