A guide to PDF generation in headless browsers and Puppeteer

preface

Recently, when I was doing a project, I met the demand of generating PDF at the back end, and generated more pages, and the style management is relatively complex, and the restoration degree is higher. After a series of community research, Puppeteer is a Node library that provides a relatively advanced API to control Chrome or Chromium using the DevTools protocol. The headless mode can convert HTML to PDF. This is currently the most popular solution in Node Server applications, but there are still a lot of pitfalls and considerations in use. Therefore, this paper summarizes and introduces headless browser, Puppeteer, HTML to PDF and other aspects.

Headless browser

1.1 Basic Understanding of headless browsers

In Wikipedia, a headless browser is a browser without a GRAPHICAL user interface GUI.

Headless browsers run in an environment similar to regular web browsers and provide automatic control over web pages, executed either through a command line interface or using network communication because there is no graphical user interface.

Headless browser for testing web pages, reptiles and other scenarios can play a huge role, because they are able to like browser rendering and understand hypertext markup language, including page layout, color, font and execution of JavaScript and Ajax style elements, these elements used in other test method is often not available.

To sum up, the basic contents of a headless browser can be summarized as follows:

It has no real rendering of the content, that is, it draws everything in memory.
It consumes less memory and works faster because it doesn’t have to draw a visual graphical interface, it doesn’t have to display anything on the actual screen and tries to run it on the back end.
withUsed to managetheProgramming interface (API). Such as:PuppeteerCan provide a relatively high-level API toDevToolsProtocol controlChromeorChromium.
An important feature is thatCan be installed on a bare Linux server. This way, in the brand new installationUbuntuorCentOSOn the server, only the binary file is compiled and installed, and the headless browser can be used.

One caveat here is that Chrome and Chromium are two different things, two browsers, and the general difference is that Chromium is an open source browser project, which is the foundation of the ChromeWeb browser. Refer to the following article for specific differences.

Article: What’s the difference between Chromium and Chrome

1.2 Application Scenarios of Headless Browser

Headless browsers are usually used for:

WebTest automation in applications,JavaScriptThe library runs automated tests
Take photos, take screenshots, and convert PDF to web pages
Use some debugging tools and performance analysis tools that come with the browser to help us analyze problems
Fetching a single page application (SPA) execute and render (solve the traditionHTTPCrawler single page application is difficult to handle asynchronous request)
Capture a timeline trace of the site to help diagnose performance problems
Collect Website data (crawler application)
Automating web interaction, simulating user behavior (e.g. keyboard input, form submission, etc.)
Used to launch some malicious attacks against QAQ

Here is an article on anti-creeping attacks based on the Headless browser controlled by Puppeteer:

Detection of headless browser Puppeteer attack and defense

Of course, there is more than one headless browser, other headless browser detection similar, you can Google~

1.3 Common headless browsers

PuppeteerOperation of theHeadless ChromeBased on the Webkit
PhantomJSBased on theWebkit
SlimerJSBased on theGecko
HtmlUnitBased on theRhnio
TrifleJSBased on theTrident
SplashBased on theWebkit

One thing to note here is that Puppeteer, as I understand it, is not a headless browser per se, in conjunction with the official definition:

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeter is a Node library that provides a high-level API for controlling Chrome or Chromium via DevTools.

It is therefore a Node library for manipulating and controlling Headless browsers such as Headless Chrome, which in theory should be able to operate in a variety of ways.

But for PhantomJS, the official definition is:

A headless WebKit scriptable with JavaScript.

PhantomJS is therefore a headless browser

This is the main introduction to Puppeteer

The basic application of Puppeteer

2.1 Official hands-on materials

The overall structure of Puppeteer is as follows:

Basically, it’s a layered architecture of Chrome. BrowserContext is a session of the browser environment (if that’s a bit hard to understand, a BrowserContext is a private window environment that doesn’t share cookies, CacheData, etc.). Page is a browser Page created from a new TAB Page, and Frame corresponds to a Page Document.

This chapter provides a brief list of common apis and operations for Puppeteer, which are relatively easy to use. You can refer to the documentation:

Basic use of Puppeteer

The Puppeteer detailed API

A website is available to demonstrate the Puppeteer Demo

Demo: Try Puppeteer

See Puppeteer in action on this website

Puppeteer is a local test that can be used directly by NPM I

2.2 Application 1: Store the page as a picture to the specified Url

Local execution:

// example.js
const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  await page.screenshot({ path: 'example.png' });

  awaitbrowser.close(); }) ();// Enter node example.js on the command line
Copy the code

Try Puppeteer

2.3 Application 2: Store web pages (HTML strings) as a PDF

Local execution:

// example.js
const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.all1024.com', {
    waitUntil: 'networkidle2'});await page.pdf({ path: '1024.pdf'.format: 'a4' });

  awaitbrowser.close(); }) ();// Enter node example.js on the command line
Copy the code

Try Puppeteer

2.4 Application 3: Execute scripts in the context of a page

Local execution:

// example.js
const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  // Get the "viewport" of the page, as reported by the page.
  const dimensions = await page.evaluate(() = > {
    return {
      width: document.documentElement.clientWidth,
      height: document.documentElement.clientHeight,
      deviceScaleFactor: window.devicePixelRatio,
    };
  });

  console.log('Dimensions:', dimensions);

  awaitbrowser.close(); }) ();// Enter node example.js on the command line
Copy the code

Try Puppeteer

2.5 Application 4: Agent

// example.js
const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch({
    // Launch chromium using a proxy server on port 9876.
    // More on proxying:
    // https://www.chromium.org/developers/design-documents/network-settings
    args: [ '-- proxy server - = 127.0.0.1:9876']});const page = await browser.newPage();
  await page.goto('https://google.com');
  awaitbrowser.close(); }) ();// Enter node example.js on the command line
Copy the code

2.6 Application 5: Automatic form submission

// example.js
const puppeteer = require('puppeteer');

(async() = > {const browser = await puppeteer.launch();
  const page = await browser.newPage();
  // Enter the web address in the address bar
  await page.goto('https://baidu.com/', {
      waitUntil: 'networkidle2'});// Enter the search keyword
  await page.type('#kw'.'Tencent', {
     delay: 1000.// Controls keypress, which is the spacing of each letter input
  });

  / / return
  await page.keyboard.press('Enter'); }) ();// Enter node example.js on the command line
Copy the code

In addition, there are many applications, you can find and explore ~

The application of Puppeteer from HTML to PDF

In the next chapter, I will summarize the common problems in HTML to PDF conversion

The project scenario is as follows: In the application with separated front and back ends, the back end is Koa, and the HTML to PDF application is based on Puppeteer. The HTML is not a Url, but an HTML string read from the template rendered by EJS. We need to export dozens of PDF reports simultaneously, each OF which is dynamically generated from back-end aggregated data, and each of which has relatively complex UI requirements, as well as chart customization and cross-page processing. In theory, you can cover most PDF generation scenarios.

3.1 Use of EJS template engine

Why use EJS in this project? The reason is obvious, we need dynamic rendering data, but the overall structure and style is fixed, so we need to use a template engine, EJS is relatively old, is a standard technology selection, EJS official documents are as follows.

Ejs official document

In the cooperation between EJS and Puppeteer, there are two theoretical solutions:

One is to aggregate EJS string and data directly through EJS renderFileAPI, and then pass the generated HTML string to Puppeteer’s Page API for PDF generation.

Alternatively, an EJS rendered HTML string can be saved as an HTML file and mounted as a Koa static resource, so that the HTML can be accessed through a Url, which is then passed to the Puppeteer page API for PDF generation.

Puppeteer is supported in two ways, one to receive HTML strings and the other to receive urls. However, Puppeteer is much more efficient than Puppeteer. The core code for Puppeteer is as follows:

/ / environment TypeScript
// The type definition of the incoming data
interface PDFDataObj {
    [propName: string] :any
}

async function getHTML(pdfReportData:PDFDataObj) {
  // Parse the HTML string
  let EJS2HTML = await new Promise((resolve, reject) = > {
    ejs.renderFile(
      path.resolve(__dirname, ".. /.. /.. /"."public/htmlModel/"."report.ejs"),  // EjS template file storage path
      pdfReportData, // Render data passed to ejS
      function (err, string) {
        // The callback function
        if (err) {
          reject(string);
        } else {
          resolve(string); }}); });return EJS2HTML as string;
}
Copy the code

3.2 Introduction of external resource files in EJS (CSS, JS files and image files)

If the resource is called using the path format in the EJS template, the static resource cannot be successfully loaded when the Puppeteer generates the PDF, for example:

<script type="text/javascript" src="/public/js/echarts.min.js"></script>
<script type="text/javascript" src="./js/echarts.min.js"></script>
Copy the code

Because the path environment has changed, how to solve the problem, there are two ways:

Upload the resource file to the CDN or someObject Storage Service(Such as Tencent’sCOSAli,OSS), and then get the resource link for replacement, but only if the project supports extranet resource invocation;
When the project cannot call external resources (as this project does), we can only mount static resource files to the current Server;

In Koa, koA-static can mount static resources for us.

In this project, you need to enable Koa’s multiple static resource paths, with one allocated to the front-end packaged file and another allocated to static files used by the back-end (such as external files introduced by EJS). This requires the use of another NPM package, koa-mount. If you have enabled path permissions in your project, remember to release the permissions for these static resources. Part of the core setting code is as follows:

/ / TypeScript environment
import Koa from "koa"
import koaJwt from "koa-jwt"
import koaMount from "koa-mount"
import koaStatic from "koa-static"
import { Config } from "./config"
export class App {
    public app: Koa
    privateserver! : Serverprivate config: Config
    public constructor() {
        this.app = new Koa()
        this.config = new Config()
    }
    private addRouter() {
        let staticPath = path.resolve(__dirname, ".. /client/dist")
        let publicPath = path.resolve(__dirname, ".. /public")
        this.app.use(koaJwt({secret:this.config.config.jwt.secretKey, key:"jwt".cookie: "jwt_token"}).unless({ path: [/^\/(v1|login|js|img|css|font|images|public)/]}))this.app.use(koaStatic(staticPath, { index: "index.html".maxage: 24 * 3600 * 1000.defer: true }))
        // Mount multiple static directories
        this.app.use(koaMount("/public", koaStatic(publicPath)))
    }
}
Copy the code

This project is built in typescript. Javascript is built in a similar way. See the setup in addRouter

The last effect is that through xxx.com/login this path can be mapped to the front entrance, can be mapped to the backend by xxx.com/public/images/xxx.png static resources, CSS, js file, such as the font files are in the same way.

The reason for doing this is because of the problem of the project directory structure. For the convenience of development, in this project, the front-end source files are placed in the directory list of the back-end, so that after the front-end NPM run build is packaged, it can be seamlessly updated to the front-end Dist directory pointed to by the back-end, without manually updating dist.

Ejs-related static resource files are used for template rendering on the back end and therefore cannot be placed in the default Dist static resource directory, otherwise they will be automatically deleted on the back end as soon as the front end is packaged. Therefore, they are independent of the front end and should be placed in a new public static resource directory, so Koa needs to enable two static resource directories.

The overall directory structure is roughly as follows. Client is the source code of the front end and contains the dist resource package. Public indicates the back-end static resource dependency ~

Once configured, resource calls are made in the FORM of urls in ejS files. For dynamic configuration, here https://www.xxx.com is passed as resourcesUrl:

<script type="text/javascript" src="<%= resourcesUrl %>/public/js/echarts.min.js"></script>
Copy the code

3.3 Invalid fonts in Puppeteer

Usually in HTML, CSS style writing, when a web page needs to specify a font,

If we set font family directly to Microsoft YaHei and do nothing else, the page rendering for some browsers with Microsoft YaHei fonts will look like Microsoft YaHei fonts. Some browsers don’t have Microsoft Yahei fonts built in, so this is the browser’s default font.

To avoid this, we usually store the relevant font files in the resources folder and call them in a manner similar to the following:

@font-face {
    font-family: 'MyWebFont';
    src: url('.. /font/webfont.woff') format('woff'),
    url('.. /font/webfont.ttf') format('truetype');
}

.targetDom {
  font-family: MyWebFont;
}
Copy the code

In Puppeteer, however, this is problematic because Puppeteer ultimately relies on an operating system-level font library to generate PDFS, which means that whatever font is installed in the system can be invoked in CSS with the same name. This sounds ridiculous, but I did it in project practice. I tried many ways but failed to solve it, and finally found that it was related to the system font.

Looking at the documentation, it is presumably due to Puppeteer’s dependence on Chromium, which is directly dependent on the underlying OS.

In other words, this problem can be solved simply by installing font libraries at the system level.

3.4 Problems in Puppeteer deployment in Linux and Docker application

The new problem is that, in most cases, the Server we deploy is A Linux operating system, which has a very different font installation than Windows or macOS, and the development environment is often Windows or macOS.

For font installation problems in Linux, refer to the following article, or refer to the following execution steps in Dockerfile:

Linux Chinese font installation

When it is deployed on the company’s self-developed cloud platform, it is easy to produce different effects and even errors between local development and online deployment, which leads to the application of Docker. Before Puppeteer was introduced, the whole project did not seem to need Docker, but it has been proved that Docker is still used for long-term project development. It will be much more convenient

In addition to the font problem, there is also chromium error in the Linux application of Puppeteer, which needs to be installed separately.

After combing all the above operations, the following Dockerfile is formed, in which commands are annotated for reference in actual project development. For other configurations, refer to 3.9.

Enter the address of the base image here
FROM mirrors.tencent.com/xxxxx/xxxxxx

ARG NODEJS_VERSION=v14.1.0

LABEL MAINTAINER="Alexzhli"

# Install
# installing chromium
RUN yum -y install chromium \
    Get and install nodejs
    && wget https://github.com/nvm-sh/nvm/archive/v0.35.1.zip \
    && unzip v0.35.1.zip -d /root/ \
    && rm v0.35.1.zip \
    && mv /root/nvm-0.35.1 /root/.nvm \
    && echo ". /root/.nvm/nvm.sh" >> /root/.bashrc \
    && echo ". /root/.nvm/bash_completion" >> /root/.bashrc \
    && source /root/.bashrc \
    && nvm install $NODEJS_VERSION \
    Install TS and TS-Node
    && npm install -g typescript ts-node \
    Install and set Linux Chinese fonts
    Install Chinese font support
    && yum -y groupinstall chinese-support \
    Set the Linux locale
    && LANG=zh_CN.UTF-8 \
    # Download font from COS
    && wget https://xxx.com./xxx/TencentSans-W7.ttf \
    # Download font from COS
    && wget https://xxx.com./xxx/msyh.ttf \
    # install font
    && cp TencentSans-W7.ttf /usr/share/fonts \
    && cp msyh.ttf /usr/share/fonts \
    && cd /usr/share/fonts \
    && mkfontscale \
    && mkfontdir \
    # update cache
    && fc-cache

WORKDIR /usr/local/app
Copy the code

This way, the development and production environments will be exactly the same

3.5 Key points of header, footer and page number

PuppeteerThe header footer scheme is provided by settingheaderTemplateandfooterTemplateSet both toHTMLString, and pass it topage.pdf()In the.
throughmarginParameter set the page margin, the margin left here, isheaderTemplateandfooterTemplateExhibition space.
headerTemplateandfooterTemplateThere is no support for calling image resources in the form of paths and urls, so what if I need to display img? After the IMG is compressed as much as possible, convert tobase64And put them insrc, can be displayed normally.
headerTemplateandfooterTemplateDoes not supportcssthebackgroundTo create a rich style header footer, convert the background to IMG and put it in.
headerTemplateandfooterTemplateThe header and footer Settings are not inThe HTML dom flowIn, they don’t belong<html/>Also don’t belong to<body/>, HTMLdomWill automatically skip this area, andwordSimilar. So it’s not possiblehtmlIn the filecssControl its style, can only write its styleTemplateString as an interline style.
headerTemplateandfooterTemplateThe header and footer Settings have some offsets by default, so they need to be specified in the interline style extramargin-top,margin-bottomTo adjust the position.
At present through practice, inPuppeteerThere is no way that we can removeheaderTemplateandfooterTemplateElsewhere in the worldpageThe page number,headerTemplateandfooterTemplateProvides default page number display support, willspanOf the labelclassSet tototalPagesTo the total number of pages, set topageNumberIs the current page number.

3.6 Solution when Echarts or images break across pages

For the whole page generation, it is hard to avoid the cross-page fracture of long pictures. In advanced operating software like Word, the official threw the problem to users by directly not allowing the cross-page fracture of long pictures, so users had to manually slice or shrink the pictures.

A broken page in Puppeteer looks like this (with header footer and margin configuration) :

However, for complex dynamic PDF generation scenarios, there is no way to intervene in shard or miniaturize, which can affect the desired results or lead to excessive development costs. Word pages are static in nature, meaning that the user must anchor each page, and there are no uncertain pages. When generating vertical Echarts charts, too many data items can take up an indefinite number of pages. There are three general solutions to this situation:

Communicating with the product that the total length of the page is as static as possible — that is, where each page is placed, the size is determined — is surely the safest bet.
Can accept pictures orEchartsIf the table breaks across pages, no action is required (provided header footer andmargin), but the image may break between the two pages.
Must supportEchartsIn a perfectly segmented scene, remove the dynamic height of the place to process, before creating only oneEchartsInstance, then create as many as needed in dynamic scenariosEchartsExample, strictly calculatechartsIn each oneitemAnd then combine the wholePageHeight for cross-page processing. For example, if achartThere are 40itemStacked vertically, each onePageYou can put 15, and nowPageThere are still eight of them leftitemThat’s fourchartExample, a total of 4 pages, whereitemThe number of chart dom elements is 8, 15, 15 and 2 respectively, and then the right height is dynamically set for each DOM element of chart to achieve perfect presentation. All of these can be made up ofjsThe dynamics of thestyleSet up andcssattributepage-break-after: always;andpage-break-before: always;Control, different scenarios code writing has a relatively large difference, not listed here, to give a rough effect of the picture, you can see that Echarts is relatively perfect segmentation ~

3.7 Native Table cross-page solution

Same as above, if the table is not processed across the page, then it looks like this (if the header footer margin is set) :

Table cross-page setup is relatively simple, using thead:

table thead {
   display: table-header-group;
   break-inside: avoid;
}
Copy the code

The tableHead is automatically completed across the page and looks like this:

3.8 How to debug WHAT YOU See and what you Get

Since it is based on a headless browser, we cannot see the page drawn by the browser. Save buffer as a PDF file and click the PDF file to view it, so that the restoration degree is the highest, but the debugging efficiency will be particularly low.

We can add a side script to the relevant EJS template:

setTimeout(() = > {
  window.print()
}, 2000)
Copy the code

You then open a new Router in the background to load the resulting HTML string (including data) so it can be viewed directly in the browser.

The reason why window.print is used is because PDF generation is different from regular HTML DOM stream. There are some differences in effect, such as header footer, page width, and even HTML does not have the concept of “page” in normal display.

Currently in application, the size of A4 PDF is 794px x 1124px (including header footer).

Window. print can roughly simulate the rendering effect of PDF, and combine with the final generated PDF to greatly improve debugging efficiency.

3.9 Best practices for HTML to PDF (creation and generation of PDF only)

Because this project involves generating dozens of PDFS at the same time, we use promise.all () for asynchronous processing.

At the same time, I have done a simple test on the browser shutdown time and the number of instances in Puppeteer, and the results are as follows:

Processing speed Multiple Browser instances > single Browser instance multi-process > single Browser single process;

Of course, this is also related to the business scenario and server environment, the above test results are not systematic, the whole function does not need to carry a lot of concurrency, so the speed is not high

In this project, each PDF is large and the HTML drawing process takes a lot of time, so multiple Browsers will be advantageous. In addition, the server configuration can affect the processing speed in different configurations, so if possible, pulling up the server configuration should not be a problem, since Puppeteer essentially runs multiple browsers and eats up server resources.

Therefore, in the project, a Browser instance is created for each PDF generation, in exchange for time.

Read some articles, combine some pit points, and come up with a relative best practice. This is of course based on the application scenario of this project. It only refers to the creation and generation of PDF using Puppeteer, and does not include UI development related content. There are some additional notes in the code

Args configuration of browser can be seen in:

Chrome Launch Parameters

async function getPDF() {
  const browser = await puppeteer.launch({
    headless: true.args: [
        "--no-sandbox".// This function must be enabled in Linux
        "--no-zygote".// "--single-process", // Turn off single-process here
        "--disable-setuid-sandbox"."--disable-gpu"."--disable-dev-shm-usage"."--no-first-run"."--disable-extensions"."--disable-file-system"."--disable-background-networking"."--disable-default-apps"."--disable-sync".// Disable synchronization
        "--disable-translate"."--hide-scrollbars"."--metrics-recording-only"."--mute-audio"."--safebrowsing-disable-auto-update"."--ignore-certificate-errors"."--ignore-ssl-errors"."--ignore-certificate-errors-spki-list"."--font-render-hinting=medium"]});// try... catch...
  try {
    const page = await browser.newPage();
		// Header template (image uses base64, where SRC base64 is a placeholder)
    const headerTemplate = `
       
        < p style=" max-width: 100%; clear: both; min-height: 1em; height: auto;" src="data:image/png; base64,iVBORw0KGgoAAAxxxxxx" /> 
      
`
		// the footer template (pageNumber will automatically inject the current pageNumber)
    const footerTemplate = `
       
         
      
`;
    // For large PDF generation, which can take a long time, no timeout is specified here
    await page.setDefaultNavigationTimeout(0);
    // Define the HTML content
    await page.setContent(this.HTMLStr, { waitUntil: "networkidle2" });
    // Wait for the font to load the response
    await page.evaluateHandle("document.fonts.ready");
    let pdfbuf = await page.pdf({
      // Page scaling
      scale: 1.// Whether to display header footer
      displayHeaderFooter: true.// PDF storage single page size
      format: "a4".// Page margins
      // The template for the header
      headerTemplate,
      // Template for footer
      footerTemplate,
      margin: {
        top: 50.bottom: 50.left: 0.right: 0
      },
      // The output page number range
      pageRanges: "".// CSS
      preferCSSPageSize: true.// Enable rendering background colors, because Puppeteer is based on Chrome, which by default does not export background images and colors to save printing ink
      // Pits must be added
      printBackground: true});/ / close the browser
    await browser.close();
    // The returned buffer does not need to be stored as a PDF. The buffer is directly sent back to the front end for download to improve the processing speed
    return pdfbuf
  } catch(e) {
    await browser.close();
    throw e
  }
}
Copy the code

This article will roughly summarize these topics, and more details will be provided later when time is available

Since it is a company project, I will not post the specific code and the final effect display, if you have any questions, please contact me ~

reference

Chrome DevTools Protocol
Wiki Headless_browser
Advantages and disadvantages of headless browser testing
How headless Chrome works
A preliminary study on the headless browser Puppeteer
Puppeteer Github
Linux Chinese font installation
Ejs official document
Chrome Launch Parameters

A guide to PDF generation in headless browsers and Puppeteer

preface

Headless browser

1.1 Basic Understanding of headless browsers

1.2 Application Scenarios of Headless Browser

1.3 Common headless browsers

The basic application of Puppeteer

2.1 Official hands-on materials

2.2 Application 1: Store the page as a picture to the specified Url

2.3 Application 2: Store web pages (HTML strings) as a PDF

2.4 Application 3: Execute scripts in the context of a page

2.5 Application 4: Agent

2.6 Application 5: Automatic form submission

The application of Puppeteer from HTML to PDF

3.1 Use of EJS template engine

3.2 Introduction of external resource files in EJS (CSS, JS files and image files)

3.3 Invalid fonts in Puppeteer

3.4 Problems in Puppeteer deployment in Linux and Docker application

3.5 Key points of header, footer and page number

3.6 Solution when Echarts or images break across pages

3.7 Native Table cross-page solution

3.8 How to debug WHAT YOU See and what you Get

3.9 Best practices for HTML to PDF (creation and generation of PDF only)

reference

Related Posts

2020 — Personal summary

How to install MySQL database on Linux

Is that how the countdown rolls now?