I’m a Gopher, so I’m going to use a Goper’s point of view to list every PDF processing scenario I’ve ever been in. For example:

PDF Render PDF Verify PDF Watermark PDF Get Pages PDF Merge PDF Split Fix Damaged PDF PDF Convert TO PNG Identify font in PDF PDF Decrypt...Copy the code

This article is mostly a list of scene problems, you can pick the part of your interest to view according to the title

I am not particularly professional about many PDF questions. Please feel free to communicate with me if you have any questions or questions


One, HTML page rendering PDF

To render PDF from HTML page, I have used the following two schemes:

  • wkhtmltopdf
  • chromedp

1. Render PDF using wkHTMLtopdf

Wkhtmltopdf is a command line tool for rendering HTML pages to PDFS, based on the Qt WebKit rendering engine

The use of simple:

Print a static HTML page as a PDF
$ wkhtmltopdf input.html output.pdf

Print a web page as a PDF
$ wkhtmltopdf https://www.google.com output.pdf
Copy the code

Wkhtmltopdf has many parameters, such as:

Support for sending HTTP POST requests, suitable for rendering custom developed web pages into PDF files:

$ wkhtmltopdf --help
...
--post <name> <value>           Add an additional post field (repeatable)
...
Copy the code

Support for javascript scripts to modify HTML before rendering PDF:

$ wkhtmltopdf --run-script "javascript:(function(){document.getElementsByClassName('dom_class_name')[0].style.display = 'none'}())" page input.html output.pdf
Copy the code

More detailed parameters can see the official website documentation

If you are using the Go language, there is a third party package that encapsulates wkHTMLTOPdf: go-wkHTMLtopdf

2. Render PDF with Chromedp

Chromedp is a software package in the Go language that provides a faster and simpler way to drive browsers that support the Chrome DevTools protocol without external dependencies (such as Selenium or PhantomJS).

Usage:

package main

import (
	"context"
	"io/ioutil"

	"github.com/chromedp/cdproto/page"
	"github.com/chromedp/chromedp"
	"errors"
)

func main(a){
    err := ChromedpPrintPdf("https://www.google.com"."/path/to/file.pdf")
    iferr ! =nil {
        fmt.Println(err)
        return}}func ChromedpPrintPdf(url string, to string) error {
	ctx, cancel := chromedp.NewContext(context.Background())
	defer cancel()

	var buf []byte
	err := chromedp.Run(ctx, chromedp.Tasks{
		chromedp.Navigate(url),
		chromedp.WaitReady("body"),
		chromedp.ActionFunc(func(ctx context.Context) error {
			var err error
			buf, _, err = page.PrintToPDF().
				Do(ctx)
			return err
		}),
	})
	iferr ! =nil {
	    return fmt.Errorf("chromedp Run failed,err:%+v", err)
	}

	if err := ioutil.WriteFile(to, buf, 0644); err ! =nil {
	    return fmt.Errorf("write to file failed,err:%+v", err)
	}

	return nil
}
Copy the code

Two, PDF watermark

Some of the tools I’ve read that support PDF watermarking are:

  • unidoc/unipdf
  • pdfcpu

1.unidoc/unipdf

Unipdf is a PDF library written in the Go language. It provides API and CLI usage mode and supports the following functions:

$ unipdf -h. Available Commands: decrypt Decrypt PDF files encrypt Encrypt PDF files explode Explodes the input file into separate single page PDF files extract Extract PDF resources form PDF form operations grayscale Convert PDF to grayscale help Help about any command info Output PDF information merge Merge PDF files optimize Optimize PDF files passwd Change PDF passwords rotate Rotate PDF file pages search Search text in PDF files split Split PDF files version Output version information and exit watermark Add watermark to PDF files ...Copy the code

Adding a watermark in CLI mode

$ unipdf watermark in.pdf watermark.png -o out.pdf

Watermark successfully applied to in.pdf
Output file saved to out.pdf
Copy the code

To add a watermark using the API, see unipdf Github Example directly

Note: UnIDOC products require a license to be purchased

2.pdfcpu

Pdfcpu is a PDF processing library written in Go language, providing API and CLI mode use

The following functions are supported:

$ pdfcpu help
...
The commands are:

   attachments list, add, remove, extract embedded file attachments
   changeopw   change owner password
   changeupw   change user password
   decrypt     remove password protection
   encrypt     set password protection
   extract     extract images, fonts, content, pages, metadata
   fonts       install, list supported fonts
   grid        rearrange pages or images for enhanced browsing experience
   import      import/convert images to PDF
   info        print file info
   merge       concatenate 2 or more PDFs
   nup         rearrange pages or images for reduced number of pages
   optimize    optimize PDF by getting rid of redundant page resources
   pages       insert, remove selected pages
   paper       print list of supported paper sizes
   permissions list, set user access permissions
   rotate      rotate pages
   split       split multi-page PDF into several PDFs according to split span
   stamp       add, remove, update text, image or PDF stamps for selected pages
   trim        create trimmed version of selected pages
   validate    validate PDF against PDF 32000-1:2008 (PDF 1.7)
   version     print version
   watermark   add, remove, update text, image or PDF watermarks for selected pages
...
Copy the code

Use CLI tool to add watermark in image form:

$ pdfcpu watermark add -mode image 'voucher_watermark.png' 's:1 abs, rot:0' in.pdf out.pdf
Copy the code

Call the API to add a watermark

package main

import (
	"github.com/pdfcpu/pdfcpu/pkg/api"
	"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
)

func main(a) {
	onTop := false
	wm, _ := pdfcpu.ParseImageWatermarkDetails("watermark.png"."s:1 abs, rot:0", onTop)
	api.AddWatermarksFile("in.pdf"."out.pdf".nil, wm, nil)}Copy the code

PDF merge

  • cpdf
  • unipdfc
  • pdfcpu

1. Use CPDF to merge PDFS

CPDF is a free open source PDF command line tool library with rich features such as:

  • Merge PDF files together, or split them apart
  • Encrypt and decrypt
  • Scale, crop and rotate pages
  • Read and set document info and metadata
  • Copy, add or remove bookmarks
  • Stamp logos, text, dates, page numbers
  • Add or remove attachments
  • Losslessly compress PDF files

Merge the PDF:

$ cpdf -merge input1.pdf input2.pdf -o output.pdf
Copy the code

2. Use uniPDF to merge PDFS

$ unipdf merge output.pdf input1.pdf input2.pdf
Copy the code

To merge PDFS using the API, see the Unpdf Github example

3. Use the PDFCPU to merge PDFS

$ pdfcpu merge output.pdf input1.pdf input2.pdf
Copy the code

Note: PDFCPU only supports PDF files with versions earlier than PDF V1.7

Split PDF

  • cpdf
  • unipdf
  • pdfcpu

1. Split PDFS using CPDF

## Split each page into a single PDF
$ cpdf -split in.pdf 1 even -chunk 1 -o ./out%%%.pdf
Copy the code

2. Split PDF using UniPDF

## Split the first page
$ unipdf split input.pdf out.pdf 1-1
Copy the code

To split PDFS using the API, see unipdf Github examples

3. Split PDFS using pdFCPU

$ pdfcpu split in.pdf .
Copy the code

Five, PDF to picture

  • mupdf
  • xpdf

1. Use mUPdf to transfer PDF images

MuPDF is a lightweight PDF, XPS, and E-book viewer.

MuPDF consists of a software library, command line tools, and viewers for various platforms.

After downloading mUPDF, you get some tools, such as:

mupdf               
pdfdraw
pdfinfo             
pdfclean            
pdfextract          
pdfshow             
xpsdraw
Copy the code

Where PDfDraw can be used to convert pictures

$ pdfdraw -o out%d.png in.pdf
Copy the code

2. Use XPDF to transfer PDF images

XPDF is a free PDF tool kit, including text parsing, image conversion, HTML conversion and more

After downloading the package, a series of tools are available:

pdfdetach 
pdffonts  
pdfimages 
pdfinfo   
pdftohtml 
pdftopng  
pdftoppm  
pdftops   
pdftotext
Copy the code

From the name, you can roughly see the use of each tool

## Convert PDF to PNG using PDfTopng
$ pdftopng in.pdf out-prefix
Copy the code

Six, PDF decryption

It’s not uncommon to encounter a scenario where you read a PDF file and find an error: the file is encrypted

But how do you solve this without a password?

  • Use QPDF to decrypt

Using QPDF for mandatory decryption, some cases can be decrypted successfully, but some cases may not be able to decrypt successfully

QPDF is a PDF tool that supports command lines

$ qpdf --decrypt in.pdf out.pdf
Copy the code
  • Decrypt using PDfCPU
$ pdfcpu decrypt encrypted.pdf output.pdf
Copy the code

When a password is available, you can use password decryption:

  • Use uniPDF to decrypt PDF
$ unipdf decrypt -p pass -o output.pdf input.pdf
Copy the code

Vii. PDF identification

Often encounter some scenarios, such as recognizing a file is not a PDF file, recognizing text in PDF, recognizing pictures in PDF, etc

1. Identify the text in the PDF

XPDF is used to parse the text from the PDF, and then some string manipulation or regular expressions are used for business analysis

  • usexpdf/pdftotextParse the text in the PDF
$ pdftotext input.pdf output.txt
Copy the code
  • useunipdfParse the text in the PDF
$ unipdf extract text input.pdf
Copy the code

Use the API to parse PDF text. See uniPDF Github examples

  • Parse PDF data using coordinate information

Above is the first parsing of the PDF text, and then according to the business processing

There is another way to parse PDF by coordinate position, which is more flexible and generic, using PDFLIb/TET

#Enter a set of coordinates to parse the data in PDF
$ tet --pageopt "Includebox = {{38 707.93 243.91 716.93}}" input.pdf
Copy the code

The coordinates can be analyzed by USING TET to get a TETML file containing coordinate information:

$ tet --tetml input.pdf
Copy the code

Of course, there are other ways to get the coordinates of the DATA in PDF, such as nodejs, etc

Note: PDFLIb/TET is paid software, but according to the official documentation, TET provides basic functions, and you do not need to purchase a license to process PDF files of less than 10 pages or less than 1 MB

The light/tet SDK provides a command line tool and a variety of language support, such as C/C++/Java/.NET/Perl/PHP/Python/Ruby/Swift but it is not the language support, so for the gopher currently only two options: CLI OR CGO

Viii. Repair damaged PDF files

There are some PDF files that appear normal when opened on the computer, but are not normal when checked with code, such as trying to parse a (damaged) PDF using a third-party library in Go:

import (
    "fmt"
    "github.com/rsc.io/pdf"
)

func main(a) {
    filePath := "path/to/your/broken.pdf"
    _, err := pdf.Open(filePath)
	iferr ! =nil {
		fmt.Println("open pdf failed,err:", err.Error())
		return}}Copy the code

When you run it, you get this:

open pdf failed,err: malformed PDF: cross-reference table not found: {50 obj}<</Contents 60 R /Group <</CS /DeviceRGB /S /Transparency /Type /Group>> /MediaBox [0 0 595.27600098 841.89001465] /Parent 3 0 R /Type /Page>>Copy the code

The computer opens normally, the program reads the error however!

At this time, if you try to open the PDF file on the computer, and then save it as a new PDF file, and then use the code to check, you will find that it is fixed!

Great, problem solved!

Wait a minute, if I have 1000 PDF files, do I have to open and save them individually? How can this be tolerated? So it would be nice if there was a batch fix feature

After a long search on the Internet, I came up with three solutions:

  • Use Acrobat SDK, call the SDK in the save as function, you can achieve the computer open save as effect
  • PDF repair using GhostScript
  • PDF repair using MUPDF

Here I have only verified that the third method is feasible, using the mUPDF-0.9-Linux-AMD64 version

After downloading the packages, you get one of the executables: pdfClean

$ pdfclean broken.pdf repaired.pdf

+ pdf/pdf_xref.c:160: pdf_read_trailer(): cannot recognize xref format: The '%'
| pdf/pdf_xref.c:481: pdf_load_xref(): cannot read trailer
\ pdf/pdf_xref.c:537: pdf_open_xref_with_stream(): trying to repair
Copy the code

Judging from the output, MUPDF tried to fix the processing

After you get the new PDF file, try opening it with the previous Go code, and you’ll be fine

All that’s left is to write a bash script, batch fix it, and that’s it!

Identify the font information of a PDF file

Sometimes to keep multiple PDF text fonts consistent, it is necessary to analyze which fonts are used in PDF. In this case, you can use XPDF/PDFFonts for font analysis

$ pdffonts input.pdfname type encoding emb sub uni object ID ------------------------------------ ----------------- ---------------- --- ---  --- --------- NimbusSanL-Regu CID TrueType Identity-H yes no yes 10 0 NimbusSanL-Bold CID TrueType Identity-H yes no yes 20 0Copy the code

Other Libiray introductions

  • PDF-Writer

This is a C++ open source library, support to create PDF, merge PDF, image watermark text operation, etc

For Gopher to use this library, you need to wrap a layer of CGO code

  • rsc/pdf

This is a Go language implementation of PDF library, can be used to read PDF information, such as read PDF content/page/font… For details, refer to the documentation


The introduction of so many third-party libraries, it is a multifarious, each show their abilities. There are some features that are duplicated in most libraries, so it depends on the actual situation.

I hope these summaries will be helpful to readers


Reference:

  • wkhtmltopdf
  • xpdf
  • cpdf
  • qpdf
  • unidoc
  • pdflib/tet
  • pdfwriter
  • mupdf
  • pdfcpu