I’m a Gopher, so I’m going to use a Goper’s point of view to list every PDF processing scenario I’ve ever been in. For example:
PDF Render PDF Verify PDF Watermark PDF Get Pages PDF Merge PDF Split Fix Damaged PDF PDF Convert TO PNG Identify font in PDF PDF Decrypt...Copy the code
This article is mostly a list of scene problems, you can pick the part of your interest to view according to the title
I am not particularly professional about many PDF questions. Please feel free to communicate with me if you have any questions or questions
One, HTML page rendering PDF
To render PDF from HTML page, I have used the following two schemes:
- wkhtmltopdf
- chromedp
1. Render PDF using wkHTMLtopdf
Wkhtmltopdf is a command line tool for rendering HTML pages to PDFS, based on the Qt WebKit rendering engine
The use of simple:
Print a static HTML page as a PDF
$ wkhtmltopdf input.html output.pdf
Print a web page as a PDF
$ wkhtmltopdf https://www.google.com output.pdf
Copy the code
Wkhtmltopdf has many parameters, such as:
Support for sending HTTP POST requests, suitable for rendering custom developed web pages into PDF files:
$ wkhtmltopdf --help
...
--post <name> <value> Add an additional post field (repeatable)
...
Copy the code
Support for javascript scripts to modify HTML before rendering PDF:
$ wkhtmltopdf --run-script "javascript:(function(){document.getElementsByClassName('dom_class_name')[0].style.display = 'none'}())" page input.html output.pdf
Copy the code
More detailed parameters can see the official website documentation
If you are using the Go language, there is a third party package that encapsulates wkHTMLTOPdf: go-wkHTMLtopdf
2. Render PDF with Chromedp
Chromedp is a software package in the Go language that provides a faster and simpler way to drive browsers that support the Chrome DevTools protocol without external dependencies (such as Selenium or PhantomJS).
Usage:
package main
import (
"context"
"io/ioutil"
"github.com/chromedp/cdproto/page"
"github.com/chromedp/chromedp"
"errors"
)
func main(a){
err := ChromedpPrintPdf("https://www.google.com"."/path/to/file.pdf")
iferr ! =nil {
fmt.Println(err)
return}}func ChromedpPrintPdf(url string, to string) error {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
var buf []byte
err := chromedp.Run(ctx, chromedp.Tasks{
chromedp.Navigate(url),
chromedp.WaitReady("body"),
chromedp.ActionFunc(func(ctx context.Context) error {
var err error
buf, _, err = page.PrintToPDF().
Do(ctx)
return err
}),
})
iferr ! =nil {
return fmt.Errorf("chromedp Run failed,err:%+v", err)
}
if err := ioutil.WriteFile(to, buf, 0644); err ! =nil {
return fmt.Errorf("write to file failed,err:%+v", err)
}
return nil
}
Copy the code
Two, PDF watermark
Some of the tools I’ve read that support PDF watermarking are:
- unidoc/unipdf
- pdfcpu
1.unidoc/unipdf
Unipdf is a PDF library written in the Go language. It provides API and CLI usage mode and supports the following functions:
$ unipdf -h. Available Commands: decrypt Decrypt PDF files encrypt Encrypt PDF files explode Explodes the input file into separate single page PDF files extract Extract PDF resources form PDF form operations grayscale Convert PDF to grayscale help Help about any command info Output PDF information merge Merge PDF files optimize Optimize PDF files passwd Change PDF passwords rotate Rotate PDF file pages search Search text in PDF files split Split PDF files version Output version information and exit watermark Add watermark to PDF files ...Copy the code
Adding a watermark in CLI mode
$ unipdf watermark in.pdf watermark.png -o out.pdf
Watermark successfully applied to in.pdf
Output file saved to out.pdf
Copy the code
To add a watermark using the API, see unipdf Github Example directly
Note: UnIDOC products require a license to be purchased
2.pdfcpu
Pdfcpu is a PDF processing library written in Go language, providing API and CLI mode use
The following functions are supported:
$ pdfcpu help
...
The commands are:
attachments list, add, remove, extract embedded file attachments
changeopw change owner password
changeupw change user password
decrypt remove password protection
encrypt set password protection
extract extract images, fonts, content, pages, metadata
fonts install, list supported fonts
grid rearrange pages or images for enhanced browsing experience
import import/convert images to PDF
info print file info
merge concatenate 2 or more PDFs
nup rearrange pages or images for reduced number of pages
optimize optimize PDF by getting rid of redundant page resources
pages insert, remove selected pages
paper print list of supported paper sizes
permissions list, set user access permissions
rotate rotate pages
split split multi-page PDF into several PDFs according to split span
stamp add, remove, update text, image or PDF stamps for selected pages
trim create trimmed version of selected pages
validate validate PDF against PDF 32000-1:2008 (PDF 1.7)
version print version
watermark add, remove, update text, image or PDF watermarks for selected pages
...
Copy the code
Use CLI tool to add watermark in image form:
$ pdfcpu watermark add -mode image 'voucher_watermark.png' 's:1 abs, rot:0' in.pdf out.pdf
Copy the code
Call the API to add a watermark
package main
import (
"github.com/pdfcpu/pdfcpu/pkg/api"
"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
)
func main(a) {
onTop := false
wm, _ := pdfcpu.ParseImageWatermarkDetails("watermark.png"."s:1 abs, rot:0", onTop)
api.AddWatermarksFile("in.pdf"."out.pdf".nil, wm, nil)}Copy the code
PDF merge
- cpdf
- unipdfc
- pdfcpu
1. Use CPDF to merge PDFS
CPDF is a free open source PDF command line tool library with rich features such as:
- Merge PDF files together, or split them apart
- Encrypt and decrypt
- Scale, crop and rotate pages
- Read and set document info and metadata
- Copy, add or remove bookmarks
- Stamp logos, text, dates, page numbers
- Add or remove attachments
- Losslessly compress PDF files
Merge the PDF:
$ cpdf -merge input1.pdf input2.pdf -o output.pdf
Copy the code
2. Use uniPDF to merge PDFS
$ unipdf merge output.pdf input1.pdf input2.pdf
Copy the code
To merge PDFS using the API, see the Unpdf Github example
3. Use the PDFCPU to merge PDFS
$ pdfcpu merge output.pdf input1.pdf input2.pdf
Copy the code
Note: PDFCPU only supports PDF files with versions earlier than PDF V1.7
Split PDF
- cpdf
- unipdf
- pdfcpu
1. Split PDFS using CPDF
## Split each page into a single PDF
$ cpdf -split in.pdf 1 even -chunk 1 -o ./out%%%.pdf
Copy the code
2. Split PDF using UniPDF
## Split the first page
$ unipdf split input.pdf out.pdf 1-1
Copy the code
To split PDFS using the API, see unipdf Github examples
3. Split PDFS using pdFCPU
$ pdfcpu split in.pdf .
Copy the code
Five, PDF to picture
- mupdf
- xpdf
1. Use mUPdf to transfer PDF images
MuPDF is a lightweight PDF, XPS, and E-book viewer.
MuPDF consists of a software library, command line tools, and viewers for various platforms.
After downloading mUPDF, you get some tools, such as:
mupdf
pdfdraw
pdfinfo
pdfclean
pdfextract
pdfshow
xpsdraw
Copy the code
Where PDfDraw can be used to convert pictures
$ pdfdraw -o out%d.png in.pdf
Copy the code
2. Use XPDF to transfer PDF images
XPDF is a free PDF tool kit, including text parsing, image conversion, HTML conversion and more
After downloading the package, a series of tools are available:
pdfdetach
pdffonts
pdfimages
pdfinfo
pdftohtml
pdftopng
pdftoppm
pdftops
pdftotext
Copy the code
From the name, you can roughly see the use of each tool
## Convert PDF to PNG using PDfTopng
$ pdftopng in.pdf out-prefix
Copy the code
Six, PDF decryption
It’s not uncommon to encounter a scenario where you read a PDF file and find an error: the file is encrypted
But how do you solve this without a password?
- Use QPDF to decrypt
Using QPDF for mandatory decryption, some cases can be decrypted successfully, but some cases may not be able to decrypt successfully
QPDF is a PDF tool that supports command lines
$ qpdf --decrypt in.pdf out.pdf
Copy the code
- Decrypt using PDfCPU
$ pdfcpu decrypt encrypted.pdf output.pdf
Copy the code
When a password is available, you can use password decryption:
- Use uniPDF to decrypt PDF
$ unipdf decrypt -p pass -o output.pdf input.pdf
Copy the code
Vii. PDF identification
Often encounter some scenarios, such as recognizing a file is not a PDF file, recognizing text in PDF, recognizing pictures in PDF, etc
1. Identify the text in the PDF
XPDF is used to parse the text from the PDF, and then some string manipulation or regular expressions are used for business analysis
- use
xpdf/pdftotext
Parse the text in the PDF
$ pdftotext input.pdf output.txt
Copy the code
- use
unipdf
Parse the text in the PDF
$ unipdf extract text input.pdf
Copy the code
Use the API to parse PDF text. See uniPDF Github examples
- Parse PDF data using coordinate information
Above is the first parsing of the PDF text, and then according to the business processing
There is another way to parse PDF by coordinate position, which is more flexible and generic, using PDFLIb/TET
#Enter a set of coordinates to parse the data in PDF
$ tet --pageopt "Includebox = {{38 707.93 243.91 716.93}}" input.pdf
Copy the code
The coordinates can be analyzed by USING TET to get a TETML file containing coordinate information:
$ tet --tetml input.pdf
Copy the code
Of course, there are other ways to get the coordinates of the DATA in PDF, such as nodejs, etc
Note: PDFLIb/TET is paid software, but according to the official documentation, TET provides basic functions, and you do not need to purchase a license to process PDF files of less than 10 pages or less than 1 MB
The light/tet SDK provides a command line tool and a variety of language support, such as C/C++/Java/.NET/Perl/PHP/Python/Ruby/Swift but it is not the language support, so for the gopher currently only two options: CLI OR CGO
Viii. Repair damaged PDF files
There are some PDF files that appear normal when opened on the computer, but are not normal when checked with code, such as trying to parse a (damaged) PDF using a third-party library in Go:
import (
"fmt"
"github.com/rsc.io/pdf"
)
func main(a) {
filePath := "path/to/your/broken.pdf"
_, err := pdf.Open(filePath)
iferr ! =nil {
fmt.Println("open pdf failed,err:", err.Error())
return}}Copy the code
When you run it, you get this:
open pdf failed,err: malformed PDF: cross-reference table not found: {50 obj}<</Contents 60 R /Group <</CS /DeviceRGB /S /Transparency /Type /Group>> /MediaBox [0 0 595.27600098 841.89001465] /Parent 3 0 R /Type /Page>>Copy the code
The computer opens normally, the program reads the error however!
At this time, if you try to open the PDF file on the computer, and then save it as a new PDF file, and then use the code to check, you will find that it is fixed!
Great, problem solved!
Wait a minute, if I have 1000 PDF files, do I have to open and save them individually? How can this be tolerated? So it would be nice if there was a batch fix feature
After a long search on the Internet, I came up with three solutions:
- Use Acrobat SDK, call the SDK in the save as function, you can achieve the computer open save as effect
- PDF repair using GhostScript
- PDF repair using MUPDF
Here I have only verified that the third method is feasible, using the mUPDF-0.9-Linux-AMD64 version
After downloading the packages, you get one of the executables: pdfClean
$ pdfclean broken.pdf repaired.pdf
+ pdf/pdf_xref.c:160: pdf_read_trailer(): cannot recognize xref format: The '%'
| pdf/pdf_xref.c:481: pdf_load_xref(): cannot read trailer
\ pdf/pdf_xref.c:537: pdf_open_xref_with_stream(): trying to repair
Copy the code
Judging from the output, MUPDF tried to fix the processing
After you get the new PDF file, try opening it with the previous Go code, and you’ll be fine
All that’s left is to write a bash script, batch fix it, and that’s it!
Identify the font information of a PDF file
Sometimes to keep multiple PDF text fonts consistent, it is necessary to analyze which fonts are used in PDF. In this case, you can use XPDF/PDFFonts for font analysis
$ pdffonts input.pdfname type encoding emb sub uni object ID ------------------------------------ ----------------- ---------------- --- --- --- --------- NimbusSanL-Regu CID TrueType Identity-H yes no yes 10 0 NimbusSanL-Bold CID TrueType Identity-H yes no yes 20 0Copy the code
Other Libiray introductions
- PDF-Writer
This is a C++ open source library, support to create PDF, merge PDF, image watermark text operation, etc
For Gopher to use this library, you need to wrap a layer of CGO code
- rsc/pdf
This is a Go language implementation of PDF library, can be used to read PDF information, such as read PDF content/page/font… For details, refer to the documentation
The introduction of so many third-party libraries, it is a multifarious, each show their abilities. There are some features that are duplicated in most libraries, so it depends on the actual situation.
I hope these summaries will be helpful to readers
Reference:
- wkhtmltopdf
- xpdf
- cpdf
- qpdf
- unidoc
- pdflib/tet
- pdfwriter
- mupdf
- pdfcpu