This tutorial teaches you how to manipulate PDF files in Python

preface

Hello, we’ve already written a Python PDF case, right? PDF batch merge, the original intention of this case is just to provide you with a convenient script, and there is no much explanation of the principle, which involves PDF processing is very practical module PyPDF2, this article will take a good analysis of this module, mainly will involve

Integrated application of OS modules
Glob module integrated application
PyPDF2 Module operation

Basic operation

The code for PyPDF2 import modules is usually:

from PyPDF2 import PdfFileReader, PdfFileWriter
Copy the code

Two methods are imported here:

PdfFileReader can be understood as a reader
PdfFileWriter can be understood as a writer

Let’s take a look at a few examples to further understand the wonders of these two tools, using a PDF of five invoices

The PDF for each invoice consists of two pages:

merge

The first job was to combine 5 invoices PDF into 10 pages. How do readers and writers fit together here?

The logic is as follows:

The reader reads all the PDF once
The reader passes the read to the writer
The writer uniformly outputs to a new PDF

Here’s another important point: the reader can only pass what it reads to the writer page by page.

Therefore, steps 1 and 2 in logic are not really independent of each other, but rather the reader reads a PDF and loops through the entire PDF page to the writer, page by page. Finally, the output is finished after all the reading work.

Take a look at the code to make things clearer:

from PyPDF2 import PdfFileReader, PdfFileWriter path = r'C:\Users\xxxxxx' pdf_writer = PdfFileWriter() for i in range(1, 6): pdf_reader = PdfFileReader(path + '/INV{}.pdf'.format(i)) for page in range(pdf_reader.getNumPages()): Pdf_writer.addpage (pdf_reader.getPage(page)) with open(path + r'\ merge PDF\merge.pdf', 'wb') as out: pdf_writer.write(out)Copy the code

Since everything needs to be delivered to the same writer for final output, the writer’s initialization must be outside the loop body.

If in the body of the loop, it will become a new writer for each access to read a PDF, so that the content given by each reader to the writer will be repeatedly overwritten, unable to achieve our merge requirements!

The code at the beginning of the loop body:

for i in range(1, 6):
    pdf_reader = PdfFileReader(path + '/INV{}.pdf'.format(i))
Copy the code

The goal is to loop in one new PDF file at a time and pass it to the reader for subsequent operations. In fact, this writing method is not very recommended, because the PDF name happens to be very regular, so you can directly specify the number of the loop. A better approach is to use the glob module:

import glob
for file in glob.glob(path + '/*.pdf'):
    pdf_reader = PdfFileReader(path)
Copy the code

Pdf_reader.getnumpages (): can get the number of pages of the reader, with the range can go through all pages of the reader.

Pdf_writer.addpage (pdf_reader.getPage(Page)) can pass the current page to the writer.

Finally, create a new PDF with the pdF_writer.write (out) method of the writer.

Break up

If you understand the combination of reader and writer in the merge operation, then the split is easy to understand. Here we will split inv1.pdf into two separate PDF documents.

The reader reads the PDF document
The reader gives the writer one page at a time
The writer outputs as soon as it fetches a page

We can also see from this code logic that the writer initialization and output position must be in the loop that reads each page of the PDF loop, not outside the loop

The code is simple:

from PyPDF2 import PdfFileReader, PdfFileWriter path = r'C:\Users\xxx' pdf_reader = PdfFileReader(path + '\INV1.pdf') for page in range(pdf_reader.getNumPages()): Pdf_writer.addpage (PDf_reader.getPage (page)) pdf_writer.addPage(PDf_reader.getPage (page) open(path + '\INV1-{}.pdf'.format(page + 1), 'wb') as out: pdf_writer.write(out)Copy the code

The watermark

The work is to add the image below as a watermark to inv1.pdf

The first step is preparation. Insert the image to be used as a watermark into Word, adjust the appropriate position and save it as a PDF file. Then you can code the code, need to use the extra copy module, see the detailed explanation below:

Initialize the reader and writer, and read the watermark PDF page for later, the core code is slightly more difficult to understand:

Watermarking is essentially merging the watermarked PDF page with each page that needs to be watermarked

Because the PDF that needs to be watermarked may have many pages, and the watermarked PDF has only one page, so if the watermarked PDF is directly taken to merge, it can be abstractly understood as adding the first page, the watermarked PDF page is gone.

Therefore, it can not be directly used to merge, but to copy the watermark PDF pages into a new standby new_page, and then use. MergePage method to complete the merge with each page, the merged page to the writer for the final unified output!

About the use of.mergePage: appear on the following page. MergePage (appear on the above page), the final effect is as shown below:

encryption

Encryption is easy, just remember: “Encryption is encryption for the writer”

So you just need to call pdf_writer.encrypt (password) after the relevant operation is done

Take the encryption of a single PDF:

Write in the last

Of course, in addition to PDF merge, split, encryption, watermarking, we can also use Python combined with Excel and Word to achieve more automation requirements, which are left to the reader development. Python 1075110200, with installation package, PDF, learning video, here is a gathering place for Python learners, zero basic, advanced, welcome

Finally, I hope you understand that one of the core aspects of Python Office Automation is batch operations — freeing your hands and automating complex tasks!

This tutorial teaches you how to manipulate PDF files in Python

Write in the last

Related Posts

From the outsourcing resignation to the day of the first byte, I cried, no one knows how much I pay

Node.js log4js is fully explained

Distributed lock-related exploration